On ESXi 7.x, one of my VMs caused the whole esxi host to hang. Only way to recover was through a hard reset. The root cause was due to the pass through Nvidia Qaudro P600 Adapter I had configured on the VM. During VM shutdown, the PCI reset function caused the host to hang.
The following articles helped me: Reddit VMware KB
Few of the PCI reset types are:
- Function Level Reset (FLR)
- Secondary Bus Reset
- Link Disable/Enable
- Device power state transition (D0 > D3hot > D0; non-standard reset method)
To resolve the issue,
Find the vendor id and device id
Use the command esxcli hardware pci list
$ esxcli hardware pci list
0000:04:00.0
Address: 0000:04:00.0
Segment: 0x0000
Bus: 0x04
Slot: 0x00
Function: 0x0
VMkernel Name:
Vendor Name: NVIDIA Corporation
Device Name: GP107GL [Quadro P600]
Configured Owner: VM Passthru
Current Owner: VM Passthru
Vendor ID: 0x10de
Device ID: 0x1cb2
SubVendor ID: 0x10de
SubDevice ID: 0x11bd
Device Class: 0x0300
Device Class Name: VGA compatible controller
Programming Interface: 0x00
Revision ID: 0xa1
Interrupt Line: 0x0b
IRQ: 255
Interrupt Vector: 0x00
PCI Pin: 0x00
Spawned Bus: 0x00
Flags: 0x3001
Module ID: 49
Module Name: pciPassthru
Chassis: 0
Physical Slot: 4
Slot Description: Slot4
Device Layer Bus Address: s00000004.00
Passthru Capable: true
Parent Device: PCI 0:0:3:0
Dependent Device: PCI 0:0:3:0
Reset Method: Bridge reset
FPT Sharable: true
NUMA Node: -1
Extended Device ID: 0
Edit the file, /etc/vmware/passthru.map file
[root@dell-t5600:~] cat /etc/vmware/passthru.map
# passthrough attributes for devices
#
# file format: vendor-id device-id resetMethod fptShareable
# vendor/device id: xxxx (in hex) (ffff can be used for wildchar match)
# reset methods: flr, d3d0, link, bridge, default
# fptShareable: true/default, false
#
# Description:
#
# - fptShareable: when set to true means the PCI device can be shared.
# Sharing refers to using multiple functions of a multi‐function
# device in different contexts. That is, sharing between two
# virtual machines or between a virtual machine and VMkernel.
#
# - resetMethod: override for the type of reset to apply to a PCI device.
# Bus reset and link reset prevent functions in a multi-function
# device from being assigned to different virtual machines, or from
# being assigned between the VMkernel and virtual machines. In
# some devices it's possible to use PCI power management capability
# D3->D0 transitions to reset the device. In the absence of the
# override, the VMkernel decides the type of PCI reset to apply
# based on the device's capabilities. The VMkernel prioritizes
# function level reset (flr).
#
# Restrictions:
#
# - PCI SR-IOV physical and virtual functions (PFs/VFs) are not allowed
# in the list below. Those must support function-level-reset and
# must be shareable.
#
# Intel 82579LM Gig NIC can be reset with d3d0
8086 1502 d3d0 default
# Intel 82598 10Gig cards can be reset with d3d0
8086 10b6 d3d0 default
8086 10c6 d3d0 default
8086 10c7 d3d0 default
8086 10c8 d3d0 default
8086 10dd d3d0 default
# Broadcom 57710/57711/57712 10Gig cards are not shareable
14e4 164e default false
14e4 164f default false
14e4 1650 default false
14e4 1662 link false
# Qlogic 8Gb FC card can not be shared
1077 2532 default false
# LSILogic 1068 based SAS controllers
1000 0056 d3d0 default
1000 0058 d3d0 default
# NVIDIA
10de ffff bridge false
# AMD FCH SATA Controller [AHCI mode]
1022 7901 d3d0 default
Change “bridge” to “link”
As Nvidia was already listed and as the device id ffff
covers all devices, I just had to change the reset type from bridge to link.
From:
# NVIDIA
10de ffff bridge false
To:
# NVIDIA
10de ffff bridge false