Aller au contenu principal

MI300X GPU problems

Use this information to resolve problems that are related to GPU and heat sink modules and the GPU baseboard.

Health check for GPUs

Note

Use one of the following utilities to check the GPU health status. Make sure to update GPU driver, which includes the following utilities required. Latest driver can be found at Drivers and Software download website for ThinkSystem SR685a V3.

For more information about System Management Interface (SMI) information, see AMD System Management Interface.

  • rocm-smi

    Run the rocm-smi utility to display the eight GPUs online.

    Figure 1. rocm-smi
    rocm-smi
  • rocm-smi --showrasinfo

    Run the rocm-smi --showrasinfo utility to display hardware details of the eight GPUs.

    Figure 2. rocm-smi --showrasinfo
    rocm-smi --showrasinfo
  • rocm-smi --showhw

    Run the rocm-smi --showhw utility to display error counters of the eight GPUs.

    Figure 3. rocm-smi --showhw
    rocm-smi --showhw
  • rocm-smi -a

    Run the rocm-smi -a utility to display status of the eight GPUs.

    Figure 4. rocm-smi -a
    rocm-smi -a

    rocm-smi -a

System fails to detect a specific GPU

When one of the events appears in the XCC web event log, it indicates the system fails to detect one or more specific GPUs.

  • When event FQXSPIO0015M: Fault in slot [PhysicalConnectorSystemElementName] on system [ComputerSystemElementName]. appears, see FQXSPIO0015M to solve the problem.
  • When event FQXSFIO0010M: An Uncorrectable PCIe Error has Occurred at Bus [arg1] Device [arg2] Function [arg3]. The Vendor ID for the device is [arg4] and the Device ID is [arg5]. The physical [arg6] number is [arg7]. appears, see FQXSFIO0010M to solve the problem.
    Note
    Parameters:
    • [arg1] Bus
    • [arg2] Device
    • [arg3] Function
    • [arg4] VID
    • [arg5] DID
    • [arg6] Slot/Bay
    • [arg7] Instance number
  • When event FQXSPUN0019M : Sensor [SensorElementName] has transitioned to critical from a less severe state. appears, see FQXSPUN0019M to solve the problem.
Note
The following table shows the mapping information between the slot numbering in XCC and physical GPU sockets.
Slot numbering in XCCPhysical GPU socketsLocation of the GPU sockets
Slot 17OAM 7

Slot 18OAM 6
Slot 19OAM 4
Slot 20OAM 5
Slot 21OAM 3
Slot 22OAM 2
Slot 23OAM 0
Slot 24OAM 1