Zum Hauptinhalt springen

GPU problems

Use this information to resolve problems that are related to H100/H200 GPUs.

Health check for GPUs

Note
  • Use one of the following utilities to check the GPU health status. Make sure to update GPU driver, which includes the following utilities required. Latest driver can be found at Drivers and Software download website for ThinkSystem SR680a V3.

    For more information about System Management Interface (SMI) information, see NVIDIA System Management Interface.

  • The following table shows the mapping information between module IDs and physical GPU sockets.

    Module IDPhysical GPU socketLocation of the GPU socket
    1SXM 1

    2SXM 2
    3SXM 3
    4SXM 4
    5SXM 5
    6SXM 6
    7SXM 7
    8SXM 8
  • nvidia-smi

    Run the nvidia-smi utility to display the eight GPUs online.

    Figure 1. nvidia-smi
    nvidia-smi
  • nvidia-smi -L

    Run the nvidia-smi -L utility to display the eight GPUs online with UUID.

    Figure 2. nvidia-smi -L
    nvidia-smi -L
  • nvidia-smi -q --id=1 -f <output file name>

    Run the nvidia-smi -q --id=1 -f <output file name> utility to export GPU inventory information.

    Type the desired file name in <output file name> to store the output. For example: nvidia-smi -q --id=1 -f /tmp/queryoam1.txt.

    Figure 3. nvidia-smi -q --id=1 -f <output file name>
    nvidia-smi -q --id=1 -f <output file name>
  • nvidia-smi --id=0 -q -d ECC,PAGE_RETIREMENT

    Run the nvidia-smi --id=0 -q -d ECC,PAGE_RETIREMENT utility to export ECC (Error Checking and Correction) errors and status of retired pages.

    ECC Mode
    Current : Enabled
    Pending : Enabled
    Ecc Errirs
    Volatile
    SRAM Correctable : 0
    SRAM Uncorrectable Parity : 0
    SRAM Uncorrectable SEC-DED : 0
    DRAM Correctable : 0
    DRAM Uncorrectable: : 0
    Aggregate
    SRAM Correctable : 0
    SRAM Uncorrectable Parity : 0
    SRAM Uncorrectable SEC-DED : 0
    DRAM Correctable : 0
    DRAM Uncorrectable : 0
    SRAM Threshold Exceeded : No
    Aggregate Uncorrectable SRAM Sources
    SRAM L2 : 0
    SRAM SM : 0
    SRAM Microcontroller : 0
    SRAM PCIE : 0
    SRAM Other : 0
    Retired Pages
    Single Bit ECC : N/A
    Double Bit ECC : N/A
    Pending Page Blacklist : N/A
  • nvidia-smi pci --getErrorCounters

    Run the nvidia-smi pci --getErrorCounters utility to display error counters of the eight GPUs.

    Figure 4. nvidia-smi pci --getErrorCounters
    nvidia-smi pci --getErrorCounters
  • nvidia-smi pci --getErrorCounters --id=<id number>

    Run the nvidia-smi pci --getErrorCounters --id=<id number> utility to display error counters of a specific GPU.

    Type the ID number of a specific GPU in <id number>. For example: nvidia-smi pci --getErrorCounters --id=2.

    Figure 5. nvidia-smi pci --getErrorCounters --id=<id number>
    nvidia-smi pci --getErrorCounters --id=<id number>

System fails to detect a specific GPU

When one of the events appears in the XCC web event log, it indicates the system fails to detect one or more specific GPUs.

  • When event FQXSPIO0015M: Fault in slot [PhysicalConnectorSystemElementName] on system [ComputerSystemElementName]. appears, see FQXSPIO0015M to solve the problem.
  • When event FQXSFIO0010M: An Uncorrectable PCIe Error has Occurred at Bus [arg1] Device [arg2] Function [arg3]. The Vendor ID for the device is [arg4] and the Device ID is [arg5]. The physical [arg6] number is [arg7]. appears, see FQXSFIO0010M to solve the problem.
    Note
    Parameters:
    • [arg1] Bus
    • [arg2] Device
    • [arg3] Function
    • [arg4] VID
    • [arg5] DID
    • [arg6] Slot/Bay
    • [arg7] Instance number
  • When event FQXSPUN0019M : Sensor [SensorElementName] has transitioned to critical from a less severe state. appears, see FQXSPUN0019M to solve the problem.
Note
The following table shows the mapping information between the slot numbering in XCC and physical GPU sockets.
Slot numbering in XCCPhysical GPU socketsLocation of the GPU sockets
Slot 17SXM 5

Slot 18SXM 7
Slot 19SXM 8
Slot 20SXM 6
Slot 21SXM 1
Slot 22SXM 3
Slot 23SXM 4
Slot 24SXM 2