Skip to main content

GPU problems

Use this information to resolve problems that are related to GPUs and the GPU board.

Note
Make sure to update GPU driver, which includes the nvidia-smi utility required for GPU problem determination. Latest driver can be found at Drivers and Software download website for ThinkSystem SD650-N V3.

Health check for GPUs and GPU board

The following sensor status by ipmitool indicates the GPUs and GPU board are in normal state.

$ ipmitool -I lanplus -H 192.168.70.125 -U USERID -P PASSW0RD 
sdr elist | grep GPU
GPU Board Power | 8Ch | ok | 21.4 | 250 Watts
GPU Board | E9h | ok | 11.8 | Transition to OK
GPU CPUs | EAh | ok | 11.9 | Transition to OK
The summary of nvidia-smi utility indicates 4 GPUs online.
Figure 1. nvidia-smi

System fails to detect the GPU board

When event Sensor GPU Board has transitioned to critical from a less severe state appears in the XCC web event log, it indicates the system fails to detect the GPU board. Go through the following steps to solve the problem.

  1. Power cycle the system.
  2. Check power input related events in XCC and SMM2 (see SMM2 - Power).
  3. Check the system temperature and water flow. Look for leakage, and disconnect then reconnect the water cooling system.
  4. Reboot the system, and run ipmi health check (see Health check for GPUs and GPU board).
  5. One of the following indicates the problem has been solved:
    • FQXSPUN0017I (Sensor GPU Board has transitioned to normal state) in XCC messages
    • Sensor GPU Board has transitioned to normal state in web log
    However, if the problem persists, complete the following steps:
    1. Collect XCC service data (see Collecting service data).
    2. Contact Lenovo Service.

System fails to detect a specific GPU

When the event Sensor GPU CPUs has transitioned to critical from a less severe state appears in the XCC web event log, it indicates the system fails to detect one or more specific GPUs. Go through the following steps to solve the problem.

  1. Check if the retimer is over-temperature from XCC event, if yes, skip the next step.
  2. Download the latest firmware from Data Center Support site (Lenovo Data Center Support for ThinkSystem SD650-N V3), and update the firmware.
  3. Reboot the system, and run ipmi health check (see Health check for GPUs and GPU board).
  4. If the event Sensor GPU Board has transitioned to normal state appears in the XCC web event log, it indicates the problem has been solved.

    However, if the problem persists, complete the following steps.
    1. Check XCC web event log to identify defective unit and problem type (see XCC GPU sensor specifications).
    2. Collect XCC service data (see Collecting service data).
    3. Run nvidia-smi for diagnosis (see NVIDIA System Management Interface for details)
      Note
      Make sure to update GPU driver, which includes the nvidia-smi utility required for GPU problem determination. Latest driver can be found at Drivers and Software download website for ThinkSystem SD650-N V3.
    4. Run nvidia-bug-report.sh (embedded tool in NVIDIA driver).
    5. Contact Lenovo Service.

XCC GPU sensor specifications

When seeing an event in XCC web event log, refer to the following table to identify defective unit and problem type. For example:

6 | 01/08/2021 | 14:34:53 | 0x0020 | Add-in Card GPU Board | Trasition to Critical from less severe | Asserted |0xA2F60F 
Table 1. XCC GPU sensor specifications 1/2
Sensor nameSensor NumberSensor TypeSensor Reading TypeEntity IDInstance/TypeReading Mask (data set to sensor)
GPU boardE9h17h07h0Bh01h

00h –Transition to OK

02h –Transition to Critical from less severe

  • Evt Data: 21
    • F1h: GPU Power Brake (no evt3)

    • F2h: PIB Thermaltrip (no evt3)

    • F6h: GPU core thermal alert

    • F8h: PIB over temp

  • Evt Data3:

    • XXh: GPU CORE index, 01h: core 1

    • 07h: core 1+core 2+core 3

GPU CPUsEAh17h07h0Bh02h

02h –Transition to Critical from less severe

  • Evt Data2:

    • B#h:Thermal alert

    • BBh: Presence and Power status

    • 21h: PCIe link status

    • E0h: GPU count from SMBIOS

    • 3Ah: Card-Health sensor

  • Evt Data3:

    • XXh: GPU CORE index, 01h: core 1

    • 0Ch: core 3 + core 4

    • ED2:B#h, ED3:VR id.

Table 2. XCC GPU sensor specifications 2/2
Sensor nameSEL Logged AssertionsSEL Logged De-assertionsThresholds Settable (B20)

LED ‘ON’ Request when Assertion

F = Fault LED

LED ‘OFF’ Request when De-assertion

F = Fault LED

GPU board02h02hN/A

00h - None

02h - F

00h - None

02h - F

GPU CPUs02h02hN/A02h-F02h-F
1

Evt data2 can be summarized, Ex. F7h: F1+F2+F4, F3: F1+F2.