Skip to main content

GPU problems

Use this information to resolve problems that are related to GPUs and the GPU board.

Note
Make sure to update GPU driver, which includes the nvidia-smi utility required for GPU problem determination. Latest driver can be found at Drivers and Software download website for ThinkSystem SD665-N V3.

Health check for GPUs and GPU board

The following sensor status by ipmitool indicates the GPUs and GPU board are in normal state.

$ ipmitool -I lanplus -H 192.168.70.125 -U USERID -P PASSW0RD 
sdr elist | grep GPU
GPU Board Power | 8Ch | ok | 21.4 | 250 Watts
GPU Board | E9h | ok | 11.8 | Transition to OK
GPU CPUs | EAh | ok | 11.9 | Transition to OK
The summary of nvidia-smi utility indicates 4 GPUs online.
Figure 1. nvidia-smi

System fails to detect the GPU board

When event Sensor GPU Board has transitioned to critical from a less severe state appears in the XCC web event log, it indicates the system fails to detect the GPU board. Go through the following steps to solve the problem.

  1. Power cycle the system.
  2. Check power input related events in XCC and SMM2 (see SMM2 - Power).
  3. Check the system temperature and water flow. Look for leakage, and disconnect then reconnect the water cooling system.
  4. Reboot the system, and run ipmi health check (see Health check for GPUs and GPU board).
  5. One of the following indicates the problem has been solved:
    • FQXSPUN0017I (Sensor GPU Board has transitioned to normal state) in XCC messages
    • Sensor GPU Board has transitioned to normal state in web log
    However, if the problem persists, complete the following steps:
    1. Collect XCC service data (see Collecting service data).
    2. Contact Lenovo Service.

System fails to detect a specific GPU

When the event Sensor GPU CPUs has transitioned to critical from a less severe state appears in the XCC web event log, it indicates the system fails to detect one or more specific GPUs. Go through the following steps to solve the problem.

  1. Check if the retimer is over-temperature from XCC event, if yes, skip the next step.
  2. Download the latest firmware from Data Center Support site (Lenovo Data Center Support for ThinkSystem SD665-N V3), and update the firmware.
  3. Reboot the system, and run ipmi health check (see Health check for GPUs and GPU board).
  4. If the event Sensor GPU Board has transitioned to normal state appears in the XCC web event log, it indicates the problem has been solved.

    However, if the problem persists, complete the following steps.
    1. Check XCC web event log to identify defective unit and problem type (see XCC GPU sensor specifications).
    2. Collect XCC service data (see Collecting service data).
    3. Run nvidia-smi for diagnosis (see NVIDIA System Management Interface for details)
      Note
      Make sure to update GPU driver, which includes the nvidia-smi utility required for GPU problem determination. Latest driver can be found at Drivers and Software download website for ThinkSystem SD665-N V3.
    4. Run nvidia-bug-report.sh (embedded tool in NVIDIA driver).
    5. Contact Lenovo Service.

XCC GPU sensor specifications

When seeing an event in XCC web event log, refer to the following table to identify defective unit and problem type. For example:

6 | 01/08/2021 | 14:34:53 | 0x0020 | Add-in Card GPU Board | Trasition to Critical from less severe | Asserted |0xA2F60F 
Table 1. XCC GPU sensor specifications
Sensor NameData
GPU CPUsSensor NumberEAh

02h - Transition to Critical from less severe

Evt Data2:

  • B0h: Thermal alert
  • BBh: Presence and Power status
  • B1h: GPU interrupt info
  • 21h: PCIe link status
  • E0h: GPU count from SMBIOS

Evt Data3:

  • XXh: GPU CORE index, 01h: core 1
  • 07h: core 3 + core 4
Sensor Type17h
Sensor Reading Type07h
Entity ID0Bh
Instance/Type02h
SEL Logged Assertions02h
SEL Logged De-assertions02h
Thresholds De-assertions 

LED ‘ON’ Request when Assertion

F = Fault KED

02h - F

LED ‘OFF’ Request when De-assertion

F = Fault KED

02h - F
GPU boardSensor NumberEAh

00h - Transition to OK

02h - Transition to Critical from less severe

Evt Data2:

  • F1h: GPU Thermaltrip (no evt3)
  • F2h: PIB Thermaltrip (no evt3)
  • F6h: GPU core thermal alert

Evt Data3:

  • XXh: GPU CORE index, 01h: core 1
  • 07h: core 3 + core 4
  • If Evt2: F4h, 01h: Overtemp flag asserted
Sensor Type17h
Sensor Reading Type07h
Entity ID0Bh
Instance/Type01h
SEL Logged Assertions02h
SEL Logged De-assertions02h
Thresholds De-assertionsN/A

LED ‘ON’ Request when Assertion

F = Fault LED

00h - None

02h - F

LED ‘OFF’ Request when De-assertion

F = Fault LED

00h - None

02h - F