GPU problems
Use this information to resolve problems that are related to GPUs and the GPU board.
Health check for GPUs and GPU board
The following sensor status by ipmitool indicates the GPUs and GPU board are in normal state.
$ ipmitool -I lanplus -H 192.168.70.125 -U USERID -P PASSW0RD
sdr elist | grep GPU
GPU Board Power | 8Ch | ok | 21.4 | 250 Watts
GPU Board | E9h | ok | 11.8 | Transition to OK
GPU CPUs | EAh | ok | 11.9 | Transition to OK
To check GPU health, you can use Intel® XPU Manager. Intel® XPU Manager is a GPU monitoring and managing tool to simplify GPU administration. Intel® XPU Manager download and information are available at Intel® XPU Manager.
System fails to detect the GPU board
When event Sensor GPU Board has transitioned to critical from a less severe state appears in the XCC web event log, it indicates the system fails to detect the GPU board. Go through the following steps to solve the problem.
- Power cycle the system.
- Check power input related events in XCC and SMM2 (see SMM2 - Power).
- Check the system temperature and water flow. Look for leakage, and disconnect then reconnect the water cooling system.
- Reboot the system, and run ipmi health check (see Health check for GPUs and GPU board).
- One of the following indicates the problem has been solved:
- FQXSPUN0017I (Sensor GPU Board has transitioned to normal state) in XCC messages
- Sensor GPU Board has transitioned to normal state in web log
However, if the problem persists, complete the following steps:- Collect XCC service data (see Collecting service data).
- Contact Lenovo Service.
System fails to detect a specific GPU
When the event Sensor GPU CPUs has transitioned to critical from a less severe state appears in the XCC web event log, it indicates the system fails to detect one or more specific GPUs. Go through the following steps to solve the problem.
- Check if the retimer is over-temperature from XCC event, if yes, skip the next step.
- Download the latest firmware from Data Center Support site (Lenovo Data Center Support for ThinkSystem SD650-I V3), and update the firmware.
- Reboot the system, and run ipmi health check (see Health check for GPUs and GPU board).
If the event Sensor GPU Board has transitioned to normal state appears in the XCC web event log, it indicates the problem has been solved.
However, if the problem persists, complete the following steps.- Check XCC web event log to identify defective unit and problem type (see XCC GPU sensor specifications).
- Collect XCC service data (see Collecting service data).
- Run xpu-smi for diagnosis (see Intel® XPU Manager for details)
- Contact Lenovo Service.
XCC GPU sensor specifications
When seeing an event in XCC web event log, refer to the following table to identify defective unit and problem type. For example:
6 | 01/08/2021 | 14:34:53 | 0x0020 | Add-in Card GPU Board | Trasition to Critical from less severe | Asserted |0xA2F60F
Sensor Name | Data | ||
GPU CPUs | Sensor Number | EAh | 02h - Transition to Critical from less severe Evt Data2:
Evt Data3:
|
Sensor Type | 17h | ||
Sensor Reading Type | 07h | ||
Entity ID | 0Bh | ||
Instance/Type | 02h | ||
SEL Logged Assertions | 02h | ||
SEL Logged De-assertions | 02h | ||
Thresholds De-assertions | N/A | ||
LED ‘ON’ Request when Assertion F = Fault KED | 02h - F | ||
LED ‘OFF’ Request when De-assertion F = Fault KED | 02h - F | ||
GPU board | Sensor Number | EAh | 00h - Transition to OK 02h - Transition to Critical from less severe Evt Data2:
Evt Data3:
|
Sensor Type | 17h | ||
Sensor Reading Type | 07h | ||
Entity ID | 0Bh | ||
Instance/Type | 01h | ||
SEL Logged Assertions | 02h | ||
SEL Logged De-assertions | 02h | ||
Thresholds De-assertions | N/A | ||
LED ‘ON’ Request when Assertion F = Fault LED | 00h - None 02h - F | ||
LED ‘OFF’ Request when De-assertion F = Fault LED | 00h - None 02h - F |