Skip to main content

GPU performance problems

In the event of high temperatures, the GPUs will self-throttle, which can cause a reduction in performance. Under normal operation this should never occur because the XCC actively monitors GPU temperatures and adjusts system fans accordingly.

However, additional scenarios will cause the GPUs to enter an Emergency Power Reduction (Power Brake) state, which will impact performance:
  • A loss of power.

  • A Power Supply Throttle assertion (typically encountered if a power supply is too hot).

  • Inlet temperature exceeds supported ASHRAE specification (e.g. 35°C for ASHRAE A2).

  • Inlet temperate exceeds 27°C in combination with fan failure.

To monitor if any of these scenarios of occurred, check the System Error LED and the XClarity Controller event log for errors related to redundancy, a degraded state, or a PCIe Power Brake.

Complete the following steps to resolve the issue:
  1. Make sure that two 2000W power supplies are installed, powered, and operational (no errors).

  2. Check the XClarity Controller event log for events related to fan failures. If errors occur, replacing the failing fan.

  3. Check the ambient temperature of the datacenter where the server is installed.

  4. Check the PCIe power brake mode.