Issues related to health monitoring of server hardware

These topics address issues that are related to health monitoring of a server or its hardware components.

It is essential to keep current with the latest system firmware for BIOS/UEFI, BMC/IMM and other components. For the latest system firmware, visit the Lenovo Support Portal Web site.

Duplicate active alerts are generated for certain memory and processor events
Operations Manager generates two duplicate active alerts when receiving certain memory and processor events because the same event is handled by multiple monitors.
Workaround: No workaround is available at this time.
Not all hardware events are reportable events for all systems
Health monitoring is dependent on hardware capability, the firmware support level, and the management software support level. As an example, some systems might have more than one physical power supply, but not all of the power supplies are instrumented or manageable.
Hardware health events are specific to hardware platforms. Not all hardware events are supported as reportable events for all hardware platforms.
This is normal behavior for Lenovo XClarity Integrator Hardware Management Pack.
Workaround: To achieve full coverage of hardware health for your IT infrastructure, upgrade to a newer system equipped with a Baseboard Management Controller (BMC) service processor, Remote Supervisor Adapter (RSA) II, or with an Integrated Management Module (IMM). Also install the latest supported firmware for the management controller.

Running out of temporary disk space on a managed system can prevent health monitoring and event alerting from working

Lenovo XClarity Integrator Hardware Management Pack monitors system health through client-side scripts and requires temporary working disk space on a managed system. The temporary working disk space is managed by Operations Manager Health Service. If that disk space is depleted, the scripts in Hardware Management Pack cannot run, and therefore are not able to correctly detect and report the health state to Operations Manager.

The temporary working disk space is, by default, allocated from the %TEMP% folder on the managed system for the Local System Account.

Note

The Local System Account is the user account under which the Operations Manager Health Service runs. There is no known recommendation for the minimum amount of disk space that you shall reserve for managed systems.

When this situation occur, the Windows event logs on the managed system for Operations Manager will contain entries similar to the following examples.

Example 1

Event Type: Warning
Event Source: Health Service Modules
Event Category: None
Event ID: 10363
Date: 4/20/08
Time: 17:24:04
User: N/A
Computer: A-X3650-RAID
Description: Module was unable to enumerate the WMI data

Error: 0x80041032
Details: Call cancelled

One or more workflows were affected by this.

Workflow name: many
Instance name: many
Instance ID: many
Management group: scomgrp1

For more information, see Microsoft Support – Events and Errors Message Center Web site.

Example 2

Event Type: Error
Event Source: Health Service Modules
Event Category: None
Date:   04/20/08
Event ID: 9100
Time:   17:25:33
User:   N/A
Computer: A-X3650-RAID
Description: An error occurred on line 105 when executing script  
'MOM Backward Compatibility Service State Monitoring Script' 
Source: Microsoft VBScript runtime error 
Description: ActiveX component can't create object: 'GetObject'

One or more workflows were affected by this.

Workflow name: System.Mom.BackwardCompatibility.ServiceStateMonitoring 
Instance name: a-x3650-raid.Lab54.local 
Instance ID: {EE77E6E4-5DC5-F316-A0CA-502E4CBFCB97}
Management group: scomgrp1

For more information, see Microsoft Support – Events and Errors Message Center Web site.

Workaround: Monitor the free disk space in the %TEMP% folder on the managed system for the Local System Account, and increase the free disk space as needed.

Some hardware alerts require a manual reset of the health state
Lenovo XClarity Integrator Hardware Management Pack can automatically reset the health state of hardware components for most of the hardware alerts. Resets occur when there is enough specific information in the alerts to determine if it is appropriate to reset the health state of the component.
However, there are cases where information about the physical condition is too generic for Hardware Management Pack to determine if the physical condition has been resolved, or if the problem is a security concern that warrants manual acknowledgement by an IT administrator.
The following examples are categories of physical hardware problems that require manual resets of health states:
- Problems that indicate a potential security breach to physical systems
- Hardware problems related to RAID or disk drives
- Hardware problems that do not contain enough specific information, such as a generic processor error
- Hardware problems that are hardware-platform specific, such as the condition of a too-hot processor not detected through a temperature sensor outside of the processor chip
Workaround: Refer to the knowledge articles about Hardware Management Pack for each monitor and alert to learn if an alert or the state of a monitor requires a manual health reset.
Alerts and events of an offline managed system are not visible in the Operations Manager Console until the managed system comes back online and reconnects to Operations Manager
Each of the alerts, events, and state changes of an agent-based managed system depend on the local Microsoft Health Service of the managed system that is communicating with the Operations Manager server. If the network connection between the Operations Manager server and the managed system is disrupted or if the managed system goes offline for some reason, no alerts or events are communicated to the Operations Manager server.
When the network connection resumes, the alerts and events previously recorded locally on the managed system flow to the Operations Manager server.
When communication between the managed systems and the Operations Manager server is fully established, Operations Manager views might contain outdated alerts and events from previously disconnected systems.
Workaround: None needed.
Disconnected NICs on managed systems are reported with an offline error, even if they are disabled in Windows
For NICs that have been disabled in Windows (either through the Control Panel or other means), Lenovo XClarity Integrator Hardware Management Pack still reports the error and the alert for the physically disconnected NIC, despite it being explicitly disabled.
Hardware Management Pack monitors the physical condition of the NICs without taking into consideration their relationship with the Windows system.
Workaround: No workaround is available at this time; however, you can disable the NIC offline alert monitor to ignore these errors. For information about how to disable a monitor, see the Operations Manager online help.
Different versions of IBM Director Agent might report a different severity for the same hardware events
Some hardware events might be reported as critical errors by Director Core Services 5.20.31, while the same events might be reported as warnings by Director Platform Agent 6.2.1 and later version.
Workaround: No workaround is available at this time.
All of the events generated with the WinEvent tool are reported under one monitor
The only purpose of the WinEvent tool (WinEvent.exe), which is part of the Director Agent 5.20.x, is to validate the connection of a managed system with Operations Manager through Lenovo XClarity Integrator Hardware Management Pack. WinEvent does not fully populate all of the relevant information needed to simulate real-world hardware events. Therefore, all of the events generated with WinEvent are reported under one monitor in Hardware Management Pack.
Workaround: No workaround is available at this time.
Outstanding errors that are generated through WinEvent from IBM Director Agent 5.10.x are reported continuously by regular health checkup monitors (even after they are manually cleared in Operations Manager)
In IBM Director Agent 5.10.x, an error generated through the WinEvent tool (WinEvent.exe) also affects the internal health state maintained inside of the Director Agent for the corresponding hardware component. The saved state affects the resulting health state reported by the regular health checkup monitor for that component. Consequently, even after that error is manually cleared in Operations Manager, the regular health checkup monitor still reports the error until the error is cleared at the Director Agent level.
In IBM Director Agent 5.20.x and later version, events generated through WinEvent do not affect the health state maintained inside of the Director Agent for the hardware component.
Workaround: Use WinEvent.exe to generate the pairing event (that is the same event ID) of severity level 0, to clear the error state maintained in the Director Agent for the hardware component. Alternatively, clear all the outstanding errors generated through WinEvent.exe by deleting the IBM\director\cimom\data\health.dat file and all the IBM\director\cimom\data\health.dat\*.evt files on the managed system and then restart the system.
No events are generated in Operations Manager for logging on or off of the Remote Supervisor Adapter II
No events are generated in Operations Manager when logging on or off of the Remote Supervisor Adapter II.
Workaround: Install the latest firmware for the Remote Supervisor Adapter II.
No alerts are generated in Operations Manager when the RSA-II event log exceeds the capacity threshold or it is full
No alerts are generated in the Operations Manager when the RSA-II event log either exceeds the capacity threshold or is full.
Workaround: Install the latest firmware for the Remote Supervisor Adapter II.
Uninstalling the OSA IPMI driver does not resolve the expected "software missing" error
Uninstalling the OSA IPMI driver from a managed system results in a "software failed" warning, not the "software missing" error, until the system reboots. The reason for this error is because the OSA IPMI driver is not Windows Plug-and-Play compliant. Until a reboot, the driver is still present in the Windows system kernel, even though it has been removed.
Workaround: For systems that are listed on the IBM support site, use the Microsoft IPMI driver to replace the OSA IPMI driver. The Microsoft IPMI driver can be installed on Windows Server 2003 R2 as an optional hardware management feature, while the driver is installed automatically on Windows Server 2008 or later version.
External hardware knowledge articles about Hardware Management Pack are not available on an Operations Manager management server that does not have Hardware Management Pack installed
If you are using the Operations Manager Console on a server that not installed with Lenovo XClarity Integrator Hardware Management Pack, the external knowledge pages about hardware alerts are not available.
Hardware Management Pack must be installed locally for these IBM knowledge pages to be accessible from the Operations Manager Console.
Workaround: To access the hardware knowledge articles, use the Operations Manager 2007 Console on a management server that has Hardware Management Pack installed.
The System x Power Data Chart is not available for multinode servers
The System x Power Data Chart monitoring power information for multinode servers function is not supported in this release for these systems: System x3850 X5, System X iDataPlex^® dx360 M4.
Workaround: Use traditional methods to monitor power data.

Give feedback