Run-time cpu deconfiguration (cpu gard), Service processor system monitoring - surveillance, System firmware surveillance – IBM 6C4 User Manual

Page 83

Advertising
background image

During boot time, the service processor does not configure processors or memory
DIMMs that are marked “bad.”

If a processor or memory DIMM is deconfigured, the processor or memory DIMM
remains offline for subsequent reboots until it is replaced or repeat gard is disabled.
The repeat gard function also provides the user with the option of manually
deconfiguring a processor or memory DIMM, or re-enabling a previously deconfigured
processor or memory DIMM.

For information about configuring or deconfiguring a processor, see the Processor
Configuration/Deconfiguration Menu on page 46. For information on configuring or
deconfiguring a memory DIMM, see the Memory Configuration/Deconfiguration Menu on
page 47. Both of these menus are submenus under the System Information Menu. You
can enable or disable CPU Repeat Gard or Memory Repeat Gard using the Processor
Configuration/Deconfiguration Menu.

Run-Time CPU Deconfiguration (CPU Gard)

L1 instruction cache recoverable errors, L1 data cache correctable errors, and L2 cache
correctable errors are monitored by the processor runtime diagnostics (PRD) code
running in the service processor. When a predefined error threshold is met, an error log
with warning severity and threshold exceeded status is returned to AIX. At the same
time, PRD marks the CPU for deconfiguration at the next boot. AIX will attempt to
migrate all resources associated with that processor to another processor and then stop
the defective processor.

Service Processor System Monitoring - Surveillance

Surveillance is a function in which the service processor monitors the system, and the
system monitors the service processor. This monitoring is accomplished by periodic
samplings called

heartbeats

.

Surveillance is available during the following phases:

v

System firmware bringup (automatic)

v

Operating system runtime (optional)

System Firmware Surveillance

System firmware surveillance is automatically enabled during system power-on. It
cannot be disabled by the user, and the surveillance interval and surveillance delay
cannot be changed by the user.

If the service processor detects no heartbeats during system IPL (for a set period of
time), it cycles the system power to attempt a reboot. The maximum number of retries
is set from the service processor menus. If the fail condition persists, the service
processor leaves the machine powered on, logs an error, and displays menus to the
user. If Call-out is enabled, the service processor calls to report the failure and displays
the operating-system surveillance failure code on the operator panel.

Chapter 3. Using the Service Processor

65

Advertising