4 self-healing – IBM P5 570 User Manual

Page 69

Advertising
background image

Chapter 3. Capacity on Demand, RAS, and manageability

57

operating system has lost control. Mutual surveillance also enables the operating system to
monitor for service processor activity and can request a service processor repair action if
necessary.

Environmental monitoring

Environmental monitoring related to power, fans, and temperature is performed by the
System Power Control Network (SPCN). Environmental critical and non-critical conditions
generate Early Power-Off Warning (EPOW) events. Critical events (for example, a Class 5 AC
power loss) trigger appropriate signals from hardware to affected components so as to
prevent any data loss without operating-system or firmware involvement. Non-critical
environmental events are logged and reported using Event Scan.

The operating system cannot program or access the temperature threshold using the SP.

EPOW events can trigger the following actions:

Temperature monitoring, which increases the fan’s speed rotation when ambient
temperature is above a preset operating range.

Temperature monitoring warns the system administrator of potential environmental-related
problems. It also performs an orderly system shutdown when the operating temperature
exceeds a critical level.

Voltage monitoring provides warning and an orderly system shutdown when the voltage is
out of operational specification.

3.2.4 Self-healing

For a system to be self-healing, it must be able to recover from a failing component by first
detecting and isolating the failed component, taking it offline, fixing or isolating it, and
reintroducing the fixed or replacement component into service without any application
disruption. Examples include:

Bit steering

to redundant memory in the event of a failed memory module to keep the

server operational

Bit-scattering

, thus allowing for error correction and continued operation in the presence

of a complete chip failure (

Chipkill™ recovery

)

Single-bit error correction using ECC without reaching error thresholds for main, L2, and
L3 cache memory

L3 cache line deletes extended from 2 to 10 for additional self-healing

ECC extended to inter-chip connections on fabric and processor bus

Memory scrubbing

to help prevent soft-error memory faults

Dynamic processor deallocation

,

in which a deallocated processor can be replaced by an

unused CoD processor to keep the system operational

Memory reliability, fault tolerance, and integrity

The p5-570 uses Error Checking and Correcting (ECC) circuitry for system memory to correct
single-bit memory failures and to detect double-bit. Detection of double-bit memory failures
helps maintain data integrity. Furthermore, the memory chips are organized such that the
failure of any specific memory module only affects a single bit within a four-bit ECC word
(

bit-scattering

), thus allowing for error correction and continued operation in the presence of

a complete chip failure (

Chipkill recovery

). The memory DIMMs also utilize

memory

scrubbing

and thresholding to determine when spare memory modules within each bank of

memory should be used to replace ones that have exceeded their threshold of error count

Advertising