5 n+1 redundancy, 6 fault masking, 7 resource deallocation – IBM P5 570 User Manual

Page 70

Advertising
background image

58

p5-570 Technical Overview and Introduction

(

dynamic bit-steering

). Memory scrubbing is the process of reading the contents of the

memory during idle time and checking and correcting any single-bit errors that have
accumulated by passing the data through the ECC logic. This function is a hardware function
on the memory controller chip and does not influence normal system memory performance.

3.2.5 N+1 redundancy

The use of redundant parts allows the p5-570 to remain operational with full resources:

Redundant spare memory bits in L1, L2, L3, and main memory

Redundant fans

Redundant power supplies

3.2.6 Fault masking

If corrections and retries succeed and do not exceed threshold limits, the system remains
operational with full resources and no client or IBM customer engineer intervention is
required:

CEC bus retry and recovery

PCI-X bus recovery

ECC Chipkill soft error

3.2.7 Resource deallocation

If recoverable errors exceed threshold limits, resources can be deallocated with the system
remaining operational, allowing deferred maintenance at a convenient time.

Dynamic or persistent deallocation

Dynamic deallocation of potentially failing components is non-disruptive, allowing the system
to continue to run. Persistent deallocation occurs when a failed component is detected, which
is then deactivated at a subsequent reboot.

Dynamic deallocation functions include:

Processor

L3 cache line delete

Partial L2 cache deallocation

PCI-X bus and slots

For dynamic processor deallocation, the service processor performs a predictive failure
analysis based on any recoverable processor errors that have been recorded. If these
transient errors exceed a defined threshold, the event is logged and the processor is
deallocated from the system while the operating system continues to run. This feature
(named

CPU Guard

) enables maintenance to be deferred until a suitable time. Processor

deallocation can occur only if there are sufficient functional processors (at least two).

To verify whether CPU Guard has been enabled, run the following command:

lsattr -El sys0 | grep cpuguard

If CPU Guard is enabled, the output will be similar to:

cpuguard enable CPU Guard True

Advertising