IBM POWER 730 User Manual

Page 13

Advertising
background image

IBM Europe, Middle East, and Africa Hardware

Announcement ZG10-0214

IBM is a registered trademark of International Business Machines Corporation

13

• Fan speed is controlled by monitoring actual temperatures on critical components

and adjusting accordingly. If internal component temperatures reach critical

levels, the system will shut down immediately, regardless of fan speed. When a

redundant fan fails, the system calls out the failing fan and continues running.

When a nonredundant fan fails, the system shuts down immediately.

Availability enhancement functions
The POWER7 family of systems continues to offer and introduce significant

enhancements designed to increase system availability.

POWER7 processor functions

As in POWER6, the POWER7 processor has the ability to do processor instruction

retry and alternate processor recovery for a number of core-related faults. This

significantly reduces exposure to both hard (logic) and soft (transient) errors in

the processor core. Soft failures in the processor core are transient (intermittent)

errors, often due to cosmic rays or other sources of radiation, and generally are not

repeatable. When an error is encountered in the core, the POWER7 processor will

first automatically retry the instruction. If the source of the error was truly transient,

the instruction will succeed and the system will continue as before. On IBM systems

prior to POWER6, this error would have caused a checkstop.

Hard failures are more difficult, being true logical errors that will be replicated

each time the instruction is repeated. Retrying the instruction will not help in this

situation. As in POWER6, POWER7 processors have the ability to extract the failing

instruction from the faulty core and retry it elsewhere in the system for a number

of faults, after which the failing core is dynamically deconfigured and called out for

replacement. These systems are designed to avoid a full system outage.

POWER7 single processor checkstopping

As in POWER6, POWER7 provides single processor checkstopping. This significantly

reduces the probability of a fault in any one processor affecting total system

availability.

Partition availability priority

Also available is the ability to assign availability priorities to partitions. If an

alternate processor recovery event requires spare processor resources in order

to protect a workload, when no other means of obtaining the spare resources is

available, the system will determine which partition has the lowest priority and

attempt to claim the needed resource. On a properly configured POWER7 processor-

based server, this allows that capacity to be first obtained from, for example, a test

partition instead of a financial accounting system.

POWER7 cache availability

The POWER processor-based line of servers continues to be at the fore-front of

cache availability enhancements. The L3 cache is now integrated on the POWER7

processor. The POWER7 processor provides both L2 and L3 cache line delete

functions.

Special uncorrectable error handling

Special Uncorrectable Error (SUE) handling was an IBM innovation introduced for
POWER5

TM

processors, where an uncorrectable error in memory or cache does not

immediately cause the system to terminate. Rather, the system tags the data and

determines whether it will ever be used again. If the error is irrelevant, it will not

force a checkstop.

PCI extended error handling

PCI extended error handling (EEH)-enabled adapters respond to a special data

packet generated from the affected PCI slot hardware by calling system firmware,

which will examine the affected bus, allow the device driver to reset it, and continue

without a system reboot. For Linux, EEH support extends to the majority of

Advertising
This manual is related to the following products: