Two independent power sources, Mainframe-class ras features – NEC INTEL 5800/1000 User Manual

Page 10

Advertising
background image

10

Mainframe-class RAS Features

Enhanced error detection of the high-speed interconnect

Intricate error handling through multi-bit error detection

and resending of errored data

Since higher speed interconnects are implemented to increase
system performance, there are higher probabilities that
interference noise will cause errors occurring along these
interconnects. One method of handling these interconnect errors
would be to disable the errored interconnect and operate in a
degradated mode.

In addition to above method, the Expres5800/1000 series servers
have implemented a methodology prevalent in supercomputers,
where by intricate multi-bit error detection is carried out, and
errored data is resent upon detection of an error. This allows
the Express5800/1000 series servers to handle the intermittent
errors which occur along the high-speed interconnects, without
impacting the system performance.

Two independent power sources

Avoid system shutdown due to failures of the power distribution units

The previous 32 processor and the 16 processor models supported
having two independent power supplies, where the 8 processor
model did not. This feature is now available on the new 8 processor
system (1080Rf) so that the system can continue operations even
in the event of a failure with in the power distribution unit.

Autonomic reporting of error logs with pinpoint prognosis

of failed components

Realization of a mainframe-class platform serviceability

The Express5800/1000 series servers are equipped with a service
processor which process server management and platform error
handling. The service processor can be considered the core
component which supports the RAS features of the system. One
feature of the service processor is its ability to analyze detail logs
(BID: built-in diagnosis) which are collected by the chipset in the
event of an error. The BID is able to diagnose the location of the
error, and will pinpoint the required FRU (Field Replaceable Unit)
so that the time required to replace the component and recover the
system, can be minimized.

In the event of a failure, the Express5800/1000 series servers
also have the capability to automatically send detailed error logs
to maintenance personnel, enabling us to further lessen the time
required to resolve a system error. Furthermore, to minimize
the possibility of a critical error, the diagnostics engine is able to
proactively predict errors rather than just react to errors.

Implementation of an Uninterruptible Power Supply (UPS) can
further increase availability. The two independent power source
feature is a standard feature on the 1320Xf and is available as an
optional feature for 1160Xf and 1080Rf.

Customer

Environment

Diagnostics Agent

Diagnostics of retry tendency and
confirmation of whether threshold
was exceeded

Service

Processor

Manager

Preven

tive Maintenance

,

Failed Compon

ent Repla

cemen

t

Maintenance Group

The error information summary
is analyzed to determine the
cause of the failure.
The development team may
be contacted for assistance.

Encrypted message

Development Group

The Error information

is sent via email

If required, the detail log is analyzed

further by the development groups

Hard

ware

Diagnostics

Agent

Log

Mail

Log

Mail

Internet

Log

A detailed hardware error log
including transaction history is
collected.

chipset

Without Check Features

Logic Circuits

ECC

Failure

Bad Data

Without Check Features

Logic Circuits

ECC

Data

Data

Failure

Unable to detect error

Circuit

Check

Error Detected

1 bit Error

Error Detection

Circuits

Error Detection

Circuits

Bad data, resulting from a simple error
such as a single bit error, can not be
blocked if a failure exists within the
error detection circuits themselves.

Diagnostics of the error detection
circuits at every system boot
insures data integrity.

Error

Reporting

Error

Reporting

Advertising