Ras design philosophy, The dual-core intel, Itanium – NEC INTEL 5800/1000 User Manual

Page 6: Processor mca (machine check architecture), Mainframe-class ras features

Advertising
background image

6

RAS Design Philosophy

Realization of a mainframe-class continuous operation through the pursuit of

reliability and availability in a single server construct

Mainframe-class RAS Features

Clustering

Dependable Server Technology

Continuous operations through failures

Redundant components, error prediction and error

correction allows for continuous operation

Minimized spread of failures

Technology to minimize the effects of hardware failures on

the system. Reduction of performance degradation and

multi-node shutdown

Smooth recovery after failures

Ability to replace failed components without

shutting down operations

Impr

oved system availability

Impr

oved r

eliability and availability as a stand alone server

Generally, in order to achieve reliability and availability on an
open server, clustering would be implemented. However,
clustering comes with a price tag. To keep costs at a minimum,
the Express5800/1000 series servers were designed to
achieve a high level of reliability and availability, but within a
single server.

The Express5800/1000 series server’s powerful RAS features
were developed through the pursuit of dependable server
technology.

Continuous operations throughout failures; minimize the
spread of failures; and smooth recovery after failures were
goals set forth which lead to implementation of technologies
such as memory mirroring, increased redundancy of intricate
components, and modularization. Through these technologies
a mainframe level of continuous operation was achieved.

Mainflame

Level

Conventional

open server

Level

PC Server

Level

Reliability

Availability

Serviceability

Center

plane

Chipset

Clock

Core I/O

PCI card

Memory

CPU

L3 cache

Power

HDD

No chipset on the center plane

ECC protection of main

data paths Intricate error

detectionof the high-

speed interconnects

Partial chipset degradation/

Dynamic recovery

Hot Pluggable

*

4

Hot Pluggable

*

4

Hot Pluggable

*

4

Hot Pluggable

*

4

Hot Pluggable

*

4

Hot Pluggable

*

4

Duplexed*

1

16 processor domain

segmentation*

2

Core I/O Relief

ECC protection

SDDC Memory

Memory

Mirroring*

1

Intel

®

Cache Safe

Technology*

3

N+1 Redundant

Two independent

power sources

Software RAID

Hardware RAID

*1 Available only on the 1320Xf/1160Xf
*2 Available only on the 1320Xf
*3 Intel

®

technology designed to avoid cache based failures

*4 Replacement of failed component without shutting down other partitions.

The Dual-Core Intel

®

Itanium

®

processor MCA

(Machine Check Architecture)

The framework for hardware, firmware and OS error handling

The Dual-Core Intel

®

Itanium

®

processor, designed for high-end

enterprise servers, not only excels in performance, but is also
abundant in RAS features. At the core of the processor’s RAS
feature set, is the error handling framework, called MCA.

MCA provides a 3 stage error handling mechanism – hardware,
firmware, and operating system. In the first stage, the CPU and
chipset attempt to handle errors through ECC (Error Correcting
Code) and parity protection. If the error can not be handled by
the hardware, it is then passed to the second stage, where the
firmware attempts to resolve the issue. In the third stage, if the
error can not be handled by the first two stages, the operating
system runs recovery procedures based on the error report
and error log that was received. In the event of a critical error,
the system will automatically reset, to significantly reduce the
possibility of a system failure.

Application Layer

Operating System

The OS logs the error, and then starts the recovery process

Hardware

CPU and chipset ECC and parity protection

The Firmware and OS aid in the correction of complex platform errors to restore the system
Error details are logged, and then a report flow is defined for the OS
Detects and corrects a wide range of hardware errors for main data structures

Firmware

Seamlessly handles the error

Advertising