Partial chipset degradation – NEC INTEL 5800/1000 User Manual

Page 7

Advertising
background image

7

Memory Mirroring

Continuous operation even in the event of a non-correctable memory error

The Express5800/1000 series server supports high-level memory
RAS features to ensure that the server can rapidly detect memory
errors, reduce multi-bit errors and continually operate even in
the event of memory chip or memory controller failures. Memory
scan, memory chip sparing (SDDC*) and memory scrubbing are
examples of those features.

A memory scan is run on all loaded memory modules at each OS

boot. If the system detects a memory failure, the failed component
is immediately isolated and detached from the system preventing
possible downtime during business operations.

Chip sparing (SDDC*) memory is a memory system loaded with
several DRAM chips that can correct errors at the chip level. If
a failure were to occur in the memory, the error can be corrected
immediately to allow for continuous operation.

Memory scrubbing checks memory content regularly (every few
milliseconds) during operation without affecting performance.
When an error is detected, it is corrected and then reported.
The scrubbing function is effective in detecting errors in a timely
manner which ultimately results in the reduction of multi-bit errors.

Memory mirroring takes place continuously, where the same data
is written onto 2 separate memory blocks instead of 1 (available
only on the 1160Xf and 1320Xf). In the event of a non-correctable
error, due to the fact that the data exists on two independent
blocks, operations are able to continue without interruption.

Partial Chipset degradation

Avoid multi-partition shutdowns resulting from chipset failures

In certain instances when multiple server partitions share a
common crossbar controller, effects of a single partition failure
may result in a multi-partition shutdown. To resolve this issue, the
Express5800/1000 series servers have been designed to allow for
the partial degradation of chipsets.

Within each of the LSI chips, which make up the chipset, multiple
LSI sub-units exist. These sub-units are connected to other sub-
units located on separate LSI chips. The combined sub-units
together make up single partition. If an error were to occur on an
LSI sub-unit, that sub-unit alone can be degradated to isolate the
failure to a single partition, thus preventing the failure to spread to
other partitions.

Furthermore, the downed partition can automatically reboot
itself, after isolating the failed subsystem, to resume operations
in a degradated mode without the intervention of a system
administrator. This is made possible, on the Express5800/1000
series servers, by the redundant paths between the Cells and the
IO.

Memory

Image

Unit of degradation

on the Express5800/

1000 Series

D

at

a

0

D

at

a

2

D

at

a

1

D

at

a

3

D

at

a

0

D

at

a

2

D

at

a

1

D

at

a

3

Cell
Controller

Memory

I/F

Memory

Controller

Memory

I/F

Memory

Controller

Memory

I/F

Memory

Controller

Memory

I/F

Memory

Controller

Components covered by

the memory mirroring

CPU

CPU

CPU

CPU

M

ir

ro

r

M

ir

ro

r

Components covered by

the standard chip sparing

PCIBox

0

0

PCIBox

1

1

0

1

Sub

Unit

Sub

Unit

Crossbar

Controller

A

Sub

Unit

Sub

Unit

Crossbar

Controller

B

Sub

Unit

Sub

Unit

Sub

Unit

Sub

Unit

Sub

Unit

Cell 1

1

Cell 0

0

Partial

degradation

Failure

n specifies the partition number

Sub-units within the chipset
Additional sub-sets exist in
actuality

Not affected

Failure occurs at the sub-unit of
the crossbar controller.
Partition 0 is shutdown so that the
failed component can be isolated.
Partition 0 is rebooted

This construct allows for continuous operation through all non-
correctablememory errors, not limited to the memory themselves,
but also in the memory interfaces and the in memory controllers.

* Single Device Data Correction

Advertising