1 reliability, 2 availability, Reliability – FUJITSU SPARC ENTERPRISE SERVER M9000 User Manual

Page 54: Availability

Advertising
background image

2-14

SPARC Enterprise M8000/M9000 Servers Overview Guide • December 2010

RAS for M8000/M9000 servers minimize system downtime by providing for error
checking at appropriate locations and by providing centralized monitoring and
control of error checking.

Also M8000/M9000 servers can be configured with clustering software or
centralized management software to enhance the RAS function.

Any scheduled system halt, such as a periodic maintenance or system configuration
change can also be performed without affecting operating resources. This can
improve service uptime significantly.

2.4.1

Reliability

Reliability represents the length of time the server can operate normally without
failure.

Reliability is equally important to both hardware and software.

To improve quality, adequate components must be selected with consideration given
to the product service life and the required response in case of a failure. In
evaluations such as stress tests that check the service life, components and products
are inspected to determine whether they meet the target reliability levels.

Furthermore, software errors are not only triggered by program errors, but also by
hardware errors.

M8000/M9000 servers provide the following functions to realize high reliability.

Monitoring by the XSCF to periodically check whether software such as the
Oracle Solaris OS is running in domains (host watchdog monitoring).

Memory patrol is periodically performed to detect memory software errors and
stuck faults, even in memory areas not normally used, to prevent use of faulty
memory and thereby prevent system failures caused by faulty memory from
occurring.

Since ECC protects functional data in all routes including a computing unit, a
register, cache memory, and a system bus, all 1-bit errors can be automatically
corrected by hardware to ensure data integrity.

2.4.2

Availability

Availability is characterized by how easily a server fails and how quickly the user
can be recovered from the failure. The amount of time the system is usable is
represented as a percentage.

Advertising
This manual is related to the following products: