Sr870bn4 machine check error handling, Classification of errors, Error types – Dell PowerEdge 7250 User Manual

Page 13: Error signaling, 1 classification of errors, 1 error types, 2 error signaling

Advertising
background image

SR870BN4 Error Reference Guide

SR870BN4 Machine Check Error Handling

Revision 1.0

7

4. SR870BN4 Machine Check Error Handling

This section gives an overview of the implementation of machine check error handling on the
SR870BN4 server system. For additional details about Itanium-based system error generation
and error handling, refer to the Itanium™ Processor Family Error Handling Guide (document
number: 249278-002) and the Itanium™ System Abstraction Layer Specification (document
number: 245359-005). Both documents can be downloaded from the web at
developer.intel.com.

4.1 Classification of Errors

Error events are classified by the processor and platform into three basic groups. This section
provides a summary of the different error types and signaling methods defined by the IPF
Machine Check architecture and implemented in the SR870BN4 platform.

4.1.1 Error

Types

Fatal: A fatal error is an error where the state has been corrupted, and the error may or

may not be contained. The platform will signal a fatal error when the integrity of the
platform or subsystem cannot be determined. These errors cannot be corrected by
hardware, firmware, or system software, and a reset of the system or subsystem is
required.

Recoverable/Uncorrectable: An error has been detected that cannot be corrected by

hardware or firmware. However, the operating integrity of platform hardware and system
state has been maintained. These errors may or may not be recoverable (determined by
system software capabilities).

Correctable: An error has been detected and corrected by hardware, or by

processor/platform firmware.

4.1.2 Error

Signaling

Corrected Machine Check Interrupt (CMCI): Corrected processor errors are signaled

as a CMCI to system software. For example, L1 tag parity errors on shared lines or
thermal events are corrected by the processor (logic or the PAL). System software
must insure that the interrupt handler for CMCI executes on the same processor that
signaled the corrected error event.

Corrected Platform Event Interrupt (CPEI): These interrupts are signaled by the

platform or the SAL. These include errors that are corrected by the platform (such as
single-bit ECC error in memory) and errors that are not correctable by the platform. In
either case, the error is contained (i.e., data poisoning), and the platform can still
function reliably. One example of an uncorrected error is a 2XECC error detected on a
write to memory.

Machine Check Events: A processor machine check occurs when the processor

detects a fatal or recoverable error during execution of instructions or when the
processor is signaled by the platform to enter machine check.

Advertising