Recovery and failure notification features, Recovery features, P. 60) – Apple Qmaster 2 and Compressor 2 (Distributed Processing Setup) User Manual

Page 60: Recovery and failure notification, Features

Advertising
background image

60

Chapter 4

Creating and Administering Clusters

Recovery and Failure Notification Features

The Apple Qmaster distributed processing system has a number of built-in features
designed to attempt recovery if there is a problem, and to notify you when it attempts
a recovery.

Recovery Features

The recovery actions described next occur automatically if failures occur in the
Apple Qmaster distributed processing system. There is no need for you, as the
administrator, to enable or configure these features.

If a service stops unexpectedly
If either the cluster controller service or the processing enabled on a service node stops
unexpectedly, the Apple Qmaster distributed processing system restarts the service. To
avoid the risk of endless stopping and restarting, the system restarts the failed service a
maximum of four times. The first two times, it restarts the service right away. If the
service stops abruptly a third or fourth time, the system restarts it only if it had been
running for at least 10 seconds before the service stopped.

If a batch is interrupted
When a service stops suddenly while in the middle of processing an Apple Qmaster
batch, the cluster controller resubmits the interrupted batch in a way that prevents the
reprocessing of any batch segments that were complete before the service stopped.
The cluster controller delays resuming the batch for about a minute from the time it
loses contact with the service.

If a batch fails
When the service is running, but one batch fails to process, a service exception occurs.
When this happens, the cluster controller resubmits the batch immediately. It resubmits
the batch a maximum of two times. If the job fails on the third submission, the
distributed processing system stops resubmitting the job. In the Batch Monitor, the job
is moved to the History table, where the status column indicates that a failure occurred.

UP01082.Book Page 60 Wednesday, March 16, 2005 5:12 PM

Advertising