8 job accounting, 9 fault tolerance, 10 security – HP XC System 2.x Software User Manual

Page 84

Advertising
background image

Example 6-8: Reporting Reasons for Downed, Drained, and Draining Nodes

$ sinfo -R

REASON

NODELIST

Memory errors

dev[0,5]

Not Responding

dev8

6.8 Job Accounting

HP XC System Software provides an extension to SLURM for job accounting. The

sacct

command displays job accounting data in a variety of forms for your analysis. Job accounting
data is stored in a log file; the

sacct

command filters that log file to report on your jobs,

jobsteps, status, and errors. See your system administrator if job accounting is not configured
on your system.

You can find detailed information on the

sacct

command and job accounting data in the

sacct

(1)

manpage.

6.9 Fault Tolerance

SLURM can handle a variety of failure modes without terminating workloads, including crashes
of the node running the SLURM controller. User jobs may be configured to continue execution
despite the failure of one or more nodes on which they are executing (refer to Section 6.4.5.1 for
further information). The command controlling a job may detach and reattach from the parallel
tasks at any time. Nodes allocated to a job are available for reuse as soon as the job(s) allocated
to that node terminate. If some nodes fail to complete job termination in a timely fashion because
of hardware or software problems, only the scheduling of those tardy nodes will be affected.

6.10 Security

SLURM has a simple security model:

Any user of the system can submit jobs to execute. Any user can cancel his or her own jobs.
Any user can view SLURM configuration and state information.

Only privileged users can modify the SLURM configuration, cancel any job, or
perform other restricted activities. Privileged users in SLURM include

root

users and

SlurmUser

(as defined in the SLURM configuration file).

If permission to modify SLURM configuration is required by others,

set-uid

programs may

be used to grant specific permissions to specific users.

SLURM accomplishes security by means of communication authentication, job authentication,
and user authorization.

Refer to SLURM documentation for further information about SLURM security features.

6-14

Using SLURM

Advertising