6 using slurm, 1 introduction, 2 slurm commands – HP XC System 2.x Software User Manual

Page 71: Table 6-1: slurm commands, Chapter 6, Using slurm

Advertising
background image

6

Using SLURM

6.1 Introduction

HP XC uses the Simple Linux Utility for Resource Management (SLURM) for system resource
management and job scheduling. SLURM is a reliable, efficient, open source, fault-tolerant,
job and compute resource manager with features that make it suitable for large-scale, high
performance computing environments. SLURM can report on machine status, perform partition
management, job management, and job scheduling.

The SLURM Reference Manual is available on the HP XC Documentation CD-ROM and from
the following Web site:

http://www.llnl.gov/LCdocs/slurm/

.

As a system resource manager, SLURM has the following key functions:

Allocate exclusive and/or non-exclusive access to resources (compute nodes) to users for
some duration of time so they can perform work

Provide a framework for starting, executing, and monitoring work (normally a parallel
job) on the set of allocated nodes

Arbitrate conflicting requests for resources by managing a queue of pending work

Section 1.4.3 describes the interaction between SLURM and LSF.

6.2 SLURM Commands

Users interact with SLURM through its command line utilities. SLURM has the following basic
commands:

srun

,

scancel

,

squeue

,

sinfo

, and

scontrol

, which can run on any

node in the HP XC system. These commands are summarized in Table 6-1 and described
in the following sections.

Table 6-1: SLURM Commands

Command

Function

srun

Submits jobs to run under SLURM management.

srun

is used to submit a job for

execution, allocate resources, attach to an existing allocation, or initiate job steps.

srun

can:

Submit a batch job and then terminate

Submit an interactive job and then persist to shepherd the job as it runs

Allocate resources to a shell and then spawn that shell for use in running

subordinate jobs

squeue

Displays the queue of running and waiting jobs (or "job steps"), including the JobID
used for

scancel

), and the nodes assigned to each running job. It has a wide variety

of filtering, sorting, and formatting options. By default, it reports the running jobs in
priority order and then the pending jobs in priority order.

scancel

Cancels a pending or running job or job step. It can also be used to send a specified
signal to all processes on all nodes associated with a job. Only job owners or
administrators can cancel jobs.

Using SLURM

6-1

Advertising