4 run-time environment, 1 slurm, 2 load sharing facility ( lsf-hpc ) – HP XC System 2.x Software User Manual

Page 24: 3 how lsf-hpc and slurm interact, Section 1.4)

Advertising
background image

1.4 Run-Time Environment

In the HP XC environment, LSF-HPC, SLURM, and HP-MPI work together to provide a
powerful, flexible, extensive run-time environment. This section describes LSF-HPC, SLURM,
and HP-MPI, and how these components work together to provide the HP XC run-time
environment.

1.4.1 SLURM

SLURM (Simple Linux Utility for Resource Management) is a resource management system
that is integrated into the HP XC system. SLURM is suitable for use on large and small
Linux clusters. It was developed by Lawrence Livermore National Lab and Linux Networks.
As a resource manager, SLURM allocates exclusive or non-exclusive access to resources
(application/compute nodes) for users to perform work, and provides a framework to start,
execute and monitor work (normally a parallel job) on the set of allocated nodes. A SLURM
system consists of two daemons, one configuration file, and a set of commands and APIs. The
central controller daemon,

slurmctld

, maintains the global state and directs operations. A

slurmd

daemon is deployed to each computing node and responds to job-related requests,

such as launching jobs, signalling, and terminating jobs. End users and system software (such
as LSF-HPC) communicate with SLURM by means of commands or APIs — for example,
allocating resources, launching parallel jobs on allocated resources, and killing running jobs.

SLURM groups compute nodes (the nodes where jobs are run) together into partitions. The
HP XC system can have one or several partitions. When HP XC is installed, a single partition
of compute nodes is created by default for LSF batch jobs. The system administrator has the
option of creating additional partitions. For example, another partition could be created for
interactive jobs.

1.4.2 Load Sharing Facility (LSF-HPC)

The Load Sharing Facility for High Performance Computing (LSF-HPC) from Platform
Computing Corporation is a batch system resource manager that has been integrated with
SLURM for use on the HP XC system. LSF-HPC for SLURM is included with the HP XC
System Software, and is an integral part of theHP XC environment. LSF-HPC interacts with
SLURM to obtain and allocate available resources, and to launch and control all the jobs
submitted to LSF-HPC. LSF-HPC accepts, queues, schedules, dispatches, and controls all the
batch jobs that users submit, according to policies and configurations established by the HP
XC site administrator. On an HP XC system, LSF-HPC for SLURM is installed and runs on
one HP XC node, known as the LSF-HPC execution host.

A complete description of LSF-HPC is provided in Chapter 7. In addition, for your convenience,
the HP XC documentation CD contains LSF Version 6.0 manuals from Platform Computing.

1.4.3 How LSF-HPC and SLURM Interact

In the HP XC environment, LSF-HPC cooperates with SLURM to combine LSF-HPC’s
powerful scheduling functionality with SLURM’s scalable parallel job launching capabilities.
LSF-HPC acts primarily as a workload scheduler on top of the SLURM system, providing
policy and topology-based scheduling for end users. SLURM provides an execution and
monitoring layer for LSF-HPC. LSF-HPC uses SLURM to detect system topology information,
make scheduling decisions, and launch jobs on allocated resources.

When a job is submitted to LSF-HPC, LSF-HPC schedules the job based on job resource
requirements and communicates with SLURM to allocate the required HP XC compute nodes
for the job from the SLURM

lsf

partition. LSF-HPC provides node-level scheduling for

parallel jobs, and CPU-level scheduling for serial jobs. Because of node-level scheduling, a
parallel job may be allocated more CPUs than it requested, depending on its resource request;
the

srun

or

mpirun -srun

launch commands within the job still honor the original CPU

1-6

Overview of the User Environment

Advertising