GPU Allocation using SLURM

All users must use Slurm to request GPUs to prevent resource contention

Slurm is used on all the PICSL lambda machines to handle GPU allocation fairly. Slurm an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters, similar in function to LSF on the BSC PMACS cluster.

Slurm provides srun for running interactive commands and sbatch for non-interactive commands.

Interactive Sessions

If you want to run an interactive task (e.g, train a deep learning network from command line) using one GPU, execute

srun --gres gpu:1 --pty /bin/bash

This will launch a terminal for you on one of the GPU machines and set thee environment variable CUDA_VISIBLE_DEVICES to the number of the GPU that has been allocated for you, e.g., CUDA_VISIBLE_DEVICES=1. It is a very good idea to use a terminal manager like tmux or screen to ensure a persistent session.

Non-Interactive Sessions

If you want to run a non-interactive script or a program on three GPUs, execute

sbatch --gres gpu:3 myprogram

and Slurm will allocate 3 gpus for myprogram and set the environment variable CUDA_VISIBLE_DEVICES to a comma delimited list of the allocated gpus, eg: CUDA_VISIBLE_DEVICES="1,2,3" for /dev/nvidia1, /dev/nvidia2, /dev/nvidia3.

For PMACS/LFS Users

The equivalents of your familiar commands ibash and xbash are:

ibash = srun --pty bash
xbash = salloc --nodelist lambda-recon bash -c 'ssh -Y $(scontrol show hostnames | head -n 1)'

Additional Options

A more complete synopsis of srun and sbatch are:

srun [-c N][--gres gpu:N][--{mem|mem-per-cpu|mem-per-gpu} N[K|M|G|T]][--node-list lambda-{picsl|recon|clam}] [-pty] command
sbatch [-c N][--gres gpu:N][--{mem|mem-per-cpu|mem-per-gpu} N[K|M|G|T]] [-o stdout-file][--error stderr-file] command

By default, --gres gpu:N is exclusive access

See https://docs.rc.fas.harvard.edu/kb/convenient-slurm-commands/ for more complete documentation.

-Gaylord