Slurm Batch System

Slurm Batch System

The Slurm system manages the department batch queue. Slurm runs jobs on the departmental and research compute nodes.

For a video lecture discussing the CS Cluster and slurm, visit the Cluster Class link.

You can also access the repository directly at the following location:

https://gitlab.cs.duke.edu/wjs/cs-cluster-talk

The CS Slurm setup organizes compute resources into two queues.

The compsci queue contains all the nodes with access to the department NFS filesystem where most user home and project directories live. If unsure, use the compsci queue.
The GPU hosts are in the compsci-gpu queue. Please do not submit general batch jobs here as they should be reserved for GPU computing.

An additional queues exist to hold computers owned by specific research groups.

Donald Lab users can use the grisman queue for priority access to the grisman cluster.

All interation with the queuing system must be done from one of the cluster head nodes. To access the head nodes, ssh to login.cs.duke.edu using your NetID and NetID password.

For the basics of Slurm operation, please see the following links

Slurm

Job scripts

All jobs submitted to Slurm must be shell scripts, and must be submitted from one of the cluster head nodes. Slurm will scan the script text for option flags. The same flags can be on the srun command or embedded in the script. Lines in the script beginning with #SBATCH will be interpretted as containing slurm flags.

The following job runs the program hostname. The script passes slurm the -D flag to run the job in the current working directory where sbatch was executed. This is the equivalent of running: sbatch -D . job.sh.

#!/bin/sh #SBATCH --time=1 hostname

Defaults

By default, each job will get a default time limit of 4 days, and 30G of memory per node. If you need more, you will need to specify that in the parameters for the batch script.

scontrol show partition compsci PartitionName=compsci AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=4-00:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=90-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=linux[1-50] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=REQUEUE State=UP TotalCPUs=1600 TotalNodes=50 SelectTypeParameters=NONE DefMemPerNode=30000 MaxMemPerNode=UNLIMITED

Examples

List running jobs: squeue
List jobs belonging to a user: squeue -A user
List running jobs: squeue -u user -t RUNNING
List compute partitions: sinfo
List compute nodes: sinfo -N
Show a node's resource attributes: sinfo -Nel
Submit a job: sbatch script.sh
Interactive session on a GPU host: srun -p compsci-gpu --gres=gpu:1 --pty bash -i
Detailed job information: scontrol show jobid -dd jobid
Using an OR constraint to select between multiple GPUs: sbatch -p compsci-gpu --constraint="2080rtx|k80”
Request a specific type and number of GPU(s): sbatch -p compsci-gpu --gres=gpu:2080rtx:1
Direct a job to a linux41 where the GPU has 10G of RAM while passing the GPU_SET variable 2 to pass to program: export GPU_SET=2;sbatch -w linux41 --mem-per-gpu=10g -p compsci-gpu job.sh
Delete a job: scancel jobid

Here is a sample script. This script will run in the compsci partition

#!/bin/csh -f
#SBATCH --mem=1G
#SBATCH --output=matlab.out
#SBATCH --error=slurm.err
matlab -nodisplay myfile.m

Please be aware that compute cluster machines are not backed up. Users should copy any important data to filesystems that are backed up to avoid losing data. In addition, try to be cognizant that this is a shared resource. Please minimize the network traffic for shared resources like disk space. If you need to read and write lots of data, please copy that to local disks, compute the results, and store the results on longer term storage.

Job scripts

Defaults

Examples

Department of Computer Science

Undergraduate

Graduate

General Information

Connect