CSL: Sun Grid Engine at Duke Computer Science

CSL: Sun Grid Engine at Duke Computer Science

The Sun Grid Engine (SGE) system manages the department batch queue. Grid Engine runs jobs on the departmental and research compute nodes.

The CS SGE Engine setup organizes compute resources into two queues.

  • The compsci queue contains all the nodes with access to the department NFS filesystem where most user home and project directories live. If unsure, use the compsci queue.
  • The GPU hosts are in the compsci-gpu queue (gpu consumables). Please do not submit general batch jobs here as they should be reserved for GPU computing.

Two additional queues exist to hold computers owned by specific research groups.

  • Donald Lab users can use the grisman queue for priority access to the grisman cluster.
  • The architecture group can use the platypus queue to direct jobs to the platypus cluster or use low-priority queue architecture to send jobs to both platypus and compsci hosts.

Jobs queued in compsci are low-priority jobs in SGE parlance. Low-priority jobs have the advantage that they can run on the nodes owned by research groups, such as the architecture group of the Donald Lab. This means low-priority jobs have the largest pool of potential machines to run on. However, if a high-priority job is submitted when all resources are utilized, a low-priority job will be slowed down by 95% to give the high priority job 95% of the CPU.

For the basics of Grid Engine operation, please see the following links

Job scripts

All jobs submitted to Grid Engine must be shell scripts, and must be submitted from one of the cluster machines. Grid Engine will scan the script text for qsub option flags. The same flags can be on the qsub command or embedded in the script. Lines in the script beginning with #$ will be interpretted as containing qsub flags.

The following job runs the program hostname. The script passes gridengine the -cwd flag to run the job in current working directory when qsub was executed. This is the equivalent of running: qsub -cwd job.sh.

#!/bin/sh
#$ -cwd 

hostname

Examples

List running jobs
qstat
List jobs belonging to a user
qstat -u user
List running jobs and MPI slaves
qstat -g t
List compute nodes
qhost
Show a node's SGE resource attributes
qhost -F -h linux1
Submit a job
qsub job.sh
Direct a job to a queue
qsub -q compsci job.sh
Direct a job to a queue request 2 GPUs
qsub -l gpu=2 -q compsci job.sh
Direct a job to a node
qsub -q compsci@linux1 job.sh
Delete a job
qdel -j [job number from qstat]

Here is a sample of mpich2 on Grid Engine. This script will run in the grisman_mpich parallel environement with 2 slave processes.

#!/bin/csh -f
# ---------------------------
# job name 
#$ -N MPI_Job
#
# pe request
#$ -pe grisman_mpich2 2
#
# Operate in current working directory
#$ -cwd
#
# ---------------------------

export MPIEXEC_RSH=/usr/bin/rsh

mpiexec -rsh -nopm -n $NSLOTS -machinefile $TMPDIR/machines my_mpiprogram