The Sun Grid Engine (SGE) system manages the department batch queue. Grid Engine runs jobs on the departmental and research compute nodes.
The CS SGE Engine setup organizes compute resources into two queues.
- The compsci queue contains all the nodes with access to the department NFS filesystem where most user home and project directories live. If unsure, use the compsci queue.
- The GPU hosts are in the compsci-gpu queue (gpu consumables). Please do not submit general batch jobs here as they should be reserved for GPU computing.
Two additional queues exist to hold computers owned by specific research groups.
- Donald Lab users can use the grisman queue for priority access to the grisman cluster.
Jobs queued in compsci are low-priority jobs in SGE parlance. Low-priority jobs have the advantage that they can run on the nodes owned by research groups, such as the architecture group of the Donald Lab. This means low-priority jobs have the largest pool of potential machines to run on. However, if a high-priority job is submitted when all resources are utilized, a low-priority job will be slowed down by 95% to give the high priority job 95% of the CPU.
For the basics of Grid Engine operation, please see the following links
All jobs submitted to Grid Engine must be shell scripts, and must be submitted from one of the cluster machines. Grid Engine will scan the script text for qsub option flags. The same flags can be on the qsub command or embedded in the script. Lines in the script beginning with #$ will be interpretted as containing qsub flags.
The following job runs the program hostname. The script passes gridengine the -cwd flag to run the job in current working directory when qsub was executed. This is the equivalent of running: qsub -cwd job.sh.
#!/bin/sh #$ -cwd hostname
- List running jobs
- List jobs belonging to a user
- qstat -u user
- List running jobs and MPI slaves
- qstat -g t
- List compute nodes
- Show a node's SGE resource attributes
- qhost -F -h linux1
- Submit a job
- qsub job.sh
- Direct a job to a queue
- qsub -q compsci job.sh
- Direct a job to a queue where the 2nd GPU has 10G of RAM while passing the GPU_SET variable 2 to pass to program
- qsub -l cuda.2.freeMemory=10000 -v GPU_SET=2 -q compsci job.sh
- Direct a job to a node
- qsub -q compsci@linux1 job.sh
- Delete a job
- qdel -j [job number from qstat]
Here is a sample of mpich2 on Grid Engine. This script will run in the grisman_mpich parallel environement with 2 slave processes.
#!/bin/csh -f # --------------------------- # job name #$ -N MPI_Job # # pe request #$ -pe grisman_mpich2 2 # # Operate in current working directory #$ -cwd # # --------------------------- export MPIEXEC_RSH=/usr/bin/rsh mpiexec -rsh -nopm -n $NSLOTS -machinefile $TMPDIR/machines my_mpiprogram
You can use the program cluster_scan or qstat to monitor the cluster.
Please be aware that compute cluster machines are not backed up. Users should copy any important data to filesystems that are backed up to avoid losing data. In addition, try to be cognizant that this is a shared resource. Please minimize the network traffic for shared resources like disk space. If you need to read and write lots of data, please copy that to local disks, compute the results, and store the results on longer term storage.