Page tree
Skip to end of metadata
Go to start of metadata

This document covers more advanced job submission options with an emphasis on submitting parallel jobs. Submission of parallel jobs to SGE requires the specification of a "parallel environment". A parallel environment is a construct that tells SGE how to allocate processors and nodes when a parallel job is submitted. A parallel environment can also run scripts that set up a proper environment for the job so there are times when a PE will be used for a specific application.

Types of parallel jobs

Before discussing what the SGE parallel environments look like, it is useful to describe the different types of parallel jobs that can be run. 

  1. Shared Memory. This is a type of parallel job that runs multiple threads or processes on a single multi-core machine. OpenMP programs are a type of shared memory parallel program.

    The OMP_NUM_THREADS variable is set to '1' by default. If your code can take advantage of the threading then specify OMP_NUM_THREADS to be equal to the number of job cores per node requested.

  2. Distributed Memory. This type of parallel job runs multiple processes over multiple processors with communication between them. This can be on a single machine but is typically thought of as going across multiple machines. There are several methods of achieving this via a message passing protocol but the most common, by far, is MPI (Message Passing Interface). There are several implementations of MPI but we have standardized on OpenMPI. This integrates very well with SGE by
    1. natively parsing the hosts file provided by SGE
    2. yielding process control to SGE
  3. Hybrid Shared/Distributed Memory. This type of parallel job uses distributed memory parallelism across compute nodes, and shared memory parallelism within each compute node. There are several methods for achieving this but the most common is OpenMP/MPI.

Shared memory

Shared memory jobs are fairly straightforward to set up with SGE. There is a shared memory parallel environment called, smp. This PE is set up to ensure that all slots requested reside on a single node.

It is important, therefore, to know how many slots are available in the machines. For example, a node with 16 cores will have 16 slots available in the queue instances of that host. Note that it may be necessary to specify the number of processors and/or number of threads that your computation will use either in the input file or an environment variable. Check the documentation of your software package for instructions on that. Once the number of processors and/or threads desired is determined simply pass that information to SGE with the smp parallel environment.

qsub -pe smp 12 myscript.job

Distributed memory

Distributed memory jobs are a little more involved to set up than shared memory jobs but they are well supported in SGE, particularly using MPI. The MPI implementations will get the hostfile, also called the machinefile from SGE, as well as the number of slots/cores requested. As long as the number of processor cores requested is equal to the number of processor cores that you want the job to run on, an MPI job script could be as simple as

Basic MPI script
#!/bin/sh
mpirun myprogram

The number of processors to use only needs to be specified as a parameter to the SGE parallel environment in the above case.

If you want to run your job with less processors than specified via SGE, or if you want control of how processes are distributed on the allocated nodes, then you will need to specify a PE that specifies the number of cores to use per node. In the case of OpenMPI, there are also mpirun flags that can override the SGE allocation rules.

There are several SGE parallel environments that could be used for distributed memory parallel jobs. Note that parallel applications must make use of the host/machine file that SGE makes available. MPI is the most common one. The parallel environments are:

  • orte (Open Run Time Environment)
    This will grab whatever slots are available anywhere on the cluster. 

    Make sure this is what you want as it is usually not the best option.

  • Xcpn, where X is the number of cores per node
  • various application specific parallel environments

The orte PE is a common parallel environment normally associated with OpenMPI. While common, it has a very limited set of use cases and is generally not the best choice. With it you can specify the number of slots that you need for your job and it will find available job slots from anywhere on the cluster, which may reside on nodes with other jobs filling other slots. It is useful if you do not care if your job processes run on nodes with other user's job processes but just need available slots that can be allocated as quickly as possible. However, it is entirely possible that the jobs you are sharing a node with could bring the node, and your job, down so please consider this when requesting the orte PE. Since MPI jobs tend to be timing sensitive, it is usually desired to have them run on nodes that are not shared with other job processes. This is the purpose of the Xcpn parallel environments. These parallel environments will only allocate nodes that have all slots free, which means the total number of slots requested must be a multiple of X for the Xcpn PEs. So to request an MPI job to use 32 processor cores.

 

This will request exactly 2 nodes.

qsub -pe 16cpn 32  myscript.job

If your job can only take advantage of 24 processes but you want to make sure that you are allocated entire nodes you can submit as below

Using less slots than requested
#!/bin/sh
#$ -pe 16cpn 32 
mpirun -n 24 -bynode myprogram

That would allocate two nodes with 32 cores total but would only use 12 on each node. The -n 24 flag to mpirun will limit the number of processes launched and the -bynode option will distribute 12 processes to the first node and 12 to the second node in round-robin fashion. With OpenMPI, there are additional mpirun parameters that can further control the distribution. As far as SGE is concerned there are 32 slots allocated on those nodes so they will not be used for anything else. Something like the above may also be needed if your job can not run on all of the processors of a node due to memory constraints.

The orte PE can not be used for jobs that need to run on less processes than allocated. This is because the distribution of jobs slots across nodes is unknown a priori.

Hybrid Shared/Distributed memory

This type of parallel job is the most complex to set up. As OpennMP/MPI is the most common type of hybrid job, that will be used as an example. The first thing to determine is what is the appropriate mix of OpenMP threads to MPI processes. That will be very application dependent and may require some experimentation. Once these are determined, the number of OpenMP threads to use per process needs to be specified with the OMP_NUM_THREADS environment variable. The number of MPI processes is specified as an option to the mpirun command. The total number of threads and processes is specified as the total number of slots for the SGE parallel environment.

Since one cannot know a priori how many nodes will be allocated using the orte PE, it will work not work for hybrid parallel applications.

If it is determined that a 32 core job will need to run with 16 OpenMP threads and 2 MPI processes, the job script would look like

myscript.job
#!/bin/sh
# Example hybrid parallel setup
#$ -pe 16cpn 32
OMP_NUM_THREADS=16
mpirun -n 2 -bynode myprogram

Other parallel environments

In addition to the SGE parallel environments mentioned above, there are a few other special purpose PEs, mostly for commercial software.

  • fluent → for running Ansys Fluent. This has hooks in it to handle fluent startup/shutdown, as well as limits for license tokens.
  • gaussian-sm → This is for running Gaussian09 in shared memory mode.
  • gaussian-linda → This is for running Gaussian09 in distributed memory and hybrid shared/distributed memory modes.

Resource requests

The Argon cluster is a heterogeneous cluster, meaning that it consists of different node types with varying amounts and types of resources. There are many resources that SGE keeps track of and most of them can be used in job submissions. However, the resource designations for machines based on CPU type, memory amount, and GPU are more likely to be used in practice. Note that there can be very different performance characteristics for different types of GPUs and CPUs. As noted above, the Argon cluster is split between two data centers,

  • ITF → Information Technology Facility
  • LC→ Lindquist Center

As we expand the capacity of the cluster the datacenter selection could be important for multi node jobs, such as those that use MPI, that require a high speed interconnect fabric. The compute nodes used in those job would need to be in the same data center. Currently, all nodes with the OmniPath fabric are located in the LC datacenter, and all nodes with the InfiniPath fabric are in the ITF datacenter. If you are running MPI jobs you will want to make sure that you do not run jobs with mixed fabrics if the job will run in a queue with multiple node types. The best way to do that is to include a fabric resource request in your job submission.

The investor queues will have a more limited variability in machine types. However, the all.q queue contains all machines and when running jobs in that queue it may be desirable to request specific machine types. The following table lists the options for the more common resource requests. They would be selected with the '-l resource' flag to qsub. Note that using these is a way to filter nodes, restricting the pool of available nodes to more specific subsets of nodes. These are only required if you have a need for specific resources.

Full Resource NameShortcut Resource NameNotes
std_mem
deprecated
sm
deprecated
use mem_128G
mid_mem
deprecated
mm
deprecated
use mem_256G
high_mem
deprecated

hm
deprecated

use mem_512G
mem_64G64G
mem_96G96G
mem_128G128G
mem_192G192G
mem_256G256G
mem_512G512G
cpu_archcpu_arch
  • broadwell
  • skylake_silver
  • skylake_gold
  • sandybridge
  • ivybridge
datacenterdc
  • ITF
  • LC
fabricfabric
  • none*
  • omnipath
  • infinipath

* no high speed interconnect fabric

gpugpuSelect any available GPU type.
gpu_k20k20
gpu_k80k80
gpu_p100p100
gpu_p40

p40


gpu_titanvtitanv
gpu_1080ti1080ti
ngpusngpusSpecify the number of GPU devices that you wish to use

GPU resources

If you wish to use a compute node that contains a GPU then it must be explicitly requested in some form. The table above lists the Boolean resources for selecting a specific GPU, or any one of the types, with the generic gpu resource.

For example, if you run a job in the all.q queue and want to use a node with a GPU, but do not care which type,

qsub -l ngpus=1

If you specifically wanted to use a node with a P100 GPU,

qsub -l gpu_p100=true

or use the shortcut,

qsub -l p100=true

In all cases, requesting any of the GPU Boolean resources will set the ngpus resource value to 1 to signify to the scheduler that 1 GPU device is required. If your job needs more than one GPU than that can be specified explicitly with the ngpus resource. For example,

qsub -l ngpus=2


Note that requesting one of the *-GPU queues will automatically set ngpus=1 if that resource is not otherwise set. However, you will have to know what types of GPUs are in those queues if you need a specific type. Investor queues that have a mix of GPU and non-GPU nodes, ie., without the -GPU suffix will need to make a request for a GPU explicit. Since ngpus is a consumable resource, once the resource, the GPU device, is in use, then it is not available for other jobs on that node until it is freed up. If you wish to run non-GPU jobs on the node in tandem with a GPU job then specify gpu=false for the non-GPU job(s). 


In addition to the ngpus resource there some other non-Boolean resources for GPU nodes that could be useful to you. With the exception of requesting free memory on a GPU device these are informational.

ResourceDescriptionRequestable
gpu.ncuda

number of CUDA GPUs on the host

NO
gpu.nopencl

number of OpenCL GPUs on the host

NO
gpu.ndev

total number of GPUs on the host

NO
gpu.cuda.N.mem_free

free memory on CUDA GPU N

YES
gpu.cuda.N.procs

number of processes on CUDA GPU N

NO
gpu.cuda.N.clock

maximum clock speed of CUDA GPU N (in MHz)

NO

gpu.cuda.N.util

compute utilization of CUDA GPU N (in %)

NO
gpu.opencl.N.clock

maximum clock speed of OpenCL GPU N (in MHz)

NO
gpu.opencl.N.mem

global memory of OpenCL GPU N

NO
gpu.names

semi-colon-separated list of GPU model names

NO

For example, to request a node with at least 2G of memory available on the first GPU device:

qsub -l gpu.cuda.0.mem_free=2G

When there are more than one GPU devices on a node, your job will only be presented with unused devices. Thus, if a node has two GPU devices and your job requests one, ngpus=1, then the job will only see a single free device. If the node is shared then a second job requesting a single GPU will only see the device that is left available. Thus, you should not have to specify which GPU device to use for your job.

Memory

There are times when it is necessary to manipulate memory usage via a request for more job slots (processor cores) than what your job will use. Job slots are one of the resources that is requested when submitting a job to the system. As a general rule, the number of job slots requested should be equal to or greater than the number of processes/threads that will actually consume resources. For example, a node that has 64G of memory and 16 processor cores has ~4G of memory per core. For serial jobs that need more than the amount provided per core on the respective systems, you will need to use the smp PE (parallel environment). For MPI jobs, you will have to use one of the PEs that request whole nodes. The orte PE should absolutely not be used in cases where you have to request more job slots than you will use.

Jobs that do not request whole compute nodes in the UI and all.q queues will have per process memory limits set by default. If a job exceeds the memory limit it will abort and will need to be submitted with a slot request sufficient to allocate the memory needed for the job. If your job is terminated it may be because it hit the memory limit and needs to be submitted with a request for more memory. To verify run the following command

qacct -j $JOBID

and look for the maxvmem field. If it matches a limit mentioned above then you will need to modify your submission. To monitor memory usage during a job use the following command:

qstat -j $JOBID | grep usage

Look at the vmem and maxvmem entries to see if these are getting near a limit. For example

qstat -j 411298|grep usage
usage    1:                 cpu=00:23:19, mem=246.85762 GBs, io=0.01576, vmem=433.180M, maxvmem=433.180M

Note the memory usage info is for the entire job. If your job is using multiple processes than adjust accordingly. If your job requested more job slots than are used then only adjust based on the number of processes actually being used to determine per process memory usage. 

There are several ways of requesting memory resources. If you have a large memory job then you may want to make sure that there is enough free memory on a machine before attempting to schedule the job. Note that these are not limits and are not enforced but only refer to the amounts at the time of scheduling.

#$ -l mf=2G -- request machines with 2G of free memory.
#$ -l sf=2G -- request machines with 2G of free swap.
#$ -l mt=12G -- request machines with a total of 12G of memory.

If the job is a serial (single processor) job and uses more memory than the amount available for a single core then you will need to request more jobs slots (processor cores). This is done by specifying a parallel environment to get more cores and the memory that is loosely associated with them. Since a serial job only runs on a single machine, the smp PE is the one to use. The choice is whether to allow sharing of the job with other jobs. If node sharing is okay then the following will work for a serial job that uses 6GB of memory on a 64G machine.

#$ -pe smp 2
myprogram ...

If everyone else running jobs on the node is requesting the appropriate number of processor cores as above then all jobs should fit in memory. For MPI jobs the same principle applies but note that the orte PE should not be used as the node allocation can not be controlled. So a 12 process MPI job that uses 8GB per process would have to use a request like the following on 64G nodes.

#$ -pe 16cpn 32
mpirun -n 12 -bynode program ...

The above would allocate (2) 16-core nodes and put 6 processes on each node for a total of 48G used per node. There can be no node sharing in this scenario.

hosts

# to run on a specific machine
#$ -l h=compute-2-36

Queues

Queue requests are governed by access lists. Please make sure you have access to a queue before requesting or your job will fail to launch. To see which queues you have access to run the following command.

whichq
# to request the CGRER investor queue.
#$ -q CGRER

If you have access to multiple queues then you can combine the resources by specifying multiple queues for the queue resource.

#$ -q CGRER,COE

One should not specify the all.q queue in a queue submission request along with an investor queue. In that scenario, the processes that land in the investor queue could immediately evict processes that land in the all.q queue, which will have the undesired effect of suspending the processes in the investor queue as well.

Also, the development queue should only be used for testing and not used for production jobs.

Job priorities

If you have not been using the cluster much recently, your jobs will likely have a higher priority than other users who are submitting lots of jobs, and your job should move towards the front of the queue. Conversely, if you have been submitting a lot of jobs, your job priorities will be lowered. This is dependent on the number of job slots requested as parallel jobs will get a higher priority in proportion to the number of requested slots. This is done in an effort to ensure that larger slot count jobs do not get starved while waiting for resources to free up.

Checkpoint/Restart

Many applications that run long running computations will create checkpoint files. These checkpoint files are then used to restart a job starting at the point where the checkpoint was last written. This capability is particularly useful for jobs submitted to the all.q queue as those jobs may be evicted so that investor jobs can run. SGE provides a mechanism to take advantage of checkpoints if your application supports them. SGE creates an environment variable (RESTARTED) that can be checked if the user specified that a job is restartable. To take advantage of the application's checkpoint capability the job script should check for the value of the $RESTARTED environment variable. A simple "if...then" block should be sufficient.

if [ $RESTARTED = 1 ]; then
    .
    .
    .
    some commands to set up the computation for a restart
    .
    .
    .
fi

Additionally, the job needs to be submitted with the '-ckpt user' option to the qsub command. This is what tells SGE that it is possible to restart a job. Without that option, SGE will not look for the $RESTARTED environment variable. Even if your application does not perform checkpoint operations, the checkpoint/restart capability of SGE can still be used to restart your job. In this case, it would just restart from the beginning. This has the effect of migrating the job to available nodes (if there are any) after a job eviction or simply re-queueing the job. 

Not all jobs can be restarted and those that can need to be set up to do so. It is your responsibility to set up the commands in the job script file that will allow the computation to restart. In some cases, this is simply using the latest output file as an input file. In other cases an input file with different options is needed.

Job Dependencies

There are times when you want to make sure a job is completed before another one is started. This can be accomplished by specifying job dependencies. To use this, use the -hold_jid flag of qsub. That flag takes a JOB_ID as a parameter and that JOB_ID is the job that needs to be completed before the current job submission can be launched. So assuming that job A needs to complete before job B can begin

qsub test_A.job
Your job 3808141 ("test_A.job") has been submitted

That will return a JOB_ID in the output. Use that in the subsequent job submission

qsub -hold_jid 3808141 test_B.job

You will see something like the following in the qstat output

qstat -u gpjohnsn
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
3808141 0.50233 test_A.job gpjohnsn     r     01/15/2014 15:40:30 sandbox@compute-6-174.local        1        
3808142 0.00000 test_B.job gpjohnsn     hqw   01/15/2014 15:40:53                                    1        

The test_B job will not begin until the test_A job is complete. There is a handy way to capture the JOB_ID in an automated way using the -terse flag of qsub.

hold_jid=$(qsub -terse test_A.job)
qsub -hold_jid $hold_jid test_B.job

The first line will capture the output of the JOB_ID and put it in the hold_jid variable. You can use whatever legal variable name you want for that. The second line will make use of the previously stored variable containing the JOB_ID.

If the first job is an array job you will have to do a little more filtering on the output from the first job.

hold_jid=$(qsub -terse test_A.job | awk -F. '{print $1}')
qsub -hold_jid $hold_jid test_B.job

If both jobs are array jobs then you can specify array dependencies with the -hold_jid_ad flag to qsub. In that case the array jobs have to be the same size and the dependencies are on the tasks such that when test_A, task 1 is complete test_B, task 1 can start.

  • No labels