Page tree
Skip to end of metadata
Go to start of metadata

This document covers more advanced job submission options with an emphasis on submitting parallel jobs. Submission of parallel jobs to SGE requires the specification of a "parallel environment". A parallel environment is a construct that tells SGE how to allocate processors and nodes when a parallel job is submitted. A parallel environment can also run scripts that set up a proper environment for the job so there are times when a PE will be used for a specific application.

Types of parallel jobs

Before discussing what the SGE parallel environments look like, it is useful to describe the different types of parallel jobs that can be run. 

  1. Shared Memory. This is a type of parallel job that runs multiple threads or processes on a single multi-core machine. OpenMP programs are a type of shared memory parallel program.

    The OMP_NUM_THREADS variable is set to '1' by default. If your code can take advantage of the threading then specify OMP_NUM_THREADS to be equal to the number of job cores per node requested.

  2. Distributed Memory. This type of parallel job runs multiple processes over multiple processors with communication between them. This can be on a single machine but is typically thought of as going across multiple machines. There are several methods of achieving this via a message passing protocol but the most common, by far, is MPI (Message Passing Interface). There are several implementations of MPI but we have standardized on OpenMPI. This integrates very well with SGE by
    1. natively parsing the hosts file provided by SGE
    2. yielding process control to SGE
  3. Hybrid Shared/Distributed Memory. This type of parallel job uses distributed memory parallelism across compute nodes, and shared memory parallelism within each compute node. There are several methods for achieving this but the most common is OpenMP/MPI.

Shared memory

Shared memory jobs are fairly straightforward to set up with SGE. There is a shared memory parallel environment called, smp. This PE is set up to ensure that all slots requested reside on a single node.

It is important, therefore, to know how many slots are available in the machines. For example, a node with 16 cores will have 16 slots available in the queue instances of that host. Note that it may be necessary to specify the number of processors and/or number of threads that your computation will use either in the input file or an environment variable. Check the documentation of your software package for instructions on that. Once the number of processors and/or threads desired is determined simply pass that information to SGE with the smp parallel environment.

qsub -pe smp 12 myscript.job

Distributed memory

Distributed memory jobs are a little more involved to set up than shared memory jobs but they are well supported in SGE, particularly using MPI. The MPI implementations will get the hostfile, also called the machinefile from SGE, as well as the number of slots/cores requested. As long as the number of processor cores requested is equal to the number of processor core you want the job to run on, an MPI job script could be as simple as

Basic MPI script
#!/bin/sh
mpirun myprogram

The number of processors to use only needs to be specified as a parameter to the SGE parallel environment in the above case.

If you want to run your job with less processors than specified via SGE, or if you want control of how processes are distributed on the allocated nodes, then you will need to specify a PE that specifies the number of cores to use per node. In the case of OpenMPI, there are also mpirun flags that can override the SGE allocation rules.

There are several SGE parallel environments that could be used for distributed memory parallel jobs. Note that parallel applications must make use of the host/machine file that SGE makes available. MPI is the most common one. The parallel environments are:

  • orte (Open Run Time Environment)
    This will grab whatever slots are available anywhere on the cluster. 

    Make sure this is what you want as it is usually not the best option.

  • Xcpn, where X is the number of cores per node
  • various application specific parallel environments

The orte PE is a common parallel environment normally associated with OpenMPI. While common, it has a very limited set of use cases and is generally not the best choice. With it you can specify the number of slots that you need for your job and it will find available job slots from anywhere on the cluster, which may reside on nodes with other jobs filling other slots. It is useful if you do not care if your job processes run on nodes with other user's job processes but just need available slots that can be allocated as quickly as possible. However, it is entirely possible that the jobs you are sharing a node with could bring the node, and your job, down so please consider this when requesting the orte PE. Since MPI jobs tend to be timing sensitive, it is usually desired to have them run on nodes that are not shared with other job processes. This is the purpose of the Xcpn parallel environments. These parallel environments will only allocate nodes that have all slots free, which means the total number of slots requested must be a multiple of X for the Xcpn PEs. So to request an MPI job to use 32 processor cores.

 

This will request exactly 2 nodes.

qsub -pe 16cpn 32  myscript.job

If your job can only take advantage of 24 processes but you want to make sure that you are allocated entire nodes you can submit as below

Using less slots than requested
#!/bin/sh
#$ -pe 16cpn 32 
mpirun -n 24 -bynode myprogram

That would allocate two nodes with 32 cores total but would only use 12 on each node. The -n 24 flag to mpirun will limit the number of processes launched and the -bynode option will distribute 12 processes to the first node and 12 to the second node in round-robin fashion. With OpenMPI, there are additional mpirun parameters that can further control the distribution. As far as SGE is concerned there are 32 slots allocated on those nodes so they will not be used for anything else. Something like the above may also be needed if your job can not run on all of the processors of a node due to memory constraints.

The orte PE can not be used for jobs that need to run on less processes than allocated. This is because the distribution of jobs slots across nodes is unknown a priori.

Hybrid Shared/Distributed memory

This type of parallel job is the most complex to set up. As OpennMP/MPI is the most common type of hybrid job, that will be used as an example. The first thing to determine is what is the appropriate mix of OpenMP threads to MPI processes. That will be very application dependent and may require some experimentation. Once these are determined, the number of OpenMP threads to use per process needs to be specified with the OMP_NUM_THREADS environment variable. The number of MPI processes is specified as an option to the mpirun command. The total number of threads and processes is specified as the total number of slots for the SGE parallel environment.

Since one can not know a priori how many nodes will be allocated using the orte PE, it will work not work for hybrid parallel applications.

If it is determined that a 32 core job will need to run with 16 OpenMP threads and 2 MPI processes, the job script would look like

myscript.job
#!/bin/sh
# Example hybrid parallel setup
#$ -pe 16cpn 32
OMP_NUM_THREADS=16
mpirun -n 2 -bynode myprogram

Other parallel environments

In addition to the SGE parallel environments mentioned above, there are a few other special purpose PEs, mostly for commercial software.

  • fluent → for running Ansys Fluent. This has hooks in it to handle fluent startup/shutdown, as well as limits for license tokens.
  • gaussian-sm → This is for running Gaussian09 in shared memory mode.
  • gaussian-linda → This is for running Gaussian09 in distributed memory and hybrid shared/distributed memory modes.

Resource Requests

If you need specific amounts of certain resources, it is possible to request those with SGE and limit the available nodes to those that meet your criteria. Some common options are

  • memory, either total or available, real, virtual or swap.
  • hostname if you need to run on a certain machine.
  • a specific queue that relates to your research group.

These can all be requested via your job script.

Memory

Machines with specific amounts of memory can be requested with resource requests.

 There are three levels of memory on the machines,

machine typeSGE request
standard memorystd_mem
mid memorymid_mem
high memory

high_mem

There are times when it is necessary to manipulate memory usage via a request for more job slots (processor cores) than what your job will use. For example, a node that has 64G of memory and 16 processor cores has ~4G of memory per core. For serial jobs that need more than the amount provided per core on the respective systems, you will need to use the smp PE (parallel environment). For MPI jobs, you will have to use one of the PEs that request whole nodes. The orte PE should absolutely not be used in cases where you have to request more job slots than you will use.

On the topic of the orte PE, it would seem that this PE is used more frequently than it should. The orte PE will allocate slots from any node on the cluster and allows node sharing between jobs. Generally speaking, MPI jobs should not be on shared nodes. There are use cases for the orte PE as it can allocate slots more flexibly. However, due to the node sharing, your job may either affect or be affected by someone else's job on the shared node.

Jobs that do not request whole compute nodes in the UI and all.q queues will have per process memory limits set by default. If a job exceeds the memory limit it will abort and will need to be resubmitted with a slot request sufficient to allocate the memory needed for the job. These limits correspond to the memory values per core given above. If your job is terminated it may be because it hit the memory limit and needs to be submitted with a request for more memory. To verify run the following command

qacct -j $JOBID

and look for the maxvmem field. If it matches a limit mentioned above then you will need to modify your submission. To monitor memory usage during a job use the following command:

qstat -j $JOBID | grep usage

Look at the vmem and maxvmem entries to see if these are getting near a limit. For example

qstat -j 411298|grep usage
usage    1:                 cpu=00:23:19, mem=246.85762 GBs, io=0.01576, vmem=433.180M, maxvmem=433.180M

Note the memory usage info is for the entire job. If your job is using multiple processes than adjust accordingly. If your job requested more job slots than are used then only adjust based on the number of processes actually being used to determine per process memory usage. 

There are several ways of requesting memory resources. If you have a large memory job then you may want to make sure that there is enough free memory on a machine before attempting to schedule the job. Note that these are not limits and are not enforced but only refer to the amounts at the time of scheduling.

#$ -l mf=2G -- request machines with 2G of free memory.
#$ -l sf=2G -- request machines with 2G of free swap.
#$ -l mt=12G -- request machines with a total of 12G of memory.

If the job is a serial (single processor) job and uses more memory than the amount available for a single core then you will need to request more jobs slots (processor cores). This is done by specifying a parallel environment to get more cores and the memory that is loosely associated with them. Since a serial job only runs on a single machine, the smp PE is the one to use. The choice is whether to allow sharing of the job with other jobs. If node sharing is okay then the following will work for a serial job that uses 6GB of memory on a 64G machine.

#$ -pe smp 2
myprogram ...

If everyone else running jobs on the node is requesting the appropriate number of processor cores as above then all jobs should fit in memory. For MPI jobs the same principle applies but note that the orte PE should not be used as the node allocation can not be controlled. So a 12 process MPI job that uses 8GB per process would have to use a request like the following on 64G nodes.

#$ -pe 16cpn 32
mpirun -n 12 -bynode program ...

The above would allocate (2) 16-core nodes and put 6 processes on each node for a total of 48G used per node. There can be no node sharing in this scenario.

hosts

# to run on a specific machine
#$ -l h=compute-2-36

Other Resource requests

There are some additional resources to be aware of. The machines in the cluster may not all be identical. Some may have accelerator cards, for example. To request a node with special resources use the '-l' option of qsub. For instance, if your job can make use of a GPU accelerator submit your job as 

qsub -l kepler myscript.job

Note that the resource names may vary between clusters so check the cluster specific documentation.

Note that if your code is linked against the Intel Math Kernel Libraries (MKL) then you can make use of a Phi accelerator in offload mode by specifying MKL_MIC_ENABLE=1.

Queues

Queue requests are governed by access lists. Please make sure you have access to a queue before requesting or your job will fail to launch. To see which queues you have access to run the following command.

whichq
# to request the ICTS investor queue.
#$ -q ICTS

If you have access to multiple queues then you can combine the resources by specifying multiple queues for the queue resource.

#$ -q CGRER,COE

One should not specify the all.q queue in a queue submission request along with an investor queue. In that scenario, the processes that land in the investor queue could immediately evict processes that land in the all.q queue, which will have the undesired effect of suspending the processes in the investor queue as well.

Also, the sandbox/development queues should only be used for testing and not used for production jobs.

Job priorities

If you have not been using the cluster much recently, you will likely have a higher priority than other users who are submitting lots of jobs, and your job should move towards the front of the queue. Conversely, if you have been submitting a lot of jobs, your job priorities will be lowered. This is dependent on the number of job slots requested as parallel jobs will get a higher priority in proportion to the number of requested slots. This is done in an effort to ensure that larger slot count jobs do not get starved while waiting for resources to free up.

Checkpoint/Restart

Many applications that run long running computations will create checkpoint files. These checkpoint files are then used to restart a job starting at the point where the checkpoint was last written. This capability is particularly useful for jobs submitted to the all.q queue as those jobs may be evicted so that investor jobs can run. SGE provides a mechanism to take advantage of checkpoints if your application supports them. SGE creates an environment variable (RESTARTED) that can be checked if the user specified that a job is restartable. To take advantage of the applications's checkpoint capability the job script should check for the value of the $RESTARTED environment variable. A simple "if...then" block should be sufficient.

if [ $RESTARTED = 1 ]; then
    .
    .
    .
    some commands to set up the computation for a restart
    .
    .
    .
fi

Additionally, the job needs to be submitted with the '-ckpt user' option to the qsub command. This is what tells SGE that it is possible to restart a job. Without that option, SGE will not look for the $RESTARTED environment variable. Even if your application does not perform checkpoint operations, the checkpoint/restart capability of SGE can still be used to restart your job. In this case, it would just restart from the beginning. This has the effect of migrating the job to available nodes (if there are any) after a job eviction or simply re-queueing the job. 

Not all jobs can be restarted and those that can need to be set up to do so. It is the users responsibility to set up the commands in the job script file that will allow the computation to restart. In some cases, this is simply using the latest output file as an input file. In other cases an input file with different options is needed.

Job Dependencies

There are times when you want to make sure a job is completed before another one is started. This can be accomplished by specifying job dependencies. To use this, use the -hold_jid flag of qsub. That flag takes a JOB_ID as a parameter and that JOB_ID is the job that needs to be completed before the current job submission can be launched. So assuming that job A needs to complete before job B can begin

qsub test_A.job
Your job 3808141 ("test_A.job") has been submitted

That will return a JOB_ID in the output. Use that in the seubsequent job submission

qsub -hold_jid 3808141 test_B.job

You will see something like the following in the qstat output

qstat -u gpjohnsn
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
3808141 0.50233 test_A.job gpjohnsn     r     01/15/2014 15:40:30 sandbox@compute-6-174.local        1        
3808142 0.00000 test_B.job gpjohnsn     hqw   01/15/2014 15:40:53                                    1        

The test_B job will not begin until the test_A job is complete. There is a handy way to capture the JOB_ID in an automated way using the -terse flag of qsub.

hold_jid=$(qsub -terse test_A.job)
qsub -hold_jid $hold_jid test_B.job

The first line will capture the output of the JOB_ID and put it in the hold_jid variable. You can use whatever legal variable name you want for that. The second line will make use of the previously stored variable containing the JOB_ID.

If the first job is an array job you will have to do a little more filtering on the output from the first job.

hold_jid=$(qsub -terse test_A.job | awk -F. '{print $1}')
qsub -hold_jid $hold_jid test_B.job

If both jobs are array jobs then you can specify array dependencies with the -hold_jid_ad flag to qsub. In that case the array jobs have to be the same size and the dependencies are on the tasks such that when test_A, task 1 is complete test_B, task 1 can start.

  • No labels