Sunday, January 11, 2015

qsub usage @XSEDE

After you create your batch script you submit it to PBS with the qsub command.
qsub myscript.job Your batch output--your .o and .e files--is returned to the directory from which you issued the qsub command after your job finishes.
You can also specify PBS directives as command-line options. Thus, you could omit the PBS directives from the first sample script above and submit the script with the command:
$ qsub -l ncpus=16 -l walltime=5:00 -j oe -q batch myscript.job
Command-line directives override directives in your scripts.

Flexible Walltime Requests

Two other qsub options are available for specifying your job's walltime request.
-l walltime_min=HH:MM:SS 
-l walltime_max=HH:MM:SS
You can use these two options instead of "-l walltime" to make your walltime request flexible or malleable. A flexible walltime request can improve your job's turnaround in several circumstances.
For example, to accommodate large jobs, the system actively drains blades to create dynamic reservations. The blades being drained for these reservations create backfill up to the reservation start time that may be used by other jobs. Using flexible walltime limits increases the opportunity for your job to run on backfill blades.
As an example, if your job requests 64 cores and a range of walltime between 2 and 4 hours and a 64-core slot is available for 3 hours, your job could run in this slot with a walltime request of 3 hours. If your job had asked for a fixed walltime request of 4 hours it would not be started.
Another situation in which specifying a flexible walltime could improve your turnaround is the period leading up to a full drain for system maintenance. The system will not start a job that will not finish before the system maintenance time begins. A job with a flexible walltime could start if the flexible walltime range overlaps the period when the maintenance time starts. A job with a fixed walltime that would not finish until after the maintenance period begins would not be started.
If the system starts one of your jobs with a flexible walltime request, the system selects a walltime within the two specified limits. This walltime will not change during your job's execution. You can determine the actual walltime your job was assigned by examining the "Resource_List.walltime" field of the output of the "qstat -f" command. The command:
$ qstat -f $PBS_JOBID
will give this output for the current job. You can capture this output to find the value of the Resource_List.walltime field.
You may need to provide this value to your program so that your program can make appropriate decisions about writing checkpoint files. In the above example, you would tell your program that it is running for 3 hours and thus should begin writing checkpoint files sufficiently in advance of the 3-hour limit so that the file writing is completed when the limit is reached. The functions "mpi_wtime" and "omp_get_wtime" can be used to track how long your program has been running so that it writes checkpoint files to make sure you save results from your program's processing.
You may also want to save time at the end of your job to allow your job to transfer files after your program ends but before your job ends. You can use the timeout command to specify in seconds how long you want your program to run. Once your job determines what its actual walltime is you can, after subtracting the amount of time you want for file transfer at the end of your job, use this value in a timeout command. For example, assume your job is assigned a walltime of 1 hour and you want your program to stop 10 minutes before your job ends to allow your job to have adequate time for file transfer. To accomplish this you could use a command like the following:
timeout --timeout=$PROGRAM_TIME -- mpirun -np 32 ./mympi
The example assumes that your script has retrieved your job's walltime, converted it to seconds--values given to timeout must be in seconds--subtracted 600 from it and assigned the value of 3000 to the variable $PROGRAM_TIME. You will probably also want to provide this value to your program. Your program can then use this value to appropriately write out checkpoint files. When your program ends your job will have time to perform necessary file transfers before your job ends.
For more information on the timeout command see the timeout man page. If you want assistance on the procedures needed to capture your job's actual walltime or to determine when your job should write checkpoint files send email to

How to improve your turnaround

We have several suggestions for how to improve your job turnaround. Firstly, you should try to be as accurate as possible in estimating the walltime request for your job. Asking for more time than your job will actually need will almost certainly result in poorer turnaround for your job. Thus, unreflectively asking for the maximum walltime you can ask for a job will almost always result in poorer turnaround.
Our second recommendation is that you always use flexible walltime requests if possible. This is especially helpful if your minimum walltime in your pair of walltime values is less than 8 hours.
Finally, due to system limitations, we must limit the number of concurrent 16-core jobs on Blacklight. Since the number of queued 16-core jobs usually is above this limit, if you are running 16-core jobs, it is to your advantage to pack multiple 16-core executions into a single job. How to pack jobs is discussed below.

Interactive access

A form of interactive access is available on Blacklight by using the "-I" option to qsub. For example, the command:
$ qsub -I -l ncpus=16 -l walltime=5:00
requests interactive access to 16 cores for 5 minutes in the. Your qsub -I request will wait until it can be satisfied. If you want to cancel your request you should type ^C.
When you get your shell prompt back your interactive job is ready to start. At this point any commands you enter will be run as if you had entered them in a batch script. Stdin, stdout, and stderr are connected to your terminal. To run an MPI or hybrid program you must use the mpirun command just as you would in a batch script.
When you finish your interactive session type ^D. When you use qsub -I you are charged for the entire time you hold your processors whether you are computing or not. Thus, as soon as you are done executing commands you should type ^D.

X-11 Connections In Interactive Use

In order to use any X-11 tool, you must also include "-X" on the qsub command line:
$ qsub -X -I -l ncpus=16 -l walltime=5:00
This assumes that the $DISPLAY variable is set. Two ways in which $DISPLAY is automatically set for you are:
  1. Connecting to Blacklight with ssh -X
  2. Enabling X-11 tunneling in your Windows ssh tool
Fluent and TAU are among the packages which require X-11 connections.

Other qsub Options

Besides those options mentioned above, there are several other options to qsub that may be useful. See man qsub for a complete list.
-m a|b|e|n
Defines the conditions under which a mail message will be sent about a job. If "a", mail is sent when the job is aborted by the system. If "b", mail is sent when the job begins execution. If "e", mail is sent when the job ends. If "n",no mail is sent. This is the default.
-M userlist
Specifies the users to receive mail about the job. Userlist is a comma-separated list of email addresses. If omitted, it defaults to the user submitting the job. You should specify your full Internet email address when using the -M option.
-v variable_list
This option exports those environment variables named in the variable_list to the environment of your batch job. The -V option, which exports all your environment variables, has been disabled on Blacklight.
-r y|n
Indicates whether or not a job should be automatically restarted if it fails due to a system problem. The default is to not restart the job. Note that a job which fails because of a problem in the job itself will not be restarted.
-W group_list=charge_id
Indicates to which charge_id you want a job to be charged. If you only have one allocation on Blacklight you do not need to use this option; otherwise, you should charge each job to the appropriate allocation. You can see your valid charge_ids by typing `groups` at the Blacklight prompt. Typical output will look like
sy2be6n ec3l53p eb3267p jb3l60q
Your default charge_id is the first group in the list; in this example "sy2be6n". If you do not specify `-W group_list` for your job, this is the allocation that will be charged.
-W depend=dependency:jobid
Specifies how the execution of this job depends on the status of other jobs. Some values for dependencyare:
afterthis job can be scheduled after job jobid begins execution.
afterokthis job can be scheduled after job jobid finishes successfully.
afternotokthis job can be scheduled after job jobid finishes unsucessfully.
afteranythis job can be scheduled after job jobid finishes in any state.
beforethis job must begin execution before job jobid can be scheduled.
beforeokthis job must finish successfully before job jobid begins
beforenotok this job must finish unsuccessfully before job jobid begins
beforeanythis job must finish in any state before job jobid begins
Specifying "before" dependencies requires that job jobid be submitted with -W depend=on:count. See the man page for details on this and other dependencies.

Packing jobs

Running many small jobs places a great burden on the scheduler and is probably inconvenient for you. An alternative is to pack many executions into a single job, which you then submit to PBS with a single qsub command. The basic method to use to pack jobs is to run each program execution in the background and place a wait command after all your executions. A sample job to pack serial executions is:
#PBS -l ncpus=128
#ncpus must be a multiple of 16
#PBS -l walltime=5:00
#PBS -j oe
#PBS -q batch
set echo
#move to my $SCRATCH directory
#copy executables and input files to $SCRATCH
cp $HOME/myserial* .
cp $HOME/serial* .
#run my executables
dplace -c 0 ./myserial1  > serial1.out   < serial1.dat &   
dplace -c 32 ./myserial2 > serial2.out < serial2.dat &   
dplace -c 64 ./myserial3 > serial3.out < serial3.dat &   
dplace -c 96 ./myserial4 > serial4.out < serial4.dat &   
Each serial execution will run on 2 blades. The dplace command insures that each execution will run on its own set of 2 blades. The executions will run concurrently.
To pack a job with executables that use threads such as OpenMP executables you should replace the dplace command with the omplace command. A sample job to pack OpenMP executables is:
#PBS -l ncpus=128
#ncpus must be a multiple of 16
#PBS -l walltime=5:00
#PBS -j oe
#PBS -q batch
set echo
#move to my $SCRATCH directory
#copy executables and input files to $SCRATCH
cp $HOME/myopen* .
#run my executables
omplace -nt 32 -c 0 ./myopenmp1 > openmp1.out  < openmp1.dat &   
omplace -nt 32 -c 32 ./myopenmp2 > openmp2.out < openmp2.dat &   
omplace -nt 32 -c 64 ./myopenmp3 > openmp3.out < openmp3.dat &   
omplace -nt -32 -c 96 ./myopenmp4 > openmp4.out < openmp4.dat &   
A sample job to pack MPI executables is:
#PBS -l ncpus=64
#ncpus must be a multiple of 16
#PBS -l walltime=5:00
#PBS -j oe
#PBS -q batch
set echo
#move to my $SCRATCH directory
#copy executables and input files to $SCRATCH
cp $HOME/mympi* .
#run my executable
mpirun -np 16 dplace -c 0-15 ./mympi > mpi1.out < mpi1.dat &   
mpirun -np 16 dplace -c 16-31 ./mympi > mpi2.out < mpi2.dat &   
mpirun -np 16 dplace -c 42-47 ./mympi > mpi3.out < mpi3.dat &   
mpirun -np 16 dplace -c 48-63 ./mympi > mpi4.out < mpi4.dat &   
Packing jobs is especially useful to do if you are running 16-core jobs. Due to system limitations, we must limit the number of concurrent 16-core jobs on Blacklight. Since the number of queued 16-core jobs usually exceeds this limit, if you are running 16-core jobs, it is to your advantage to pack multiple 16-core executions into a single job.
If you have questions about packing jobs send email to For more information about dplace and omplace see the man pages for dplace and omplace.

Monitoring and Killing Jobs

The "qstat -a" command displays the status of the queues. It shows running and queued jobs. For each job it shows the amount of walltime and the number of cores and processors requested. To get the actual number of cores a job is using you must divide the displayed value by two. For running jobs it shows the amount of walltime the job has already used
The commands "qstat -s", "qstat -f" and "pbsnodes -a" can be used to give status information about the system and your jobs. The comments that these commands provide can be used to determine why your jobs have not started. The "qstat -f" command take a jobid as an argument.
The qdel command is used to kill queued and running jobs. An example is the command:
$ qdel 54
The argument to qdel is the jobid of the job you want to kill, which you are shown when you submit your job or you can get it with the qstat command. If you cannot kill a job you want to kill send email to


Debugging strategy

Your first few runs should be on a small version of your problem. Your first run should not be for your largest problem size. It is easier to solve code problems if you are using fewer processors. This strategy should be followed even if you are porting a working code from another system.
Do not run a debugging run on any of Blacklight's front ends. You should always run a Blacklight program with qsub.


The idb and gdb debuggers are available on Blacklight. The gdb debugger has a man page. Information for idb is available online. This online documentation has links to more idb reference material. Send email to if you want another debugger to be installed.

Compiler options

Several compiler options can be useful to you when you are debugging your program. If you use the "-g" option to the Intel or GNU compilers, the error messages you receive when your program fails will probably be more informative. For example, you will probably be given the line number of the source code statement that caused the failure. Once you have a production version of your code you should not use the -g`" option or your program will run slower.
The -check bounds option to the ifort compiler will cause your program to tell you if it exceeds an array bounds while running.
Variables on Blacklight are not automatically initialized. This can cause your program to fail if it relies on variables being initialized. The -check uninit and -ftrupuv options to the ifort compiler will catch certain cases of uninitialized variables, as will the -Wall and -O options to the GNU compilers.
There are more options to the Intel and GNU compilers that may assist you in your debugging. For more information see the appropriate man pages.

Core files

The key to making core files on Blacklight is to allow them to be written by increasing the maximum file size allowable for core files. The default size is 0 bytes. If you are using sh-type shells you do this by issuing the command:
$ ulimit -c unlimited
For csh-type shells you issue the command:
$ limit coredumpsize unlimited
Core files are created in directory ~/tmp. For more information about core files issue the command:
$ man 5 core

Little endian versus big endian

The data bytes in a binary floating point number or a binary integer can be stored in a different order on different machines. Blacklight is a little endian machine, which means that the low-order byte of a number is stored in the memory location with the lowest address for that number while the high-order byte is stored in the highest address for that number. The data bytes are stored in the reverse order on a big endian machine.
If your machine has Tcl installed you can tell whether the machine is little endian or big endian by issuing the command:
$ echo 'puts $tcl_platform(byteOrder)' | tclsh
You can read a big endian file on Blacklight if you are using the Intel ifort compiler. Before you run your program issue the command:
$ setenv FORT_CONVERTn big_endian
for each Fortran unit number from which you are reading a big endian file. For 'n' substitute the appropriate unit number.

Improving Performance

Calculating Mflops

You can calculate your code's Mflops rate using the TAU utility. The TAU examples show how to determine timing data and floating point operation counts for your program, from which you can calculate your Mflops rate.

Cache Performance

Cache performance can have a significant impact on the performance of your program. Each Blacklight core has three levels of cache. The primary data and instruction caches are 32 Kbytes each. The L2 cache is 256 Kbytes. The L3 cache, which is shared by the 8 processors on a core, is 24 Mbytes. When hyper-threading is enabled the two threads on a core share the L1 and L2 caches.
You can measure your program's cache miss rate for each of the available caches by setting the appropriate counters when using the TAU utility. If you need assistance in measuring or improving your cache performance send email to

Turbo Boost

Blacklight's Nehalem processors have a feature referred to as Turbo Boost. Under certain workload conditions its processor cores can automatically and dynamically run faster than their base clockrate of 2.27 GHz. Although the activation of the Turbo Boost feature is application dependent, we have found that it is most often activiated when only a few cores per processor are being used, because its activation depends on the processor's power consumption and temperature.

Collecting timing data

Collecting timing data is essential for measuring and improving program performance. We recommend five approaches for collecting timing data. The ja and /usr/bin/time utilities can be used to collect data at the program level. They report results to the hundredths of seconds. The TAU utility and the omp_get_wtime and MPI_Wtime functions can be used to collect timing data at a finer grain. The default precision for TAU is microseconds, but the -linuxtimers or -papi option can be used to obtain nanosecond precision. The precision for omp_get_wtime is microseconds, while the precision for MPI_Wtime is nanoseconds.


The ja command turns on job accounting for your job. This allows you to obtain information on the elapsed time and memory and IO usage of your program, plus other data. You should place the ja command in your batch script after your PBS specifications and before your executable commands.
You must pair your initial ja command with another ja command at the end of your job. We recommend you use the command ja -chlst for this second command. The option "-t" to this second ja command turns off job accounting and writes your accounting data to stdout. The other options to the second ja command determine what output you will receive from ja. We recommend the -chls options because we think they will provide detailed but useful information about your job's processes. However, you can look at the man page for ja to see what reporting options you want to use.
There is no overhead to using ja. We strongly recommend you use ja when you want to understand the resource usage of your jobs. You can use this information when you submit future jobs. The output from ja can also be used for debugging and performance improvement purposes.
If your job terminates normally and you have included the "-t" option with your second ja command, your ja output is written to your job's stdout. If you have any questions about using ja or encounter any errors when running ja send email to

Memory files

Blacklight's operating system creates a file system out of its blade memory. Thus, your program can perform IO to blade memory rather than to disk. Memory IO is several orders of magnitude faster than disk IO. However, each Blacklight job can only perform memory IO to the blades associated with that job. A job cannot write to the memory of blades assigned to other jobs.
The environment variable $SCRATCH_RAMDISK is set to point to the memory associated with each job. Unlike $SCRATCH, this variable is given a new value for each job. Otherwise, this variable can be treated like $SCRATCH. From within your job, you can cd to it, you can copy files to and from it, and you can use it to open files.
Memory IO is faster than disk IO, but it does have disadvantages. Each job's memory filespace is cleared whenever the job terminates, whether normally or abnormally. Thus, if you are using memory IO you must copy your memory files back from $SCRATCH_RAMDISK before your job ends or the files are lost. If your job terminates abnormally your files will be lost. Moreover, memory IO is limited in size relative to disk IO. Each job can only use the memory associated with that job. Furthermore, memory IO is limited to the memory available after memory is allocated for your program. Moreover, the largest size a single memory file can be is 256 Gbytes. Therefore, the use of memory files is best suited to IO-intensive jobs that perform IO to lots of small files.

IO optimization

File striping

If your program reads or writes large files you should use $SCRATCH. Your $HOME space is limited. In addition, the $SCRATCH file space is implemented using the Lustre parallel file system. A program that uses $SCRATCH can perform parallel IO and thus can significantly improve its performance. File striping can be used to tune your parallel IO performance and is particularly effective for files that are 1 Gbyte or larger.
A Lustre file system is created from an underlying set of file systems called Object Storage Targets (OSTs). Your program can read from and write to multiple OSTs concurrently. This is how you can use Lustre as a parallel file system.
A striped file is one that is spread across multiple OSTs. Thus, striping a file is one way for you to be able to use multiple OSTs concurrently. However, striping is not suitable for all files. Whether it is appropriate for a file depends on the IO structure of your program.
For example, if each of your cores writes to its own file you should not stripe these files. If each file is placed on its own OST then as each core writes to its own file you will achieve a concurrent use of the OSTs because of the IO structure of your program. File striping in this case could actually lead to an IO performance degradation because of the contention between the cores as they perform IO to the pieces of their files spread across the OSTs.
An application ideally suited to file striping would be one in which there is a large volume of IO but a single core performs all the IO. In this situation you will need to use striping to be able to use multiple OSTs concurrently.
However, there are other disadvantages besides possible IO contention to striping and these must be considered when making your striping decisions. Many interactive file commands such as "ls -l" or unlink will take longer for striped files.
You use the lfs setstripe command to set the striping parameters for a file. You have to set the striping parameters for a file before you create it.
The format of the lfs setstripe command is:
$ lfs setstripe filename -c stripe-count
A value of -1 for the stripe count means the file should be spread across all the available OSTs.
For example, the command:
$ lfs setstripe bigfile.out -c -1
sets the stripe count for bigfile.out to be all available OSTs.
The command:
$ lfs setstripe manyfiles.out -c 1
has a stripe count of 1. Each file will be placed on its own OST. This is suitable for the situation where each core writes its own file and you do not want to stripe these files.
You can also specify a directory instead of a filename in the lfs setstripe command. The result will be that each file created in that directory will have the indicated striping. You can override this striping by issuing an lfs setstripe command for individual files within that directory.
The kind of striping that is best for your files is very application dependent. Your application will probably fall between the two extreme cases discussed above. You will therefore need to experiment with several approaches to see which is best for your application. A value of -1 for stripe count will probably give you the best performance if you are going to use file striping, but you should try several values. The maximum value you can give for stripe count on Blacklight is currently 8.
There is a man page for lfs on Blacklight. Online documentation for Lustre is also available. If you want assistance with what striping strategy to follow send email to

Third-party software

Third-party routines can often perform better than routines you code yourself. You should investigate whether there is a third-party routine available to replace any of the routines you have written yourself.
For examples, we recommend the FFTW libraryfor FFTs. For linear algebra routines we recommend the MKL library.

Performance monitoring tools

We have installed several performance monitoring tools on Blacklight. The TAU utility is a performance profiling and tracing tool. The PAPI utility can be used to access the hardware performance counters on Blacklight. We intend to install more performance tools on Blacklight. If you want assistance in using any of these tools or have a utility you would like us to install send email to

Performance improvement assistance: the Memory Advantage Program

Blacklight is a very large hardware-coherent shared memory machine. Blacklight is thus suitable for a range of memory-intensive computations that cannot readily be deployed on a distributed-memory machine.
PSC has established the Memory Advantage Program (MAP) to enable users to take advantage of Blacklight's unique capabilities. MAP includes consulting assistance from PSC, special queue handling if necessary and service unit discounts.
To participate in MAP you should send an email to with a description of your scientific problem and any information you have on how effectively your program is currently using Blacklight's shared memory. A PSC Scientific Specialist will then contact you to troubleshoot problems, provide advice on the use of debugging and performance analysis tools and procedures, and offer suggestions on fixes and optimizations. During this consultation process you will be able to make benchmarking, debugging and test runs at a 50% discount for a period of up to 4 weeks.
You can also send email to if you want optimization assistance in areas other than memory usage.

Software Packages

A list of software packages installed on Blacklight is available. If you would like us to install a package that is not in this list send email to

Running Gaussian on Blacklight

Just getting an account on Blacklight is not sufficient to give you access to Gaussian if you want to use Gaussian. You must fill out our online PSC Gaussian User Agreement to get access to Gaussian at PSC.
If you have questions about access to Gaussian send email to

The Module Command

Before you can run many software packages, you must define paths and other environment variables. To use a different version of a package, these definitions often have to be modified. The module command makes this process easier. For use of the module command, including its use in batch jobs, see the module documentation.

Stay Informed

As a user of Blacklight, it is imperative that you stay informed of changes to the machine's environment. Refer to this document frequently.
You will also periodically receive email from PSC with information about Blacklight. In order to insure that you receive this email, you should make sure your email forwarding is set properly by following the instructions for setting your email forwarding.


PSC requests that a copy of any publication (preprint or reprint) resulting from research done on Blacklight be sent to the PSC Allocations Coordinator. In addition, if your research was funded by the NSF you should log your publications at the XSEDE Portal. We also request that you include an acknowledgement of PSC in your publication.


You have several options for reporting problems on Blacklight.
  • If you are an XSEDE user you can send email to, mentioning PSC in the subject line. You will get an acknowledgement from the XSEDE Operations Center, and then you will be contacted by PSC staff.
  • You can call the User Services Hotline at 412-268-6350 from 9:00 a.m. until 5:00 p.m., Eastern time, Monday through Friday.

No comments:

Post a Comment