qsub myscript.job Your batch output--your .o and .e files--is returned to the directory from which you issued the
qsubcommand after your job finishes.
You can also specify PBS directives as command-line options. Thus, you could omit the PBS directives from the first sample script above and submit the script with the command:
$ qsub -l ncpus=16 -l walltime=5:00 -j oe -q batch myscript.jobCommand-line directives override directives in your scripts.
qsuboptions are available for specifying your job's walltime request.
-l walltime_min=HH:MM:SS -l walltime_max=HH:MM:SSYou can use these two options instead of "
-l walltime" to make your walltime request flexible or malleable. A flexible walltime request can improve your job's turnaround in several circumstances.
For example, to accommodate large jobs, the system actively drains blades to create dynamic reservations. The blades being drained for these reservations create backfill up to the reservation start time that may be used by other jobs. Using flexible walltime limits increases the opportunity for your job to run on backfill blades.
As an example, if your job requests 64 cores and a range of walltime between 2 and 4 hours and a 64-core slot is available for 3 hours, your job could run in this slot with a walltime request of 3 hours. If your job had asked for a fixed walltime request of 4 hours it would not be started.
Another situation in which specifying a flexible walltime could improve your turnaround is the period leading up to a full drain for system maintenance. The system will not start a job that will not finish before the system maintenance time begins. A job with a flexible walltime could start if the flexible walltime range overlaps the period when the maintenance time starts. A job with a fixed walltime that would not finish until after the maintenance period begins would not be started.
If the system starts one of your jobs with a flexible walltime request, the system selects a walltime within the two specified limits. This walltime will not change during your job's execution. You can determine the actual walltime your job was assigned by examining the "
Resource_List.walltime" field of the output of the "
qstat -f" command. The command:
$ qstat -f $PBS_JOBIDwill give this output for the current job. You can capture this output to find the value of the
You may need to provide this value to your program so that your program can make appropriate decisions about writing checkpoint files. In the above example, you would tell your program that it is running for 3 hours and thus should begin writing checkpoint files sufficiently in advance of the 3-hour limit so that the file writing is completed when the limit is reached. The functions "
mpi_wtime" and "
omp_get_wtime" can be used to track how long your program has been running so that it writes checkpoint files to make sure you save results from your program's processing.
You may also want to save time at the end of your job to allow your job to transfer files after your program ends but before your job ends. You can use the timeout command to specify in seconds how long you want your program to run. Once your job determines what its actual walltime is you can, after subtracting the amount of time you want for file transfer at the end of your job, use this value in a timeout command. For example, assume your job is assigned a walltime of 1 hour and you want your program to stop 10 minutes before your job ends to allow your job to have adequate time for file transfer. To accomplish this you could use a command like the following:
timeout --timeout=$PROGRAM_TIME -- mpirun -np 32 ./mympiThe example assumes that your script has retrieved your job's walltime, converted it to seconds--values given to timeout must be in seconds--subtracted 600 from it and assigned the value of 3000 to the variable
$PROGRAM_TIME. You will probably also want to provide this value to your program. Your program can then use this value to appropriately write out checkpoint files. When your program ends your job will have time to perform necessary file transfers before your job ends.
For more information on the timeout command see the timeout man page. If you want assistance on the procedures needed to capture your job's actual walltime or to determine when your job should write checkpoint files send email to firstname.lastname@example.org.
Our second recommendation is that you always use flexible walltime requests if possible. This is especially helpful if your minimum walltime in your pair of walltime values is less than 8 hours.
Finally, due to system limitations, we must limit the number of concurrent 16-core jobs on Blacklight. Since the number of queued 16-core jobs usually is above this limit, if you are running 16-core jobs, it is to your advantage to pack multiple 16-core executions into a single job. How to pack jobs is discussed below.
-I" option to
qsub. For example, the command:
$ qsub -I -l ncpus=16 -l walltime=5:00requests interactive access to 16 cores for 5 minutes in the. Your
qsub -Irequest will wait until it can be satisfied. If you want to cancel your request you should type ^C.
When you get your shell prompt back your interactive job is ready to start. At this point any commands you enter will be run as if you had entered them in a batch script. Stdin, stdout, and stderr are connected to your terminal. To run an MPI or hybrid program you must use the mpirun command just as you would in a batch script.
When you finish your interactive session type ^D. When you use
qsub -Iyou are charged for the entire time you hold your processors whether you are computing or not. Thus, as soon as you are done executing commands you should type ^D.
-X" on the
$ qsub -X -I -l ncpus=16 -l walltime=5:00This assumes that the
$DISPLAYvariable is set. Two ways in which
$DISPLAYis automatically set for you are:
- Connecting to Blacklight with
ssh -X Blacklight.psc.teragrid.org
- Enabling X-11 tunneling in your Windows
qsubthat may be useful. See man
qsubfor a complete list.
- -m a|b|e|n
- Defines the conditions under which a mail message will be sent about a job. If "a", mail is sent when the job is aborted by the system. If "b", mail is sent when the job begins execution. If "e", mail is sent when the job ends. If "n",no mail is sent. This is the default.
- -M userlist
- Specifies the users to receive mail about the job. Userlist is a comma-separated list of email addresses. If omitted, it defaults to the user submitting the job. You should specify your full Internet email address when using the -M option.
- -v variable_list
- This option exports those environment variables named in the variable_list to the environment of your batch job. The -V option, which exports all your environment variables, has been disabled on Blacklight.
- -r y|n
- Indicates whether or not a job should be automatically restarted if it fails due to a system problem. The default is to not restart the job. Note that a job which fails because of a problem in the job itself will not be restarted.
- -W group_list=charge_id
- Indicates to which
charge_id you want a job to be charged. If you only have one allocation
on Blacklight you do not need to use this option; otherwise, you should
charge each job to the appropriate allocation. You can see your valid
charge_ids by typing `groups` at the Blacklight prompt. Typical output
will look like
sy2be6n ec3l53p eb3267p jb3l60qYour default charge_id is the first group in the list; in this example "sy2be6n". If you do not specify `-W group_list` for your job, this is the allocation that will be charged.
- -W depend=dependency:jobid
- Specifies how the execution of this job depends on the status of other jobs. Some values for dependencyare:
after this job can be scheduled after job jobid begins execution. afterok this job can be scheduled after job jobid finishes successfully. afternotok this job can be scheduled after job jobid finishes unsucessfully. afterany this job can be scheduled after job jobid finishes in any state. before this job must begin execution before job jobid can be scheduled. beforeok this job must finish successfully before job jobid begins beforenotok this job must finish unsuccessfully before job jobid begins beforeany this job must finish in any state before job jobid begins
qsubcommand. The basic method to use to pack jobs is to run each program execution in the background and place a wait command after all your executions. A sample job to pack serial executions is:
#!/bin/csh #PBS -l ncpus=128 #ncpus must be a multiple of 16 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo #move to my $SCRATCH directory cd $SCRATCH #copy executables and input files to $SCRATCH cp $HOME/myserial* . cp $HOME/serial* . #run my executables dplace -c 0 ./myserial1 > serial1.out < serial1.dat & dplace -c 32 ./myserial2 > serial2.out < serial2.dat & dplace -c 64 ./myserial3 > serial3.out < serial3.dat & dplace -c 96 ./myserial4 > serial4.out < serial4.dat & waitEach serial execution will run on 2 blades. The
dplacecommand insures that each execution will run on its own set of 2 blades. The executions will run concurrently.
To pack a job with executables that use threads such as OpenMP executables you should replace the
dplacecommand with the
omplacecommand. A sample job to pack OpenMP executables is:
#!/bin/csh #PBS -l ncpus=128 #ncpus must be a multiple of 16 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo #move to my $SCRATCH directory cd $SCRATCH #copy executables and input files to $SCRATCH cp $HOME/myopen* . #run my executables omplace -nt 32 -c 0 ./myopenmp1 > openmp1.out < openmp1.dat & omplace -nt 32 -c 32 ./myopenmp2 > openmp2.out < openmp2.dat & omplace -nt 32 -c 64 ./myopenmp3 > openmp3.out < openmp3.dat & omplace -nt -32 -c 96 ./myopenmp4 > openmp4.out < openmp4.dat & waitA sample job to pack MPI executables is:
#!/bin/csh #PBS -l ncpus=64 #ncpus must be a multiple of 16 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo #move to my $SCRATCH directory cd $SCRATCH #copy executables and input files to $SCRATCH cp $HOME/mympi* . #run my executable mpirun -np 16 dplace -c 0-15 ./mympi > mpi1.out < mpi1.dat & mpirun -np 16 dplace -c 16-31 ./mympi > mpi2.out < mpi2.dat & mpirun -np 16 dplace -c 42-47 ./mympi > mpi3.out < mpi3.dat & mpirun -np 16 dplace -c 48-63 ./mympi > mpi4.out < mpi4.dat & waitPacking jobs is especially useful to do if you are running 16-core jobs. Due to system limitations, we must limit the number of concurrent 16-core jobs on Blacklight. Since the number of queued 16-core jobs usually exceeds this limit, if you are running 16-core jobs, it is to your advantage to pack multiple 16-core executions into a single job.
If you have questions about packing jobs send email to email@example.com. For more information about
omplacesee the man pages for
qstat -a" command displays the status of the queues. It shows running and queued jobs. For each job it shows the amount of walltime and the number of cores and processors requested. To get the actual number of cores a job is using you must divide the displayed value by two. For running jobs it shows the amount of walltime the job has already used
The commands "
qstat -s", "
qstat -f" and "
pbsnodes -a" can be used to give status information about the system and your jobs. The comments that these commands provide can be used to determine why your jobs have not started. The "
qstat -f" command take a jobid as an argument.
qdelcommand is used to kill queued and running jobs. An example is the command:
$ qdel 54The argument to
qdelis the jobid of the job you want to kill, which you are shown when you submit your job or you can get it with the qstat command. If you cannot kill a job you want to kill send email to firstname.lastname@example.org.
Do not run a debugging run on any of Blacklight's front ends. You should always run a Blacklight program with
gdbdebuggers are available on Blacklight. The
gdbdebugger has a man page. Information for
idbis available online. This online documentation has links to more
idbreference material. Send email to email@example.com if you want another debugger to be installed.
-g" option to the Intel or GNU compilers, the error messages you receive when your program fails will probably be more informative. For example, you will probably be given the line number of the source code statement that caused the failure. Once you have a production version of your code you should not use the -g`" option or your program will run slower.
The -check bounds option to the
ifortcompiler will cause your program to tell you if it exceeds an array bounds while running.
Variables on Blacklight are not automatically initialized. This can cause your program to fail if it relies on variables being initialized. The -check uninit and -ftrupuv options to the
ifortcompiler will catch certain cases of uninitialized variables, as will the -Wall and -O options to the GNU compilers.
There are more options to the Intel and GNU compilers that may assist you in your debugging. For more information see the appropriate man pages.
$ ulimit -c unlimitedFor csh-type shells you issue the command:
$ limit coredumpsize unlimitedCore files are created in directory ~/tmp. For more information about core files issue the command:
$ man 5 core
If your machine has Tcl installed you can tell whether the machine is little endian or big endian by issuing the command:
$ echo 'puts $tcl_platform(byteOrder)' | tclshYou can read a big endian file on Blacklight if you are using the Intel
ifortcompiler. Before you run your program issue the command:
$ setenv FORT_CONVERTn big_endianfor each Fortran unit number from which you are reading a big endian file. For 'n' substitute the appropriate unit number.
You can measure your program's cache miss rate for each of the available caches by setting the appropriate counters when using the TAU utility. If you need assistance in measuring or improving your cache performance send email to firstname.lastname@example.org.
MPI_Wtimefunctions can be used to collect timing data at a finer grain. The default precision for TAU is microseconds, but the -linuxtimers or -papi option can be used to obtain nanosecond precision. The precision for
omp_get_wtimeis microseconds, while the precision for
You must pair your initial ja command with another ja command at the end of your job. We recommend you use the command ja -chlst for this second command. The option "
-t" to this second ja command turns off job accounting and writes your accounting data to stdout. The other options to the second ja command determine what output you will receive from ja. We recommend the -chls options because we think they will provide detailed but useful information about your job's processes. However, you can look at the man page for ja to see what reporting options you want to use.
There is no overhead to using ja. We strongly recommend you use ja when you want to understand the resource usage of your jobs. You can use this information when you submit future jobs. The output from ja can also be used for debugging and performance improvement purposes.
If your job terminates normally and you have included the "
-t" option with your second ja command, your ja output is written to your job's stdout. If you have any questions about using ja or encounter any errors when running ja send email to email@example.com.
The environment variable
$SCRATCH_RAMDISKis set to point to the memory associated with each job. Unlike
$SCRATCH, this variable is given a new value for each job. Otherwise, this variable can be treated like
$SCRATCH. From within your job, you can cd to it, you can copy files to and from it, and you can use it to open files.
Memory IO is faster than disk IO, but it does have disadvantages. Each job's memory filespace is cleared whenever the job terminates, whether normally or abnormally. Thus, if you are using memory IO you must copy your memory files back from
$SCRATCH_RAMDISKbefore your job ends or the files are lost. If your job terminates abnormally your files will be lost. Moreover, memory IO is limited in size relative to disk IO. Each job can only use the memory associated with that job. Furthermore, memory IO is limited to the memory available after memory is allocated for your program. Moreover, the largest size a single memory file can be is 256 Gbytes. Therefore, the use of memory files is best suited to IO-intensive jobs that perform IO to lots of small files.
$HOMEspace is limited. In addition, the
$SCRATCHfile space is implemented using the Lustre parallel file system. A program that uses
$SCRATCHcan perform parallel IO and thus can significantly improve its performance. File striping can be used to tune your parallel IO performance and is particularly effective for files that are 1 Gbyte or larger.
A Lustre file system is created from an underlying set of file systems called Object Storage Targets (OSTs). Your program can read from and write to multiple OSTs concurrently. This is how you can use Lustre as a parallel file system.
A striped file is one that is spread across multiple OSTs. Thus, striping a file is one way for you to be able to use multiple OSTs concurrently. However, striping is not suitable for all files. Whether it is appropriate for a file depends on the IO structure of your program.
For example, if each of your cores writes to its own file you should not stripe these files. If each file is placed on its own OST then as each core writes to its own file you will achieve a concurrent use of the OSTs because of the IO structure of your program. File striping in this case could actually lead to an IO performance degradation because of the contention between the cores as they perform IO to the pieces of their files spread across the OSTs.
An application ideally suited to file striping would be one in which there is a large volume of IO but a single core performs all the IO. In this situation you will need to use striping to be able to use multiple OSTs concurrently.
However, there are other disadvantages besides possible IO contention to striping and these must be considered when making your striping decisions. Many interactive file commands such as "
ls -l" or unlink will take longer for striped files.
You use the lfs setstripe command to set the striping parameters for a file. You have to set the striping parameters for a file before you create it.
The format of the lfs setstripe command is:
$ lfs setstripe filename -c stripe-countA value of -1 for the stripe count means the file should be spread across all the available OSTs.
For example, the command:
$ lfs setstripe bigfile.out -c -1sets the stripe count for bigfile.out to be all available OSTs.
$ lfs setstripe manyfiles.out -c 1has a stripe count of 1. Each file will be placed on its own OST. This is suitable for the situation where each core writes its own file and you do not want to stripe these files.
You can also specify a directory instead of a filename in the lfs setstripe command. The result will be that each file created in that directory will have the indicated striping. You can override this striping by issuing an lfs setstripe command for individual files within that directory.
The kind of striping that is best for your files is very application dependent. Your application will probably fall between the two extreme cases discussed above. You will therefore need to experiment with several approaches to see which is best for your application. A value of -1 for stripe count will probably give you the best performance if you are going to use file striping, but you should try several values. The maximum value you can give for stripe count on Blacklight is currently 8.
There is a man page for lfs on Blacklight. Online documentation for Lustre is also available. If you want assistance with what striping strategy to follow send email to firstname.lastname@example.org.
For examples, we recommend the FFTW libraryfor FFTs. For linear algebra routines we recommend the MKL library.
PSC has established the Memory Advantage Program (MAP) to enable users to take advantage of Blacklight's unique capabilities. MAP includes consulting assistance from PSC, special queue handling if necessary and service unit discounts.
To participate in MAP you should send an email to email@example.com with a description of your scientific problem and any information you have on how effectively your program is currently using Blacklight's shared memory. A PSC Scientific Specialist will then contact you to troubleshoot problems, provide advice on the use of debugging and performance analysis tools and procedures, and offer suggestions on fixes and optimizations. During this consultation process you will be able to make benchmarking, debugging and test runs at a 50% discount for a period of up to 4 weeks.
You can also send email to firstname.lastname@example.org if you want optimization assistance in areas other than memory usage.
If you have questions about access to Gaussian send email to email@example.com.
You will also periodically receive email from PSC with information about Blacklight. In order to insure that you receive this email, you should make sure your email forwarding is set properly by following the instructions for setting your email forwarding.
- If you are an XSEDE user you can send email to firstname.lastname@example.org, mentioning PSC in the subject line. You will get an acknowledgement from the XSEDE Operations Center, and then you will be contacted by PSC staff.
- You can call the User Services Hotline at 412-268-6350 from 9:00 a.m. until 5:00 p.m., Eastern time, Monday through Friday.