Tuesday, January 13, 2015

PSC blacklight trial, 20150113


Instructions: 
"Once you login you will be in your $HOME directory (/usr/users/1/hqin2) which is backed up but has a quota of 5 Gbytes. You also have access to a $SCRATCH directory (/brashear/hqin2) which has essentially unlimited storage  and is not backed up. Files in $SCRATCH may be removed, oldest first, to make room when needed, though we try to keep  them for 2-weeks at least.

There is a file archiver, you can access it as the directory /arc/users/hqin2/ from the login node, where you can store whatever you need to keep long-term (while your allocation is active, of course). You can also connect to the archiver via sftp, at data.psc.edu. You can use Fugu or any other graphical user interface if you prefer. This is the simplest way to transfer files to PSC, you can see them in the /arc directory from the login node and copy them to/from the $HOME or $SCRATCH directory as needed.

When you run and write data, we prefer that you write to $SCRATCH, which is a distributed file system and can handle the load, and not to $HOME."


hqin2@tg-login1:~> echo $HOME
/usr/users/1/hqin2
hqin2@tg-login1:~> echo $SCRATCH
/brashear/hqin2
hqin2@tg-login1:~> du /arc/users/hqin2
2 /arc/users/hqin2
hqin2@tg-login1:~> df /arc/users/hqin2
Filesystem           1K-blocks      Used Available Use% Mounted on
/arc                 3656882477312 2021505932032 1635376545280  56% /arc
hqin2@tg-login1:~> df -h /arc/users/hqin2
Filesystem            Size  Used Avail Use% Mounted on
/arc                  3.4P  1.9P  1.5P  56% /arc


Instructions:
"Look at this webpage:
http://www.psc.edu/index.php/computing-resources/blacklight

it has examples of scripts for running batch jobs, in particular I think you will want to run an 'interactive batch job' to check that your code works.

    qsub -I -l ncpus=16 -l walltime=0:30:00 -q debug

once you get a prompt, you are on the 'backend', or 'compute node', i.e. Blacklight proper, and everything runs there, not on the login node.

Let's say  I have a trivial R example:

y <- rnorm(10)
print(y)

this is saved in a file (example.R), and I want to run it. So I type the 'qsub ....' command above, and  after I got an interactive prompt, enter the following;

source /usr/share/modules/init/bash
module load R

R --slave CMD BATCH ./example.R

and the output appears in 'example.Rout'.  OK, so I'm done. To get out of the 'compute node', I type 'exit' and press enter.

The first two lines (source ... ; module ...) load the definition of the 'module' command, the second uses the module command to put (a version of) R in my path, and the last executes the R script in batch mode.  

Once I have figured out that everything is working, I can run the script in full batch mode (non-interactively) by putting this into a PBS script, i.e. a file, let's call it 'R.pbs':

#!/bin/bash
#PBS -q batch
#PBS -l ncpus=16
#PBS -l walltime=0:03:00

source /usr/share/modules/init/bash
module load R
cd $PBS_O_WORKDIR

ja
R --slave CMD BATCH ./example.R
ja -chlst

So you are just entering the commands you typed interactively,  after a line that indicates what 'shell' you want to run under, and some  options to the batch scheduler (the number of cores, and the minutes, which you had entered on the command line before).   What is new is the "cd $PBS_O_WORKDIR" which makes the script start on whatever directory you were when you submitted the command. Also, the couple of lines "ja" and "ja -chlst"  surrounding the call to R. They are not essential, but  collect useful information on the job (maximum amount of memory, time spent, cpu time used, etc.)

So you have this script called 'R.pbs',  and you can submit it to the scheduler with the command

    qsub R.pbs

The scheduler will reply with something like:
394363.tg-login1.blacklight.psc.teragrid.org

the number is the 'job ID' of your PBS job, which you can use to ask for more information from the scheduler.  You can always ask it 'what jobs do I have in the queue' like this:

    qstat -u hqin2

and it will list them all, together with the state (R means running, Q means it still in the queue).  If it lists nothing, it means all your jobs completed.  After the job completed, there should appear a couple of files in the directory where you put the script. Since I didn't use any option to give the job a name, the files would be named {script name}.e#### and {script name}.o####, in the example that would be R.pbs.o########## and R.pbs.e#######. The 'o' file has any output that the job would write to the standard output, the 'e' file anything that would normally go to the standard error file.   You can also redirect output from any command in the job script to a file. "

source /usr/share/modules/init/bash
module load R
R --slave CMD BATCH ./example.R
hqin2@tg-login1:~> ll example.R* #output is example.Rout
-rw-r--r-- 1 hqin2 mc48o9p  24 2015-01-13 20:47 example.R
-rw-r--r-- 1 hqin2 mc48o9p 942 2015-01-13 20:48 example.Rout
hqin2@tg-login1:~> nano -w R.pbs
hqin2@tg-login1:~> pwd
/usr/users/1/hqin2
hqin2@tg-login1:~> qsub R.pbs 
418673.tg-login1.blacklight.psc.teragrid.org

hqin2@tg-login1:~> qstat -u hqin2

tg-login1.blacklight.psc.teragrid.org: 
                                                                    Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID  NDS  TSK  Memory Time  S Time
-------------------- -------- -------- ---------- ------- ---- ---- ------ ----- - -----
418673.tg-login1     hqin2    batch_r  R.pbs          --   --    16    --  00:03 Q   -- 
hqin2@tg-login1:~> 

Nothing was in the output file. So, I modified the running line to "R -f example.R"

hqin2@tg-login1:~/test> ls
example.R  R2.pbs
hqin2@tg-login1:~/test> ll
total 8
-rw-r--r-- 1 hqin2 mc48o9p  24 2015-01-13 22:33 example.R
-rw-r--r-- 1 hqin2 mc48o9p 199 2015-01-13 22:33 R2.pbs
hqin2@tg-login1:~/test> qsub R2.pbs 
418692.tg-login1.blacklight.psc.teragrid.org
hqin2@tg-login1:~/test> qstat -u hqin2

tg-login1.blacklight.psc.teragrid.org: 
                                                                    Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID  NDS  TSK  Memory Time  S Time
-------------------- -------- -------- ---------- ------- ---- ---- ------ ----- - -----
418692.tg-login1     hqin2    batch_r  R2.pbs         --   --    16    --  00:03 Q   -- 
hqin2@tg-login1:~/test> cat R2.pbs 
#!/bin/bash
#PBS -q batch
#PBS -l ncpus=16
#PBS -l walltime=0:03:00

source /usr/share/modules/init/bash
module load R
cd $PBS_O_WORKDIR

ja
#R --slave CMD BATCH ./example.R
R -f example.R

ja -chlst


hqin2@tg-login1:~/test> ll
total 16
-rw-r--r-- 1 hqin2 mc48o9p   24 2015-01-13 22:33 example.R
-rw-r--r-- 1 hqin2 mc48o9p  199 2015-01-13 22:33 R2.pbs
-rw------- 1 hqin2 mc48o9p    0 2015-01-13 23:13 R2.pbs.e418692
-rw------- 1 hqin2 mc48o9p 4905 2015-01-13 23:13 R2.pbs.o418692
hqin2@tg-login1:~/test> cat R2.pbs.o418692 

R version 2.15.3 (2013-03-01) -- "Security Blanket"
Copyright (C) 2013 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> y = rnorm(10)
> print (y)
 [1] -0.46271891  0.34547494 -0.97556883 -0.64659599  0.01052027  0.06472313
 [7]  0.43858725  0.83961732 -0.74945123  0.15012829


Job Accounting - Command Report
===============================

    Command       Started    Elapsed    User CPU    Sys CPU       CPU      Block I/O    Swap In      CPU MEM        Characters           Logical I/O      CoreMem   VirtMem   Ex
     Name           At       Seconds    Seconds     Seconds    Delay Secs  Delay Secs  Delay Secs  Avg Mbytes     Read     Written     Read      Write    HiValue   HiValue   St   Ni  Fl   SBU's 
===============  ========  ==========  ==========  ==========  ==========  ==========  ==========  ==========  =========  =========  ========  ========  ========  ========  ===  ===  ==  =======
# CFG   ON(    1) (    7)  23:13:32 01/13/2015  System:  Linux bl0.psc.teragrid.org 2.6.32.49-0.3-default #1 SMP 2011-12-02 11:28:04 +0100 x86_64
ja               23:13:32        0.31        0.00        0.00        0.00        0.00        0.00        0.85      0.019      0.000        19         3      1064     23780    0    0         0.00
uname            23:13:32        0.00        0.00        0.00        0.00        0.00        0.00       12.64      0.004      0.000         8         1       664      5316    0    0         0.00
R                23:13:32        0.00        0.00        0.01        0.00        0.00        0.00        0.00      0.000      0.000         0         1       884     12616    0    0  F      0.00
sed              23:13:32        0.00        0.00        0.01        0.00        0.00        0.00        0.00      0.004      0.000        10         1       816      5396    0    0         0.00
R                23:13:32        0.00        0.00        0.01        0.00        0.00        0.00        0.00      0.000      0.000         0         1       888     12616    0    0  F      0.00
sed              23:13:32        0.00        0.00        0.01        0.00        0.00        0.00        0.00      0.004      0.000        10         1       812      5396    0    0         0.00
R                23:13:32        0.00        0.00        0.01        0.00        0.00        0.00        0.00      0.000      0.000         0         0       856     12612    0    0  F      0.00
rm               23:13:33        0.01        0.00        0.00        0.00        0.00        0.00        0.96      0.012      0.000        20         0       712      5336    0    0         0.00
R                23:13:33        0.35        0.22        0.08        0.00        0.00        0.00       70.16      4.166      0.001       190        25     32412     75240    0    0         0.00


Job CSA Accounting - Summary Report
====================================

Job Accounting File Name         : /dev/tmpfs/418692/.jacct65df3
Operating System                 : Linux bl0.psc.teragrid.org 2.6.32.49-0.3-default #1 SMP 2011-12-02 11:28:04 +0100 x86_64
User Name (ID)                   : hqin2 (51231)
Group Name (ID)                  : mc48o9p (15132)
Project Name (ID)                : ? (0)
Job ID                           : 0x65df3
Report Starts                    : 01/13/15 23:13:32
Report Ends                      : 01/13/15 23:13:33
Elapsed Time                     :            1      Seconds
User CPU Time                    :            0.2200 Seconds
System CPU Time                  :            0.1090 Seconds
CPU Time Core Memory Integral    :            5.2741 Mbyte-seconds
CPU Time Virtual Memory Integral :           15.2699 Mbyte-seconds
Maximum Core Memory Used         :           31.6523 Mbytes
Maximum Virtual Memory Used      :           73.4766 Mbytes
Characters Read                  :            4.2103 Mbytes
Characters Written               :            0.0012 Mbytes
Logical I/O Read Requests        :          257
Logical I/O Write Requests       :           33
CPU Delay                        :            0.0030 Seconds
Block I/O Delay                  :            0.0002 Seconds
Swap In Delay                    :            0.0000 Seconds
Number of Commands               :            9
System Billing Units             :            0.0000

hqin2@tg-login1:~/test> 


Note: I compared today's R.pbs with job1.sh on 20150112
the line  "source /usr/share/modules/init/bash" seems to be critical. It make sure that "module" can be recognized. 

No comments:

Post a Comment