Tuesday, June 23, 2015

20150623Tue, 0624Wed, 0625Thu blacklight ms02 network permutation runs

Plan: Edit files locally, push to github.
          At blacklight, pull from github.

On Byte:
Download zip file from  https://github.com/hongqin/mactower-network-failure-simulation

$ scp mactower-network-failure-simulation-master.zip hqin2@data.psc.xsede.org:./.
... ...
mactower-network-failure-simulation-master.zip                                   7%   56MB   3.5MB/s   03:25 ETA

On blacklight
hqin2@tg-login1:/brashear/hqin2> pwd
/brashear/hqin2
hqin2@tg-login1:/brashear/hqin2> which unzip
/usr/bin/unzip
hqin2@tg-login1:/brashear/hqin2>  unzip /arc/users/hqin2/mactower-network-failure-simulation-master.zip

Archive:  /arc/users/hqin2/mactower-network-failure-simulation-master.zip
... ...

1:55pm. This zip file is not a git repository. So, I try to git clone using the command line at blacklight.  See https://help.github.com/articles/importing-a-git-repository-using-the-command-line/

git clone --bare https://github.com/hongqin/mactower-network-failure-simulation.git 
/*this does not work on blacklight, even though it works on Byte*/

/*try directory for input file through qsub */
hqin2@tg-login1:/brashear/hqin2/mactower-network-failure-simulation-master/ms02GINPPI> ll
-rw-r--r--   1 hqin2 mc48o9p    199 2015-06-23 14:20 R.pbs
-rw-r--r--   1 hqin2 mc48o9p   1193 2015-06-23 14:18 test1.R

hqin2@tg-login1:/brashear/hqin2/mactower-network-failure-simulation-master/ms02GINPPI> cat R.pbs 
#!/bin/bash
#PBS -q batch
#PBS -l ncpus=16
#PBS -l walltime=0:03:00

source /usr/share/modules/init/bash
module load R
cd $PBS_O_WORKDIR

echo hostname

ja
R --slave CMD BATCH test1.R /*not right?*/
ja -chlst

2:27pm
hqin2@tg-login1:/brashear/hqin2/mactower-network-failure-simulation-master/ms02GINPPI> qsub R.pbs 
461387.tg-login1.blacklight.psc.teragrid.org

hqin2@tg-login1:/brashear/hqin2/mactower-network-failure-simulation-master/ms02GINPPI> qstat -u hqin2

tg-login1.blacklight.psc.teragrid.org: 
                                                                    Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID  NDS  TSK  Memory Time  S Time
-------------------- -------- -------- ---------- ------- ---- ---- ------ ----- - -----
461387.tg-login1     hqin2    batch_r  R.pbs          --   --    16    --  00:03 Q   -- 


R2.pbs
#!/bin/bash
#PBS -q batch
#PBS -l ncpus=16
#PBS -l walltime=0:03:00

source /usr/share/modules/init/bash
module load R

cd $SCRATCH
ja
R --slave CMD BATCH ./test1.R
ja -chlst

/*It took about 28 minutes for the job to finish */
hqin2@tg-login1:/brashear/hqin2/mactower-network-failure-simulation-master/ms02GINPPI> ll
total 496
-rw-r--r--   1 hqin2 mc48o9p    176 2015-06-23 14:45 R2.pbs
-rw-------   1 hqin2 mc48o9p      0 2015-06-23 15:02 R2.pbs.e461389
-rw-------   1 hqin2 mc48o9p   5002 2015-06-23 15:03 R2.pbs.o461389
-rw-r--r--   1 hqin2 mc48o9p    195 2015-06-23 14:24 R.pbs
-rw-------   1 hqin2 mc48o9p      0 2015-06-23 15:02 R.pbs.e461387
-rw-------   1 hqin2 mc48o9p   5206 2015-06-23 15:03 R.pbs.o461387
-rw-r--r--   1 hqin2 mc48o9p    160 2015-06-23 18:18 test1.R
-rw-------   1 hqin2 mc48o9p    986 2015-06-23 15:03 test1.Rout


I then use test1.R, test2.R, and test3.R to generate more ms02 network models. 
I need to pass these parameter through command line parameters to R. 

I forgot to change wall time for the two job submssions. 
how to delete a qsub job?

hqin2@tg-login1:/brashear/hqin2/blacklight> qstat -u hqin2

tg-login1.blacklight.psc.teragrid.org: 
                                                                    Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID  NDS  TSK  Memory Time  S Time
-------------------- -------- -------- ---------- ------- ---- ---- ------ ----- - -----
461426.tg-login1     hqin2    batch_r  test1.R        --   --    16    --  00:10 Q   -- 
461427.tg-login1     hqin2    batch_r  test2.pbs      --   --    16    --  00:03 Q   -- 
461435.tg-login1     hqin2    batch_r  test3.pbs      --   --    16    --  01:00 Q   -- 
hqin2@tg-login1:/brashear/hqin2/blacklight> qdel 461426.tg-login1
qdel: illegally formed job identifier: 461426.tg-login1
hqin2@tg-login1:/brashear/hqin2/blacklight> qdel 461426
hqin2@tg-login1:/brashear/hqin2/blacklight> qstat -u hqin2

tg-login1.blacklight.psc.teragrid.org: 
                                                                    Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID  NDS  TSK  Memory Time  S Time
-------------------- -------- -------- ---------- ------- ---- ---- ------ ----- - -----
461427.tg-login1     hqin2    batch_r  test2.pbs      --   --    16    --  00:03 Q   -- 

461435.tg-login1     hqin2    batch_r  test3.pbs      --   --    16    --  01:00 Q   -- 


/*I then changed wall time and resubmit the first 2 jobs */
hqin2@tg-login1:/brashear/hqin2/mactower-network-failure-simulation-master/ms02GINPPI> qstat -u hqin2

tg-login1.blacklight.psc.teragrid.org: 
                                                                    Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID  NDS  TSK  Memory Time  S Time
-------------------- -------- -------- ---------- ------- ---- ---- ------ ----- - -----
461435.tg-login1     hqin2    batch_r  test3.pbs      --   --    16    --  01:00 Q   -- 
461438.tg-login1     hqin2    batch_r  test1.pbs      --   --    16    --  01:00 Q   -- 

461439.tg-login1     hqin2    batch_r  test2.pbs      --   --    16    --  01:00 Q   -- 

There are problems with the write.csv().  Relative directory did not work in qsub.
00:40am, I added explicit path for the outuput file in test2.R.
00:44am qsub test2.pbs

3am. job were run. After 1.5 hours in the queue.
hqin2@tg-login1:/brashear/hqin2> ll
total 32
-rw-------  1 hqin2 mc48o9p   69 2015-06-23 23:15 test1.Rout
-rw-------  1 hqin2 mc48o9p   69 2015-06-24 03:35 test2.Rout

-rw-------  1 hqin2 mc48o9p   69 2015-06-24 03:35 test3.Rout
hqin2@tg-login1:/brashear/hqin2> cat test2.Rout 
Fatal error: cannot open file './test2.R': No such file or directory
/* So, my pbs job submission file has path problems */

4pm. Still no-output files in my intended directory.

/brashear/hqin2/mactower-network-failure-simulation-master/ms02GINPPI

4:36pm
hqin2@tg-login1:/brashear/hqin2/mactower-network-failure-simulation-master/ms02GINPPI> cat test1.pbs
#!/bin/bash
#PBS -q batch
#PBS -l ncpus=16
#PBS -l walltime=0:05:00

source /usr/share/modules/init/bash
module load R

echo hostname

pwd
cd $SCRATCH/mactower-network-failure-simulation-master/ms02GINPPI
pwd

ja
R -f  test1.R > test1.dump.txt
ja -chlst

4:38pm
hqin2@tg-login1:/brashear/hqin2/mactower-network-failure-simulation-master/ms02GINPPI> qsub test1.pbs
461579.tg-login1.blacklight.psc.teragrid.org
hqin2@tg-login1:/brashear/hqin2/mactower-network-failure-simulation-master/ms02GINPPI> qstat -u hqin2

tg-login1.blacklight.psc.teragrid.org: 
                                                                    Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID  NDS  TSK  Memory Time  S Time
-------------------- -------- -------- ---------- ------- ---- ---- ------ ----- - -----

461579.tg-login1     hqin2    batch_r  test1.pbs      --   --    16    --  00:05 Q   --

This worked.
hqin2@tg-login1:/brashear/hqin2/mactower-network-failure-simulation-master/ms02GINPPI> ll -ht
total 532K
drwxr-xr-x 102 hqin2 mc48o9p 4.0K 2015-06-24 17:47 dipgin.ms02.output
-rw-------   1 hqin2 mc48o9p 1.9K 2015-06-24 17:47 test1.dump.txt
-rw-------   1 hqin2 mc48o9p 4.8K 2015-06-24 17:47 test1.pbs.o461579
-rw-------   1 hqin2 mc48o9p   39 2015-06-24 17:47 test1.pbs.e461579
-rw-r--r--   1 hqin2 mc48o9p  255 2015-06-24 16:36 test1.pbs

-rw-r--r--   1 hqin2 mc48o9p  784 2015-06-24 16:31 test1.R

June25, 2015
I wrote a new ms02 script that can take parameters in command line. I scp this script to blacklight.
-rw-r--r--   1 hqin2 mc48o9p 1.9K 2015-06-25 00:39 ms02-2015June24.R

hqin2@tg-login1:/brashear/hqin2/mactower-network-failure-simulation-master/ms02GINPPI> cat ms02.pbs 
#!/bin/bash
#PBS -q batch
#PBS -l ncpus=16
#PBS -l walltime=0:30:00

source /usr/share/modules/init/bash
module load R

echo hostname

pwd
cd $SCRATCH/mactower-network-failure-simulation-master/ms02GINPPI
pwd

ja
R -f ms02-2015June24.R --args 302 500

ja -chlst

00:49am 
hqin2@tg-login1:/brashear/hqin2/mactower-network-failure-simulation-master/ms02GINPPI> qsub ms02.pbs 

461610.tg-login1.blacklight.psc.teragrid.org

hqin2@tg-login1:/brashear/hqin2/mactower-network-failure-simulation-master/ms02GINPPI> qstat -u hqin2

tg-login1.blacklight.psc.teragrid.org: 
                                                                    Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID  NDS  TSK  Memory Time  S Time
-------------------- -------- -------- ---------- ------- ---- ---- ------ ----- - -----
461610.tg-login1     hqin2    batch_r  ms02.pbs    177131  --    16    --  00:30 R   -- 


Total cpus requested from running jobs: 16

I also created two more submission ms02b.pbs and ms02c.pbs.

hqin2@tg-login1:/brashear/hqin2/mactower-network-failure-simulation-master/ms02GINPPI> ll -th 
total 564K
-rw-------   1 hqin2 mc48o9p   94 2015-06-25 03:31 ms02c.pbs.e461613
-rw-------   1 hqin2 mc48o9p   94 2015-06-25 03:31 ms02b.pbs.e461612
drwxr-xr-x 210 hqin2 mc48o9p 4.0K 2015-06-25 03:30 dipgin.ms02.output
-rw-------   1 hqin2 mc48o9p   94 2015-06-25 01:20 ms02.pbs.e461610
-rw-------   1 hqin2 mc48o9p 2.7K 2015-06-25 01:00 ms02c.pbs.o461613
-rw-r--r--   1 hqin2 mc48o9p  263 2015-06-25 01:00 ms02c.pbs
-rw-------   1 hqin2 mc48o9p 2.7K 2015-06-25 01:00 ms02b.pbs.o461612
-rw-r--r--   1 hqin2 mc48o9p  262 2015-06-25 00:59 ms02b.pbs
-rw-------   1 hqin2 mc48o9p 2.7K 2015-06-25 00:49 ms02.pbs.o461610
-rw-r--r--   1 hqin2 mc48o9p  262 2015-06-25 00:47 ms02.pbs
-rw-r--r--   1 hqin2 mc48o9p 1.9K 2015-06-25 00:45 ms02-2015June24.R

It looks like my wall time is too short. 





hqin2@tg-login1:/brashear/hqin2/mactower-network-failure-simulation-master/ms02GINPPI> cat ms02.pbs.e461610
[Previously saved workspace restored]


=>> PBS: job killed: walltime 1865 exceeded limit 1800

My estimations: 
30 minutes is for 15 ms02 models.
150 minutes is for 100 ms02 models  

Based on these estimations, I submitted 8 jobs, with each requesting 4 hours of walltime. 
hqin2@tg-login1:/brashear/hqin2/mactower-network-failure-simulation-master/ms02GINPPI> grep args *pbs
ms02b.pbs:R -f ms02-2015June24.R --args 310 400
ms02c.pbs:R -f ms02-2015June24.R --args 401 500
ms02d.pbs:R -f ms02-2015June24.R --args 501 600
ms02e.pbs:R -f ms02-2015June24.R --args 601 700
ms02f.pbs:R -f ms02-2015June24.R --args 799 800
ms02g.pbs:R -f ms02-2015June24.R --args 800 900
ms02h.pbs:R -f ms02-2015June24.R --args 900 1000

ms02.pbs:R -f ms02-2015June24.R --args 100 200


No comments:

Post a Comment