Blacklight is a different development environment than most of you are used to. This article hopefully will help ease the pain of learning how to use blacklight. Note that this is not a substitute for reading the information about blacklight.
Creating a blacklight account
If you haven't created a blacklight account, do so immediately. There is a several business-day latency between you signing up for an account and you having access. See this piazza post for some more information on how to sign up.
Setting up your blacklight environment
Blacklight uses the module system to manage the developer configuration. To get everything set up for us, you just need to run:
module load openmpi/1.6/gnu
module load python
module load mpt
You might want to add this to you ~/.bashrc
or equivalent.
Running on blacklight
Blacklight allows to connect via ssh to a head node where you can do some small development and add jobs to the job queue. Since anything you do on the head node disrupts work done in the queues, it is really impolite to allow any process to run for any length of time. Instead, you can use a special debug queue with (relatively) quick turnaround times. Note that jobs in the debug queue are limited and can run with only 16 processors.
Here is an example job:
#!/bin/bash
# Request 128 cpus. ncpus must be a multiple of 16.
#PBS -l ncpus=128
# Limit to 20 minutes of walltime.
#PBS -l walltime=20:00
# Merge stdout and stderr into one output file.
#PBS -j oe
# Run in the batch queue.
#PBS -q batch
# Use the name wsp.job.
#PBS -N wsp.job
# Load omplace
source /usr/share/modules/init/bash
module load mpt
# Move to my $SCRATCH directory.
cd $SCRATCH
# Set this to your working directory.
execdir=$HOME/asst3/prog1_wsp_openmp
# Copy executable to $SCRATCH.
cp $execdir/wsp wsp
# Copy the input into $SCRATCH.
mkdir -p input
cp $execdir/input/dist17 .
# Run my executable with a thread per core, pinning each thread
# to a unique processor.
omplace -nt $PBS_NCPUS ./wsp -i input/dist17
Pay attention to a couple of things:
The top of the file has a lot of comments of the form
#PBS flag_to_qsub
. These are used to control some parameters of your program. You can also control these parameters directly via arguments toqsub
:qsub -l ncpus=16 -q debug wsp.job
You use the
$SCRATCH
directory to store your temporary input files, etc.- On blacklight, we use
omplace
to control OpenMP.omplace
will wrap the program and will pin each thread to a unique processor. Also remember that on blacklight you effectively have dedicated access to each CPU. - All paths are relative to
execdir
. If you do not install your code in~/asst3
, you will need to modify this variable.
Here is a script to run timer.bash
:
#!/bin/bash
# Request 128 cpus. ncpus must be a multiple of 16.
#PBS -l ncpus=128
# Limit to 20 minutes of walltime.
#PBS -l walltime=20:00
# Merge stdout and stderr into one output file.
#PBS -j oe
# Run in the batch queue.
#PBS -q batch
# Use the name timer.job.
#PBS -N timer.job
# Set this to your working directory.
execdir=$HOME/asst3/prog1_wsp_openmp
# Load omplace and python 2.7.
source /usr/share/modules/init/bash
module load mpt
module load python
# Move to my $SCRATCH directory.
cd $SCRATCH
# Copy executable and necessary files to $SCRATCH.
cp $execdir/wsp wsp
# Copy over the scoreboard and quit early if we fail to do so.
cp $execdir/scoreboard_token .
if [ $? -ne 0 ]
then
exit -1
fi
# The script depends on mkinput.py to generate input.
cp $execdir/mkinput.py .
# The script depends on the existance of an input directory.
mkdir -p input
# The script depends on the existance of a Makefile with a run target.
echo "run: " > Makefile
# Copy over the timer script, modifying a few variables:
# set num_threads to the actual number of CPUS used
# set num_iterations to 1, since we assume we have full access to the CPU
# use omplace to control openmp and pin jobs to CPUS
cat $execdir/timer.bash | sed 's/num_threads=.*/num_threads=$PBS_NCPUS/' \
| sed 's/num_iterations=.*/num_iterations=1/' \
| sed 's/OMP_NUM_THREADS=/omplace -nt /' \
> timer.bash
# Run the script.
bash timer.bash
In the assignment directory, there is a tarball of some useful job scripts.
/afs/cs/academic/class/15418-s13/assignments/jobs.tgz
Using the queue
There are a couple of useful commands for dealing with the job queue. You should probably man
every one of these.
qsub
:qsub
is how you launch jobs. Feel free to learn more, but the basic usage is pretty simple:$ qsub my_job_file.job 231778.tg-login1.blacklight.psc.teragrid.org
On a successful launch, it reports to you the job identifier.
qstat
:qstat
is how you get information about the job queue. It is used in the following way:$ qstat -a tg-login1.blacklight.psc.teragrid.org: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------- ------- ---- ---- ------ ----- - ----- 228983.tg-login1 blood res_2 RMI_RNA_st 328959 -- 192 -- 400:0 R 342:5 228988.tg-login1 blood res_2 BAB_RNA_al 336758 -- 128 -- 400:0 R 342:5 ... <snip> ... 230995.tg-login1 gcommuni batch_l1 nz2 -- -- 16 19000m 95:30 Q -- 230996.tg-login1 gcommuni batch_l1 nz3 -- -- 16 19000m 95:30 Q -- Total cpus requested from running jobs: 3536
Note that you can use this to see your approximate place in the queue:
qstat -a | awk '{print NR" "$0}' | grep $USER
qdel
:qdel
is used to cancel a job. If you accidentally launch a job you didn't mean to, you can always cancel it with:$ qdel 231778