How to use Blacklight : 15-418 Spring 2013

Blacklight is a different development environment than most of you are used to. This article hopefully will help ease the pain of learning how to use blacklight. Note that this is not a substitute for reading the information about blacklight.

Creating a blacklight account

If you haven't created a blacklight account, do so immediately. There is a several business-day latency between you signing up for an account and you having access. See this piazza post for some more information on how to sign up.

Setting up your blacklight environment

Blacklight uses the module system to manage the developer configuration. To get everything set up for us, you just need to run:

module load openmpi/1.6/gnu
module load python
module load mpt

You might want to add this to you ~/.bashrc or equivalent.

Running on blacklight

Blacklight allows to connect via ssh to a head node where you can do some small development and add jobs to the job queue. Since anything you do on the head node disrupts work done in the queues, it is really impolite to allow any process to run for any length of time. Instead, you can use a special debug queue with (relatively) quick turnaround times. Note that jobs in the debug queue are limited and can run with only 16 processors.

Here is an example job:

#!/bin/bash


# Request 128 cpus. ncpus must be a multiple of 16.
#PBS -l ncpus=128

# Limit to 20 minutes of walltime.
#PBS -l walltime=20:00              

# Merge stdout and stderr into one output file.
#PBS -j oe

# Run in the batch queue.
#PBS -q batch

# Use the name wsp.job.
#PBS -N wsp.job

# Load omplace
source /usr/share/modules/init/bash
module load mpt

# Move to my $SCRATCH directory.
cd $SCRATCH

# Set this to your working directory.
execdir=$HOME/asst3/prog1_wsp_openmp

# Copy executable to $SCRATCH.
cp $execdir/wsp wsp

# Copy the input into $SCRATCH.
mkdir -p input
cp $execdir/input/dist17 .

# Run my executable with a thread per core, pinning each thread
# to a unique processor.
omplace -nt $PBS_NCPUS ./wsp -i input/dist17

Pay attention to a couple of things:

The top of the file has a lot of comments of the form #PBS flag_to_qsub. These are used to control some parameters of your program. You can also control these parameters directly via arguments to qsub:
```
qsub -l ncpus=16 -q debug wsp.job
```
You use the $SCRATCH directory to store your temporary input files, etc.
On blacklight, we use omplace to control OpenMP. omplace will wrap the program and will pin each thread to a unique processor. Also remember that on blacklight you effectively have dedicated access to each CPU.
All paths are relative to execdir. If you do not install your code in ~/asst3, you will need to modify this variable.

Here is a script to run timer.bash:

#!/bin/bash

# Request 128 cpus. ncpus must be a multiple of 16.
#PBS -l ncpus=128

# Limit to 20 minutes of walltime.
#PBS -l walltime=20:00              

# Merge stdout and stderr into one output file.
#PBS -j oe

# Run in the batch queue.
#PBS -q batch

# Use the name timer.job.
#PBS -N timer.job

# Set this to your working directory.
execdir=$HOME/asst3/prog1_wsp_openmp

# Load omplace and python 2.7.
source /usr/share/modules/init/bash
module load mpt
module load python

# Move to my $SCRATCH directory.
cd $SCRATCH

# Copy executable and necessary files to $SCRATCH.
cp $execdir/wsp wsp

# Copy over the scoreboard and quit early if we fail to do so.
cp $execdir/scoreboard_token .
if [ $? -ne 0 ]
then
 exit -1
fi

# The script depends on mkinput.py to generate input.
cp $execdir/mkinput.py .

# The script depends on the existance of an input directory.
mkdir -p input

# The script depends on the existance of a Makefile with a run target.
echo "run: " > Makefile

# Copy over the timer script, modifying a few variables:
#   set num_threads to the actual number of CPUS used
#   set num_iterations to 1, since we assume we have full access to the CPU
#   use omplace to control openmp and pin jobs to CPUS
cat $execdir/timer.bash | sed 's/num_threads=.*/num_threads=$PBS_NCPUS/' \
                        | sed 's/num_iterations=.*/num_iterations=1/'    \
                        | sed 's/OMP_NUM_THREADS=/omplace -nt /'         \
                        > timer.bash

# Run the script.
bash timer.bash

In the assignment directory, there is a tarball of some useful job scripts.

/afs/cs/academic/class/15418-s13/assignments/jobs.tgz

Using the queue

There are a couple of useful commands for dealing with the job queue. You should probably man every one of these.

qsub: qsub is how you launch jobs. Feel free to learn more, but the basic usage is pretty simple:
```
$ qsub my_job_file.job
231778.tg-login1.blacklight.psc.teragrid.org
```
On a successful launch, it reports to you the job identifier.

qstat: qstat is how you get information about the job queue. It is used in the following way:

$ qstat -a

tg-login1.blacklight.psc.teragrid.org: 
                                                            Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID  NDS  TSK  Memory Time  S Time
-------------------- -------- -------- ---------- ------- ---- ---- ------ ----- - -----
228983.tg-login1     blood    res_2    RMI_RNA_st  328959  --   192    --  400:0 R 342:5
228988.tg-login1     blood    res_2    BAB_RNA_al  336758  --   128    --  400:0 R 342:5
... <snip> ...
230995.tg-login1     gcommuni batch_l1 nz2            --   --    16 19000m 95:30 Q   -- 
230996.tg-login1     gcommuni batch_l1 nz3            --   --    16 19000m 95:30 Q   -- 

Total cpus requested from running jobs: 3536

Note that you can use this to see your approximate place in the queue:

qstat -a | awk '{print NR" "$0}' | grep $USER

qdel: qdel is used to cancel a job. If you accidentally launch a job you didn't mean to, you can always cancel it with:
```
$ qdel 231778
```