How to Use Blacklight : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

Blacklight is a different development environment than most of you are used to. This article hopefully will help ease the pain of learning how to use blacklight. Note that this is not a substitute for reading the information about blacklight.

Creating a blacklight account

If you haven't created a blacklight account, do so immediately. There can be a several business-day latency between you signing up for an account and you having access.

Setting up your blacklight environment

Blacklight uses the module system to manage the developer configuration. To get everything set up for us, you just need to run:

module load openmpi/1.6/gnu
module load python
module load mpt

You might want to add this to you ~/.bashrc or equivalent.

Running on blacklight

Blacklight allows to connect via ssh to a head node where you can do some small development and add jobs to the job queue. Since anything you do on the head node disrupts work done in the queues, it is really impolite to allow any process to run for any length of time. Instead, you can use a special debug queue with (relatively) quick turnaround times. Note that jobs in the debug queue are limited and can run with only 16 processors.

Here is an example job:

#!/bin/bash
#Allocate only as much resources as you need
#It helps reducing your group and other groups' queueing time
#ncpus must be a multiple of 16
#PBS -l ncpus=16
#pmem can be up to 8gb
#use 8gb when you are using large data size
#PBS -l pmem=2gb
#PBS -l walltime=5:00

# Merge stdout and stderr into one output file
#PBS -j oe

#PBS -q batch

# use the name prog1.job
#PBS -N prog1.job

# Load mpi.
source /usr/share/modules/init/bash
module load openmpi/1.6/gnu
module load mpt

# Move to my $SCRATCH directory.
cd $SCRATCH

# Set this to the important directory.
execdir=PROGDIR
exe=parallelSort

# Set the argument for running your program
args="-s 10000000 -d exp -p 5"

# Copy executable to $SCRATCH.
cp $execdir/$exe $exe

# Run my executable
mpirun -np NCORES ./$exe $args

Pay attention to a couple of things:

The top of the file has a lot of comments of the form #PBS flag_to_qsub. These are used to control some parameters of your program. You can also control these parameters directly via arguments to qsub:

qsub jobs/your.job

or

qsub -l ncpus=16 -q jobs/your.job
You use the $SCRATCH directory to store your temporary input files, etc.
On blacklight, we use omplace to control OpenMP. omplace will wrap the program and will pin each thread to a unique processor. Also remember that on blacklight you effectively have dedicated access to each CPU.
All paths are relative to execdir. If you are not using make jobs to generate job files, you will need to modify this variable.

Using the queue

There are a couple of useful commands for dealing with the job queue. You should probably man every one of these.

qsub: qsub is how you launch jobs. Feel free to learn more, but the basic usage is pretty simple:
```
$ qsub my_job_file.job
231778.tg-login1.blacklight.psc.teragrid.org
```
On a successful launch, it reports to you the job identifier.

qstat: qstat is how you get information about the job queue. It is used in the following way:

$ qstat -a

tg-login1.blacklight.psc.teragrid.org: 
                                                            Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID  NDS  TSK  Memory Time  S Time
-------------------- -------- -------- ---------- ------- ---- ---- ------ ----- - -----
228983.tg-login1     blood    res_2    RMI_RNA_st  328959  --   192    --  400:0 R 342:5
228988.tg-login1     blood    res_2    BAB_RNA_al  336758  --   128    --  400:0 R 342:5
... <snip> ...
230995.tg-login1     gcommuni batch_l1 nz2            --   --    16 19000m 95:30 Q   -- 
230996.tg-login1     gcommuni batch_l1 nz3            --   --    16 19000m 95:30 Q   -- 

Total cpus requested from running jobs: 3536

Note that you can use this to see your approximate place in the queue:

qstat -a | awk '{print NR" "$0}' | grep $USER

qdel: qdel is used to cancel a job. If you accidentally launch a job you didn't mean to, you can always cancel it with:
```
$ qdel 231778
```