Assignment 3: Two Algorithms, Many Cores : 15-418/618 Spring 2014

Due: Tues March 4th, 11:59PM EST

120 points total

Overview

In this two-part assignment you will write parallel implementations of two common algorithms (1) sorting a large array of numbers and (2) computing the single-source shortest path in an unweighted, directed graph using breadth-first search (BFS). To give you exposure to two new parallel programming environments, your sorting algorithm will be implemented in the message passing model (using MPI). Your BFS will be implemented as a shared-address space program (using OpenMP). For this assignment you will be using the GHC 3000 machines and Blacklight, a supercomputer hosted at the Pittsburgh Supercomputing Center (PSC). You will first get your code running well on the 6-core (12-virtual cores, due to Hyper-threading) machines in GHC 3000. Once your code scales well on these machines, you'll have the opportunity to test, evaluate, and improve its scalability on a supercomputer.

Parts 1 and 2 of this assignment are independent exercises.

Environment Setup

GHC 3000 Setup Instructions

The Gates 3000 machines are ghc27-50.ghc.andrew.cmu.edu. You can verify that you are on a correct machine by running 'less /proc/cpuinfo' and confirming that Linux reports 12 virtual cores. The six-core, 3.2 GHz Intel Xeon CPU in these machines is described in detail here.

On the GHC machines, add /usr/lib64/openmpi/bin to your PATH and /usr/lib64/openmpi/lib to your LD_LIBRARY_PATH environment variables. You should use the GHC machines for the initial debugging and testing of your code. You can develop and test your code on any machine, but performance should be reported on the GHC 3000 machines. Note: as of right now, MPI is not available on ghc46, ghc49, or ghc75. Computing services should have this fixed shortly.

Blacklight Setup Instructions

PLEASE, PLEASE FOLLOW THESE INSTRUCTIONS ASAP AS THERE IS A LAG TIME IN ACCOUNT CREATION.

You need to first create an account on XSEDE and then tell the course staff. Once you have created an account, fill out this form so we know your XSEDE username and can give you access to use the class quota on Blacklight. Before trying to run your code on Blacklight, please read the "Help on Running on Blacklight" section at the end of this document.

There are two ways to access Blacklight to run your code. The first is through the XSEDE web portal. You will need to login into your XSEDE account at http://portal.xsede.org. Once logged in, navigate to "MY XSEDE", then "SSH Terminal" and select "Blacklight" in the dropdown menu to initiate an SSH session. The second is via ssh to blacklight.psc.teragrid.org (we expect that the latter option will be far more convenient). To login via ssh you may first need to change your Blacklight password by following the instructions found here: http://www.psc.edu/index.php/computing-resources/blacklight#password.

Problem 1: Parallel Sorting Using MPI (40 points code + 20 points writeup)

In part 1 of the assignment you will implement a parallel sort using MPI. To begin, we'll explain the algorithm we wish for you to use. The algorithm is a variant of bucket sort which you've likely seen in a previous algorithms classes. However, in our implementation, more balanced buckets are obtained by randomly choosing pivots from the input array. The parallel sort can be divided into four steps, which are explained below. You may also want to take a look at these notes on parallel sort (created by the course TAs).

Note, in this explanation:

N is the length of array A to be sorted;
P is the number of processes used in the parallel program (we say processes and not threads in an MPI program, because logical threads of control are typically implemented as seperate processes, and can even be running on different machines). Remember that since you are working in the message passing model, each process resides in its own private address space (recall the example here). To communicate data between address spaces, the processes must send each other messages. You can think of each process as being 1-to-1 with a processor, but keep in mind that a Hyper-Threaded CPU can simultaneously run up to two processes per core. This is the case on the Gates 3000 machines where it is advantageous to run up to 12 processes on the CPU's six cores.
S is the number of samples used to choose pivots sorting process (as will be described below).
At the beginning of the algorithm, the unsorted array A is distributed across the processes. Specifically, each processes' address space contains an N/P size chunk of the array A.

In general, the basic idea of the sorting algorithm is to partition elements of the array A into P buckets, with one bucket belonging to each process (we say each process "owns" one bucket). Once the input array elements are partitioned into the correct bucket and have been communicated to the owning process, each bucket can be sorted independently using a fast sequential algorithm. Thus, the challenge of the parallel sort implementation is to efficiently communicate elements to the process responsible for each bucket.

Step 1: Choosing Pivots to Define Buckets

Basic Idea: The first step of the algorithm is to select P-1 pivots that define the P buckets. (Bucket i will contain elements between pivot[i-1] and pivot[i].) To do this, your code should randomly select S samples from the entire array A, and then choose P-1 pivots from the selection using the following process:

samples = array of S elements chosen at random with uniform probability from array A
samples_sorted = sort(samples)
pivots = [samples_sorted[S/P], samples_sorted[2S/P], samples_sorted[3S/P] ... ]

Note that the result pivots is an array of P-1 elements. You are free to select the value of S as you wish, but our reference solution uses S=12*P*lg(N), and you are not required to tweak this value as part of the optimization process.

Tips: We leave it up to you to figure out how to generate the S random indices and sort the corresponding elements. However, we suggest keeping it simple and sorting the pivots inside one process using a fast sequential sort. If the number of samples S is small compared to N, this operation and will constitute only a small fraction of execution time. We leave it up to you to determine whether the S samples should be determined in parallel.

Step 2: Bucketing Elements of the Input Array

Basic Idea: The second step is to bucket all elements of A into P buckets where element A[i] is placed in bucket j if pivot[j-1] <= A[i] < pivot[j]. (The 0'th bucket contains all elements less than pivot[0], the P-1'th bucket contains all elements greater than or equal to pivot[P-2]) The randomized choice of pivots ensures that in expectation, the number of elements in each bucket is well balanced. (This is important, because it will lead to good workload balance in Step 4!)

Tips: In the next step, each process i will send all the elements in its address space belonging to bucket j to process j. Therefore, the bucketing step should organize data for this upcoming communication phase. On average (assuming a random input array A), process i will communicate about N/(P^2) elements to process j, and we suggest that you consider bulk transfers to maximize communication performance.

Step 3: Redistributing Elements

Now that the bucket containing each array element is known, redistribute the data elements such that each process i holds all the elements in bucket i.

Step 4: Final Local Sort

Finally, each process uses a fast sequential sorting algorithm to sort each bucket. As a result, the distributed array is now sorted!

What You Need to Do

Your job is to develop a fast implementation of the parallel sort algorithm given above using the Open MPI message passing library. (There are many parallel sorting algorithms, but your implementation should follow the basic outline of the algorithm sketched above.)

Begin by downloading the starter code at /afs/cs.cmu.edu/academic/class/15418-s14/assignments/asst3_part1.tgz. Please keep your code modifications limited to src/parallelSort.cpp. The function parallelSort is the main entrypoint to your code.

How to Run ParallelSort on the GHC machines

To build the starter code:

tar xf asst3_part1.tgz
cd asst3_part1
make

To run parallelSort on a ghc machine:

make run

To run parallelSort on a ghc machine with different arguments, see the commandline help:

./parallelSort --help
Usage: ./parallelSort [options]
Sort a random or the input dataset 
Program Options:
-s  --size <N>           Size of dataset to sort
-d  --dist exp|norm|bad1 Distribution to generate dataset
-p  --par <pram>         Use <pram> as distribution parameter
-a  --almost <swap>      use <swap> comparisons to generate almost sorted dataset
-i  --input <file>       Use <file> instead of generated dataset
-?  --help               This message

You make wish to modify the Makefile on the following lines to run with other arguments:

39 run : parallelSort
40  $(MPIRUN) -np <processors> parallelSort <arguments>

How to Run ParallelSort on Blacklight

To run code on Blacklight, you must submit your jobs to the system's batch queue. The job will be enqueued (with all other jobs submitted by other users of the machine) and scheduled to run on the Supercomputer when it reaches the front of the queue. The Blacklight information page and this article by the course staff are good resources to understand how to use the batch system. We also recommend following the quick start tips and tricks below:

To build parallelSort with MPI, you need to first set all the proper envirnoment variables. Blacklight uses the module system to make this easy, so you just need to run:

    module load openmpi/1.6/gnu

You might want to add this to your .bashrc or equivalent.

In short, to run parallelSort on blacklight:

ssh blacklight.psc.xsede.org
module load openmpi/1.6/gnu
scp -r ghc<xx>.ghc.andrew.cmu.edu:</path/to/>asst3_part1 ./
cd asst3_part1
make jobs
qsub jobs/<username>_<np>.job

We have given you rules in Makefile that can be used to create jobs scripts for running on the Blacklight. The job scripts will be placed in the job/ directory with $andrewID\_numCores.job with numCores = 1,2,4,8,16,32,64,128. To create the job scripts:

make jobs

If you wish to modify arguments to parallelSort, modify asst3_part1/jobs/example.job:

25 args='your_arguments_to_parallelSort'

Then type:

make jobs
qsub jobs/<username>_<np>.job

To view your submitted jobs:

qstat -u <username>

To delete a job:

qdel <jobid>

It is very important to delete jobs you don't need from the batch queue, especially as we near the deadline and more jobs are in the queue. Blacklight is a supercomputer used by scientists across the country (machines of this scale are a scarce resource). We are fortunate we are given the opportunity to use Blacklight in 15-418/618. We consider it impolite to occupy the machine with jobs you do not need. It also hurts other 15-418 students that need access to the machine to finish their assignment.

Grading

We expect that, much like in Assignment 2, you will ultimately try a number of solutions. We expect your write-up will contain a detailed description of this optimization process and we will grade on the quality of this description. Remember, the point of 15-418 assignments is for you to learn how to reason about, write, debug, and profile a parallel program in order achieve the best speedup you can. If you acheive a good speedup and have a clear writeup that conveys the most important aspects of your experimentation process, you should expect to receive a good grade.

40 points will be awarded for code performance. Below are timings for the TA's reference solution of parallel sort on Blacklight for arrays of size 100M and 1B. (Execution times are given in seconds.) To obtain full performance points, we expect that your parallel sort solution achieves performance that is within 20% of the reference solution at high processor counts. It may be possible to do even better, and we encourage you to "beat the TAs". We will grade performance that does not meet the full performance bar on a curve based on class student results.
- Please include a table of your results for array sizes of 1 million, 100 million, and 1 billion. Performance will be assessed on the 100M and 1B sized datasets.
- In addition to these tables, each each problem size, please include a graph of speedup of your parallel implementation relative to the sequential sort time.
20 points will be awarded for the quality of the writeup. We expect that, much like in Assignment 2, you will ultimately try a number of solutions during the optimization process. We expect your write-up will contain a detailed description of this optimization process and we will grade on the quality of this description. In addition to your algorithm description, we are looking for you analyze the performance of your program. Specifically address the following questions:
- In the 1 million element array case, speedup for large P may be quite poor. Please describe why the shape of the speedup graph looks the wasy it does.
- What are reasons why speedup is not perfect for large P even when the array size is large. Is it workload imbalance? Is it communication costs? Make measurements of your program (time spent in each phase, time spent communicating, waiting for a barrier etc.) that allow you to support your answer with real data. (Much like we did in Assignment 1)

N = 1M

args='-s 1000000 -d exp -p 5'
 Sequential sort: 0.1254
 Parallel sort:  
   P = 2    0.0751  
   P = 4    0.0395
   P = 8    0.0248
   P = 16   0.0152
   P = 32   0.0076
   P = 64   0.0091 
   P = 128  0.0227

N = 100M

args='-s 100000000 -d exp -p 5'
 Sequential sort: 16.2609
 Parallel sort:  
   P = 2    8.2094
   P = 4    4.1012
   P = 8    2.1516
   P = 16   1.063
   P = 32   0.7151
   P = 64   0.4655
   P = 128  1.077

N = 1B

args='-s 1000000000 -d exp -p 5'
 Sequential sort: 166.5324
 Parallel sort:
   P = 2    87.7883
   P = 4    44.0403
   P = 8    21.9028
   P = 16   10.3575
   P = 32   7.0433
   P = 64   4.3028
   P = 128  5.5775

Although your grade is based on sorting performance on Blacklight, we figured it my be helpful when debugging/testing to know the reference's performance and GHC 3000 machines. It is given below.

N = 1M

 Sequential sort: 0.0808
 Parallel Sort:
   P = 2    0.0446
   P = 4    0.0249
   P = 6    0.0169 
   P = 8    0.0153
   P = 12   0.0116

N = 10M

 Sequential sort: 0.9244
 Parallel Sort:
   P = 2    0.5345
   P = 4    0.2885
   P = 6    0.1957
   P = 8    0.1688
   P = 12   0.1321

N = 100M

 Sequential sort: 10.3218
 Parallel Sort:
   P = 2    5.6164  (1.84x)
   P = 4    3.0374  (3.40x)
   P = 6    2.1998  (4.49x)
   P = 8    1.8858  (5.47x)
   P = 12   1.5016  (6.87x)

Learning MPI

As was the case with CUDA in Assignment 2, we expect you to learn how to use the Open MPI library on your own. Below is a brief discussion of MPI and some helpful links.

Open MPI is a message passing API used frequently in scientific computing. Here is a brief example:

int main(int argc, char** argv) {
  int size;  /* the number of processes */
  int rank;  /* the id of the current process */
  int value;

  MPI_Init(& argc, & argv);
  MPI_Comm_rank(MPI_COMM_WORLD, & rank);   /* rank is the id of the current process */
  MPI_Comm_size(MPI_COMM_WORLD, & size);   /* size is the total number of processes*/

  if (rank == 0) {
     for (int i = 1; i < size; i++) {
        value = i*i;
        printf("process 0 performing a blocking send of %d to process %d\n", num, i);
        MPI_Send(& value, 1, MPI_INT, i, MY_MSG_TAG, MPI_COMM_WORLD); 
     }
  } else {
     MPI_Recv(& value, 1, MPI_INT, 0, MY_MSG_TAG, MPI_COMM_WORLD, NULL);
     printf("Finished a blocking receive of %d from process 0\n", num);        
  }

  MPI_Finalize();
}

The problem above would be launched with the command mpirun on both the gates machines and on blacklight. The -np parameter controls the number of processes launched in the MPI program:

mpirun -np 4 ./my_mpi_program

Helpful links:

This article by the course staff may be useful.
A tutorial with some example code: http://www.lam-mpi.org/tutorials/nd/.
An explanation and reference https://computing.llnl.gov/tutorials/mpi/.
Some official reference material http://www.open-mpi.org/doc/v1.6/.

Problem 2: Parallel Breadth-First Search Using OpenMP (30 points code + 30 points writeup)

Breadth-first search (BFS) is a common algorithm that you've almost certainly seen in a prior algorithms class. In this problem, you'll explore the challenges of parallelizing BFS on modern parallel machines.

To begin, grab the starter code at: /afs/cs.cmu.edu/academic/class/15418-s14/assignments/asst3_part1.tgz

To can run the code on the graphs located in /afs/cs/academic/class/15418-s14/assignments/asst3_graphs. For example:

make
./bfs /afs/cs.cmu.edu/academic/class/15418-s14/assignments/asst3_graphs/random_1m.graph

Note that some of these graphs are quite large. For example, the largest graph contains over 50 million vertices and nearly 500 million edges and is a 2GB binary file on disk. You will not want to read these files from AFS frequently, so we suggest copying them to a local directory into one of the GHC 3000 machines such as: /tmp/15418. (Anyone can write to /tmp on the Gates machines.) If you do this, you can read the graph data off local disk. Further, if everyone agrees on putting data in the same location, you can share the local dataset your classmates have already placed on the machine without making many copies.

Please familiarize yourself with the function bfs_top_down() in bfs.cpp, which contains a sequential implementation of BFS. The code uses BFS to compute the distance to vertex 0 for all vertices in the graph. You may which to familiarize yourself with the graph structure defined in graph.h as well as the simple array data structure vertex_set which is an array of vertices used to represent the current frontier of BFS.

Step 1

Please parallelize the top-down BFS. You'll need to focus on identifying parallelism, as well as inserting the appropriate synchronization to ensure correctness. We'll point out up front that you should not expect to achieve near-perfect speedups on this problem (we'll leave it to you to think about why!). A good solution in step 1 will perform approximately three times faster than the sequential starter code when using 12 threads on the six-core (but hyper-threaded) GHC 3000 machines when run on the larger random graphs.

Tips/Hints:

Always start by considering what work can be done in parallel.
Some part of the computation will need to be synchronized, for example, but wrapping the appropriate code in a critical region using #pragma omp critical. However in this problem you can get by with a single atomic operation called "compare and swap". You can read about GCC's implementation of compare and swap, which is exposed to C code as the function __sync_bool_compare_and_swap. If you can figure out how to use compare and swap for this problem, you will achieve higher performance than using a critical region.
Are there conditions where is it possible to avoid using compare_and_swap at all?
In class we talked about potential of work assignment introducing overheads in a parallel program, and we discussed ways to reduce it.
The starter code contains printfs at each step of the BFS process. These are there to help you understand the algorithm. You may wish to remove these when timing performance.

Step 2

If you take a look at the step-by-step results of the top-down BFS, you'll notice that the frontier grows quite large in some steps (sometimes growing to contain a large fraction of the entire graph). This is particularly true in the rmat graphs. Think about why might this behavior cause a performance problem in the top-down BFS implementation from step 1. It suggests an alternative implementation of a breadth first search step that may be more efficient in these situations. Instead of iterating over all vertices in the frontier and marking all vertices adjacent to the frontier, it is possible to implement BFS by having each vertex check whether it should be added to the frontier! Basic pseudocode for the algorithm is as follows:

for each vertex v in graph:
    if v has not been visited AND 
       v shares an incoming edge with a vertex on the frontier:
          add vertex v to frontier;

This algorithm is referred to as a "bottom up" implementation of BFS, since each vertex looks up the BFS tree to find its ancestor. (As opposed to being found by its ancestor in a "top down" fashion). Please implement a bottom-up BFS to compute the shortest path to all vertices in the graph from the root. Start by implementing a simple sequential version. Then parallelize your implementation and compare the performance of the two implementations. You should expect to achieve about a 2.5 times speedup in your solution as compared to a well-written sequential implementation.

Tips/Hints:

It may be useful to consider how to represent the set of unvisited nodes.
How do the synchronization requirements of the bottom-up BFS change?

Step 3

Now that you have implemented both top-down and bottom-up parallel BFS implementations, combine the techniques to form a new hybrid algorithm that performs a fast parallel BFS.

Grading

30 points will be awarded for code performance on large random graphs such as random_50m.graph. We want to see your performance results from 1-12 threads on GHC3K, and for up to 32 threads on Blacklight. Below is table detailing the TA's reference solution for the three BFS implementations (part 1: "top down", part 2: "bottom up", and part 3: "hybrid"). We have provided timings on the GHC 3000 machines as well as on Blacklight. Times are given in seconds (Note that speedup is compared to a sequential implementation of each algorithm that uses NO OMP constructs. e.g., in the top-down case, it is the top-down starter code provided in the assignment starter code.) To obtain full performance points, we expect that your top-down, bottom up, and hybrid solutions to achieve performance that is within 30% of the reference solution at high thread counts on the GHC machines and for up to 16 cores on Blacklight. It may be possible to do even better, and we encourage you to try to "beat the TAs". We will grade performance that does not meet the full performance bar on a curve based on class student results.

Gates 3000:

Num threads      Top Down         Bottom Up          Hybrid      
      1          13.817 (1x)      43.725 (1x)        7.792 (1x)
      2          10.000 (1.38x)   24.308 (1.80x)     5.006 (1.55x)
      4           7.540 (1.83x)   20.284 (2.16x)     3.811 (2.04x)
      6           6.334 (2.18x)   17.025 (2.57x)     3.327 (2.34x)
      8           5.335 (2.59x)   18.980 (2.30x)     3.321 (2.35x)
     12           4.924 (2.81x)   18.616 (2.35x)     3.209 (2.43x)

Blacklight:

Num threads      Top Down         Bottom Up          Hybrid      
      1          43.939 (1x)      68.785 (1x)        14.481 (1x)
      2          36.170 (1.21x)   39.152 (1.76x)     10.265 (1.41x)
      4          23.676 (1.86x)   18.905 (3.64x)     5.7480 (2.51x)
      8          17.376 (2.53x)   10.269 (6.70x)     3.5660 (4.06x)
     16          20.939 (2.10x)    7.669 (8.97x)     4.2000 (3.45x)   
     32          52.345 (0.84x)   39.059 (1.76x)     12.547 (1.15x)  
     64          51.847 (0.85x)   39.967 (1.72x)     14.031 (1.03x)   64 threads NOT REQUIRED
    128          84.176 (0.52x)   41.985 (1.64x)     20.692 (0.70x)   128 threads NOT REQUIRED

Gates 3000:

For debugging, here is the performance of reference solution on one-million nodes graph.

Num threads   Top Down    Bottom up    Hybrid
    1         0.174       0.23         0.078
    2         0.114       0.121        0.046
    3         0.104       0.1          0.04
    4         0.106       0.074        0.031
    5         0.078       0.068        0.027
    6         0.076       0.067        0.026
   12         0.056       0.064        0.022

30 points will be awarded for the quality of the writeup. As in part 1, we expect your write-up will contain a detailed description this optimization process and we will grade on the quality of this description. Make sure you tell us how you implemented the hybrid approach and what data structures you used to represent the frontier in the bottom up approach. In addition to your algorithm description, we are looking for you to analyze the performance of your program. Specifically address the following questions:
- Where is the synchronization in each your solutions? Do you do anything to limit the overhead of synchronization?
- Why do you think your code (and the TA's code) is unable to achieve perfect speedup? (Is it workload imabalance? communication/synchronization? data movement?)
- When you run your code on Blacklight with more than 16 threads, do you see a noticable drop off in performance? Why do you think this might be the case?

Learning OpenMP

OpenMP is an API and set of C-language extensions that provides compiler support for parallelism. It is well documented online, but here is a brief example:

/* Serial code here ... */

#pragma omp parallel
{
   /* This block is potentially executed in parallel on multiple cores */
}
/* Wait until all processes finish before continuing */

/* Serial code again ... */

You can also use OpenMP to tell the compiler to parallelize iterations of for loops, and to manage mutual exclusion.

/* The iterations of the for loop may be parallelized */      
#pragma omp parallel for                                                        
for (int i = 0; i < 100; i++) {                                                      
  #pragma omp critical                                                          
  {
    /* This block will be executed by at most one thread at a time. */
    printf("Thread %d got iteration %lu\n", omp_get_thread_num(), i);           
  }                                                                             
}

Here is an example for an atomic counter update.

int my_counter = 0;
#pragma omp parallel for                                                        
for (int i = 0; i < 100; i++) {                                                      
    if ( ... some condition ...) {
       #pragma omp atomic
       my_counter++;
    }
}

As a participant in a 400-level course, we expect you to be able to read OpenMP documentation on your own (Google will be very helpful), but here are some useful links to get started:

The OpenMP 3.0 specification: http://www.openmp.org/mp-documents/spec30.pdf.
An OpenMP cheat sheet http://openmp.org/mp-documents/OpenMP3.1-CCard.pdf.
OpenMP has support for reductions on shared variables, and for declaring thread-local copies of variables.

How to Run OpenMP on Blacklight

We have provided a submission script under asst3_part2/jobs/submit.py. The script will automatically create the relavent job scripts and qsub them into the Blacklight running queue. To run bfs on Blacklight:

ssh blacklight.psc.xsede.org
module load gcc/4.8.0
scp -r ghc<xx>.ghc.andrew.cmu.edu:</path/to/>asst3_part2 ./
mkdir graphs
scp -r ghc<xx>.ghc.andrew.cmu.edu:</path/to/graphs> ./graphs
cd asst3_part2
make
cd jobs
python submit.py </path/to/input/graph>

Hand-in Instructions

Please submit your code in the handin directory: /afs/cs/academic/class/15418-s14-users/<ANDREW ID>/asst3

Remember: You many need to execute aklog cs.cmu.edu to avoid permission issues.

Please submit your writeup as pdf file named writeup_asst3.pdf. Your writeup should include a description of your final algorithms used and a detailed description of experimentation that indicates your thought process in parallelizing your code. Note that the writeup is 50% of the grade on this assignment.
Please submit your MPI sorting code as parallelSort.cpp.
Please submit your BFS code as bfs.cpp