Assignment 3: Two Algorithms, Two Programming Models

Due: Thurs Feb 26th, 11:59PM EST

120 points total

## Overview

In this two-part assignment you will write parallel implementations of two common algorithms (1) sorting a large array of numbers and (2) computing the single-source shortest path in an unweighted, directed graph using breadth-first search (BFS). To give you exposure to two new parallel programming environments, your sorting algorithm will be implemented in the message passing model (using MPI). Your BFS will be implemented as a shared-address space program (using OpenMP). For this assignment you will be using GHC machines and Blacklight, a supercomputer hosted at the Pittsburgh Supercomputing Center (PSC). You will first get your code running well on the 6-core (12-virtual cores, due to hyper-threading) machines in GHC. Once your code scales well on these machines, you'll have the opportunity to test, evaluate, and improve its scalability on a supercomputer.

Note to students: Parts 1 and 2 of this assignment are independent exercises. Some teams may wish to partition the work in two, but we prefer you work on both parts together.

## Environment Setup

#### GHC Setup Instructions

In this assignment, you will need to use ghc27-46.ghc.andrew.cmu.edu to run your code. The six-core, 3.2 GHz Intel Xeon CPU in these machines is described in detail here. You can verify that you are on a correct machine by running 'less /proc/cpuinfo' and confirming that Linux reports 12 virtual cores.

On the GHC machines, add /usr/lib64/openmpi/bin to your PATH and /usr/lib64/openmpi/lib to your LD_LIBRARY_PATH environment variables. You can develop and test your code on any machine, but performance should be reported only on the hosts listed above.

When running part 2 (BFS), add the <working-directory/lib> to your LD_LIBRARY_PATH environment variable. This will allow your BFS executable to use the reference library provided in the starter code (it is needed to run bfs).

#### Blacklight Setup Instructions

STOP WHAT YOU ARE DOING AND PLEASE FOLLOW THESE INSTRUCTIONS ASAP. THERE CAN BE 24-48 HOURS BETWEEN THE TIME WE ADD YOUR BLACKLIGHT ACCOUNT AND WHEN YOU GAIN ACCESS TO BLACKLIGHT, SO YOU WANT TO INITIATE THIS PROCESS AS SOON AS POSSIBLE.

You need to first create an account on XSEDE. Once you have created an account, fill out this form so we know your XSEDE username and can give you access to use the class quota on Blacklight.

There are two ways to access Blacklight to run your code. The first is through the XSEDE web portal. You will need to login into your XSEDE account at http://portal.xsede.org. Once logged in, navigate to "MY XSEDE", then "SSH Terminal" and select "Blacklight" in the dropdown menu to initiate an SSH session. The second is via ssh to blacklight.psc.teragrid.org (we expect that the latter option will be far more convenient). To login via ssh you may first need to change your Blacklight password by following the instructions found here: http://www.psc.edu/index.php/computing-resources/blacklight#password.

## Problem 1: Parallel Sorting Using MPI (40 points code + 20 points writeup)

In part 1 of the assignment you will implement a parallel sort using MPI, an API for message passing. Unlike Assignment 2, we will give you the parallel sorting algorithm we wish for you to use. The algorithm is a variant of bucket sort which you've likely seen in a previous algorithms classes. However, in our implementation, more balanced buckets are obtained by randomly choosing pivots from the input array. The parallel sort can be divided into four steps, which are explained below.

Note, in this explanation:

• N is the length of array A to be sorted.
• P is the number of processes used in the parallel program (we say processes and not threads in an MPI program, because logical threads of control are typically implemented as seperate processes, and can even be running on different machines). Remember that since you are working in the message passing model, each process resides in its own private address space (recall the example here). To communicate data between address spaces, the processes must send each other messages. You can think of each process as being 1-to-1 with a processor, but keep in mind that a Hyper-Threaded CPU can simultaneously run up to two processes per core. This is the case on the GHC machines where it is advantageous to run up to 12 processes on the CPU's six cores.
• S is the number of samples used to choose pivots sorting process (as will be described below).
• Assume that at the beginning of the algorithm, the unsorted array A is distributed across the processes. Specifically, each processes' address space contains an N/P size chunk of the array A.

In general, the basic idea of the sorting algorithm is to partition elements of the array A into P buckets, with one bucket belonging to each process (we say each process "owns" one bucket). Once the input array elements are partitioned into the correct bucket and have been communicated to the owning process, each bucket can be sorted independently using a fast sequential algorithm. Thus, the challenge of the parallel sort implementation is to efficiently communicate elements to the process responsible for each bucket.

We'll now explain the algorithm in more detail. The process is also illustrated in these notes (created by the course TAs of yore).

### Step 1: Choosing Pivots to Define Buckets

Basic Idea: The first step of the algorithm is to select P-1 pivots that define the P buckets. (Bucket i will contain elements between pivot[i-1] and pivot[i].) To do this, your code should randomly select S samples from the entire array A, and then choose P-1 pivots from the selection using the following process:

samples = array of S elements chosen at random with uniform probability from array A
samples_sorted = sort(samples)
pivots = [samples_sorted[S/P], samples_sorted[2S/P], samples_sorted[3S/P] ... ]

Note that the result pivots is an array of P-1 elements. You are free to select the value of S as you wish, but our reference solution uses S=12*P*lg(N), and you are not required to tweak this value as part of the optimization process.

Tips: We leave it up to you to figure out how to generate the S random indices and sort the corresponding elements. However, we suggest keeping it simple and sorting the pivots inside one process using a fast sequential sort. If the number of samples S is small compared to N, this operation and will constitute only a small fraction of execution time. We leave it up to you to determine whether the S samples should be determined in parallel.

### Step 2: Bucketing Elements of the Input Array

Basic Idea: The second step is to bucket all elements of A into P buckets where element A[i] is placed in bucket j if pivot[j-1] <= A[i] < pivot[j]. (The 0'th bucket contains all elements less than pivot[0], the P-1'th bucket contains all elements greater than or equal to pivot[P-2]) The randomized choice of pivots ensures that in expectation, the number of elements in each bucket is well balanced. (This is important, because it will lead to good workload balance in Step 4!)

Tips: In the next step, each process i will send all the elements in its address space belonging to bucket j to process j. Therefore, the bucketing step should organize data for this upcoming communication phase. On average (assuming a random input array A), process i will communicate about N/(P^2) elements to process j, and we suggest that you consider bulk transfers to maximize communication performance.

### Step 3: Redistributing Elements

Now that the bucket containing each array element is known, redistribute the data elements such that each process i holds all the elements in bucket i.

### Step 4: Final Local Sort

Finally, each process uses a fast sequential sorting algorithm to sort each bucket. As a result, the distributed array is now sorted!

### What You Need to Do

Your job is to develop a fast implementation of the parallel sort algorithm given above using the Open MPI message passing library. (There are many parallel sorting algorithms, but your implementation should follow the basic outline of the algorithm sketched above.)

This assigment we are distributing the start code via git. Begin by cloning the git repo at:

Please keep your code modifications limited to src/parallelSort.cpp. The function parallelSort is the main entrypoint to your code.

### How to Run ParallelSort on the GHC machines

To build the starter code:

cd asst3/part1 && make

To run parallelSort on a ghc machine:

make run

To run parallelSort on a ghc machine with different arguments, see the commandline help:

./parallelSort --help
Usage: ./parallelSort [options]
Sort a random or the input dataset
Program Options:
-s  --size <N>           Size of dataset to sort
-d  --dist exp|norm|bad1 Distribution to generate dataset
-p  --par <param>        Use <param> as distribution parameter
-a  --almost <swap>      use <swap> comparisons to generate almost sorted dataset
-i  --input <file>       Use <file> instead of generated dataset
-?  --help               This message

If you want to run parallelSort without using make run, then you need to add /path/to/asst3/part1/lib to your LD_LIBRARY_PATH.

You make wish to modify the Makefile on the following lines to run with other arguments:

39 run : parallelSort
40  LD_LIBRARY_PATH=... $(MPIRUN) -np <processors> parallelSort <arguments> ### How to Run ParallelSort on Blacklight To run code on Blacklight, you must submit your jobs to the system's batch queue. The job will be enqueued (with all other jobs submitted by other users of the machine) and scheduled to run on the Supercomputer when it reaches the front of the queue. The Blacklight information page and the writeup on How to Use Blacklight by the course staff are good resources to understand how to use the batch system. We also recommend following the quick start tips and tricks below: To build parallelSort with MPI, you need to first set all the proper envirnoment variables. Blacklight uses the module system to make this easy, so you just need to run: module load openmpi/1.6/gnu You might want to add this to your .bashrc or equivalent. In short, to run parallelSort on blacklight: ssh blacklight.psc.xsede.org module load openmpi/1.6/gnu scp -r ghc<xx>.ghc.andrew.cmu.edu:</path/to/>asst3 ./ cd asst3/part1 make jobs qsub jobs/<username>_<np>.job We have given you rules in Makefile that can be used to create jobs scripts for running on the Blacklight. The job scripts will be placed in the job/ directory with$andrewID\_numCores.job with numCores = 1,2,4,8,16,32,64,128. To create the job scripts:

make jobs

If you wish to modify arguments to parallelSort, modify asst3/part1/jobs/example.job:

25 args='your_arguments_to_parallelSort'

Then type:

make jobs

To delete a job:

qdel <jobid>

Notice to students: It is very important to delete jobs you don't need from the Blacklight batch queue, especially as we near the deadline and more jobs are in the queue. Blacklight is a supercomputer used by scientists across the country (machines of this scale are a scarce resource). We are fortunate to be given the opportunity to use Blacklight in 15-418/618. It is considered impolite to occupy the machine with jobs you do not need. In addition to researchers at large, it also hurts other 15-418/618 students that need access to the machine to finish their assignment.

We expect that, much like in Assignment 2, you will ultimately try a number of solutions. We expect your write-up will contain a detailed description of this optimization process and we will grade on the quality of this description. Remember, the point of 15-418 assignments is for you to learn how to reason about, write, debug, and profile a parallel program in order achieve the best speedup you can. If you acheive a good speedup and have a clear writeup that conveys the most important aspects of your experimentation process, you should expect to receive a good grade.

• 40 points will be awarded for code performance. Below are timings for the TA's reference solution of parallel sort on Blacklight for arrays of size 100M and 1B. (Execution times are given in seconds.) To obtain full performance points, we expect that your parallel sort solution achieves performance that is within 25% of the reference solution at high processor counts. It may be possible to do even better, and we encourage you to "beat the TAs". We will grade performance that does not meet the full performance bar on a curve based on class student results.
• Please include a table of your results for array sizes of 1 million, 100 million, and 1 billion. Performance will be assessed on the 100M and 1B sized datasets.
• In addition to these tables, for each problem size, please include a graph of speedup of your parallel implementation relative to the sequential sort time.
• 20 points will be awarded for the quality of the writeup. We expect that, much like in Assignment 2, you will ultimately try a number of solutions during the optimization process. We expect your write-up will contain a detailed description of this optimization process and we will grade on the quality of this description. In addition to your algorithm description, we are looking for you analyze the performance of your program. Specifically address the following questions:
• In the 1 million element array case, speedup for large P may be quite poor. Please describe why the shape of the speedup graph looks the way it does.
• What are reasons why speedup is not perfect for large P even when the array size is large. Is it workload imbalance? Is it communication costs? Make measurements of your program (time spent in each phase, time spent communicating, waiting for a barrier etc.) that allow you to support your answer with real data. (Much like we did in Assignment 1)

Reference Solution Blacklight Performance:

N = 1M

args='-s 1000000 -d exp -p 5'
Sequential sort: 0.1254
Parallel sort:
P = 2    0.0751
P = 4    0.0395
P = 8    0.0248
P = 16   0.0152
P = 32   0.0076
P = 64   0.0091
P = 128  0.0227

N = 100M

args='-s 100000000 -d exp -p 5'
Sequential sort: 16.2609
Parallel sort:
P = 2    8.2094
P = 4    4.1012
P = 8    2.1516
P = 16   1.063
P = 32   0.7151
P = 64   0.4655
P = 128  1.077

N = 1B

args='-s 1000000000 -d exp -p 5'
Sequential sort: 166.5324
Parallel sort:
P = 2    87.7883
P = 4    44.0403
P = 8    21.9028
P = 16   10.3575
P = 32   7.0433
P = 64   4.3028
P = 128  5.5775

Although your grade is based on sorting performance on Blacklight, we figured it my be helpful when debugging/testing to know the reference's performance on the GHC machines. It is given below.

N = 1M

Sequential sort: 0.0838
Parallel Sort:
P = 2    0.0446
P = 4    0.0249
P = 6    0.0169
P = 8    0.0153
P = 12   0.0116

N = 10M

Sequential sort: 0.9675
Parallel Sort:
P = 2    0.5345
P = 4    0.2885
P = 6    0.1957
P = 8    0.1688
P = 12   0.1321

N = 100M

Sequential sort: 10.3218
Parallel Sort:
P = 2    5.6164  (1.84x)
P = 4    3.0374  (3.40x)
P = 6    2.1998  (4.49x)
P = 8    1.8858  (5.47x)
P = 12   1.3549  (7.62x)

### Learning MPI

As was the case with CUDA in Assignment 2, we expect you to learn how to use the Open MPI library on your own. Below is a brief discussion of MPI and some helpful links.

Open MPI is a message passing API used frequently in scientific computing. Here is a brief example:

int main(int argc, char** argv) {
int size;  /* the number of processes */
int rank;  /* the id of the current process */
int value;

MPI_Init(& argc, & argv);
MPI_Comm_rank(MPI_COMM_WORLD, & rank);   /* rank is the id of the current process */
MPI_Comm_size(MPI_COMM_WORLD, & size);   /* size is the total number of processes*/

if (rank == 0) {
for (int i = 1; i < size; i++) {
value = i*i;
printf("process 0 performing a blocking send of %d to process %d\n", value, i);
MPI_Send(& value, 1, MPI_INT, i, MY_MSG_TAG, MPI_COMM_WORLD);
}
} else {
MPI_Recv(& value, 1, MPI_INT, 0, MY_MSG_TAG, MPI_COMM_WORLD, NULL);
printf("Finished a blocking receive of %d from process 0\n", value);
}

MPI_Finalize();
}

The problem above would be launched with the command mpirun on both the gates machines and on blacklight. The -np parameter controls the number of processes launched in the MPI program:

mpirun -np 4 ./my_mpi_program

In particular, you might find a subset of the following functions useful:

And for all of these functions, you might want the "v" variant (e.g. MPI_Alltoallv) which gives you greater control over how much data is transferred where.

## Problem 2: Parallel Breadth-First Search Using OpenMP (30 points code + 30 points writeup)

Breadth-first search (BFS) is a common algorithm that you've almost certainly seen in a prior algorithms class. In this problem, you'll explore the challenges of parallelizing BFS on modern parallel machines.

Please familiarize yourself with the function bfs_top_down() in bfs.cpp, which contains a sequential implementation of BFS. The code uses BFS to compute the distance to vertex 0 for all vertices in the graph. You may wish to familiarize yourself with the graph structure defined in graph.h as well as the simple array data structure vertex_set which is an array of vertices used to represent the current frontier of BFS.

### How to Run BFS on the GHC machines

To begin, grab the starter code via:

You will run the code on the graphs located in:

(Yes, the -s14 url is correct. The graphs are huge so we did not copy them from last year.)

For example:

Note that some of these graphs are quite large. For example, the largest graph contains over 50 million vertices and nearly 500 million edges and is a 2GB binary file on disk. You will not want to read these files from AFS frequently, so we suggest copying them to a local directory into one of the GHC 3000 machines such as: /tmp/15418. (Anyone can write to /tmp on the Gates machines.) If you do this, you can read the graph data off local disk. Further, if everyone agrees on putting data in the same location, you can share the local dataset your classmates have already placed on the machine without making many copies.

Note there are two ways of calling the bfs executable:

make GRAPH="path/to/graph" run

will run the check script against multiple threads and provide a summarized results of the speedups. However, no speedup results will be reported if the implementation is incorrect.

will run the check script with n threads and report the speedup, but there will be no correctness checking. This is the script run on Blacklight as well, so please make sure that your implementation is correct on GHC before submitting any jobs to Blacklight.

### Step 1

Please parallelize the top-down BFS. You'll need to focus on identifying parallelism, as well as inserting the appropriate synchronization to ensure correctness. We'll point out up front that you should not expect to achieve near-perfect speedups on this problem (we'll leave it to you to think about why!). A good solution in step 1 will perform approximately three times faster than the sequential starter code when using 12 threads on the six-core (but hyper-threaded) GHC machines when run on the larger random graphs.

Tips/Hints:

• Always start by considering what work can be done in parallel.
• Some part of the computation may need to be synchronized, for example, but wrapping the appropriate code in a critical region using #pragma omp critical. However in this problem you can get by with a single atomic operation called "compare and swap". You can read about GCC's implementation of compare and swap, which is exposed to C code as the function __sync_bool_compare_and_swap. If you can figure out how to use compare-and-swap for this problem, you will achieve higher performance than using a critical region. (We will talk about compare and swap in detail in the second half of the course, but in this problem it's up to you to figure out how to use it.
• Are there conditions where is it possible to avoid using compare_and_swap? That is when you know in advance the compare will fail?
• In class we talked about potential of work assignment introducing overheads in a parallel program, and we discussed ways to reduce it.
• The starter code contains printfs at each step of the BFS process. These are there to help you understand the algorithm. You may wish to remove these when timing performance.

### Step 2

If you take a look at the step-by-step results of the top-down BFS, you'll notice that the frontier grows quite large in some steps (sometimes growing to contain a large fraction of the entire graph). This is particularly true in the rmat graphs. Think about why this behavior might cause a performance problem in the top-down BFS implementation from step 1. It suggests an alternative implementation of a breadth-first search step that may be more efficient in these situations. Instead of iterating over all vertices in the frontier and marking all vertices adjacent to the frontier, it is possible to implement BFS by having each vertex check whether it should be added to the frontier! Basic pseudocode for the algorithm is as follows:

for each vertex v in graph:
if v has not been visited AND
v shares an incoming edge with a vertex on the frontier:

This algorithm is referred to as a "bottom up" implementation of BFS, since each vertex looks up the BFS tree to find its ancestor. (As opposed to being found by its ancestor in a "top down" fashion). Please implement a bottom-up BFS to compute the shortest path to all vertices in the graph from the root. Start by implementing a simple sequential version. Then parallelize your implementation and compare the performance of the two implementations. You should expect to achieve about a 2.5 times speedup in your solution as compared to a well-written sequential implementation.

Tips/Hints:

• It may be useful to consider how to represent the set of unvisited nodes.
• How do the synchronization requirements of the bottom-up BFS change?

### Step 3

Now that you have implemented both top-down and bottom-up parallel BFS implementations, combine the techniques to form a new hybrid algorithm that performs a faster parallel BFS than either of the two techniques alone. This hybrid approach is called a "Direction Optimization BFS".

Note: There are two preprocessor macros defined in bfs.h, USE_HYBRID_FUNCTION and DEBUG. Defining the hybrid macro will make the checker run correctness and performance testing on your hybrid implementation. You can use the debug flag as it is used in the bfs_top_down function in order to allow you jump between printing runtime logs and not printing rapidly. We recommend you are not printing from inside the BFS functions when you are trying to measure performance.

• 30 points will be awarded for code performance on large random graphs such as random_50m.graph. and rmat_32m.graph. Once the BFS algorithm you write is correct, calling ./bfs on a graph will run your solution against a reference library so that you can compare the performance of your code and the performance that we expect you to achieve on a system to system basis.
• We will be grading performance results from the 1-12 thread solutions on ghc27-46.ghc.andrew.cmu.edu, and for up to 32 threads on Blacklight. Other machines that might be interesting to test results on are the 4-core GHC machines (ghc49+.ghc.andrew.cmu.edu) as well as the 40-core andrew Unix Machines (unix.andrew.cmu.edu)
• Please include a table of your results from Blacklight with core counts from 1 to 32. Performance will be graded based on your performance from Blacklight.
• To obtain full performance points, we expect that your top-down, bottom up, and hybrid solutions to achieve performance that is within 25% of the reference solution at high thread counts on the GHC machines and for up to 32 cores on Blacklight. It may be possible to do even better, and we encourage you to try to "beat the TAs". As in part 1, we will grade performance that does not meet the full performance bar on a curve based on class student results.
• Important Grading is solely based on absolute performance times not speedup. speedup is given to give you some information about the speedup profile but your solution must be close to the reference in time
• Please include both your time as well as the reference solution for Blacklight on your writeup

• 30 points will be awarded for the quality of the writeup. As in part 1, we expect your write-up will contain a detailed description this optimization process and we will grade on the quality of this description. Make sure you tell us how you implemented the hybrid approach and what data structures you used to represent the frontier in the bottom up approach. In addition to your algorithm description, we are looking for you to analyze the performance of your program. Specifically address the following questions:

• Where is the synchronization in each your solutions? Do you do anything to limit the overhead of synchronization?
• Why do you think your code (and the TA's code) is unable to achieve perfect speedup? (Is it workload imabalance? communication/synchronization? data movement?)
• When you run your code on Blacklight with more than 16 threads, do you see a noticable drop off in performance? Why do you think this might be the case?

### Learning OpenMP

OpenMP is an API and set of C-language extensions that provides compiler support for parallelism. It is well documented online, but here is a brief example:

/* Serial code here ... */

#pragma omp parallel
{
/* This block is potentially executed in parallel on multiple cores */
}
/* Wait until all processes finish before continuing */

/* Serial code again ... */

You can also use OpenMP to tell the compiler to parallelize iterations of for loops, and to manage mutual exclusion.

/* The iterations of the for loop may be parallelized */
#pragma omp parallel for
for (int i = 0; i < 100; i++) {
#pragma omp critical
{
/* This block will be executed by at most one thread at a time. */
}
}

Please see OpenMP documentation for the syntax for how to tell OpenMP to used different forms of static or dynamic scheduling (e.g., omp parallel for schedule(dynamic)).

Here is an example for an atomic counter update.

int my_counter = 0;
#pragma omp parallel for
for (int i = 0; i < 100; i++) {
if ( ... some condition ...) {
#pragma omp atomic
my_counter++;
}
}

As a participant in a 400-level course, we expect you to be able to read OpenMP documentation on your own (Google will be very helpful), but here are some useful links to get started:

## How to Run OpenMP on Blacklight

We have provided a submission script under asst3_part2/jobs/submit.py. The script will automatically create the relevent job scripts and qsub them into the Blacklight running queue. To run bfs on Blacklight:

ssh blacklight.psc.xsede.org
scp -r ghc<xx>.ghc.andrew.cmu.edu:</path/to/>asst3_part2 ./
mkdir graphs
scp -r ghc<xx>.ghc.andrew.cmu.edu:</path/to/graphs> ./graphs
cd asst3_part2
make
cd jobs
python submit.py </path/to/input/graph>

By default, the program is run on core counts up to 32 cores. If you want to change how many cores the program is run on, you can modify the python file submit.py and change the core_counts array to reflect whatever core counts you want. Then just rerun the python submit command

## Hand-in Instructions

Remember: You many need to execute aklog cs.cmu.edu to avoid permission issues.

We expect you to hand in four files:

1. Please submit your writeup as a pdf file named writeup_asst3.pdf. Your writeup should include a description of your final algorithms used and a detailed description of experimentation that indicates your thought process in parallelizing your code. Note that the writeup is 50% of the grade on this assignment.