Assignment 3: Two Parallel Graph Algorithms : Parallel Computer Architecture and Programming : Tsinghua Summer 2017

Due: Wed July 26, 23:59

100 points total

Overview

In this assignment you will implement two parallel graph processing algorithms (pagerank, and breadth-first search) using a parallel programming framework called OpenMP. A good implementation of this assignment will be able to run algorithms like breadth-first search on graphs containing hundreds of millions of edges in seconds.

Environment Setup

You will need to run code on the cluster machines for this assignment. The cluster consists of 12 machines you can use for this assignment. The IP address of the head node is 166.111.227.244. Each machine in the cluster contains 24 physical cores: two 12-core 2.5 GHz Intel Xeon E5-2680 v3 processors (although dynamic frequency scaling can take them to 3.3 GHz when the chip decides it is useful to do so). For the curious, a complete specification for this CPU can be found at http://ark.intel.com/products/81908/Intel-Xeon-Processor-E5-2680-v3-30M-Cache-2_50-GHz. Note that hyperthreading is turned OFF on these machines.

Getting Started on the lab cluster

There is no public Internet access from the lab cluster machines, so we hosted a local git repository. Please be sure to follow the detailed instructions for the lab cluster.

Use ssh to log in to 166.111.227.244
Clone the Assignment 3 starter code from internal git repository into your home directory on the cluster:

git clone ssh://166.111.227.244/data/home/HPC2017_Summer/class/asst3
Follow the detailed instructions to run your compiled binaries on the cluster.

Part 1: Parallel Graph Algorithms on a Multi-Core CPU

In this part of the assignment, you will implement two graph processing algorithms: breadth-first search (BFS) and a simple implementation of page rank.

Background: Learning OpenMP

In this assignment we'd like you to use OpenMP for multi-core parallelization. OpenMP is an API and set of C-language extensions that provide compiler support for parallelism. It is well documented online, but here is a brief example of parallelizing a for loop, with mutual exclusion

You can also use OpenMP to tell the compiler to parallelize iterations of for loops, and to manage mutual exclusion.

/* The iterations this for loop *may be* parallelized */      
#pragma omp parallel for                                                      
for (int i = 0; i < 100; i++) {  

  /* different iterations the loop body may be
     run in parallel on different cores */

  #pragma omp critical                                                          
  {
    /* This block is marked as a critical section, so it will be executed by at
       most one thread at a time. */
    printf("Thread %d got iteration %lu\n", omp_get_thread_num(), i);           
  }                                                                             
}

In this assignment to get good performance, you might need to tweak how OpenMP performs its scheduling of independent loop iterations to threads, so please see the OpenMP documentation for the syntax for how to tell OpenMP to use different forms of static or dynamic scheduling (e.g., omp parallel for schedule(dynamic)) and what the granularity of scheduling should be (e.g., omp parallel for schedule(dynamic, 16)).

Here is an example for an atomic counter update using openMP's #pragma omp atomic contruct, which is a faster alternative to creating a critical section when only a simple read-modify-write operation needs to be made atomic.

int my_counter = 0;
#pragma omp parallel for                                                        
for (int i = 0; i < 100; i++) {                                                      
    if ( ... some condition ...) {
       #pragma omp atomic
       my_counter++;
    }
}

Finally, OpenMP has useful support for reductions across all loop iterations.

int my_counter = 0;
#pragma omp parallel for reduction(+:counter)
for (int i=0; i < 100; i++)
    my_counter++;

We expect you to be able to read OpenMP documentation on your own (Google will be very helpful), but here are some useful links to get you started:

The OpenMP 3.0 specification: http://www.openmp.org/mp-documents/spec30.pdf.
An OpenMP cheat sheet http://openmp.org/mp-documents/OpenMP3.1-CCard.pdf.
OpenMP has support for reductions on shared variables, and for declaring thread-local copies of variables.

Background: Representing Graphs Using Arrays

The starter code operates on directed graphs, whose implementation you can find in graph.h and graph_internal.h. A graph is represented by an array of edges (both outgoing_edges and incoming_edges), where each edge is represented by an integer describing the id of the destination vertex. Edges are stored in the graph sorted by their source vertex, so the source vertex is implicit in the representation. This makes for a compact representation of the graph, and also allows it to be stored contiguously in memory. For example, to iterate over the outgoing edges for all nodes in the graph, you'd use the following code which makes use of convenient helper functions defined in graph.h (and implemented in graph_internal.h):

for (int i=0; i<num_nodes(g); i++) {
  // Vertex is typedef'ed to an int. Vertex* points into g.outgoing_edges[]
  const Vertex* start = outgoing_begin(g, i);
  const Vertex* end = outgoing_end(g, i);
  for (const Vertex* v=start; v!=end; v++)
    printf("Edge %u %u\n", i, *v);
}

In the /tools section of the starter code you'll find a useful program called graphTools that prints statistics about graphs. We have a directory of graphs placed here for you:

/data/home/HPC2017_Summer/class/graph

Students sometimes want to make their own graphs for debugging. You can write down a graph definition in a text file and use the graphTools app to convert it to a binary file that can be used with parts 1 and part 2 of the assignment below. See the command help for: ./graphTools text2bin graphtextfilename graphbinfilename.

Part 1: Warm up: Implementing Page Rank (20 points)

As a simple warm up exercise to get comfortable using the graph data structures, and to get acquainted with a few OpenMP basics, we'd like you to begin by implementing a basic version of the well-known page rank algorithm.

Please take a look at the pseudocode provided to you in the function pageRank(), in the file pagerank/page_rank.cpp.. You should implement the function, parallelizing the code with OpenMP. Just like any other algorithm, first identify independent work and any necessary sychronization.

You can run your code, checking correctness and performance against the staff reference solution using:

./pr /data/home/HPC2017_Summer/class/graph/com-orkut_117m.graph

You'll find a number of graphs in the course directory. Note some of these graphs are quite large (more than a GB in size). Some interesting graphs include:

com-orkut_117m.graph
oc-pokec_30m.graph
rmat_200m.graph
soc-livejournal1_68m.graph

By default, the pr program runs your page rank algorithm with an increasing number of threads (so you can assess speedup trends). However, since runtimes at low core counts can be long, you can explicitly specify the number of threads to only run your code under a single configuration.

./pr %GRAPH_FILENAME% 24

Your code should handle cases where there are no outgoing edges by distributing the probability mass on such vertices evenly among all the vertices in the graph. That is, your code should work as if there were edges from such a node to every node in the graph (including itself). You'll see this in the comments in the starter code.

Testing Your Performance

You can run the grading script via: `./pr_grader /data/home/HPC2017_Summer/class/graph

Part 2.1: Parallel Breadth-First Search ("Top Down")

Breadth-first search (BFS) is a common algorithm that you've almost certainly seen in a prior algorithms class. Please familiarize yourself with the function bfs_top_down() in bfs/bfs.cpp, which contains a sequential implementation of BFS. The code uses BFS to compute the distance to vertex 0 for all vertices in the graph. You may wish to familiarize yourself with the graph structure defined in common/graph.h as well as the simple array data structure vertex_set (bfs/bfs.h), which is an array of vertices used to represent the current frontier of BFS.

You can run bfs using:

./bfs /data/home/HPC2017_Summer/class/graph/rmat_200m.graph

(as with page rank, bfs's first argument is a graph file, and an optional second argument is the number of threads.)

When you run bfs, you'll see execution time and the frontier size printed for each step in the algorithm. Correctness will pass for the top-down version (we've given you a correct sequential implementation), but it will be slow. (Note that bfs will report failures for a "bottom up" and "hybrid" versions of the algorithm, which you will implement later in this assignment.)

In this part of the assignment your job is to parallelize top-down BFS. As with page rank, you'll need to focus on identifying parallelism, as well as inserting the appropriate synchronization to ensure correctness. We wish to remind you that you should not expect to achieve near-perfect speedups on this problem (we'll leave it to you to think about why!).

Testing Your Performance

You can run the grading script via: ./bfs_grader /data/home/HPC2017_Summer/class/graph

Tips/Hints:

Always start by considering what work can be done in parallel.
Some part of the computation may need to be synchronized, for example, by wrapping the appropriate code within a critical region using #pragma omp critical. However, in this problem you can get by with a single atomic operation called compare and swap. You can read about GCC's implementation of compare and swap, which is exposed to C code as the function __sync_bool_compare_and_swap. If you can figure out how to use compare-and-swap for this problem, you will achieve much higher performance than using a critical region. (We will talk about compare and swap in detail in the second half of the course, but in this problem it's up to you to figure out how to use it.)
Are there conditions where it is possible to avoid using compare_and_swap? In other words, when you know in advance that the comparison will fail?
There is a preprocessor macro VERBOSE to make it easy to disable useful print per-step timings in your solution (see the top of bfs/bfs.cpp). In general, these printfs occur infrequently enough (only once per BFS step) that they do not notably impact performance, but if you want to disable the printfs during timing, you can use this #define as a convenience.

Part 2.2: "Bottom Up" BFS

Think about what behavior might cause a performance problem in the BFS implementation from Part 2.1. An alternative implementation of a breadth-first search step may be more efficient in these situations. Instead of iterating over all vertices in the frontier and marking all vertices adjacent to the frontier, it is possible to implement BFS by having each vertex check whether it should be added to the frontier! Basic pseudocode for the algorithm is as follows:

for each vertex v in graph:
    if v has not been visited AND 
       v shares an incoming edge with a vertex u on the frontier:
          add vertex v to frontier;

This algorithm is sometimes referred to as a "bottom up" implementation of BFS, since each vertex looks "up the BFS tree" to find its ancestor. (As opposed to being found by its ancestor in a "top down" fashion, as was done in Part 2.1)

Please implement a bottom-up BFS to compute the shortest path to all the vertices in the graph from the root. (see bfs_bottom_up() in bfs/bfs.cpp) Start by implementing a simple sequential version. Then parallelize your implementation.

Tips/Hints:

It may be useful to think about how you represent the set of unvisited nodes. Do the top-down and bottom-up versions of the code lend themselves to different implementations?
How do the synchronization requirements of the bottom-up BFS change?

Part 2.3: Hybrid BFS (60 points)

Notice that in some steps of the BFS, the "bottom up" BFS is signficantly faster than the top-down version. In other steps, the top-down version is signficantly faster. This suggests a major performance improvement in your implementation, if you could dynamically choose between your "top down" and "bottom up" formulations based on the size of the frontier or other properties of the graph! If you want a solution competitive with the reference one, your implementation will likely have to implement this dynamic optimization. Please provide your solution in bfs_hybrid() in bfs/bfs.cpp.

Tips/Hints:

If you used different representations of the frontier in Parts 2.1 and 2.2, you may have to convert between these representations in the hybrid solution. How might you efficiently convert between them? Is there an overhead in doing so?

Writeup and Handin (20 pts)

Along with your code, we would like you to hand in a clear, high-level description of how your implementation works as well as a brief description of how you arrived at your solutions. Specifically address approaches you tried along the way, and how you went about determining how to optimize your code (For example, what measurements did you perform to guide your optimization efforts?).

Aspects of your work that you should mention in the write-up include:

Replicate the page rank and hybrid BFS score table generated for your solution.
For bfs, describe the process of optimizing your code:
- Where is the synchronization in each your solutions? Do you do anything to limit the overhead of synchronization?
- Did you decide to dynamically switch between the top-down and bottom-up BFS implementations? How did you decide which implementation to use?
- Why do you think your code (and the staff reference) is unable to achieve perfect speedup? (Is it workload imabalance? communication/synchronization? data movement?)