Assignment 3: Parallel VLSI Wire Routing via OpenMP

Due: Friday, October 13th, 2017, 11:59 PM

Final Deadline: Monday, October 16th, 11:59 PM

Maximum Late Days: 3

As you may have realized with Assignment 2, the earlier you start this the better.

## Overview

The purpose of this assignment is to introduce you to parallel programming using OpenMP and to illustrate how the realities of parallel machines affect performance. Although the sequential version of the task that you are asked to parallelize is relatively straightforward, there are a number of subtle issues involved in achieving high performance with your parallel code.

## Environment Setup

You will collect your numbers on the Latedays cluster. Please read the README file to learn about how to submit a job to latedays. The code will be run on Xeon Phi, and we'll be using the offload mode in the assignment.

http://15418.courses.cs.cmu.edu/fall2016/article/7

You will ssh with the following:

 ssh ANDREW_ID@latedays.andrew.cmu.edu

You'll notice that when you log in, you won't be in AFS. This is just fine, untar the starter code in the directory you're in right when you log in or a subdirectory that you create yourself.

Everything you need to know about how to run your code, how to use OpenMP, and what is provided in the starter code is in the README! Download the starter code from Autolab.

You will need to add the following lines to your ~/.bashrc file:

module load gcc-4.9.2
$java WireGrapher [input]  More specifics on what we want in your writeup are in the Performance Analysis section. The starter code also contains a validate script that will check that your costs file and outputs file match up correctly. Details on how to run this are in the README. When you submit your results to Autolab, they must validate with this script. More on this in the Hand In section. ## Measuring Performance Execution time: To evaluate the performance of the parallel program, measure the following times using gettimeofday() 1. Initialization Time: the time required to do all the sundry initialization, read the command line arguments, and create the separate processes. Start timing when the program starts, and end just before the main computation starts. 2. Computation Time: this is strictly the time to compute the result. (It does not include the time necessary to print them out.) Start timing when the main computation starts (after all the processes have been created), and finish when all of the results have been calculated. Note that: Total Time = Initialization Time + Computation Time. Speedup is calculated as T1/Tp, where T1 is the time for one processor, and Tp is the time for P processors. Computation Speedup uses only Computation Time, and Total Speedup uses the Total Time. Cache misses: In this assignment, we will be using hardware counters to report certain performance metrics. In particular, we will measure the number of cache misses of your parallel programs by using "perf", a performance analysis tool for Linux, to report performance counters at program level. You can use the following command to view the statistics: $ perf stat $PROGRAM  You can use the following command to measure the cache misses: $ perf stat -e cache-misses $PROGRAM  "-e" is the option to specify the events we want to report, you can see the list of events using: $ perf list


When you're running on the Xeon Phi, the job output will have the cache-miss count at the bottom after your print statements for you already. This will make more sense after reading the README in the starter code.

Here is a tutorial in perf if you're curious: https://perf.wiki.kernel.org/index.php/Tutorial

## Performance Analysis

The goal of this assignment is for you to think carefully about how real-world effects in the machine are limiting your speedup, and how you can improve your program to get better performance. If your performance is disappointing, then it is likely that you can restructure your code to make things better. We are especially interested in hearing about the thought process that went into designing your program, and how it evolved over time based on your experiments.

Your report should include the following items:

1. A detailed discussion of the design and rationale behind your approach to parallelizing the algorithm. Specifically try to address the following questions:
• What approaches have you taken to parallelize the algorithm?
• Where is the synchronization in your solution? Did you do anything to limit the overhead of synchronization?
• Why do you think your code is unable to achieve perfect speedup? (Is it workload imbalance? communication/synchronization? data movement?)
• At high thread counts, do you observe a drop-off in performance? If so, (and you may not) why do you think this might be the case?
2. The output of your program (shown graphically) for the different input circuits. You can generate this using the WireGrapher.java program explained in the Implementation Details section.
3. A plot of the Total Speedup and Computation Speedup vs. Number of Processors (Nprocs). Use Nprocs = 1, 4, 16, 64, 128 and 240. Please submit your program to latedays cluster for the experiment, see ASSTDIR/README for more information about how to submit to latedays.
4. A plot of the total number of cache misses for the entire program vs. Number of Processors (Nprocs).
5. A plot of the arithmetic mean of per-thread cache misses(from perf stat -e cache-misses \$PROGRAM) vs. Number of Processors (Nprocs).
6. Discuss the results that you expected for all the plots in Question 2-5 and explain the reasons for any non-ideal behavior that you observe.
7. A plot of the Total Speedup and Computation Speedup on 240 threads with respect to 1 thread where the value of P (i.e. the probability of forcing a wire to be rerouted ala simulated annealing) is varied between 0.01, 0.1, and 0.5. (If running with 1 thread is too slow, you are free to change the baseline to 4 threads or even 16 threads)
8. Discuss the impact of varying P on performance, explaining any effects that you see.
9. A plot of the Total Speedup and Computation Speedup on 240 threads where the input problem size is varied. There are different ways to vary the problem size, for example, grid size, number of wires, the average length of wires or even the layout of the wires. Here, please explore the different grid size and number of wires. Plese report the result of input in ASSTDIR/code/inputs/problemsize. (Again, if running with 1 thread is too slow, you are free to change the baseline to 4 threads or even 16 threads)
10. Discuss the impact of problem size (both grid size and number of wires) on performance.

We will be grading based on your report on times in the Performance Analysis section and the performance will be graded using the inputs in ASSTDIR/code/inputs/timeinput, running with 64, 128, and 240 threads. Here are the computation times (in seconds) for the reference solution:

UPDATE: Cost benchmarks are on Piazza

These times are merely a benchmark for you to shoot for. We'll more carefully consider grading assignments as we see all the student submissions and times in autolab, but we don't expect your solution to be much slower than this.

60% of your grade will be based on your performance and timings while 40% of your grade will be based on your writeup. We really care about your thought process on parallelizing the algorithm and your analysis on the result. Do NOT make the mistake of focusing solely on the parallelization and ignoring the writeup. Even if you reach the benchmark, you won't end up with a good score if you can't explain why your final solution is good and how you got there.

If you're struggling to reach the benchmark, it's in your best interest to spend time creating a quality writeup.

Yes, we will be checking for style. We'll be relatively lenient, but you will be docked if your style is particularly bad. If you write all of your code in the main function, you will lose points.

We will also check that your job outputs print out max cost and aggregate cost values of the resulting cost arrays. We will check that the values aren't unreasonably high, you want this as low as possible.

## Hand In

This is important, read this carefully

Electronic submission through Autolab: Your submission should be a .tar file (named as andrewid1_andrewid2.tar.gz, that has a single directory (named as andrewid1_andrewid2) consisting of the following files and directories:

• Your entire code directory. Please run make clean before tarring and submitting. If we copy the code directory you submit to our machine, it should still compile and run without a hiccup.
• In the file_outputs_submit directory within the code directory, put in your resulting output and costs files for the benchmark tests in the Grading section. There should be 18 files submitted here, 2 per benchmark/thread count combination. These must pass the validation script.
• In the job_outputs_submit directory within the code directory, put in your resulting job outputs for the benchmark tests. There should be 9 files submitted here, 1 per benchmark/thread count combination. The starter code by default prints computation and total times. You also need to add print statements that give us your max cost and aggregate cost of your resulting cost array.
• Read the previous bullet again, I suspect a lot of people are going to miss this detail.
• writeup.pdf, your writeup in .pdf format. Your writeup should include the items listed in the Performance Analysis section.

To create this tar file, run this command:

 > tar czf andrewid1_andrewid2.tar.gz andrewid1_andrewid2