Previous | Next --- Slide 9 of 36
Back to Lecture Thumbnails
snshah

Just to recap, I believe the two proposed solutions for this problem (counting the total number CS courses all the students in the room are enrolled in) were: 1. Parallelize by pairing up students, summing the number of courses between them, then pairing the sums, summing those, etc., which would be log(n) time. 2. Parallelize summing each row and being limited by either the longest row or number of rows.

The context of the problem is what made the second solution more efficient in the classroom, due to communication overhead (trying to pair people up around the room could take a long time).

Does anyone have examples of real-world contexts of parallelism where the theoretically optimized solution is less effective than the "best" one? I'm having trouble coming up with something very different to what we did in class.

aaguilet

@snshah I'm not sure if this applies, but one thing that came to mind is the case when you’re dealing with specific tools or frameworks, like Hadoop. You need to finely tune the amount of data per map and the number of maps; among other things, for the optimized solution to be efficient. For example, processing small blocks of data is not efficient (as it is optimized for large ones) and having too many maps can cause scheduling overhead.

I found this post which talks about it in more detail.

shabnam

To add to @aaguilet's comment I have myself seen this happen. Theoretically Hadoop was supposed to perform better on my dataset with a big cluster but it performed much worse. The reason we later found out was that each mapper had a IO output buffer which wasn't enough. It led to what ( in Hadoop terms ) is known as "spilling" after every map task and that caused us to lose any speedups gained by the parallelization achieved with hadoop. This was mainly because I/O costs took away all the speedups.

black

Communication really need to take into consideration when we implement parallel program. MapReduce makes reducer created at the very beginning and fetch the data once there is a mapper finished its task. Then the data transfer time can be reduced significantly because of this technology!

kleino

Even looking at how MapReduces are implemented, you notice they don't exactly use the tree structure mentioned above, but rather work on several big chunks of memory and generally process data within each chunk sequentially. This lets you take advantage of both parallelism and computers' ability to work extremely quickly on sequential tasks, and it demonstrates the tradeoff between the costs and benefits of parallelism. And of course, you can make this more efficient using latency-reducing techniques like described by black above.

jon

I feel like the analogy breaks down between the demo and a real parallel program in that there was no "algorithm". If we were told beforehand that we were to pass our partial sums to the left and then front of the classroom, we would have taken much less time.

However, even if we had meticulously planned out the O(log(n)) span "reduce" function from 150/210, I believe that we would find it to be slower than the O(sqrt(n)) span "row by row" solution. The main reason is that in the "row by row" solution, all communications occur between neighbors, while the "reduce" algorithm requires long distance communication. I believe the same would be true on supercomputers, which are commonly laid out as grids/tori (http://en.wikipedia.org/wiki/Torus_interconnect).