Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

anonymous

What this study showed is that in order to allow for high parallelization, a lot of overhead had to be added. Even though the code scales really well, and has high performance with many cores, running a single thread it is twice as slow as other implementations.

365sleeping

I guess a lot of overhead is incurred by the communications. Spark is very bad at handling communications. BTW, @anonymous twice as slow?

MaxFlowMinCut

@anonymous, I think you meant twice as fast? And I agree that this study shows that it's wise to test something locally before you expend the (enormous) resources to scale out. Though Occam's Razor is pervasive throughout all of computer science, it seems like it should be heeded even more strongly in the context of parallel computing; where highly tuned optimizations can very quickly add a ton of complexity to your code (and apparently it can also actually hurt throughput).

Also, though the above distributed examples work (they're just slow), I think the message in this picture is worth keeping in mind throughout our careers.

kayvonf

Paper reference: Scalability! But at what COST?

kayvonf

Question: Class... what did we learn here?

kayvonf

@365sleeping: What do you mean by "Spark is very bad at handling communications?"

jhibshma

@kayvonf, we learned a great case of your principle: Do the simplest thing first.

Also, don't compare your performance to your own code or system running at a smaller scale. The results can deceive you. Sure, scaling may improve performance relative to your own code, but what about relative to a different, simpler version?

BensonQiu

It's interesting that most of the single-threaded implementations in the paper were intentionally chosen to be the simplest and most direct implementations the authors could think of. If the single-threaded case was optimized, single-threaded performance might be even faster than what is presented in the lecture slide.

BensonQiu

To get an accurate speedup measurement, we shouldn't necessarily compare the parallel case of an algorithm to the sequential case of the same algorithm. Instead, we should compare the parallel algorithm to the best sequential implementation that we can find.

For example, the paper authors found a 10x speedup (153s to 15s) for the single-threaded case when they switched their algorithm from label propagation to union-find. Comparing parallel label propagation to sequential union-find (rather than sequential label propagation) gives us a more accurate idea of how much faster the single-threaded case can be.

aeu

I made a reference to an article I read on the difference in performance (e.g. wall clock time from start to finish) between a Hadoop cluster and simple Unix utilities piped together. It's a fun read: Command-line tools can be 235x faster than your Hadoop cluster

The article explicitly outlines the size of the data being small is an important factor.

cmusam

@jhibshma I totally agree. Before starting to implement anything, we should always make our goals and priorities clear. Energy efficiency? Performance?