Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

In-Memory Distributed Computing using Spark

Previous | Next --- Slide 39 of 43

kayvonf

Scalability! But at What COST? McSherry 2015: https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf

Abandon

Although the performance improvement that the frameworks like Spark brings out is dwarfed compared with improving storage quality, it is unfair if we don't consider the data size and cost. Here, the data size is only 5.76GB. If there is a 10TB data set? How can we fit it into RAM? Maybe you can get a machine with 10TB RAM, but I don't think it is cost efficient. So, my point is that big data processing is exactly what those frameworks are designed for.

eourcs

If I'm reading the figures correctly, single-threaded implementations loading from SSD are still faster/competitive with big-data frameworks. I think the point of the article is that it's very easy to conflate scalability with performance; we like to pretend that these frameworks are zero cost abstractions, but they are absolutely not.

And yes, there is some size of data where the scale of the frameworks will shine, but we should be careful to consider if we really need these frameworks. I think if the question is "how do I make this run faster," your first thought should be "write better code," not "buy a bunch of new machines and run Spark." At the end of the day, we care about scalability because of its relation with performance; if it is not achieving this goal, how can we say we've made progress?