Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

yikesaiting

The goal of Spark shows us that there is no guarantee that Spark will be faster than Hadoop+GFS in every cases. In fact, in my opinion, with a good distributed file system, like GFS, some tasks running on a incredibly large scale of data will take less time for Hadoop.

ote

@yikesaiting I don't really see how you can make the guarantee that some tasks running on an incredibly large scale of data will take less time for Hadoop than Spark. Could you elaborate? Is it like a specific kind of task?

temmie

@yikesaiting I'm also curious about how Hadoop+GFS could ever be faster, since surely keeping data in memory will always be faster than writes? Is there some cost for ensuring fault tolerance in Spark that could outweigh those benefits? Or does Hadoop take a similar approach to Spark that saves us the cost of writes?

fleventyfive

For the curious, SpliceMachine is a company that sells a hybrid, in-memory database that uses both Hadoop and Spark! It's an interesting case of where a database uses both an in-memory architecture for fast transactions, and also has a Hadoop backend that can process complex analytical queries.

randomized

Can anyone clarify "writing of intermediates"? I guess in a map-reduce job, the output of the mapper is the intermediate entity. So is it referring to the case when multiple map-reduce jobs are executed sequentially?

BensonQiu

@randomized: Yes, I think the output of the mapper job counts as intermediate results. It's also possible to have mapreduce jobs with multiple iterative map/reduce phases. In that case, you could potentially have several intermediate map and reduce outputs.

teamG

@temmie

I think even though Spark is a rising trend, and that it has a clear advantage of utilizing memory, in some use cases MapReduce is still better.

For example, in this paper: http://www.vldb.org/pvldb/vol8/p2110-shi.pdf

Even though MapReduce got outperformed by Spark in many cases like Word Count (classic example), PageRank, etc. It actually did better in a sorting algorithm called TeraSort. Under the same settings, MapReduce is actually 2x faster than Spark.

One explanation that the paper tried to give is that it's more efficient at "shuffling data". I guess this is relevant to the overhead from the communication dependencies in Spark. For MapReduce, a lot of the overhead in moving data around can "overlap with the map stage".