Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

Previous | Next --- Slide 12 of 31

cgjdemo

One of the intention to develop MapReduce framework is to process huge datasets using a large number of nodes. For huge datasets, we cannot assume that all the intermediate results can fit into memory. So if the intermediate results are forced to be written into disk, can Spark still outperform Hadoop?

ericwang

@cgjdemo, I think programmers need to choose the data to be kept in memory very carefully. Here are some FAQs from https://spark.apache.org/faq.html. Seems Spark could support such cases, but the performance would be affected.

What happens if my dataset does not fit in memory?

Often each partition of data is small and does fit in memory, and these partitions are processed a few at a time. For very large partitions that do not fit in memory, Spark's built-in operators perform external operations on datasets.

What happens when a cached dataset does not fit in memory?

Spark can either spill it to disk or recompute the partitions that don't fit in RAM each time they are requested. By default, it uses recomputation, but you can set a dataset's storage level to MEMORY_AND_DISK to avoid this.

lament

Semi-relevent: In lecture, it was mentioned how Spark only guarantees that memory is reserved for the end value that one requests to be computed. For example, if one had:

A = x.filter(-------)

B = A.filter(-------)

C = B.fliter(----------)

[force compute of C]

Then memory is reserved for C, and possibly for B or A, though the latter are not guaranteed.