Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

In-Memory Distributed Computing using Spark

Previous | Next --- Slide 28 of 43

williamx

In this case, we are dealing with narrow dependency (each line depends only on the results of the previous line). Hence, it is possible to implement multiple RDD operations by executing all the instructions on one element at a time. Doing so removes the cost of storing the intermediate collections in memory.

life

spark is a runtime where even though you write your program in the above version, it automatically runs your code in the below version for you

machine6

Spark reduces inefficiencies among operations on collections by reusing intermediate data.

koala

By executing all the instructions per element, we reduce the amount of memory we need to store the intermediate results; rather, we only care about the final result. By "fusing" the loops that would occur if we break the program into stages on the complete intermediate results from the previous stage, we can deal with low memory storage.

o_o

The above Spark code is an abstraction of what the programmer wants to happen. The programmer wants to map and filter over all the lines in the file; and Spark helps the programmer do this by implementing the Spark code in the "loop fusion" manner. Thus, Spark doesn't need all of the intermediate data to be stored to memory, and thus we won't run out of memory.