Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

In-Memory Distributed Computing using Spark

Previous | Next --- Slide 21 of 43

Back to Lecture Thumbnails

Cake

This syntax reminds me a lot of Microsoft's C# EntityFramework..

https://msdn.microsoft.com/en-us/library/gg696172(v=vs.103).aspx

bochet

Two types of operations:

Transformations: RDD -> RDD
Actions: RDD -> NonRDD

jedi

An important reason to highlight this distinction between transformations and actions is that transformations allow Spark to build the lineage (a DAG) of computation, which is lazily evaluated.

Using the DAG, the Spark optimizer can look for more opportunities for parallelism, short-circuiting computations and other sources of speedup. Apart from the fact that intermediate state is held in-memory instead of spilling to disk, this seems to be one of the key reasons why Spark runs faster than MapReduce for the same computation.

However, actions trigger the computation to begin, since they have side effects. They are often injected by Spark developers into code to keep the lineage from growing excessively long.