Previous | Next --- Slide 21 of 43
Back to Lecture Thumbnails
Cake

This syntax reminds me a lot of Microsoft's C# EntityFramework..

https://msdn.microsoft.com/en-us/library/gg696172(v=vs.103).aspx

bochet

Two types of operations:

  • Transformations: RDD -> RDD
  • Actions: RDD -> NonRDD
jedi

An important reason to highlight this distinction between transformations and actions is that transformations allow Spark to build the lineage (a DAG) of computation, which is lazily evaluated.

Using the DAG, the Spark optimizer can look for more opportunities for parallelism, short-circuiting computations and other sources of speedup. Apart from the fact that intermediate state is held in-memory instead of spilling to disk, this seems to be one of the key reasons why Spark runs faster than MapReduce for the same computation.

However, actions trigger the computation to begin, since they have side effects. They are often injected by Spark developers into code to keep the lineage from growing excessively long.