Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

In-Memory Distributed Computing using Spark

Previous | Next --- Slide 33 of 43

pagerank

The API persist has several options in Spark. The most common ones I use are MEMORY_ONLY and MEMORY_AND_DISK.

MEMORY_ONLY means it will try to persist the specified RDD in memory. But this is not guaranteed. If the memory is not large enough for containing all persist RDDs, it will throw away old ones based on RLU. If the RDDs thrown aways are needed later, it will be recomputed according to the lineage.

MEMORY_AND_DISK means it will first try to persist in memory. When the RDDs are thrown away because of the limitation of memory, it will be stored in disks. So every time you could get this RDD without recomputing.

MEMORY_AND_DISK is not always ideal. It is common that recomputing a few steps from the RDDs in memory can be faster than reading the exact RDDs from disks.

sandeep6189

partitioning the data is crucial for scaling the application.