The API persist has several options in Spark. The most common ones I use are MEMORY_ONLY and MEMORY_AND_DISK.
MEMORY_ONLY means it will try to persist the specified RDD in memory. But this is not guaranteed. If the memory is not large enough for containing all persist RDDs, it will throw away old ones based on RLU. If the RDDs thrown aways are needed later, it will be recomputed according to the lineage.
MEMORY_AND_DISK means it will first try to persist in memory. When the RDDs are thrown away because of the limitation of memory, it will be stored in disks. So every time you could get this RDD without recomputing.
MEMORY_AND_DISK is not always ideal. It is common that recomputing a few steps from the RDDs in memory can be faster than reading the exact RDDs from disks.
sandeep6189
partitioning the data is crucial for scaling the application.
The API
persist
has several options in Spark. The most common ones I use areMEMORY_ONLY
andMEMORY_AND_DISK
.MEMORY_ONLY
means it will try to persist the specified RDD in memory. But this is not guaranteed. If the memory is not large enough for containing all persist RDDs, it will throw away old ones based on RLU. If the RDDs thrown aways are needed later, it will be recomputed according to the lineage.MEMORY_AND_DISK
means it will first try to persist in memory. When the RDDs are thrown away because of the limitation of memory, it will be stored in disks. So every time you could get this RDD without recomputing.MEMORY_AND_DISK
is not always ideal. It is common that recomputing a few steps from the RDDs in memory can be faster than reading the exact RDDs from disks.partitioning the data is crucial for scaling the application.