In this case, we are dealing with narrow dependency (each line depends only on the results of the previous line). Hence, it is possible to implement multiple RDD operations by executing all the instructions on one element at a time. Doing so removes the cost of storing the intermediate collections in memory.
life
spark is a runtime where even though you write your program in the above version, it automatically runs your code in the below version for you
machine6
Spark reduces inefficiencies among operations on collections by reusing intermediate data.
koala
By executing all the instructions per element, we reduce the amount of memory we need to store the intermediate results; rather, we only care about the final result. By "fusing" the loops that would occur if we break the program into stages on the complete intermediate results from the previous stage, we can deal with low memory storage.
o_o
The above Spark code is an abstraction of what the programmer wants to happen. The programmer wants to map and filter over all the lines in the file; and Spark helps the programmer do this by implementing the Spark code in the "loop fusion" manner. Thus, Spark doesn't need all of the intermediate data to be stored to memory, and thus we won't run out of memory.
In this case, we are dealing with narrow dependency (each line depends only on the results of the previous line). Hence, it is possible to implement multiple RDD operations by executing all the instructions on one element at a time. Doing so removes the cost of storing the intermediate collections in memory.
spark is a runtime where even though you write your program in the above version, it automatically runs your code in the below version for you
Spark reduces inefficiencies among operations on collections by reusing intermediate data.
By executing all the instructions per element, we reduce the amount of memory we need to store the intermediate results; rather, we only care about the final result. By "fusing" the loops that would occur if we break the program into stages on the complete intermediate results from the previous stage, we can deal with low memory storage.
The above Spark code is an abstraction of what the programmer wants to happen. The programmer wants to map and filter over all the lines in the file; and Spark helps the programmer do this by implementing the Spark code in the "loop fusion" manner. Thus, Spark doesn't need all of the intermediate data to be stored to memory, and thus we won't run out of memory.