is spark smart enough to perhaps periodically merge contiguous lines in the log file together so to create a single log entry to save space? for instance, would spark be able to sense that the 4 operations could be condensed into a single line and hence a single log entry and thus do so?
pavelkang
Does this contradict with the fact that Spark keeps intermediate datasets in memory, since loop fusion is avoiding generating intermediate data.
kayvonf
There seems to be a bit of confusion about the point of this slide. I want to clarify that the top half of the slide is a sequence of RDD operations. There are many possible implementations of this program. For example, one implementation might implement RDDs as arrays in memory.
The bottom of a slide shows one possible implementation that is memory efficient and also maximizes arithmetic intensity (it immediately consumes elements as soon as they are produced). A good Spark implementation would strive to implement the program at the top in a manner that that execution characteristics very similar to the C++ code on the bottom.
Haskell actually has support for stream fusion built in.
http://research.microsoft.com/en-us/um/people/simonpj/papers/ndp/haskell-beats-C.pdf
is spark smart enough to perhaps periodically merge contiguous lines in the log file together so to create a single log entry to save space? for instance, would spark be able to sense that the 4 operations could be condensed into a single line and hence a single log entry and thus do so?
Does this contradict with the fact that Spark keeps intermediate datasets in memory, since loop fusion is avoiding generating intermediate data.
There seems to be a bit of confusion about the point of this slide. I want to clarify that the top half of the slide is a sequence of RDD operations. There are many possible implementations of this program. For example, one implementation might implement RDDs as arrays in memory.
The bottom of a slide shows one possible implementation that is memory efficient and also maximizes arithmetic intensity (it immediately consumes elements as soon as they are produced). A good Spark implementation would strive to implement the program at the top in a manner that that execution characteristics very similar to the C++ code on the bottom.