Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

In-Memory Distributed Computing using Spark

Previous | Next --- Slide 28 of 44

Back to Lecture Thumbnails

PIC

Haskell actually has support for stream fusion built in.

http://research.microsoft.com/en-us/um/people/simonpj/papers/ndp/haskell-beats-C.pdf

chuangxuean

is spark smart enough to perhaps periodically merge contiguous lines in the log file together so to create a single log entry to save space? for instance, would spark be able to sense that the 4 operations could be condensed into a single line and hence a single log entry and thus do so?

pavelkang

Does this contradict with the fact that Spark keeps intermediate datasets in memory, since loop fusion is avoiding generating intermediate data.

kayvonf

There seems to be a bit of confusion about the point of this slide. I want to clarify that the top half of the slide is a sequence of RDD operations. There are many possible implementations of this program. For example, one implementation might implement RDDs as arrays in memory.

The bottom of a slide shows one possible implementation that is memory efficient and also maximizes arithmetic intensity (it immediately consumes elements as soon as they are produced). A good Spark implementation would strive to implement the program at the top in a manner that that execution characteristics very similar to the C++ code on the bottom.