Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

Previous | Next --- Slide 14 of 31

cacay

Here is a random comment about optimizations.

There is an overhead of creating intermediate results when you compose functions (e.g. var ip = ... does filter followed by map followed by collect). A naive implementation would create entire intermediate lists which is very inefficient. Using iterators or laziness, you can generate one cell at a time and collect it immediately. This still generates O(n) cells, but only one exists at a time.

The proper way to deal with this is deforestation. Check out foldr/build, and stream fusion if you are interested.

scedarbaum

I was wondering if there was a good way to tell Spark that certain RDDs shouldn't persist in memory. This would be useful for when you only wanted 1 or 2 RDDs to persist (would likely decrease memory swapping and cache traffic). Turns out there is an aptly named "unpersist" function which does exactly that.