Here is a random comment about optimizations.
There is an overhead of creating intermediate results when you compose functions (e.g. var ip = ... does filter followed by map followed by collect). A naive implementation would create entire intermediate lists which is very inefficient. Using iterators or laziness, you can generate one cell at a time and collect it immediately. This still generates O(n) cells, but only one exists at a time.
var ip = ...
The proper way to deal with this is deforestation. Check out foldr/build, and stream fusion if you are interested.
I was wondering if there was a good way to tell Spark that certain RDDs shouldn't persist in memory. This would be useful for when you only wanted 1 or 2 RDDs to persist (would likely decrease memory swapping and cache traffic). Turns out there is an aptly named "unpersist" function which does exactly that.