After we obtain mobileViews, how do we use them to get howMany? Do each node compute howMany and merge? Or each node merge mobileViews and compute howMany?
I didn't show it on the slide, but you should assume that the implementation of the count() action "consumes" a partition of mobileViews, and computes the size of the partition. (After the partition of mobileViews's size is computed it can be freed if desired by the Spark scheduler.) Then, there's a basic sum reduction of the partition counts across the cluster, with the final result returned to the application as the return value of mobileViews.count().
Do the orange boxes mean "in-memory"? Then won't we have locality problems by splitting one dataset across multiple nodes?
@pavelkang. Yellow boxes correspond to partitions of RDDs. Gray boxes are files on disk. Note that an RDD partition could certainly be implemented as a file on disk (materialized on disk) or as a buffer in memory (materialized in memory).