Previous | Next --- Slide 27 of 44
Back to Lecture Thumbnails

After we obtain mobileViews, how do we use them to get howMany? Do each node compute howMany and merge? Or each node merge mobileViews and compute howMany?


I didn't show it on the slide, but you should assume that the implementation of the count() action "consumes" a partition of mobileViews, and computes the size of the partition. (After the partition of mobileViews's size is computed it can be freed if desired by the Spark scheduler.) Then, there's a basic sum reduction of the partition counts across the cluster, with the final result returned to the application as the return value of mobileViews.count().


Do the orange boxes mean "in-memory"? Then won't we have locality problems by splitting one dataset across multiple nodes?


@pavelkang. Yellow boxes correspond to partitions of RDDs. Gray boxes are files on disk. Note that an RDD partition could certainly be implemented as a file on disk (materialized on disk) or as a buffer in memory (materialized in memory).