Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

In-Memory Distributed Computing using Spark

Previous | Next --- Slide 27 of 44

haboric

After we obtain mobileViews, how do we use them to get howMany? Do each node compute howMany and merge? Or each node merge mobileViews and compute howMany?

kayvonf

I didn't show it on the slide, but you should assume that the implementation of the count() action "consumes" a partition of mobileViews, and computes the size of the partition. (After the partition of mobileViews's size is computed it can be freed if desired by the Spark scheduler.) Then, there's a basic sum reduction of the partition counts across the cluster, with the final result returned to the application as the return value of mobileViews.count().

pavelkang

Do the orange boxes mean "in-memory"? Then won't we have locality problems by splitting one dataset across multiple nodes?

kayvonf

@pavelkang. Yellow boxes correspond to partitions of RDDs. Gray boxes are files on disk. Note that an RDD partition could certainly be implemented as a file on disk (materialized on disk) or as a buffer in memory (materialized in memory).