Previous | Next --- Slide 9 of 31
Back to Lecture Thumbnails
kayvonf

Question: What does this slide mean when it says "move computation to the data"? (Can you think of another example of moving computation to the data -- it can be unrelated to this lecture.)

ChandlerBing

I think that the term "move computation to the data" suggests that we perform computation in such a way that we require minimum data transfers to do so. That is to say, all (or most of) the data required to perform that computation is already present on that node. Thus we can exploit the data locality and perform computations with low communication overhead. One example which this reminds me of is the bucket sort in assignment 3. We bucketed the elements which were already present on a given node (similar to running mapper on the parts of the input file which are already present on the node)

rbcarlso

So the first point says run the code where the data is, and the third point raises the possibility of duplicating jobs. Does this mean duplicating the data too, and can this cause problems with inconsistent results?

cube

Moving computation to data can also be seen (as discussed in class) with things like web searches. You send a string to Google, which uses its index to compute search results, and gives back the results. It would be much worse if you moved the index to your local computer and found the search results there.

sgbowen

@rbcarlso: I believe it usually involves duplicating the data. Map-reduce systems are often run on top of distributed file systems that are optimized for the given map-reduce framework (Hadoop runs on HDFS, Google Map Reduce runs on GFS, etc.). Distributed file systems almost always keep several replicas of every piece of data in case of node failure. So when scheduling a job, it's probably possible to find more than one node that already has the data in question.

Also, keep in mind that in map-reduce jobs, the original input data is often read-only, and the map operation produces new, completely different data (which should be a deterministic result for a given chunk of input). So there shouldn't be a problem with inconsistent results (unless a node is somehow faulty).

bwf

In class, Kayvon mentioned that work duplicated between machines doesn't truly duplicate the work done since when a job returns, the outstanding duplicates can be killed. Is there anything to stop the same job from being started multiple times and one finishing first?(resulting in repeated work) Or would the granularity of the jobs be too small for this to matter?

yuel1

@bwf, in implementations such as Hadoop, duplicate jobs are only speculatively launched if they are a certain % below the mean progress of all other jobs.