Slide View : 15-418/618 Spring 2014

Previous | Next --- Slide 7 of 36

black

I think that is why MapReduce Framework will pre-process the data uploaded by user. It will divide data into several parts of relative same size. The number of parts will correlation to the number of compute nodes available at that time. However, pre-process can't guarantee the distribution of work will be perfect, there are always some parts that are slow, and what MapReduce does is to generate some new tasks to help complete these extremely slow parts to smooth the worst case.

This comment was marked helpful 0 times.

mpile13

This came up during assignment 1: having the same size doesn't always guarantee the computation will take the same amount of time. In this case, how does MapReduce determine whether one process is being too slow or not and decide to add more tasks?

This comment was marked helpful 0 times.

jinsikl

@mpile13 I'm not sure MapReduce tries to solve that problem. It's main concern is providing a reliable distributed computing that scales. As far as I know MapReduce will spawn a map task for each block, and never has a block being processed by multiple tasks.

This comment was marked helpful 0 times.

nrchu

@mpile13 I believe jinsikl is correct, MapReduce add more tasks based on whether one process is too slow or not. Remember that MapReduce is used on very large scale computations. If a single node's input of data is abnormal and causes some kind of error or simply takes too long to compute, we can alert the user and they can either manually examine the case or simply discard that worker. With a very large network, the contribution of a single worker is not very significant (think wordcount for a million pages, for example) and you will get a very close approximate answer regardless.

I am not sure what black is saying when about MapReduce generating new tasks when some are extremely slow. Pure speculation: it might be possible that if we expect a large variance in computation time, but have no way of predicting it, then when a certain worker node takes too long to complete, the master can duplicate the work that was assigned to that node and split it into N parts to N idle workers. I think this would work very well if we expect either very short computation time OR very long computation time, because there will be more and more idle workers as the entire algorithm is bottlenecked by the reducers. However, I have no idea if there is any mapreduce framework that handles anything like this.

This comment was marked helpful 0 times.