Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

kayvonf

It is not written down on the slide, but in class we discussed the semantics of the system entrypoint mapReduceJob. (You needed to know the semantics in order to propose a valid implementation of those semantics.) Can someone try to describe the semantics again for us here.

landuo

I think the semantics of runMapReduceJob is that run mapper function for every line in the input, prepare intermediate result for reducer, run reducer function on each key of the data generated by mapper function, and store result in output directory

sanchuah

Mapper function would be applied to each line of input and emit intermediate key-value pairs for reducer. The reducer function get the intermediate values as key-list_of_values and do something on it, and then store the processed values.

marjorie

I agree with landuo and sanchuah. I'd just add one thing that's NOT in the semantics, which is ordering. runMapReduceJob doesn't necessarily have to do all of the mapping and THEN all the reducing; it's a valid implementation to process part of the input completely, output the result, and then move on to the next chunk of input.

kayvonf

@marjorie. I'm glad you made that comment. Actually that would not be a correct implementation because the system needs all the tuples for a given key when it runs the reduce function for that key. If you haven't run all the map tasks, the system can't be guaranteed all the tuples are available, and thus cannot be guaranteed to provide them all the mapper.

toutou

When I check the log in AWS after some jobs failed, I find there are both mapper and reducer processes have been completed and partially done. So I'm confused when it is designed to start a reducer in map reduce framework? Does it begin when some intermediate data has been written to disk?

parallelfifths

@toutou, I'm assuming that, as with most processing of continuous data streams, this has to be done in batches of some sort, because the click log isn't going to have any finite end in the physical sense. Based on Kayvon's response, I understand why @marjorie's interpretation of the semantics is incorrect for a given batch, but it seems like we've just "passed the buck" on the data stream problem, because we still have to choose how to segment the stream such that we have a collection of keys and values that is useful. I suppose these problems are outside of the scope of our concerns in this class, though. :)