Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

kayvonf

Question: How might a MapReduce scheduler leverage program structure to avoid having to restart the entire program from the last checkpoint when one node fails?

kayvonf

Another question: How would you implement a checkpoint of an arbitrary C program?

ekr

Maybe the program could maintain a DAG of node dependencies? Then if a node fails only the ones that are affected need to be restarted, and the separate parts of the program continue normally.

meatie

I think it might also be helpful to study the reasons that cause memory failure, so then we can design corresponding strategies.

For example, say memory failure is more likely to happen if the temperature is high, then we can start keeping logs of updates when the temperature exceeds some threshold. In this case, we do not need to keep logs the whole time, hence reduces the overhead.

Kapteyn

To implement checkpoints for an arbitrary C program we could periodically log the execution context (register files, stack, instruction pointer). I think we might actually need to store at least the dirty lines in the cache (and the lines in the write buffer) as well because otherwise, write backs that have yet to happen would be lost when we restart the program from the checkpoint.

kayvonf

@Kapteyn. You'd have to checkpoint all of the processes' memory as well. (And that's the real cost)

BigFish

For the first question, it depends on what type of failure it is.

If a map task fails, it is unnecessary to rerun the whole program again, since those successful map tasks' output have already been split to local disk. Simply retry the map task on maybe another machine is sufficient. However, there should be a maximum number of retries.
If worker node fails during map or shuffling phase, you may want to rerun all map tasks on that worker node again, since all output is on that node's local disk instead of HDFS, so they are inaccessible by others yet.
If a reduce task fails, it is much more complicated. Imaging that the failure is caused by node failure running that reduce task, then all data obtained from mappers are lost. And it is very hard to figure out where the data comes from and what map tasks need to be rerun. It seems like it is necessary to redo the shuffle phase to get data required by this reduce task.