Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

In-Memory Distributed Computing using Spark

Previous | Next --- Slide 20 of 44

haboric

Upon worker node failure, completed map tasks are re-executed because their output is stored on the local disk(s) of the failed machine, which is inaccessible by other nodes. On the contrary, there is no need to re-execute the completed reduce tasks because their output is stored in global file system, which can be accessed by other nodes. More information on map-reduce fault tolerance can be found in section 3.3 of mapreduce paper: http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf.

Lawliet

From what I understand, any number of worker nodes can fail and it would still be able to work correctly (after restarting tasks as necessary). Or is there some critical point at which the computation fails because too many worker nodes failed?