Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

In-Memory Distributed Computing using Spark

Previous | Next --- Slide 16 of 43

Back to Lecture Thumbnails

kayvonf

http://hadoop.apache.org/

paracon

Using multiple machines helps to improve the overall throughput of the system by reducing the delay caused by stragglers(slow machines)

blah329

Does the scheduler duplicate jobs on multiple machines as a way of implementing fault-tolerance as well? And this reduces latency incurred by node failures because another machine is still running the job and will be able to finish it faster than having to start the job over from scratch?

kayvonf

That is a possible implementation, yes.

kapalani

At such large scales, is it possible that the scheduler becomes the bottleneck for the system as it tries to find an optimal distribution jobs to nodes and handling nodes that have gone down since I imagine that with hundreds of machines in a cluster, its pretty common for a machine to go down because of a disk failure(especially given the fact hadoop programs write to disk so much) or some other reason

firebb

The way Hadoop/MapReduce system handling the straggler is called speculative execution. The Hadoop jobtacker would detect the slow worker through profiling and launch a redundant idle worker to handle the same task. It is a not easy thing to reason whether a worker is slow or just the task takes more time. Since the MapReduce model assumes that map and reduce operation is deterministic, if one of the workers finishes early, the other one could be just killed.