Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

Master

Reasoning for this graph:

transformations to RDD_A and RDD_F are both the end tasks of a stage, due to the fact that RDD_B has a wide dependency on RDD_A and RDD_G has a wide dependency on RDD_B and RDD_F.

apk

@Master, for the second part, did you mean to say RDD_G has a wide dependency on RDD_F?

Master

@apk Corrected..

BestBunny

Using this graph to demonstrate how narrow dependencies make recomputation as a result of a node failure much more efficient as mentioned a few slides later:

Consider that the node containing 'RDD_D part 1' failed. Since this task has a narrow dependency, the data that is lost can be recomputed by simply recalculating 'RDD_C part 1' followed by 'RDD_D part 1' (just one previous task and task that generated the data).

Now consider that the node containing 'RDD_G part 2' failed. Since this task had a wide dependency, all of 'stage 2' and all of 'RDD_A' and 'RDD_B part 2' have to be recomputed in addition to 'RDD_G part 2' in order to restore the lost data.

jocelynh

I recall that Prof. Kayvon mentioned that RDD_B and RDD_F could be swapped such that we would need to materialize RDD_B instead of RDD_F.

From my current understanding I think that "swapped" means that rather than just leaving the dependency graph's edges and materializing RDD_B instead of RDD_F, we would also need to swap the edges such that RDD_G no longer has a wide dependency on RDD_F (and that RDD_G has a wide dependency on RDD_B). Is this correct, or did he mean something else entirely?

hzxa21

@BestBunny I guess in your second example when RDD_G part 2 fails, stage 2 (RDD_F) is not necessarily needed to be recomputed because it has been materialized. I think materialized RDDs are actually cached in the memory once an action is triggered. But I am kind of confused what materialized means. Is it the same as calling .cache() or .persist() explicitly? Will RDDs be materialized for sure when an action occurs?

locked

My understanding is that we need to materialize the RDD if other RDDs have a wide dependency on it. For example RDD_G has a wide dependency on RDD_F and RDD_B has a wide dependency on RDD_A.

themj

Materialize the RDD means that all the data from that RDD needs to be stored in memory before the next RDD can proceed with its work.