Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

In-Memory Distributed Computing using Spark

Previous | Next --- Slide 42 of 43

kapalani

When we say hadoop stored the intermediates in the file system, do we mean the local file system on that node or in HDFS for the whole cluster? If it is the former isn't the only way we access the data on a node in the cluster is by having that node read from disk since only it has access to it and this could be problematic if that node goes down. But if its the latter it seems really expensive since now multiple copies of the same file have to be stored on different nodes in the cluster

vadtani

@kapalani I believe the HDFS since it is reliable and now as we have replicated copies so multiple nodes can perform different operations using the replicated data in parallel on different nodes. For example, we are creating different RDDs (using different operations) from the same data then this can be done in parallel by accessing the same data from 2 different replicas.

Please correct me if I am wrong.