Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

Previous | Next --- Slide 39 of 60

holard

MPI was covered in 15-440 (distributed systems) as well. At what point do we consider a large parallel machine a distributed system? Or is this sort of the intersection between the two?

kavyon

@holard The phrase "large parallel machine" seems to suggest that it is in fact, one machine, or one entity. Distributed systems are generally several machines that communicate over a network. It's probably safe to say that the line between a large parallel machine and a distributed system is when the processors no longer share memory spaces, and can only access data in each other's memory by asking the other processor via message-passing. Even though the dual-socket machines described on Slide 33 have separate memory banks, they need not send messages to the other processor to get data.

kapalani

An advantage to using a message passing interface layer can be seen in something like a large distributed filesystem where the message passing layer could reorder and or batch requests before sending it out for lower latency

kayvonf

@holard. Good question. One big distinction between the topics in a distributed systems course vs. this course is that in general in this course, we assume that the machine is reliable and secure: That is, topics like fault tolerance, redundancy, and security -- which are fundamental to distributed computing topics, are less of a focus in 418/618. However, at the supercomputing scale, robustness to errors is in fact a very big deal. Although we don't talk about resiliency much in 418/618, perhaps we should!

But the performance-related ideas of 418/618 apply at systems of all scale, from a system-on-a-chip powering your iPhone, to a huge datacenter.

rjvani

When having a message passing system, is it always necessary to have an explicit library, like MPI? Or is there some way that this could be abstracted?