Previous | Next --- Slide 32 of 52
Back to Lecture Thumbnails
kayvonf

Some interesting stats about the NUMA performance of Blacklight:

Latency to DRAM on the same blade as a processor: ~200 clocks

Latency to DRAM located on another blade: ~1500 clocks

Source: slide 6 of http://staff.psc.edu/blood/XSEDE/Blacklight_PSC.pptx

mofarrel

If having a shared address space is not as desirable in a supercomputer, why does Blacklight implement one?

kayvonf

Good question, and I should definitely clarify my comments. In general, a shared address space is a far more convenient programming abstraction than message passing. The programmer just reads and writes memory, and doesn't need to worry about the details of the memory hierarchy to get to a program that is correct. The problem comes when the programmer begins to think about performance.

To illustrate, consider an extreme example of adopting a shared memory model for writing applications that spanned your smartphone and the cloud. Say your program allocates memory that is required for an image displayed on the phone's UI, and the system happens to make the allocation in memory on Google's servers --- Remember, the system doesn't know what your call to malloc is for, it just chooses a part of the address space and puts the allocation there. Since as a programmer you don't know or care where that allocation is, you are confused by why your program runs so slow --- when the reason is that every frame the UI code accesses the image data, requiring bytes transferred from Google!

The problem is that the abstraction of unified global address space hides all the details of the machine. It tells the programmer, "an address is an address, don't worry about where it's stored, and don't worry about the costs associated with accessing it". But the problem is that the performance differences are so severe in this case, as a programmer you have to know a bit about the implementation to write a useful program. Therefore, a better abstraction would give the programmer mechanisms to at least say "I want an allocation in fast memory for the phone".

In the case of Blacklight, the reason these supercomputers are millions of dollars is that they provide a very high-performance implementation of global shared memory across racks of processors (as compared to, say, if the nodes were connected by Ethernet). This implementation tries to deliver performance that is high enough that it is acceptable to think of all memory being the same. It enlarges the space of programs that scale well without major programmer effort, so really by buying a computer we are paying for the high performance interconnect, and for the programming convenience (shared address space) that having a high performance interconnect provides.

But to squeeze all the performance out of the supercomputer, you will still have to think about data placement, and many programmers conclude that "hey, message passing forces me into a model where I think about communication from the start, so even though it's more work to implement my first correct program (since I have to write all these sends and receives), that program tends to be written very efficiently." In contrast, the shared address space programming model's convenience is also it's drawback: it lets you get a program up and running quickly, but that program may not be written with communication in mind, and thus it might not be very fast.

bwasti

You mention that performance is the primary downfall of a naively written shared address space program. Doesn't message passing have an implicit overhead associated with constructing the messages that a shared memory model can avoid? (Wikipedia says "In message passing, each of the arguments has to copy the existing argument into a portion of the new message," which is what I'm basing this claim on.) Would a "perfectly" written shared address space program stand to have higher performance than a "perfectly" written message passaging program (without breaking through the abstraction layer of messages)?

kayvonf

@bwasti. Tough question to answer with any level of precision. I guess my best answer is that it depends what machine you are running on. If I'm an architect and don't want to support fast global shared memory, perhaps I put those resources into hardware that accelerates larger messages, or simply more processing cores.

I suppose, given a machine that's highly optimized for shared memory programs, like the one in the picture above, your thinking is reasonable: Given two programs that have the exact same communication pattern, one implemented in terms of loads and stores and the other implemented in terms of a messaging API, then the former would likely be faster because it avoided any overhead associated with the messaging API. Then again, if you wrote both programs to have the same communication pattern, your two programs are essentially the same, and so you can't make much of an argument that one model was more convenient than the other. (which was the initial point: to say that writing shared memory programs was more convenience since you didn't have to use a more rigid message passing structure, and instead get by with a more natural pattern of issuing loads and stores whenever necessary.)

benchoi

The situation in which I can imagine shared memory being faster is when the latency of access to the memory is extremely low, and there has to be a lot of data shared. I'm imagining a case where multiple cores share a very large cache to which they have fast access, and the multiple threads share data that uses most of the cache and accesses various parts of it often.

It seems to me that message passing is faster in most situations since it forces the programmer to consider latency, which makes a significant difference in high-latency situations (e.g. over a network). Knowing that the message would have to be copied many times, the programmer would avoid sending messages containing tons of data unnecessarily (unlike a shared-memory programmer who might blithely do that).

dfarrow

I've been lucky enough to be able to run a simulation engine on blacklight, and it's remarkable how well the shared memory architecture is abstracted from the programmer. I was able to allocate over 100 Gb of RAM during one simulation, and until now I never even knew that RAM latency was dependent on the physical location of the memory. For programmers who have a project that runs on commodity hardware and need to run a few very large instances of their program without changing the code, blacklight is amazing.

RICEric22

A fat tree is similar to a tree, but unlike the standard trees that you may be used to, a fat tree has 'thicker' links between nodes as you get closer to the root. In terms of bandwidth, this allows the root node to handle incoming data from all of its leaves, whereas each leaf only sends its own data to its parent.

Q_Q

@bwasti I was poking around the SGI UV1000 sales materials, and apparently each blade has an MPI co-processor, so that the work of creating/recieving messages is offloaded from the processor! (I was really impressed)