Previous | Next --- Slide 40 of 47
Back to Lecture Thumbnails
kayvonf

The above-left program computes the absolute value of a collection of numbers, but elements of the input collection are defined in terms of another collection as given below:

tmp_input[i] = input[ indices[i] ]

The i'th element of the input stream tmp_input to the kernel is the indices[i]'th value of the source stream input.

We call this operation a "gather" because the operation gathers contiguous elements from a source stream (here: input) to formulate a new stream (here: tmp_input) to pass to a kernel.

On the right, I show a similar situation that uses a "scatter". The elements of the output stream are scattered to locations of the output stream output.

Amanda

In this slide, we are introduced to two key concepts in the course: Gather and Scatter.

Gather, which assumes that the requested elements are spread across memory, is an operation that assembles and stores elements contiguously; or, in Kayvon's words, "a parallel load where the indices are not contiguous". As an instruction, this works by taking a vector register of indices (R0), an empty resultant vector register (R1) and a base pointer (mem_base), looking up the elements given by the indices R0 relative to mem_base, and storing these elements contiguously in R1. Currently, Gather does not exist on any CPU on the market, but (as is mentioned in the next slide) will be included in the new AVX instruction set.

Scatter does the opposite, taking a contiguous result and'scattering' its elements throughout memory, presumably in a manner converse to what is seen in Gather.

It's important to note that both of these operations are very memory-intensive. However, these concepts are key in implementing important parallel algorithms such as quicksort and sparse matrix transpose (Source)

Question: is there some kind of interesting tie-in between this stuff and distributed systems? Seems like there ought to be.

pebbled

Note: "the new AVX instruction set" is AVX2. AVX is already on Sandy Bridge, bulldozer, and later procs. And just to make this explicit: Scatter is not part of AVX2.

Question: Is scatter more difficult to implement than gather in hardware? Is gather more useful to programmers than scatter?

http://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Advanced_Vector_Extensions_2

mmp

In general, I think gather is more useful than scatter. I can't dig up numbers to substantiate this, but my understanding is that more loads are generally issued than stores for regular memory access; assuming that's true, I'd assume that gather would see more use than scatter when running SIMD/SPMD workloads.

I don't know of specific implementation issues for scatter vs. gather, but I know that both are quite hard--there's a reason it's taken so long to get even gather on CPUs. In general, having an instruction that can read from 8 completely different locations in memory is quite a change from before, when a single instruction could only access a single memory location. This in turn has all kinds of impact on the entire microarchitecture.

One example is fault handling: what if you are in the middle of a gather, have done a bunch of loads, and then you hit a page fault? You need to be able to resume the gather instruction partway through, after the fault is serviced. This is a bit of trouble.

Another example is the TLB: in the worst case, each of the gather addresses could be unaligned, so that it spans two pages; this means that up to 16 virtual-to-physical address translations have to be done. This presents the TLB with a workload quite different than it had before--now it has to get a bunch of translations done quickly.

In general, you can probably pick any part of the microarchitecture and come up with ways that gather and scatter impact its implementation / violate earlier assumptions in the design. Caches are probably another good place to think about this.