Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

Performance Optimization II: Locality, Communication, & Contention

Previous | Next --- Slide 28 of 47

byeongcp

I remember in class Kayvon mentioned the bottom code is better, and he wanted reasoning beyond just "having higher arithmetic intensity." Could someone explain this again?

Also, Kayvon mentioned that some part of modern compiler research focuses on looking at instructions and turning them into something like the bottom code. Could someone explain what that was about?

kayvonf

@byeongcp: I'll turn this question back around on you. Given what you've learned on assignment 1, why do you think the second version of the code might run faster than the first version. (This is equivalent to me asking, why is high arithmetic intensity a desirable program characteristic?) To help you think about the answer, remember a goal to performance optimization is to design programs that keep all our available execution units doing useful work all the time.

kayvonf

Also, after reading this slide, it might be useful to go back and read the slides about stream programming from the programming models lecture. Can you now see how a programming written in a streaming model might afford the compiler the opportunity to perform "kernel fusion" operations such as the one manually performed in the C-code at the bottom here?

jazzbass

@byeongcp We have higher arithmetic intensity on the second implementation because we spend less time waiting for results from main memory than in the first implementation, therefore the processor can spend more time making progress calculating the arithmetic operations. We achieve this by structuring our code differently to make better use of the cache.

Here is a more verbose explanation, I hope this helps:

On the first example, we first calculate (1) temp = A + B. When we start with temp[0] = A[0] + B[0], we loaded the first K elements from A, B into the cache (the value of K will depend on the block size of the cache). The code has good spatial locality, therefore the next K-1 reads and writes won't cause any cache misses.

Afterwards, we calculate (2) D = temp * C on mul. For D[0] = C[0] * temp[0], we must load the first K elements of C and temp into the cache (at this point, the first K elements of temp that we wrote before on add must have been replaced by something else in the cache because presumably the arrays are large). This also implies that we already made a write back to main memory of all of the elements from temp.

In the other version of the code, we don't have to do the extra load and write of the elements of temp from main memory. For D[0], we load the first K elements from A, B, C into the cache when we calculate D[0] = (A[0] + B[0]) * C[0]. The next K-1 iterations will be cache hits. We keep on doing this for each batch of K elements.

So, in the first implementation, for every batch of K elements we had to go to main memory 6 times ( (1) read A, read B, write temp, and (2) read C, read temp, write D). On the second implementation, we only have to go to main memory 4 times (read A, B, C, write D). Reads and writes from main memory are very slow so we can expect a significant speedup from the second implementation.

grose

@kayvonf: in response to your second question -- yeah, because it exposes data parallelism, because it guarantees that the value an element will end up as depends only on the one corresponding value, whereas the modular programming above could still do like... a convolution at the kernel level. Any two or more maps performed on a stream can easily be fused into one map. To go SML: map g (map f A) = map (g o f) A

rohitban

So increasing arithmetic intensity reduces the bandwidth requirement for our programs and may increase performance if the program is bandwidth bound