Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

Performance Optimization II: Locality, Communication, & Contention

Previous | Next --- Slide 38 of 72

yeq

The arithmetic intensity is calculated as the amount of computation (proportional to the grid area) divided by the amount of communication (proportional to the grid perimeter). Our goal is to maximize this value. Thus, the intuition behind above strategy is that square has the largest area for all rectangles given a perimeter value.

holard

I'm a little confused about why arithmetic intensity is a useful metric here. If we fix the number of processors, the elements computed will be the same regardless of our decomposition (unless we distribute unevenly). Then, increasing arithmetic intensity is equivalent to reducing # of elements communicated. Why is maximizing arithmetic intensity preferred over minimizing elements communicated?

Vincent

@holard, I think your second last sentence actually answers your question. Since the number of elements needed to be compute are usually fixed, we need to minimized the number of elements needed to communicate. And as a result, the arithmetic intensity will be increased. If arithmetic intensity is good, we can save the bandwidth for other communications.

khans

To decompose this problem, we look at how the work is proportional to N. The work depends on the perimeter. Notice that we can decrease the perimeter by splitting up the grid.

pk267

But isn't this increase in arithmetic intensity not very useful because it comes at a cost of higher cache misses for a single processor? So, the computation per processor would actually take longer.

paracon

@pk267, I don't think the cache locality is necessarily bad. Since it would depend on the access pattern of the algorithm, and cache properties. Also in your last statement, are you referring to the time taken by the processor when each calculated a slab of data?

blah329

The derivation of the metrics described above are as followed: each processor divides the NxN grid into the square root of the number of processors number of pieces along the width and length, giving us P blocks, each of length and width N/sqrt(N). This means that the total number of elements computed is N^2/P. In the worst case, the amount of data located is all of the edges of the block, and each edge is N/sqrt(P) long. Thus, if you divide the number of computations by the number of elements computers, then an arithmetic intensity of N/sqrt(p) is produced.