Previous | Next --- Slide 28 of 66
Back to Lecture Thumbnails
jezimmer

Assuming the same math and specs hold as given on slide 61, would we not still be waiting on memory most of the time when running this program too? I think the only difference is that there are only two memory loads/writes (assuming no thrashing), so it would be

$$\frac{672 \times 10^9 \times 8}{1024 \times 1024 \times 1024} = 5007 \text{ GB/s}$$

Does that not imply that this is also really slow? Given that there are few circumstances where doing less than two memory accesses is rare, it seems that we can't hide the latency anywhere.

Elias

Question: this is Kayvon's fictitious data-parallel language. Throughout the course, I gather we're going to look at several different options for real languages for this purpose. That said, does anybody know which are the major (current) parallel languages? What differentiates the top contenders? What are industry standards?

jezimmer

@Elias I don't know how "industry standard" it is, but Cilk Plus (which we used in 15-210) has a cilk_for statement, which declares to the compiler that the entire body of the for statement can be run in parallel on all instances. It's syntax is basically the same as a normal for statement, and does something similar to what the fictitious forall statement above hints at.

BigFish

@jezimmer On slide 61, we only perform one multiply operation. However, here we totally have 14 operations. Assumed that add and multiply both need one clock to perform. The memory bandwidth to feed the data should roughly be 5007/14 GB/s. The calculation may be false, however, I think the general idea should be correct.

kayvonf

The arithmetic intensity depends on what the value of terms is. Each iteration of the outermost i loop loads x[i] and stores to result[i] (8 bytes must be transferred). The number of arithmetic instructions performed per eight bytes of traffic is a function of the value of terms.

This has an interesting implication: In theory (and I say in theory because just like the simple thought experiment on slide 61 we'll assume that memory access latency is perfectly hidden) as long as the computation remained bandwidth limited, you could increase the precision of the sin(x) calculation by adding up more terms in the Taylor expansion without increasing the runtime of your program. However, at some point, the program would become compute limited, and its performance would then increase linearly with the value of terms.