Previous | Next --- Slide 31 of 48
Back to Lecture Thumbnails
ericwang

So if we assign data into 4 sub squares, there will be even less communication requirement (2*N vs. N). Is my understanding correct?

But I think we still need to consider other aspects like data locality. For example, in Blocked Assignment, the data processed in each processor is in contiguous memory space. It's not true if we assign nodes into the sub squares.

regi

It seems like much of this choice relates to orchestration as well as assignment. Picking which areas to assign to what threads impacts how the threads will communicate and synchronize. But is orchestration more related to this design decision, the actual implementation of how threads communicate, or both?

kayvonf

@regi. Absolutely. Great comment.

byeongcp

I'm curious if the interleaved version actually gives much of a speedup. I thought about the "arithmetic intensity" mentioned in lecture 2, and it seems like the actual math we do is very simple (adding 5 elements and one multiply). I think the amount of the communication overhead needed to make interleaved assignment would cancel out the speedup we get from parallelizing (although i'm not too sure about this, so if someone could tell me why this might not be the case, that would be great).

uncreative

@byeongcp As I understand it we add five elements and do one multiply for every element in the row, and we interleave rows. If there were a very large number of elements per row, this could take a long time, and the parallelism could be very useful. Unless the array size is sufficiently large, it does seem that communication overhead would be a problem.

We will also probably run into issues if we cannot fit a large amount of the array into our cache. As you mention the arithmetic intensity is not very large, and it seems like if we end up thrashing the cache we will likely end up bandwidth limited.

admintio42

This was before we had started thinking about all of the cache traffic that arises from data parallel programs. An assignment that considered the cache line size would help the program.