Slide View : 15-418/618 Spring 2014

Performance Optimization II: Locality, Communication, & Contention

Previous | Next --- Slide 14 of 42

pingshaz

If we assume that communication between the threads is costy, then one possible improvement for this scheme is instead of passing between the threads one row above and below, we pass two each. So for example P2 would work on row 2 through row 8 (one-indexed) and P3 would work on row 5 through row 11. With 2 overlapping rows between each block, every thread can now run two iterations before having to update the overlapping rows. However every update would need to send and receive twice as many rows as before, but this may still take roughly the same time because the data transferred is unlikely to exceed the bandwidth and latency is bottlenecking the performance

This comment was marked helpful 0 times.

nrchu

I don't think I understand your solution, @pingshaz. If the row overlaps, then P2 still must communicate to P1 row 2's new status. If you are alternating which block these boundary rows belong to with every iteration, well, it doesn't help because you're still dependent on the rows below/above from the change of the previous iteration, I don't see how you can go two iterations and still correctly modify the boundary rows.

This comment was marked helpful 0 times.

pingshaz

Well, in general if we only want to compute the results of 3 rows after one iteration, we need the row above and below as our dependencies. If we need to know the results of them after two iters, we need two rows both above and below. So my idea was to do two iterations and communicate the four dependency rows in one go. I wasn't alternating the boundary assignments. For example, in my scheme, row 5 through row 8 (one indexed) belongs to both P2 and P3 all the time. So there is actually a fair bit of repeated computation. But in some cases it's worth the hassle because communication is expensive.

This comment was marked helpful 0 times.