Arithmetic intensity becomes more and more likely to be bound by communication since as the number of processors increase the arithmetic intensity decreases. That means for some fixed amount of work, adding more processors will not be helpful once we hit a threshold because at that point communication dominates computation.
I think another example that illustrates this is the Saxpy example from assignment 1. We were performing one single multiply-add for every 16 bytes read, exhibiting low arithmetic intensity. Because the program was bandwidth bound, performance didn't get better on multi-execution cores and SIMD (in fact it got worse) because of the communication overhead associated with it, as opposed to executing on a single ALU of a single core.
I'm wondering this work assignment strategy seems to require the number of processors to be a square. So how can 32 processors work in this work assignment?
@mallocanswer In this case, I would probably divide the work into 64 or 256 or 1024 smaller squares, then assign 2 or 4 or 8 mini-squares to each processor.
The reason why I chose them is that we want to use the processors fully, and also have small bits of work for processors to work on.