To comment on the trade off between work balance and locality in this example: Load balance is good, even with the static (randomized) assignment, because there's such a large number of bodies relative to processors, and also because there is little relationship between the locations of the bodies in the array and in physical space. However, for the static partition, the cost of data access is high due to the amount of communication required by the lack of contiguity of the bodies in physical space. The cost-zone assignment reduces this data access overhead without comprising load balance. (Parallel Computer Architecture)
To see this tradeoff in the diagrams, we can look at the overall height of the bars as well as the height of individual sections. The randomized static assignment actually spends less time doing useful work, meaning it requires less computation to compute the positions of the bodies in the next step of the simulation under the static assignment. However, the dynamic assignment requires less overall time because the data and synch costs are significantly lower. The increased busy time in the dynamic assignment is due to the processors having to compute their position in the tree independently, adding that work on top of the actual simulation update.
So the increased busy time in cost-zone assignment comes from every processor traversing the entire tree? (besides the negligible work estimate) It's increased by more than 5s, 30% of busy time.
Can we parallelize the process? So processors cooperatively produce the assignment.