To reiterate what was discussed in class, a particular instance (programIndex = 0 thru 3 here) would be outputting to an index if the index%programCount = programIndex.
Does that means that different instances will execute the program at the same time (same space) so that they can make advantage of spatial locality to increase the hit rate of cache? If so, why these instance can run at the same time? Is there upper bound of the number of instances?
@xiaoguaz, I think these instance are able to run at the same time due to the SIMD usage. So, the upper bound would be based on how many ALUs(hardware) there are to execute SSE or AVX instructions.
The reason that the instances alternate instead of being assigned blocks of the array is for spatial cache locality. For example in this scenario, the instances of the gang will operate on the first four (in some arbitrary order) before moving on. Thus, there will be a relatively high amount of cache hits.