A big assumption for our statement that the performance of program can increase proportional to the arithmetic intensity is that the program is bandwidth bounded.
Assume the bandwidth is unlimited (a memory instruction is as fast as an arithmetic instruction now), even if we further increase the arithmetic intensity, the performance of the program may not increase because the CPU can only execute each instruction at that rate.
In the case that we are bandwidth bound rather than bounded by throughput of instruction execution, we will get a speedup proportional to the arithmetic intensity. The reasons for this include the fact that we don't have a allocate temporary structures tmp1 and tmp2, and we take advantage of temporal locality. The speedup would be (3/5) / (1/3) which is about 2.
Question: Why would the transformation from Program 1 to Program 2 not yield a notable speedup if the original program was compute bound?
The major difference between Program 1 and Program 2 is that Program 2 significantly reduces the memory cost and increases locality, while the computation cost stays the same. If the original program is compute bound, then the reduction in memory cost wouldn't matter because it doesn't contribute too much to the performance(time to run the program). The time to wait for memory operations could be ignored since the running time is dominated by the computation.
The greater arithmetic intensity of program 2 would not help if the original program was compute-bound because being compute-bound would imply that the main reason for the slow runtime of program 1 is not memory bandwidth, so decreasing the number of loads/stores relative to arithmetic ops would not significantly improve performance.
@kayvonf If the program is compute bound, then the performance is limited by the computing power of the computing resource. Even if we increase the data for the computing resource to process, as the computing resource is already busy with some data, the increased data will not be processed immediately. Therefore, even if we increase the arithmetic intensity, as the computing resource can only process certain amount of data for a unit time period, the increased arithmetic intensity will not help.
What if I change program 1 to do something like this: C[i] = A[i] + A[i] + B[i] + B[i] - A[i] - B[i]
Then there are 5 math ops, 2 loads and 1 store, so arithmetic intensity is now 5/3. Does this ensure that new program 1 is not compute bound?
C[i] = A[i] + A[i] + B[i] + B[i] - A[i] - B[i]
Doesn't compute bound mean that it's spending most of its time doing the calculations? So if you add more calculations to it and increase its arithmetic intensity, it wouldn't do anything to help it being compute bound. @arcticx also explains this in the latter part of their post.