Previous | Next --- Slide 49 of 72
Back to Lecture Thumbnails
crow

It seems that this could also hurt performance, if the arrays are aligned so that the loads cause conflict misses. However, as long as the associativity of the cache is large enough, it shouldn't pose too much of an issue.

bschmuck

The fused function would also take advantage of the single multiply-add operation that is available in some ALUs.

vasua

I wonder if this is actually a downfall of using a low level language, such as C++? We are forced to be incredibly explicit with what we want to do, often resulting in code such as the top segment when writing for modularity. By contrast, in a higher level language (e.g. numPy in Python), we can more generally say what we want to happen to the data (again, this declarative vs imperative idea we've seen a few times), and let the compiler figure out the most efficient implementation. In theory it should be possible for the compiler to convert the top segment to the bottom one, but why make it more difficult for the compiler?

rlaporte

I feel like we're making somewhat of an assumption here as to how this code is being compiled...

When we say "four loads, one store per 3 math ops" we're supposing that there are enough free registers to store the intermediary results of our math operations directly on the processor. If these two registers were not available, our compiler would have to store the intermediary results on the stack, which would ruin our arithmetic intensity ?

In other words, I am a bit confused as to how we can guarantee that intermediary arithmetic results are stored on-chip regardless of the state of execution. Is there a set of registers dedicated to storing the results of arithmetic operations?

paramecinm

@rlaporte Because when the program enters a function, it will store the values already in registers to stack to make sure you have registers to use. The function fuse needs at most 7 registers to store i, n, A, B, C, D, E (actually it will be less because compiler will reuse the register) so there will be enough registers to store intermediate results.

lilli

@vasua The benefit of loop fusion depends upon the way the system's memory/storage is structured. I looked this up and some compilers such as this one does perform loop fusion. https://docs.oracle.com/cd/E19205-01/819-5264/afapp/index.html Others don't.

hpark914

The code on the bottom performs better because it maximizes the amount of math operations between memory load/stores.