Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

Previous | Next --- Slide 5 of 47

pdp

One good example is of matrix multiplication. The other one is of grid solver, where the data was stored as blocks instead of row-wise.

anonymous

In assignment2, CUDA circle render, one optimization we use is to replace global memory with shared memory since the former is allocated on heap and the later is allocated on stack.

hweetvpu

The size of tmp_buf in convolution example is another case where we considered cache locality as we designed the code.