Previous | Next --- Slide 47 of 54
Back to Lecture Thumbnails
solovoy

Why only selecting up to four runnable warps from up to 64 resident on core, when there are 6 * 32 ALUs available in a core? Is this specific to the convolve example, because there are 128 threads/block = 4 warp * 32 ALUs? If it is, why not make the number of threads per block 192, i.e. the number of threads per block equals to the number of ALUs in a core, so that 6 warps can run together?

TypicalChazz

We have only 4 warp selectors above, and each one can select up to 2 instructions by exploiting ILP. This means we can potentially have 8 instructions to run at once. Each instruction is run across the appropriate 32-wide SIMD lane. In the architecture above, this means we need 6 'regular' ALU instructions(corresponding to the 6 * 32 ALUs), 1 'special' instruction(corresponding to the 32 special SIMD function units) and 1 load/store instruction(correspoding to the 32 SIMD load/store units) to have maximal parallelism.

We can't select more than 4 runnable warps due to the hardware specifications above. If we had more warp selectors, we could potentially fetch more instructions, but we wouldn't have enough SIMD functional units to fully parallelize the execution of all instructions. Basically, some of the fetched instructions might have to stall and wait for the other instructions to finish running on the existing SIMD functional units.