Slide View : 15-418/618 Spring 2014

Programming for Performance: Work Distribution

Previous | Next --- Slide 4 of 46

smklein

As a quick recap, an NVIDIA "warp" is comparable to an Intel "hardware-thread", or an ISPC "gang", where an instruction stream is shared.

This comment was marked helpful 0 times.

tchitten

Further similarity between the ISPC "gangs" and NVIDIA "warps" is that they are strictly SIMD. Each warp can only be executing a single instruction (ignoring ILP) and if a branch occurs, the warp is split into active and inactive threads (active/inactive lanes in ISPC). An important difference between "gangs" and "warps" however is that while the user must manage which lanes are active in SSE/AVX in software, the hardware manages thread activity on NVIDIA GPUs thus this property is only relevent to efficiency and not correctness.

This comment was marked helpful 0 times.

kayvonf

@tchitten: "an important difference" --> "an important implementation difference."

This comment was marked helpful 0 times.

pradeep

Hi, I am a little confused as to why we can choose only upto 4 runnable warps at any time. This is because I see 32*6 ALU units/6 warps that can run at any time. So shouldn't we able to choose 6 out of the 64 ?

This comment was marked helpful 0 times.

kayvonf

@pradeep. Good question. The answer is simple. That's what this particular NVIDIA chip does! It has been designed to have the capability to choose up to four runnable warps, and up to two independent instructions per each of those warps. NVIDIA architects must have concluded that this was a reasonable amount of decode/dispatch capability for keeping the execution resources busy for the workloads they cared about most.

This comment was marked helpful 0 times.

pradeep

Yeah that mostly clears it up, but I am still a little hazy as to why the NVIDIA designers would put an extra 32 * 2 ALU's (since we are utilizing only 4 * 32 out of the 6*32 in this chip). Is it there for fault tolerance or may be like for work distribution so that you don't go to the same set of 4 warps again and again and choose another set of 4.

This comment was marked helpful 0 times.

kayvonf

@pradeep. To clarify. The chip has the ability to run 6 instructions (4 arithmetic, 2 load/store), and the capability to find up to 8. So in this sense, it's the opposite of what you mentioned. In an ideal scenario, the chip can decode more than it can run.

This comment was marked helpful 1 times.

kkz

@pradeep Keep in mind that there are also two groups of special purpose units and that a warp selector can run two instructions at a time.

This comment was marked helpful 1 times.

shabnam

@kayvonf I am not sure if I understand what you mean by "two independent instructions per each of those warp"?

This comment was marked helpful 0 times.

mchoquet

@shanbam in lecture 1 we talked about the idea of instruction-level-parallelism, which is that if the processor sees that two nearby instructions (maybe addl x 5 and addl x 3) that are independent, it can execute them in the same clock cycle. That's why the 4 warp selectors on the chip above each have 2 instruction fetchers; they can run 2 instructions at once if those instructions are independent. My interpetation of the above picture was that if among the 8 instructions selected to be run in a single clock cycle, 1 is `special math', 1 is load/store, and 6 are normal math, then all 8 batches of 32-wide SIMD units can be used at once.

This comment was marked helpful 3 times.

kayvonf

@mchoquet. Very nicely said.

This comment was marked helpful 0 times.

taegyunk

For GTX680, is it correct that a warp consists of 32 number of threads? and each thread is basically mapped to each one of 32 shared units?

This comment was marked helpful 0 times.

LilWaynesFather

The max warp size is dependent on the the SIMD lane size. So for GTX680 the warp size is 32. And yes, each cuda thread is mapped to a SIMD unit in the SIMD "block" assigned to the warp.

This comment was marked helpful 0 times.

paraU

All the threads in one block will be done in the same core. This core has the shared memory and that's why shared memory can make the program so much faster. Another thing we found in doing the homework is that we don't need to manually free space in the shared memory. When the block has finished, CUDA will automatically free those spaces.

This comment was marked helpful 0 times.