The implementation CUDA thread is more like ISPC task. A few workers will be created and keep fetching unfinished jobs. So there won't be 8K instances of shared variables at the same time, otherwise it may even break the hardware resources limit.
In ISPC tasks, each task could run SIMD instructions but here in CUDA thread even the program instances of a SIMD instruction are counted as "threads". Is there any advantage of going into that granularity when it comes to scheduling?
@Fantasy, this is not quite right. The main similarity between ISPC and CUDA programming is that they both use the SPMD model. However there are important differences between ISPC tasks and CUDA threads. CUDA threads are actually most analogous to ISPC program instances. Both get mapped to one lane or ALU in a vector. On an NVIDIA GPU the vectors are just much wider, i.e. 32 wide instead of 4 or 8 as is typical with ISPC.
There is a very big difference between ISPC tasks and CUDA threads. All CUDA threads within a thread block must run on the same core. Different ISPC tasks however, can run on different cores. CUDA threads run together in groups of 32 in a warp, which is analogous to program instances executing together in an ISPC gang. Remember that each ISPC task maps to 1 ISPC gang so if you really wanted to make an analogy between ISPC tasks and something that exists in CUDA, I would say a task is most like a warp's worth of work.
But I wouldn't suggest making this mental connection because again, all CUDA threads in a CUDA thread block must run on the same core. This means all warps that are assigned the threads of one thread block will run on the same core and no pthreads are being created in this case to handle the execution of a warp's worth of work. In contrast, ISPC uses pthreads to launch tasks onto different cores.
Is it fine to say a CUDA thread block is more like a task? They can both run different cores. One task maps to 1 ISPC gang of program instances and one thread block has many CUDA threads. And CUDA thread is analogous to program instance.
@Fantasy I think that's a better analogy to make :) CUDA has a thread block scheduler that will schedule thread blocks on to cores when they are available.