As Kayvon explained during lecture, it's really important that you're guaranteed concurrent execution of threads in a CUDA block. A number of things could go horribly wrong if this wasn't the case. For example, if you tried to sync those threads (analogous to a barrier in ISPC) at some point in your code, and you only had some of the threads in a block running, they could potentially all reach the __syncthreads() call... only to wait on other threads that haven't even begun executing yet! So this CUDA constraint helps avoid nasty deadlocks like these.
As Kayvon explained during lecture, it's really important that you're guaranteed concurrent execution of threads in a CUDA block. A number of things could go horribly wrong if this wasn't the case. For example, if you tried to sync those threads (analogous to a barrier in ISPC) at some point in your code, and you only had some of the threads in a block running, they could potentially all reach the __syncthreads() call... only to wait on other threads that haven't even begun executing yet! So this CUDA constraint helps avoid nasty deadlocks like these.
This comment was marked helpful 1 times.