Isn't scheduling 2D grids for threads and blocks, and manually managing them too much of a burden on the programmer with little gain? Does CUDA provide higher order constructs that hide this overhead? Are there languages that are higher level than CUDA or OpenCL?
CUDA is meant to be a language with relatively low abstraction distance from the capabilities of the machine. (By this I mean a 418 student reading CUDA code should have a pretty good handle on how it maps to a modern GPU.) Higher level programming systems offer the potential for convenience and productivity, but delegate the task of mapping computations to the GPU to a compiler or runtime system. The cost of the convenience is performance if the compiler/runtime can't do as good of a job as a 418 student.
For example, graphics shader programming uses a much more functional approach to parallelism that CUDA.
CUDA programs actually have two levels of abstraction: the language itself and a virtual GPU architecture. As long as the language compiles down to target the same virtual GPU architecture, NVIDIA can change up the CUDA language as much as they'd like without being concerned with programs written in a new CUDA version not having proper semantics on an older GPU. If the CUDA language needs hardware support for a new feature than it is easy for NVIDA to specify a new virtual GPU architecture and only allow that feature in a program if they compile to that new virtual architecture.
These two abstractions then bottom out to a single implementation of the virtual GPU architecture: a real GPU architecture. This real GPU architecture implements the virtual GPU architecture, allowing different GPU chips to have a lot of flexibility in the way the hardware is laid out without being concerned with violating some assumption made at the language level.
Additionally, compilers with knowledge of the specific real GPU the program is going to run on (for example, JIT compilers) can specialize the virtual GPU code to take advantage of the real GPU hardware characteristics in the final compiled binary.
You can find more info in the CUDA Toolkit documentation section on GPU Compilation.