Just want to confirm, the GPU is supposed to have 4 L1 caches instead of 2. Is that correct? Since we're processing 4 instruction streams simultaneously.


On each clock, up to four runnable warps from the on-core execution contexts are selected to be run. The GPU/CUDA lecture also notes that there is instruction-level parallelism available.