Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

Previous | Next --- Slide 52 of 66

jiajunbl

It seems that the handling of multiple logical threads on a CPU can be done by managing the cache in the CPU, which makes me think that the number of threads per core can be dynamically changed during runtime - which would make things really interesting. However, I also recalled that the context switching logic of this code relies on hardware circuits.

What I understood/misunderstood from this lecture is that the number of threads per core is a hardware determined value.

However, it seems that with some operating system overhead and a bit of additional logic on the chip, that this could be a dynamic value which can be changed during runtime.

kayvonf

@jiajunbl: I see some very understandable confusion here and in student comments on the last slide. Here's they way I'd like to you think about it.

On CPUs, the number of execution contexts per core is a hardware-determined value. For example, an Intel CPU is designed to have storage for the execution contexts of two hardware threads. (This is more state than is required if the chip only supported execution of one thread.)

However, other processors, namely most modern GPUs, have a shared pool of on-chip storage (technically, this storage is not the L1 cache) that can be dynamically configured to be split into execution contexts for different number of threads. What I'm illustrating in the sequence from slide 50 to slide_52 is that if a thread requires an execution context with many registers, then the chip will only be able to store execution contexts for a small number of threads, and thus will provide minimal latency hiding. If threads require very few registers, then the chip can partition this shared storage into many execution contexts, and thus provide a lot more latency hiding.

On GPUs, the number of required registers for a thread is determined at compile time and stored as metadata in the program binary. When the program is run on a GPU the chip is configured accordingly. A common tip in GPU optimization guides is to try and make your program use fewer registers, as it will allow the GPU to run more threads at once, and provide the application better latency hiding.