Slide View : Parallel Computer Architecture and Programming : 15-418/618 Fall 2016

Previous | Next --- Slide 42 of 59

Split_Personality_Computer

It seems like data storage here is all or none: either you use a local variable or you update something in memory. However, we know sometimes when there are more threads than cores that we can actually have multiple threads running on a single core and that this is actually a good thing because it can hide latency from memory accesses. My question is, is there a way to exploit this idea when writing parallel code? Is there a way for threads running on the same core to communicate with each other in a way that is faster than just going to memory?

It seems to be good practice to sort of 'climb' the memory pyramid when you need a lot of workers to contribute to a single variable. For instance, to solve something computationally expensive using NVIDIA's GPUs you can launch many CTAs (concurrent thread arrays), each of which have multiple threads running in them, so that 1) each thread computes its answer, 2) each thread adds its answer to memory stored between threads in one CTA (like a tiny per-CTA cache), and 3) each CTA updates global memory. This way you have three 'stepping stones' of memory. Is there a CPU example for having multiple threads on one core communicate to one another this efficiently?