Previous | Next --- Slide 36 of 54
Back to Lecture Thumbnails
oulgen

Are there any side effects of cuda barriers since cuda threads are using shared memory? I mean if two threads are to touch the same part of the memory, it is important to know which one is getting there first. Could barriers be used in such a way to enabled priority or masking?

aznshodan

What's the difference btwn sharing an address space and sharing memory?

kayvonf

Most of the time when someone says "sharing memory" they mean sharing the same address space. Memory is expected has a uniform set of addresses (the address space). Sharing the address space means that when thread 1 asks for the data at address X it is referred to the same piece of data as when thread 2 asks for data at address X. This means that the threads share the same logical memory. This is exactly what someone means when they are describing a "shared memory" system, and also what someone means when referring to CUDA "shared memory".

However, if we want to take about implementation details, one might say "shared memory" to mean that two processors can both access the same physical SRAM on chip, or the same DRAM.

Specifically in the case of CUDA programming, which we've addressed a lot, the CUDA abstraction of shared memory is implemented by (aka "backed") an SRAM in the SMX core. A GPU core can load and store from global memory address space (the same address space is shared by all cores) or it can load and store from a private local SRAM (only accessible to that core, but potentially shared among threads running on the core). If you want, you can think of a GPU having two types of load instructions. One, given an address, loads from the global memory (backed by DRAM, and potentially cached) address space. The other, given an address, loads from that address in the local core's SRAM. CUDA global memory loads would use the former instruction, shared memory loads would use the latter.

ld_global r0 0x1234
ld_local  r1 0x1234

In practice, loads from the two address spaces may not be implemented with different instructions (there are other ways to implement this functionality). To be honest I'm not sure what the ISA looks like these days. I'd be interested if someone could dig into the description of PTX to figure it out! http://docs.nvidia.com/cuda/parallel-thread-execution/#axzz3R5rawOqT

Architects often call a core-local addressable SRAM a "scratchpad", since it is not a cache! Cache is invisible to software. It is a hardware implementation detail that serves to accelerate access to data in DRAM. Software loads data from an address and, if the data stored at that address happens to be in the cache, then the access is services more quickly.

A scratchpad is different. Like a cache, it is an on-chip high speed local storage unit. (and in fact, the storage of cache in a scratchpad or a cache is just SRAM). However, a scratchpad is abstracted as its one address space and is data is to be moved into the scratchpad, it is done explicitly by software. For example, a CUDA program will manually copy data into shared memory, which until the hood amounts to executing instructions that copy data into the GPU core's local scratchpad memory.

Why use a scratchpad and not a cache? First, a scratchpad doesn't incur the overhead (honestly, it's not all that much in chip area) of all the cache management logic that a traditional cache must have. (A traditional cache has to manage what data is stored in the cache). Second, and more importantly, in high-performance computing applications, when performance is paramount, programmers sometimes like complete control of what data is and is not stored on chip. Contrast this to cache optimization, where the programmer is not able to directly control what data is moved into the cache, but tried to write code that work wells with what they know the cache will do.