Atomic operations cause an overhead, I'm guessing, because the variable needs to be surrounded by some sort of lock. How would a programmer know when this is worth it? Doesn't that require some pretty specific understanding of the chips implementation of different operations?
This comment was marked helpful 0 times.
Atomic operations are an abstraction -- instead needing to know when or where to use a lock, the programmer simply declares that an operation needs to happen atomically. This is assumed to cause a little overhead, simply because of whatever the implementation needs to do to ensure atomicity, but I don't see why the programmer needs to understand the implementation.
If a programmer believes that an algorithm with modification of shared memory is the most appropriate, then they will probably need to use atomic operations. If any extra overhead in that respect is unacceptable, then they will need to rewrite their algorithm to avoid needing atomic operations.
I am wondering why this mechanism is useful? At the last row, it would launch most number of blocks, but which block is working depends on which block first reach the atomic operation. Later blocks would not run if they come to the atomic operation late, and N is not big enough.
@yanzhan2 The benefit is to save block scheduling/switch overhead as well as balancing workload. Notice the while loop in the convolve kernel. The idea is to launch just enough thread blocks to keep all gpu cores busy and each block pulls work from a shared work queue (workCounter). This idea is similar to what ispc task is doing as opposed to launching lots of pthreads.
Expanding on the balancing workload point, I believe this technique is mostly used as a part of Dynamic Assignment, which, as that lecture slide describes, is mostly useful for when "the execution time of tasks, or the total number of tasks, is unpredictable".
Additionally, about @yanzhan2's last point, usually this technique would be used when the amount of work is much greater than the number of total threads so that each thread would do at least a couple of operations before it exits.
Another way to say things is that atomic operations, by definition, require serialization. Only one thread can be executing an atomic operation on a specific address at the same time. If many threads wish to do perform an atomic operation on the same address simultaneously, all of these operations must be serialized.
Since atomic operations are supported by hardware, instead of insanely add a mutex lock to each atomic operation, can I assume that this overhead is actually not very prominent? When having algorithms that requires serialization, is the atomic operation would be best to consider first? Is there any better way in general that is better than using atomic operations?
Atomic operations are very costly. They still need to ensure mutual exclusion, therefore in a multi processor system this is going to incur similar costs to that of a lock. However these costs are reduced slightly because atomic operations are generally highly optimized in the hardware.
Suppose we were to have N threads running on separate cores all trying to add atomically to a shared variable. If atomic operations are supported in hardware and don't in fact involve an additional lock, and these cores all happen to share their highest level cache, would it be reasonable to assume that this N-threaded program would be no slower than a sequential program adding N times? Or is there an additional penalty for using atomic operations beyond their inherently sequential nature and the risk of false sharing?
For GPU, a thread block runs on a single core, and thread blocks are independent with each other, so I guess there won't be situations in which different CUDA threads on different GPU cores trying to update a shared variable.