Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

Previous | Next --- Slide 15 of 30

BryceToTheCore

Main Idea

I think the main idea of this optimization is that the lock is kept mostly in a shared read state, whereas without the extra test, the test and set moves the lock into an exclusive read mode that would spawn lots of cache coherence traffic.

Side Note

In general, I would imagine that it is preferable to call atomic operations as few times as possible, because they are expensive.

ChandlerBing

I agree. This is similar to one of the optimizations done in Assignment 3. In this case, the call to the atomic operation test_and_set is expensive in terms of the inter-connect traffic generated and hence should be called as rarely as possible.

kayvonf

Challenge: In a later class I was asked about the cost of an atomic RMW operation compared to just a write operation in a cache coherent system (the discussion was held in the context of a compare-and-swap operation). I mentioned that someone should write some code to try it. I'd like to see the answer posted here!

uncreative

I wrote some simple code which might give some insight.

I used fetch and add instead of compare and swap, since it was easier to think of something my compiler couldn't optimize away.

I tried adding using regular instructions and atomic instructions each 32,000,000 times.

The main loop looked like:

for(i=0;i<32000000;i++){
  counter += i % 2; //a little bit hard to optimize out
}

The results show, that adding to a volatile int was about 3x slower than the atomic operations, and that a regular int is quite fast:

The results running on -O0 look like this:

Writing to a regular int took 87285
Writing to a volatile int took 86367
Atomically writing to a regular int took 198348
Atomically writing to a volatile int took 192543

And by -O3, they look like this:

Writing to a regular int took 12852
Writing to a volatile int took 74813
Atomically writing to a regular int took 192978
Atomically writing to a volatile int took 191950

I looked a little at the assembly code, and gcc uses a separate instruction for the atomic operations:

lock add % eax,0x20092e(%rip)

and the regular load-add-store (at least on -O0) for the non-atomic ops:

mov    0x200a04(%rip),% eax
add    % edx,% eax
mov    % eax,0x2009fc(%rip)

In the volatile non-atomic case using -O3, gcc inserts a few instructions between the read and write:

4007a0: 8b 0d ba 08 20 00     mov    0x2008ba(%rip),% ecx
4007a6: 89 d6                 mov    % edx,% esi
4007a8: 83 c2 01              add    $0x1,% edx
4007ab: 83 e6 01              and    $0x1,% esi
4007ae: 01 f1                 add    % esi,% ecx
4007b0: 39 da                 cmp    % ebx,% edx
4007b2: 89 0d a8 08 20 00     mov    % ecx,0x2008a8(%rip)

For the non-volatile variable on -O3, gcc has turned my 6 line function into 350 lines of assembly. I think the counter is in a register, and the loop seems to have been unrolled. I don't know what else happened. I'm pretty sure gcc didn't figure out that all my code does is divide by 2 and add 1.

EDIT: In fact, I went back and timed dividing by 2 and adding 1, and gcc must be looping, because dividing is much faster.

If anyone is curious I can post the code I used. I was worried it would take up too much space.

Elias

tl;dr - writing an integer is as much as 15x slower, when using atomics (and -O3). Wow!

lament

@uncreative Bravo.

lament

@BryceToTheCore How is that different from the original test-and-set? There, we only wrote in the case that the lock was available, the rest of the time we read. Could someone explain the difference carefully?

EDIT: Part of this is answered on the next slide. Also, as detailed on a previous slide, in the original test-and-set, apparently the system is smart enough to "see the intent" of several lines of code. Namely, the cache traffic involved only multiple BusRDx, not BusRD followed by a write as I had pictured in my head. The former "sees" what the overall code does, while the latter would follow the assembly more closely.

BryceToTheCore

@lament,

Regarding the original test-and-set:

While it is true that we only write in the case that the lock is available, the use of the atomic operation incurs a transition to the read exclusive state. Therefore it is impossible for multiple threads to share the variable in the shared state.