Slide View : 15-418/618 Spring 2014

Performance Optimization II: Locality, Communication, & Contention

Previous | Next --- Slide 27 of 42

smklein

If two processors both tried accessing each other's cache lines, problems could occur.

P1 would ask for the cache line, load it from memory (slowly), and finally have access. Then, possibly before P1 could finish its work, P2 would ask for the cache line, invalidating P1's cache line, and making it move all the way through L3 cache to get over to the different processor.

This "ping pong" effect of the cache line would result in a lot of invalidation and waiting, rather than progress.

This comment was marked helpful 0 times.

yanzhan2

I agree with @smklein's explanation, but whether making cache line move all the way through L3 cache depends on the design. If cache to cache transfer is provided, then the data could be directly transfer to the other cache line using whether bus based or directory based structure and protocol.

This comment was marked helpful 0 times.

arjunh

This example from the wikipedia page helped me a lot in understanding how false sharing is a real thing:

In the code below, we observe that the fields in the instance of struct foo will be on the same cache line (assuming a cache line size of at least 8 bytes). We then launch two functions on two different processors, both using the same instance of struct foo. One method (sum_a) only reads from f.x. The other method (inc_b) writes to f.y the incremented value.

Every time the inc_b function increments f.y, the processor running the inc_b function sends out a message over the interconnect, informing the other processors that they should invalidate the cache line, due to modification. Unfortunately, those modifications are disjoint from the reads that the other processor (running sum_a) makes.

Therefore, it would be ideal that no invalidation should occur, but since the smallest level of granularity that the system works in is in terms of the size of a single cache line (artifactual communication), there will be a significant amount of 'thrashing' in this code.

Python:

struct foo {
  int x;
  int y; 
};

static struct foo f;

/* The two following functions are running concurrently: */

int sum_a(void)
{
  int s = 0;
  int i;
  for (i = 0; i < 1000000; ++i)
    s += f.x;
  return s;
}

void inc_b(void)
{
  int i;
  for (i = 0; i < 1000000; ++i)
    ++f.y;
}

One solution to this problem would be to add padding between the x and y fields, so that they are on different cache lines. For example, if the cache line was 8 bytes in size, we could use the following struct definition to avoid false sharing:

Python:

struct foo { 
  int x; 
  char c[4]; 
  int y; 
}

Now, the writes to f.y will not interfere with the reads from f.x. This struct definition is more cache-friendly, but of course, it incurs a substantial memory footprint.

This comment was marked helpful 1 times.

Q_Q

Thinking back to assignment 2, I guess one of the reasons why using a thread for each pixel was so slow is that the pixels in the image are laid out in row order, but if the thread blocks worked on squares in the image, there would be a lot of cache-ping-pong between the L1 cache for each SMXes on the GPU.

This comment was marked helpful 0 times.

jinsikl

I think one important thing that came up in lecture was that these artifactual communication isn't inherent in the code, but simply the result of hardware implementations. Thus it is hard to know exactly how these artifactual communications are affecting the performance of your program.

This comment was marked helpful 0 times.

mofarrel

@yanzhan2 In the intel x86 architecture, it is most likely that it passes across processors at the level 2 cache. This is because it needs to maintain inclusion throughout the cache hierarchy. Copying a value from processor 1's L1 to processor 2's L1 violates inclusion unless it updates processor 2's L2. This process of updating processor 2 L1 and L2 would be to large an overhead for an L1 cache, as it is intended to be extremely quick.

x86 seems to copy across from one processors L2 to another processors L2. This is suggested by the fact that reading from another processors L2 is faster than reading from L3 (slide 18).

This comment was marked helpful 0 times.

tianyih

Is false sharing and artifactual communication the same thing?

This comment was marked helpful 0 times.

drayson

They're closely related -- false sharing is the pattern of accesses that causes the artifactual communication to occur

This comment was marked helpful 0 times.