Previous | Next --- Slide 3 of 23
Back to Lecture Thumbnails
kayvonf

Question: In class someone asked if, in the case of un-padded counter variables, would six threads running on three hyper-threaded cores (achieved by pegging threads to hardware contexts via pthread_setaffinity_np) outperform six threads each running on their own core. I love the proposed experiment and someone should give it a shot and let us know.

In the un-padded counter scenario:

  • Why might six threads on six cores run slower than six threads on three cores?
  • Why might six threads on six cores run faster than six threads on three cores?

In the padded counter scenario:

  • Do you expect to obtain higher performance with six threads on six cores or six threads on three cores?
Kapteyn

I wrote up some code to run 6 threads with unpadded data on 6 and 3 cores. Just like in the code presented in class, each thread is tasked with incrementing a counter (stored inside an array, indexed by threadId) 2 billion times.

Before running the tests, I hypothesized that running 6 threads on 3 cores would be slightly faster that running 6 threads on 6 cores.

My reason for why the 6 threads 6 cores case will perform worse than the 6 threads 3 cores case is contention. Since all threads are attempting to write to the same cache line and the only thing they are tasked with doing is to update one value in that line, essentially all of their work is serialized. At all times they all are contending for the same cache line.

In this situation, since only one thread can be updating a value in the line at a time, it would be in fact worse to have more cores fighting for the line because this causes more overhead of communication between caches to orchestrate who gets to modify the cache line.

When we double the number of cores, we double the probability that a BusRdx will be issued on the bus at any given moment in time, causing the thread currently with exclusive access to have to drop its line. The probability doubles if we double the number of cores because caches issue BusRdx messages, not threads. So if we put two threads on each core such that there are only 3 cores being used, there are only three caches potentially fighting for the same cache line. With six cores, we have 6 caches potentially fighting for the same cache line.

This might increase our chance of bad situations in the cache coherence protocol such as livelock. Livelock is a situation that occurs when two processors are contending for the same cache line. Processor1 issues a BusRdx and causes processor2 to invalidate its line, but before processor1 can actually complete its write, processor2 issues a BusRdx and causes processor1 to invalidate its line.

I think a good analogy would be a situation where a group of people are trying to get through a narrow door that can only fit one person at a time. The more people that try to fight to get through the door first, the longer it will take for any given person to get through. Whereas if there were fewer people, people would flow more easily through the door.

Another reason for why using 3 cores might be slightly faster than 6 cores is that in the 3 core case, since two threads share each core, when thread1 in any given core gets exclusive access to the line, should thread2 on the same core be lucky enough to execute its write immediately after thread1 finishes its write and before some thread from another core can claim exclusive access to the line, then we can avoid a cache miss because thread1 and thread2 share the same L1 cache and the line already exists in their cache in the modified state.

Here are the results from averaging 15 test runs of each scenario:

6 threads on 3 cores: 16.904964 sec

6 threads on 6 cores: 17.1839864 sec

So the 6 threads on 6 cores case is indeed slightly slower, but only by about 0.3 seconds on average.

Also as a comparison, the sequential version ran in 24.781217 seconds.


Padded Data

I also wrote another test to run 6 threads with padded data on 6 and 3 cores. Unlike the previous experiment, before running this test I was fairly certain that six threads on six cores would be about twice as fast as six threads on three cores when we have padded arrays because now writes are not serialized and we can update all six counters completely in parallel as a result if we use six cores. If we use only three cores then we can only update three counters simultaneously. Sure, there will be some benefit from multithreading on a single processor, but that's far from the speed up gained from parallelizing across six cores.

Here are the results confirming the above hypothesis:

6 threads on 3 cores with padded data: 7.766555 sec

6 threads on 6 cores with padded data: 3.8747856 sec

I'd also like to reconcile my observations from this experiment with the observations from the experiment we ran in class where we saw that the slowdown between 12 threads on six cores with unpadded data and 12 threads on six cores with padded data was noticeably worse than the slowdown between 6 threads on six cores with unpadded data and 6 threads on six cores with padded data.

In that situation, I believe the overhead of managing twice as many threads caused the slowdown. Typically, multithreading is beneficial because we can hide memory latency. But given the specific nature of our code, there is very little for each thread to do besides write to a location in memory. Thus, the ability to switch between threads on a given processor does not help to increase our runtime to any significant degree. On top of that note that we are running pthreads, not CUDA threads for example, where context switching is handled by the hardware and is much less expensive than context switching handled by the OS as is the case with pthreads.