Previous | Next --- Slide 34 of 37
Back to Lecture Thumbnails
Kapteyn

When programming in ISPC we often launch multiple tasks where each task executes on a different processor of the CPU and each ISPC instance from each task performs some operation on an element in a array whose values are stored contiguously in memory.

Say our SIMD vector width is 8-wide. Then it is likely that multiple processors will simultaneously want to write to the same cache line. For example, we can have a situation where the first task is operating on ints 0 through 7 of the array while the second task is operating on ints 8 through 15 of the array, which all fit on the same cache line. Assuming that ISPC is not padding our arrays behind the scenes, it would seem that ISPC would run into the performance problems caused by cache coherence discussed in this slide.

Since the whole point in launching multiple tasks is to achieve parallelism across cores, it seems like the implementation of ISPC somehow gets around the problem illustrated in this slide. How does it do this?

ak47

On an unrelated note, this was discussed as a hideous piece of code...am I missing something? What is so hideous?

Olorin

@ak47: Instead of using NUM_THREADS * sizeof(int) bytes of memory, we now use CACHE_LINE_SIZE * sizeof(int) bytes. In class, Kayvon said that cache line was 64 bytes (if I recall correctly), so this uses 64/4 = 16 times as much memory. (Unless ints are 8 byte, in which case it's still 8 times as much.)

This extra memory isn't being used for any purpose related to the task we want to complete, so it's very wasteful.

jezimmer

Hmm, I guess when he mentioned that the code was hideous I was thinking stylistically. Now instead of simply indexing into a simple array, all your counter accesses will have to look like

c myCounter[i].myCounter += 1;

I guess I subconsciously justified the "grossness" of the excess memory usage by thinking that it would drastically improve thrashing in the program, and so for a little extra (likely a "constant" amount more) memory you get a pretty good performance improvement.

Olorin

You can always abstract syntactic ugliness like that away, e.g. #define ACCESS(myCounter, i) myCounter[i].myCounter or similar.

I wonder if a cleaner optimization might be to note that we have a small-ish number of CPUs, and keep multiple integers (one for each virtual core?) in the same cache line. This still wastes a lot of space, but a significant amount less, since each array element would have multiple integers instead of just one.

uncreative

It's interesting that all of the parallel programming languages we have used seem to have semantics for avoiding code like this. In OpenMPI, you have to send messages, so such an array would be impossible, and in OpenMP, they have the reduce operation for summing counters like this.