Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

Previous | Next --- Slide 18 of 66

rofer

This reminds me of one of my favorite technical terms.

This kind of problem is called embarrassingly parallel. This is a type of problem where it is exceedingly trivial to separate it into parallel tasks (in this case because of the lack of data dependencies).

arjunh

Question: A quick question (already answered in the lecture), but an important thing to keep in mind: is an embarrassingly parallel program one that you expect to necessarily see a considerable speedup on? If not, explain why such a program may fail to parallelize well. What could you do to improve the performance?

pmassey

I could be wrong, but it seems as though you should expect an embarrassingly parallel program to see a considerable speedup. As in the above example, SIMD computation allows for a single decode and fetch of the body of code to be run on each instance of the data. However, as shown in class, it may not be efficient at all. The main reason behind the speedup is that all of the data is independent and the same code is run on all instances.

arjunh

@pmassey Memory access speed can definitely be a problem, as you pointed out. That being said, it would be cool to revisit this problem in a couple lectures, when we start talking more about caching, in particular, one notorious problem known as false sharing.

Also, think about divergent execution when using vectorized instructions. You'll explore this more in assignment 1.

landuo

Embarrassingly parallel program does not guarantee a considerable speedup. Divergent execution can significantly affect vectorized instructions by making all other lanes waiting for one lane. In addition, false sharing occurs when multiple processors reading/writing to same address space in memory, which leads to frequent invalidation of cache lines in other processors and leads to a performance drop.

skywalker

I believe the divergent execution is because even though we can have parallel SIMD execution across the elements; the inner for loop takes different times for different elements. Hence we're not guaranteed a good distribution.