Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

Performance Optimization II: Locality, Communication, & Contention

Previous | Next --- Slide 44 of 47

ekr

Now that the quiz is turned in, it's interesting to note that while this was presented as helpful for the homework, it also came in handy for Q2 on the quiz. I'm not sure if this was intentional. It occurs to me that maybe this was done so that we'd be more familiar with the concept, to be better able to use it in the homework.

kayvonf

Question: What are the benefits and drawbacks of this solution compared to the previous implementations of the same problem? Consider all the metrics we think about when optimizing code for modern parallel computers:

Amount of parallelism
Amount of synchronization
Data locality and arithmetic intensity
Amount of work performed (in particular, compared to a sequential implementation)

ghawk

Amount of Parallelism: The parallelism in this solution is high, i.e. every thread launched works on a different particle in parallel. The parallelism is the same as in Solution 2 and 3, but higher than solution 1.

Synchronization: The need for locks is eliminated here, because of the absence of contention while updating grid_index, cell_starts or cell_ends. Hence, it's a better solution in terms of synchronization compared to 2,3 or 4.

Data locality and arithmetic intensity: Each thread in Step 1 computes grid_index for a single particle, hence there is a huge communication overhead (and requires high BW). I think the data locality is low, and the since the computation of each particle is in a separate thread (where data of each particle is transferred to a thread), the arithmetic intensity would be low.

Amount of work performed: The amount of work performed is high compared to a sequential implementation, because we sort the grid indicies for particles after computing them, in order to populate cell_starts and cell_ends. But, at the same time this solution is massively parallel and would still perform better than a sequential solution in terms of runtime.

kayvonf

Great answer @ghawk. I do claim there is still synchronization in this example, but not fine-grained synchronization. Where is it? Who wants to chip in?

ghawk

@kayvonf, are you referring to the synchronization required at the end of every step? I guess we need to wait for all threads to finish their computation in step 1 before proceeding to step 2 here, and we encounter the same case when we need to proceed to step 3. This is because Grid_index (output of step 1) is also the input of step 2, which requires all its entries to be populated by the correct grid indices before the sort can take place.