Now that the quiz is turned in, it's interesting to note that while this was presented as helpful for the homework, it also came in handy for Q2 on the quiz. I'm not sure if this was intentional. It occurs to me that maybe this was done so that we'd be more familiar with the concept, to be better able to use it in the homework.
kayvonf
Question: What are the benefits and drawbacks of this solution compared to the previous implementations of the same problem? Consider all the metrics we think about when optimizing code for modern parallel computers:
Amount of parallelism
Amount of synchronization
Data locality and arithmetic intensity
Amount of work performed (in particular, compared to a sequential implementation)
ghawk
Amount of Parallelism: The parallelism in this solution is high, i.e. every thread launched works on a different particle in parallel. The parallelism is the same as in Solution 2 and 3, but higher than solution 1.
Synchronization: The need for locks is eliminated here, because of the absence of contention while updating grid_index, cell_starts or cell_ends. Hence, it's a better solution in terms of synchronization compared to 2,3 or 4.
Data locality and arithmetic intensity: Each thread in Step 1 computes grid_index for a single particle, hence there is a huge communication overhead (and requires high BW). I think the data locality is low, and the since the computation of each particle is in a separate thread (where data of each particle is transferred to a thread), the arithmetic intensity would be low.
Amount of work performed: The amount of work performed is high compared to a sequential implementation, because we sort the grid indicies for particles after computing them, in order to populate cell_starts and cell_ends. But, at the same time this solution is massively parallel and would still perform better than a sequential solution in terms of runtime.
kayvonf
Great answer @ghawk. I do claim there is still synchronization in this example, but not fine-grained synchronization. Where is it? Who wants to chip in?
ghawk
@kayvonf, are you referring to the synchronization required at the end of every step? I guess we need to wait for all threads to finish their computation in step 1 before proceeding to step 2 here, and we encounter the same case when we need to proceed to step 3. This is because Grid_index (output of step 1) is also the input of step 2, which requires all its entries to be populated by the correct grid indices before the sort can take place.
Now that the quiz is turned in, it's interesting to note that while this was presented as helpful for the homework, it also came in handy for Q2 on the quiz. I'm not sure if this was intentional. It occurs to me that maybe this was done so that we'd be more familiar with the concept, to be better able to use it in the homework.
Question: What are the benefits and drawbacks of this solution compared to the previous implementations of the same problem? Consider all the metrics we think about when optimizing code for modern parallel computers:
Amount of Parallelism: The parallelism in this solution is high, i.e. every thread launched works on a different particle in parallel. The parallelism is the same as in Solution 2 and 3, but higher than solution 1.
Synchronization: The need for locks is eliminated here, because of the absence of contention while updating
grid_index
,cell_starts
orcell_ends
. Hence, it's a better solution in terms of synchronization compared to 2,3 or 4.Data locality and arithmetic intensity: Each thread in Step 1 computes
grid_index
for a single particle, hence there is a huge communication overhead (and requires high BW). I think the data locality is low, and the since the computation of each particle is in a separate thread (where data of each particle is transferred to a thread), the arithmetic intensity would be low.Amount of work performed: The amount of work performed is high compared to a sequential implementation, because we sort the grid indicies for particles after computing them, in order to populate
cell_starts
andcell_ends
. But, at the same time this solution is massively parallel and would still perform better than a sequential solution in terms of runtime.Great answer @ghawk. I do claim there is still synchronization in this example, but not fine-grained synchronization. Where is it? Who wants to chip in?
@kayvonf, are you referring to the synchronization required at the end of every step? I guess we need to wait for all threads to finish their computation in step 1 before proceeding to step 2 here, and we encounter the same case when we need to proceed to step 3. This is because Grid_index (output of step 1) is also the input of step 2, which requires all its entries to be populated by the correct grid indices before the sort can take place.