Slide View : 15-418/618 Spring 2014

Parallel Programming Models

Previous | Next --- Slide 5 of 52

Back to Lecture Thumbnails

sbly

Do all instances in a gang finish execution at the same time (because they're all running the same instructions?) Or do they finish at different times but we have to wait until they're all done for control to return?

This comment was marked helpful 0 times.

kayvonf

@sbly: I'll turn the question back on you. Given what you know about SIMD execution after assignment 1, what do you think?

This comment was marked helpful 0 times.

shabnam

Unless ISPC fiugres out how to evenly allocate work ( which I believe is really hard ) I don't think its possible. All of them do not end at the same time in my opinion.

This comment was marked helpful 0 times.

moon

@sbly I highly doubt all instances complete simultaneously. Just like our simulated "programs" in lecture 1, we had the case where one person finished his job early and waited, while another person worked until the very end. Everyone was running the "same instructions" (adding numbers together), but some people had more numbers to add than others.

Similarly, in this case, all threads are running the same instructions, but some instances have to calculate more, and take longer, than others. Thus, any threads which have completed their job wait until all of the other threads have also completed, upon which they return the answer.

This comment was marked helpful 0 times.

kkz

As I understand it, each thread is running identical logic on 8-wide vector data which is being manipulated as a single entity. Each ALU then carries identical operations during each clock cycle, and so the threads should finish at the same clock cycle. If I'm completely wrong on this, can some please correct me? :(

edit: assuming appropriate simd implementation on suitable hardware.

This comment was marked helpful 0 times.

uhkiv

I think it depends on the code. If some instances are given way more work then other instances, then the instances will not finish at the same time, but if the workload is approximately evenly distributed among the instances, then they will finish approximately at the same time. Hence, I think the key is distributing the work evenly so that for example, one instance (i.e. one index in the loop) does not have too much work to do then other instances.

Please correct me if I'm wrong, since I'm also slightly confused about the notion.

This comment was marked helpful 0 times.

mchoquet

@kkz is right about this one. SIMD stands for same instruction, multiple data; all the tasks in the gang are running the same instruction stream (not just duplicate streams but actually the same one), so they are always executing the same instructions at the same time. For this case, they'd load a batch of numbers, compute sign on all of them in lockstep, and write those results back out again. In the case where the numbers take different amounts of work to compute they still send the instructions to all the ALUs, but throw away the outputs for those ALUs that didn't need to do that work. A diagram is here.

This comment was marked helpful 0 times.

squidrice

After finishing the extra credits of sqrt SSE version, I think a gang of instances on a single core actually represents a single vector lane. Every unit of it runs completely same operation for each cycle. In this level, "busy" and "idle" are distinguished by whether the unit is computing useful data (they are all busy in physical view). Those invalid data would be removed by the mask after all units finish their jobs.

This comment was marked helpful 0 times.

analysiser

I think to achieve simultaneous execution time, it depends on not only the same instructions, but also the same data. Some inputs may cause instructions more time to be executed. This could be observed very clear from problem 4 in assignment 1. If the input data is not evenly distributed then some program instances would take longer to finish.

This comment was marked helpful 0 times.

aew

As @kkz and @mchoquet mentioned, all instances will complete at the same time because they are executing the same instructions. When different inputs require more work, this means that instances with input requiring less work will remain idle. As @analysiser said, this was very apparent in assignment 1. In the first question, uneven distribution of workload also limited the speedup possible by parallelization.

This comment was marked helpful 0 times.