Question: Consider the sinx application used throughout the lecture. Now imagine running a parallel version (parallelized over outer loop iterations) of that program on this processor on an input of 10,240 elements (4 x 2560). In your answer you can assume that we compute the sinx approximation out to many terms, so that the computation is highly compute bound. do you think this workload will be able to run on this processor at the processor's peak rate? Explain why or why not.

asd

I believe that we can run the sinx workload at the processor's peak rate.
This is because, if each core can execute 32x4 elements at one time, we would need 2560 / (32x4) = 20 cores for 2560 elements. This means that the first 2560 elements would be scheduled on all the 20 cores and can be computed simultaneously.
10,240 is 4 times that number.
Since each element is computationally intensive, we can achieve better utilization by interleaving on the cores (hiding of memory latency) for the remaining elements, since we have execution contexts available.

preritrodney

Yes, I think the processor will run at its peak rate. This is because the work will be equally distributed on each core and hence each core will run 10240/(20 x 32) = 16 threads of which 4 will run simultaneously on each core. The remaining 12 threads on each core will be used for interleaving for latency hiding. This will work because the arithmetic operation is highly compute bound and thus memory latency in itself will not be a big problem.

Question:Consider the sinx application used throughout the lecture. Now imagine running a parallel version (parallelized over outer loop iterations) of that program on this processor on an input of 10,240 elements (4 x 2560). In your answer you can assume that we compute the sinx approximation out to many terms, so that the computation is highlycompute bound. do you think this workload will be able to run on this processor at the processor's peak rate? Explain why or why not.I believe that we can run the sinx workload at the processor's peak rate. This is because, if each core can execute 32x4 elements at one time, we would need 2560 / (32x4) = 20 cores for 2560 elements. This means that the first 2560 elements would be scheduled on all the 20 cores and can be computed simultaneously. 10,240 is 4 times that number. Since each element is computationally intensive, we can achieve better utilization by interleaving on the cores (hiding of memory latency) for the remaining elements, since we have execution contexts available.

Yes, I think the processor will run at its peak rate. This is because the work will be equally distributed on each core and hence each core will run 10240/(20 x 32) = 16 threads of which 4 will run simultaneously on each core. The remaining 12 threads on each core will be used for interleaving for latency hiding. This will work because the arithmetic operation is highly compute bound and thus memory latency in itself will not be a big problem.