In this example, the wrapper function parallel_sinx() breaks up the work and parallelizes some of it by creating a new thread that runs sinx() on the first $N/2$ numbers and runs sinx() in the main thread on the last $N/2$ numbers. Thus, the potential speed-up is at most 2x, which is not very good considering this problem could be further parallelized with $ N $ independent sinx() calculations running.
This comment was marked helpful 1 times.
DanceWithDragon
It depends on how many cores you have.
If you have only two cores, it reaches best speedup if works are equally distributed.
In that case, parallel with more threads will produce overheads.
This comment was marked helpful 0 times.
yingchal
But in the later of this lecture, professor also mentions that, sometimes we also need have multiple threads per core to hide the stall.
This comment was marked helpful 0 times.
kayvonf
@yingchal: You are absolutely correct. However, you should also consider the above workload. It reads one value from memory x[i]. Then it performs a lot of math. Then it writes one value out to memory result[i]. So you can assume the arithmetic intensity of the program is pretty high. In fact, in this example, the parameter terms determines the arithmetic intensity of this program. If terms is reasonably large, the performance of the program is going to be largely determined by the rate the processor can do math. The potential stall due to the load of x[i] is only going to happen only once per cache line, and it's going to not effect the program that much.
(Advanced comment: this is such an easy access pattern that any modern CPU prefetcher will likely have brought the data into the cache by the time it has been accessed.)
In this example, the wrapper function
parallel_sinx()
breaks up the work and parallelizes some of it by creating a new thread that runssinx()
on the first $N/2$ numbers and runssinx()
in the main thread on the last $N/2$ numbers. Thus, the potential speed-up is at most 2x, which is not very good considering this problem could be further parallelized with $ N $ independentsinx()
calculations running.This comment was marked helpful 1 times.
It depends on how many cores you have. If you have only two cores, it reaches best speedup if works are equally distributed. In that case, parallel with more threads will produce overheads.
This comment was marked helpful 0 times.
But in the later of this lecture, professor also mentions that, sometimes we also need have multiple threads per core to hide the stall.
This comment was marked helpful 0 times.
@yingchal: You are absolutely correct. However, you should also consider the above workload. It reads one value from memory
x[i]
. Then it performs a lot of math. Then it writes one value out to memoryresult[i]
. So you can assume the arithmetic intensity of the program is pretty high. In fact, in this example, the parameterterms
determines the arithmetic intensity of this program. Ifterms
is reasonably large, the performance of the program is going to be largely determined by the rate the processor can do math. The potential stall due to the load ofx[i]
is only going to happen only once per cache line, and it's going to not effect the program that much.(Advanced comment: this is such an easy access pattern that any modern CPU prefetcher will likely have brought the data into the cache by the time it has been accessed.)
This comment was marked helpful 1 times.