Through the lecture we examined several design strategies that work particularly well for certain workloads (for example, SIMD with high instruction stream coherence, multi-threading to reduce stalling). Here and in the previous slide, we combine all of these ideas for a single many-core processor. Is it the case that these design decisions are rarely in conflict, or is balancing these to achieve good performance a more delicate issue than it first appears?
@illuminated There are two main reasons why design decisions can come into conflict: 1) cost constraints and 2) space constraints.
If building a chip with SIMD capability costs X and building more fetch and decode units and execution contexts for multithreading costs X and you only have X dollars remaining in your budget, you have to choose one to implement in your chip.
Furthermore, as you can see from this slide, having hardware support for multithreading and SIMD on a chip takes us more space. You have limited space on your chip. Especially for small devices like smart phones, you often have to make design decisions based on space constraints.
If in a processor with the above architecture we use ispc and have tasks > 8 and SIMD then would we have 8 tasks scheduled at a time such that 4 do vector operations(SIMD) and 4 do scaler operations on the single execution unit.
Thus our maximum possible speedup would be (4*8 + 4)X in that case?
Is this analysis correct?
If we only need to use a single instruction stream, is there any difference between using 2 cores with 1 fetch/decode, 1 execute, 1 execution context each vs. 1 core with 2 fetch/decode, 2 execute, 2 execution contexts?
@huehue yes, a single core with 2 fetch and decode units can exploit ILP within that single instruction stream (assuming these 2 fetch and decode units are capable of finding independent instructions within one instruction stream). If we had a 2 core machine, we would have to pick a core to run the single instruction stream on. Within that core, we can only run one instruction at once because it only has 1 fetch/decode and 1 execution unit.