Question: What sort of workloads do you think would have coherent/divergent execution? What might you do to improve the level of instruction stream coherence?
Even if you don't have access to vectorized instructions, could such measures to improve coherence still lead to performance gains? If so, what aspect of the computer architecture are you taking advantage of?
@arjunh I'm not sure what you mean by "workload" but to improve the level of instruction stream coherence you would want to avoid a lot of casing such as if's and switches and if you absolutely have to use them then make sure your data will be split evenly among the cases.
@top By 'workload', I mean the nature of the data-set that is fed to the program and the task to be performed. The data-set could be an image, a database file or something else. The task could be an image-processing technique (such as brightening the image or performing a database operation).
Can instructions be interpreted as coherent if they can be reduced to each other. For instance what if an instruction to add floating point numbers 1.000 + 2.000 needs to be executed in the proximity of an integer add instruction 1 + 2? Can the floating point instruction be interpreted as an integer instruction to make use of the same type of Arithmetic logic units? In general, if a chip has special execution units that are idling, can the processor transform instructions to other types if they can be proven to be executed identically (i.e floating point instructions that would compute to erroneous answers will not be transformed, because that would change the execution of floating point arithmetic)?
@BryceToTheCore I'd say no; the units that would be used for integer arithmetic (arithmetic logic units or ALU's) are very different from those used for floating-point arithmetic (floating-point units or FPU's). In general, floating-point arithmetic is substantially more expensive and difficult to implement properly (in fact, there was a very (in)famous bug in the Intel P5 Pentium FPU back in the 90's; see this for more details. The recall of the chip cost nearly $500 million.)
@arjunh Thanks for the comment! I suppose that it is a non trivial task to convert floating point numbers to integer numbers at the hardware level within a clock cycle and would likely need a lot of extra hardware to determine whether such a conversion would be safe or not. I suppose it is for the best that the logic units stall instead of performing these checks at every cycle.
@top I agree that you want to avoid casing in order to achieve greater coherence, however, I'm not sure that splitting data evenly among cases is the best way to do this. If we are split among the cases evenly, we are much more likely to have ALU's that are not operating at their maximum capacity since its much more likely that they'll have to enter various parts of each case as opposed to entering only the if or the else(if its a simple if else statement). I would think that it is more important to try and group your data into groups that all enter the same case, so you can use as many of your ALU's at once as possible.
Here's something that you might find interesting; it turns out that many operations (such as checking which values are within a certain range) perform significantly better when given a sorted input instead of an unsorted one. This thread gives a particularly good explanation of why this happens (and further proof why it's vital to know about the architecture of the machines you will write code for).
For your second question, I guess we could still see performance gains. We would be taking advantage of the branch predictor because we have less branches, or more uniform branches?