A useful list:
During lecture, you have mentioned that programs written for ARM need to set up explicit barriers to ensure some sort of consistency and order. I assume the need for barriers in ARM is much higher than an x86 architecture. Since barrier is a very expensive operation, how does ARM deliver performance that can match some x86 processors? Is it by trading off I/O latency with barrier latency?
Also, with such a relaxed memory model, does ARM perform better (than x86) on computation-heavy parallel tasks with some I/O loads sprinkled on top?
@aeu: My understanding is that the explicit barriers are actually helping ARM in this case. x86 essentially needs to assume the existence of a barrier around every memory operation that might violate its guarantees. In fact, I wouldn't be surprised if the CPU created uop barriers in certain cases. Of course, the CPU might have silicon dedicated to eliding these, so in practice, the only real downside might be that the ARM approach requires less energy and complexity to achieve the same performance.
It's interesting that ARM has such a relaxed consistency model. In my experience programming with ARM processors (ex. Java on Android), I haven't had to take this effect into account. Do higher level programming languages make heavy use of memory fences in their ARM compilers and interpreters, in order to make the system appear sequential?
@jrgallag, higher-level programming languages may have explicit memory models, such as Java - https://en.wikipedia.org/wiki/Java_memory_model. This allows you, the programmer, to not have to take this into account when writing your code. And so, yes, the compilers and interpreters are responsible for ensuring that the program observes the language's memory model even when the architecture provides a different one.