Previous | Next --- Slide 33 of 57
Back to Lecture Thumbnails

Sometimes I wonder whether it is worth adding so many complex logics in order to make sure 100% correctness of the program in rare situation. For instance, in order to think of a case that breaks the program correctness, we need to be absolutely critical in some cases. Sometimes I feel that in some domains, it is okay that "most of the time" the program is correct, then we can save a lot of additional hardware logics.


The issue with reasoning about situations being rare is that on large scales "rare" things become commonplace.

Let's say there's a hardware concurrency bug that results in corrupted state in one out of every trillion instructions, on average.

Google searches probably each result in at least 1,000 instructions being executed, and Google gets around 3.5 billion searches per day (according to So, in this hypothetical, we'd expect the bug to show up at least 3 times per day, on just google search. Sometimes the effects of the bug might not be too terrible, but other times, maybe they'd cause security vulnerabilities, crashes, or state that gets corrupted in important, hard-to-debug ways.

(For a non-concurrency example, a hardware defect called Rowhammer was recently found to allow a very nasty privilege escalation exploit: Google's Project Zero blog has all of the technical details and ars technica has a slightly less technical, shorter overview.)


There are certainly a lot of domains in which approximations or correctness with high probability is sufficient, but given the high complexity of many parallel programs, guaranteed correctness is pretty important. You wouldn't want to run a huge job for hours only to find it incorrect. That said, we can be a lot more flexible on the programming side (which is usually much easier to change than hardware), which motivates the loosening of consistency.


Do you mean the hardware should only be correct usually or software should be correct usually? If there's a are hardware bug, then any application that requires utmost accuracy will be on unreliable hardware. There are many software applications that can return approximate answers very fast, but all of these must be running on reliable hardware. Imagine a programmer not being able to find out if their issue is software or hardware related!


I remember reading a paper about relaxing correctness guarantees for hardware. They claim adding a small probability of error can lead to hardware 30 times more power efficient and a couple times faster. It then all comes down to managing the likelihood of error. In certain applications, small errors are perfectly fine (e.g. approximation algorithms). Other times, you can repeat the computation enough times to make sure your answer is right. Quantum algorithms for example almost always give probabilistic results. Saying that all hardware must be flawless and always give the correct answer rules quantum computers out immediately.


What is the power and hardware overhead to correctly supporting parallelism and coherence? The whole point of creating systems which process independent instruction streams is to avoid the power wall associated with improving single instruction stream performance. It seems that we are able to utilize transistors more effectively at lower power by utilizing parallelism but, as we've seen in the complexity required to support this parallelism, it's not completely free. Is it possible that we have simply pushed off the issue of power by adopting a solution with a lower growth rate? That is, is there a point at which the mechanisms to support parallelism start stressing our power limits? If not, what is the fundamental reason and what does it suggest about future systems?