I feel like maintaining cache coherence might be more trouble than it's worth. Wouldn't we save a lot of die space and clock cycles by using a large shared cache? Is there a point where coherence overhead outweighs what we gain by having private caches?
We would save die space if we removed the private caches, (Assuming the one large cache takes up less space that the sum of the private caches.)
Faster caches are typically more expensive to manufacture. This is perhaps one of the reasons that we need a hierarchical scheme of smaller and smaller caches. Ideally if we had infinite money we could store all memory in one large shared set of registers or something, but alas if we increased the size of the cache, then we would probably only be able to afford a less performant option.
I believe it is possible to construct degenerate programs that have terrible cache coherence. The question is are these programs useful and can they be rewritten in a more cache coherent form?
The benefit of cache coherency versus one large shared cache for processors with a relatively small number of cores is also not so clear to me.
Historically, the benefit has always been better latency. Faster memory is more expensive to make, and it's inherently harder to provide more fast memory, since memory takes up space, and more space means longer circuits (that's my n00b understanding at least - I'm not ECE, not sure how relevant this really is in chip design).
Cache latency specs aren't the easiest thing to find online, but you may want to try running a benchmark on your system to get some idea. I've seen L3 latency stats for different chips (i7's, ARM's, nVidia Tegra) ranging from 25-50 cycles, which is quite a lot compared to pipelined ALU ops. A shared L2 cache can reduce this latency for commonly-accessed data without implementing coherence, but still isn't enough to support a core burning on ALU at full speed. (This would be like running CUDA code where every operation needs to fetch from global memory - very slow!). A 32KB L1 cache may seem small, and comes with all the baggage of coherence, but will dramatically improve performance for common cases like iterating over an array.