The difference between atomics and cache coherence is that atomics are explicit and visible in the code you write, whereas cache coherence introduces invisible performance bugs.
Nah, they're the same. An atomic call may be cheap or extremely expensive... you can't tell by looking at the code. Same with a memory operation in a coherent cache, except they're slightly less bad compared to atomics.
We need architectures and programming models that push people in the direction of scalable parallel code. Cache coherence is wrong because it facilitates bad code.
So do atomics. So does shared memory. So does basically everything that deviates from the happy world of completely independent stream programming. Sadly, while these things sacrifice performance and don't encourage programmers to write "scalable code", these things are also useful as the past 5 years have shown.
Too many programmers parallelize their code and forget to parallelize their data structures because of cache coherence. That's why I consider it a bug, not a feature.
Far more don't even parallelize their code until they have to... it's the same deal. People will optimize stuff when they need to, and it's far better to have code always *work* and then optimize for parallelism and caches than it is to have the correctness of the code effectively depend on those optimizations (which I will note, are only necessary for a subset of the code you're writing).
From a software developer's point of view, the opposite is certainly true: once you have partitioned your data structures for parallel execution on a non-coherent processor, it's easy to get parallel scalability on a processor with coherent caches.
Right but we're talking about code that cannot be expressed efficiently without coherent caches. Same deal with shared memory, same deal with atomics, etc. Vertex scatter histograms and parallel reductions written in DX9 all run on current hardware, but not as well as version that use atomics and shared memory.
My experience programming with non-coherent caches is that they are practical, useful, and perform pretty well.
Sure, but I'd say the same thing for coherent caches
Again, thinking about just the non-coherent cache cases predisposes you to not think about the problems that can't be expressed efficiently in that model. For instance, sparse, many/wide-bin histograms with local, data-dependent coherence are hard to express efficiently on current GPUs, Fermi included, but work well with coherent caches. You basically can't make use of shared memory/caches at all in this case unless they are coherent. You really want the hardware to efficiently move around bins/cache lines precisely how coherent caches work in this case, and this is not uncommon.