armchair_architect
Newcomer
In the terminology you use here, are the manual shared memory management and coalescing requirements of CUDA part of the programming model or just of NVs current implementation?
Coalescing would be implementation. You could relax those rules significantly and (a) existing code would still run without change, and (b) apps that do uncoalesced access would automatically go faster. Whether global memory is cached or not is also in this category.
The on-chip shared memory is part of the programming model. You can implement it with normal cache and good cache eviction policy controls (like Larrabee apparently has). But your code has to be written to deal with it explicitly or you get no benefit, and it affects your algorithms and data structures. So I'd call it part of the programming model.
I wouldn't be upset if shared memory evolved in the Larrabee direction. The important thing is to be able to guarantee very low latency and very high bandwidth to a chunk of data used by a group of cooperating threads. Generic caching isn't good enough, IMHO.