Yes, the CPU<->GPU coherence is not yet perfect. However we have not seen a problem with that, since we sidestepped that problem completely, by moving the whole graphics engine to the GPU. We don't need tight communication between the rendering and the game logic (physics, AI, etc). Obviously for general purpose computing, it would be a great improvement to get a shared CPU+GPU L3 cache and good cache coherence between the units (without needing to flush caches frequently or use slower BW bus). With "units" I mean both the CPU and the various parts of the GPU (such as the front end, vector units, scalar unit, etc). More coherence = less flushes needed = less stalls.The cache hierarchy could stand for an improvement.
The 2011 implementation GCN still has is a step above the incoherent read-only pipeline it had before, but its behavior is still too primitive to mesh with the CPU (Onion+ skips it) and its method of operation moves a lot of bits over significant distances.
The increase in channel count, and the number of requestors for the L2 means more data moving over longer distances, which costs power. The way GCN enforces coherence by forcing misses to the L2 or memory costs power as well.
Changes like more tightly linking parts of the GPU to the more compact HBM channels and going writeback between the last-level cache and the CUs could reduce this, but not without redesigning the cache.
On GPU side, the atomics are going directly to L2. That could be improved to make some cases faster. However in practice we use LDS atomics locally, and then perform one global atomic to synchronize. This greatly reduces the number of global atomics you need. GCN LDS is highly optimized for atomics. GCN also has super fast cross lane operations, allowing the developer to sidestep the need of atomics (and LDS accesses) in many common cases. Unfortunately only OpenCL 2.0 exposes this feature on PC (subgroup operations, see here: http://developer.amd.com/community/blog/2014/11/17/opencl-2-0-device-enqueue/). I am still sad that this feature didn't get included in DirectX 12
I would also love to see ordered atomics in DirectX some day
Heh, I was going to say that I want full float instruction set and full LDS read/write instruction set... but that I don't need memory write supportThere was a paper on promoting the scalar unit to support scalar variants of VALU instructions, which had some benefits.
Tonga's GCN variant did promote the scalar memory pipeline to support writes, at least.
But I see some important uses cases for scalar unit memory writes. Especially if the scalar unit cache is not going to be coherent with the SIMD caches.
I am not a hardware engineer, so I don't know that much about the hardware level power saving mechanisms. Those are often fully transparent to the programmer. I just try to get maximum performance out of the hardware. This is why I suggest things that allow the programmer to write new algorithms that perform faster. All kinds of hardware optimizations that allow shutting down unneeded transistors are obviously we much welcome.Other questions are whether GCN can evolve to express dependency information to the scheduling hardware. It currently has 10 hardware threads per SIMD waking up and being evaluated per cycle. Perhaps some of that work could be skipped if the hardware knew that certain wavefronts could steam ahead without waking up all the arbitration hardware.
This is another thing it could borrow from Nvidia, or again the VLIW architectures GCN replaced.
Register caching sounds like a good idea. Majority of the registers are just holding data that is not needed right now, as the GPUs don't have register renaming and register spilling to stack. Register cache should allow a considerably larger larger register file (little bit more far away), while keeping the performance and power consumption similar. This would definitely help GCN.
16 bit fields are also sufficient for unit vector processing (big parts of the lighting formulas) and for color calculations (almost all post processing). No need to waste double amount of bits if you don't need them.16-bit fields are sufficient for machine learning.