Yeah. To me GCN was even five times faster than Kepler in compute. Just nobody talked about it, not even AMD themselves it seemed. When did you ever see a 5x lead over the competition? Never. And today all we hear is how much 'behind' AMD is.
To me GCN is the best GPU architecture ever made, and its drawn power translates to performance. I think AMD does big changes less often, but if they do there is a good chance they take the lead for some time.
It's been some time since then, so I've probably forgotten many things, but which benchmarks or metrics had a 5x lead? There were some specific use cases like double-precision that I can remember, although that would understandably be of little concern outside of compute like HPC--where AMD's lack of a software foundation negated even leads like that.
This came up in the pre-E3 thread.
https://forum.beyond3d.com/posts/2067755/
I speculated on a few elements of the patent here:
https://forum.beyond3d.com/posts/2069676/
One embodiment is a CU with most of the SIMD resources stripped from the diagram, and other elements like the LDS and export bus removed.
From GCN, it's a loss of 3/4 of the SIMD schedulers, while from Navi it's a loss of 1/2. SIMD-width isn't touched on much, although one passage discusses a 32-thread scenario.
Beyond these changes, the CU is physically organized differently, and its workload is handled differently.
The SIMD is in one embodiment arranged like a dual-issue unit, and there is a tiered register file with a larger 1-read and 1-write file and a smaller multi-ported fast register file. There is a register-access unit that can be used to load different rows from each register bank, and a crossbar that can rearrange values from the register file or the outputs of the VALUs. Possibly, the loss of the LDS may not have removed the hardware involved in the handling of more arbitrary access of the banked structure, and it was repurposed and expanded upon for this. Use cases like efficient matrix transpose operations were noted as a use case for these two rather significant additions to the access hardware.
The workload handling is also notably changed. The scalar path is heavily leveraged to run a persistent thread, which unlike current kernels is expected to run continuously between uses. The persistent kernel monitors a message queue for commands, which it then matches in a lookup table with whatever sequence of instructions need to be run for a task.
The standard path on a current GPU would involve command packets going to a command processor, which then hands off to the dispatch pipeline, which then needs to arbitrate for resources on a CU, which then needs to be initialized, and then the kernel can start. Completion and a return signal is handled indirectly, partly involving the export/message path and possibly a message/interrupt engine. Subsequent kernels or system requests would need to go through this process each time.
The new path has at least an initial startup, but only for the persistent thread. Once it is running, messages written to its queue can skip past all the hand-offs and into the initial instructions of the task. Its generation of messages might also be more direct than the current way CUs communicate to the rest of the system.
This overall kernel has full access to all VGPRs, so it's at least partly in charge of keeping the individual task contexts separate and needs to handle more of the startup and cleanup that might be handled automatically in current hardware. There's some concurrency between tasks, but from the looks of things it's not going to have as many tasks as a full SIMD would. The scalar path may also see more cycles taken up by the persistent kernel rather than direct computation.
There was one possible area of overlap with GFX10 when the patent mentioned shared VGPRs, but this was before the announcements of sub-wave execution, which has a different sort of shared VGPR.
Other than a brief mention of its possibly being narrower than GCN, it's substantially different from GCN and RDNA.
Use cases include packet processing, image recognition, cryptography, and audio. These are cited as workloads that are more latency-sensitive and whose compute and kernel state doesn't change that much.
Sony engineering has commented in the past that audio processing for the PS4 GPU was very limited due to its long latency, and AMD has developed multiple projects for handling this better--be it TrueAudio, high-priority queues, TrueAudio next, and the priority tunneling for Navi. This method might be more responsive.
Perhaps something like cryptography might make sense for the new consoles with their much faster storage subsystems, which I would presume to be compressed and encrypted to a significant degree. Not sure GPU hardware would beat dedicated silicon for just that one task.
Other elements, like image recognition and packet processing might come up in specific client use cases, but I would wonder if this could be useful in HPC as well.
The fast transpose capability is something that might benefit one idea put forward for ray-tracing on an AMD-like GPU architecture (can pack/unpack ray contexts to better work around divergence), although in this instance it would be less integrated than AMD's TMU-based ray-tracing or even Nvidia's RT cores, since this new kind of CU would be much more separate and it may lack portions of standard capability. It's not clear whether such a CU or its task programs would be exposed the same way, as there are various points where an API or microcode could be used rather than direct coding.