BTW, if you guys have any questions, I'm glad to answer what I can. I've had a full arch briefing; AMD just didn't give us time to write much.
Just a bunch of things that occurred to me in the order I skimmed the slides (no worries if not covered):
Is there a clear count on the number of shader engines? The diagram seems to have the GPU divided into two, but could it be four going by the way the CU arrays are arranged in blocks of 4 with their own rasterizer and primitive blocks?
What specifically is in the purview of the primitive unit, rasterizer, and geometry processor? What's been "centralized" in Navi versus how shader engines each had a geometry processor?
The GCN CU had a branch/message block that the RDNA diagram didn't include. Any mention of that in the briefing, or just artistic oversight?
Was there more detail on the what changed with the LDS now that it's apparently spanning two CUs?
Perhaps too esoteric, but any discussion about GCN instructions that might have been removed or changed due to the SIMD changes (ex. DPP has various 16-wide strides, LLVM does mention possibly discarding some instructions for handling branches or skipping instructions (VSKIP, fork and join instructions)?
Did they mention register bank conflicts or a register reuse cache in the briefing?
The slide on the RDNA SIMD Unit states "up to 20 wave controllers". Does this mean up to 20 wavefronts could be assigned to a single SIMD?
Details on the L0 and L1 caches? Is the L0 just the old L1 with a new name? How many accesses or how many WGPs can the L1 service per clock?
Is this a write-through hierarchy from L0 to L1 to L2? Did AMD outline possible changes in how it handles cache updates or memory ordering?
The centralized geometry processor has 4 prim units, but those are distributed across shader engines?
What does it mean by uniformly handling vertex reuse, primitive assembly, index reset, etc.? Does that mean that single geometry processor block does all of that, or it's responsible for farming out surface/vertex shaders across the die.
It uniformly distributes pre and post tessellation work, so it controls where the hull and domain shaders go? Where's the tessellation part located?
Async Compute Tunneling: what makes it able to drive down the amount of lower-priority work so completely? Is it able to context-switch existing waves out, or if not, what made GCN less effective in draining wavefronts?
Process gains are now harder to come by and even harder to realize, particularly on larger dies.
Even though the node gains are usually more modest than marketing would purport, AMD's slide has the node giving ~15% of a 50% improvement to the node, which translates to a single-digit percentage gain attributed to the node jump from GF's 14nm. It seems AMD ate up most of the possible improvement by staying well past the point of diminishing return for circuit performance versus power consumption.
So, in the end we have our next gen super simd architecture?. RDNA 2, the so called next gen is then this plus Ray tracing ?. It has sense to evolve first in rasterizing architecture and after add the RT support.
There may be some elements similar to the claims, though some like a register destination cache are mentioned in the LLVM commits and not in the slide deck. I'm not sure how much the RDNA SIMD layout matches. For one, the ALU arrays are described in terms of being narrower than SIMD16, with SIMD8 being mentioned as an example. The patent's suggestion of a superscalar or long instruction word encoding to help capture unused register cycles isn't mentioned, nor is there a clear instance of an instruction being split across one type of SIMD block at the same time as another--the SFU in the slide deck appears to operate in parallel and independently of the main ALUs. There appears to be a mode that promotes a SIMD32 issue into a two-cycle form, not one that allows for two instructions to issue simultaneously on the same SIMD.
The register file and register destination cache figure more heavily in a separate register file patent that is less dependent on the SIMD organization, and there is brief mention of elements from it like the register cache and register bank conflicts. The super SIMD patent didn't necessarily require that the banking was visible to software, and it seemed to regard the registers as being addressed as rows stretching across all four banks, rather than each bank's row being a different register ID.
The 64-wide threads are described as hiding inter-instruction latency (which I interpret as referring to complex register file banking (e.g. indexed reads), lane interchange and LDS operations) better than 32-wide.
The LLVM code changes include latency figures that indicate that the overall process of reading operands from the register file and issuing an instruction is 1 cycle longer than it used to be, regardless of the addressing mode.
AMD's slides don't mention bank conflicts based on the register ID, though the LLVM changes do. AMD's instruction issue slide doesn't appear to be using conflicting register IDs, however.
interestingly you can see from this example that Navi cannot issue two instructions consecutively when there's a store to load. v0 is written in cycle 2 but cannot be consumed until cycle 7, four cycles of latency. GCN doesn't have to switch to another hardware thread in this case, while Navi does.
The traditional GCN microarchitecture matched issue latency to execution and forwarding latency, whereas Navi's implementation has removed the issue limitation. GCN's long-held promise of near-zero thought having to be dedicated to ALU and forwarding has been abandoned. This, plus the extra vector register read cycle, may point to streamlining some of the internal pipelining needed to get everything flowing in the 4-cycle cadence, and perhaps a sacrifice in circuit depth per stage to get better clock speed with a 5-cycle latency pipeline.
The most immediate upshot seems to be that the scalar unit's much quicker execution and forwarding latency are no longer dependent on the vector path's cadence. If not crossing domains, it might allow for stretches of setup code to run chains of serial scalar code without that stall.
The LLVM speed model is interesting in that it has mostly the same latency numbers as prior generations, just multiplied in cycle count by 4--aside from the vector ALU path having an extra register file cycle. In that regard, there's an updated instruction issue and sequencing element to the uarch, but it's plausible that many of the pipeline paths haven't diverged from the 4-cycle execution and forwarding pattern.
So perhaps If you have a workgroup of 128 or 256 running in Workgroup Processor mode, then you get twice as much LDS capacity and 4x the cache bandwidth as on just a single compute unit.
It does seem like a big factor is the ability for a single workgroup to leverage 4x the bandwidth than would have been possible in the past with single-CU workgroup allocation. It seems like it might help versus the competition since Volta and Turing did upgrade their cache bandwidth. If the LDS is shared, I wonder if this also means the workgroup barriers that used to be per-CU are also shared. The LLVM changes do mention there are subtleties to accessing two L0 caches, given the memory hierarchy's weak consistency.
The slide with the caches and LDS seems a little ambiguous as to how much the LDS has been upgraded, and if it might have tweaks to let a workgroup shuffle data little more efficiently between its two sides.
Without putting much thought into it, I wonder if some of the more complex merged shader stages like primitive shaders might benefit from this. One half of a workgroup could start the culling process and use the LDS to hand non-culled vertices to the the vertex processing half of the same shader, rather than having to switch back and forth between shader phases.
The cache slide shows just how much bandwidth there is internal to the cache hierarchy, which I hope people pay heed to when planning a chiplet-based future. The infinity fabric is still at the far end, handing data from one channel to the nearest L2 slice, mostly. The L2/L1 fabric carries more than twice the bandwidth, and a single dual-CU set of data paths could generate more than half of the bandwidth of the whole GDDR6 subsystem--and each L1 has 5 of them.
Trying to take any of the internal groupings out of the die is even more expensive than it was with GCN.
Some of my earlier questions about how heavily loaded the L2/L1 fabric was appears to be answered by the addition of the L1, and clients like the RBEs and rasterizer hang off the L1 rather the L2.
It doesn't seem like Navi has entirely dispensed with the old way of distributing work to the shader engines and rasterizers, since the geometry block has arrows to them that do not go through a cache.
With respect to the lower-latency hierarchy:
Per a GDC 2018 presentation on engine optimizations (
https://gpuopen.com/gdc-2018-presentations/, one by Timothy Lottes), a GCN vector L1 hit has a latency of ~114 cycles. So I guess RDNA dropping its cache hit latency down to 90 cycles is better than nothing. Going by the reverse-engineering of Volta (
https://arxiv.org/pdf/1804.06826.pdf) AMD is about three times slower rather than about four times slower at the L1. GPU L1 aren't speed demons, obviously, and Volta appears to have latencies close to GCN at the L2 and beyond.