AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
Each Shader Engine contains 10 Workgroup Processors, which in turn each contain 2 CUs. The CUs inside of a WGP can be grouped up to cooperate on workloads, if the compiler deems it beneficial.

Mike_Mantor-Next_Horizon_Gaming-Graphics_Architecture_06092019_20.jpg

So each Compute Unit has 64 threads (SPs) but can be split into two for two issues of 32 SPs each, and then each workgroup has some "local data share" block that does... something with cache/IO/whatever. Alright then. So higher level does look like those odd one off mobile Vegas with 20 CUs for macbooks, it's just the naming conventions threw me off. Thanks!
 
BTW, if you guys have any questions, I'm glad to answer what I can. I've had a full arch briefing; AMD just didn't give us time to write much.
Just a bunch of things that occurred to me in the order I skimmed the slides (no worries if not covered):
Is there a clear count on the number of shader engines? The diagram seems to have the GPU divided into two, but could it be four going by the way the CU arrays are arranged in blocks of 4 with their own rasterizer and primitive blocks?
What specifically is in the purview of the primitive unit, rasterizer, and geometry processor? What's been "centralized" in Navi versus how shader engines each had a geometry processor?
The GCN CU had a branch/message block that the RDNA diagram didn't include. Any mention of that in the briefing, or just artistic oversight?
Was there more detail on the what changed with the LDS now that it's apparently spanning two CUs?
Perhaps too esoteric, but any discussion about GCN instructions that might have been removed or changed due to the SIMD changes (ex. DPP has various 16-wide strides, LLVM does mention possibly discarding some instructions for handling branches or skipping instructions (VSKIP, fork and join instructions)?
Did they mention register bank conflicts or a register reuse cache in the briefing?
The slide on the RDNA SIMD Unit states "up to 20 wave controllers". Does this mean up to 20 wavefronts could be assigned to a single SIMD?
Details on the L0 and L1 caches? Is the L0 just the old L1 with a new name? How many accesses or how many WGPs can the L1 service per clock?
Is this a write-through hierarchy from L0 to L1 to L2? Did AMD outline possible changes in how it handles cache updates or memory ordering?
The centralized geometry processor has 4 prim units, but those are distributed across shader engines?
What does it mean by uniformly handling vertex reuse, primitive assembly, index reset, etc.? Does that mean that single geometry processor block does all of that, or it's responsible for farming out surface/vertex shaders across the die.
It uniformly distributes pre and post tessellation work, so it controls where the hull and domain shaders go? Where's the tessellation part located?
Async Compute Tunneling: what makes it able to drive down the amount of lower-priority work so completely? Is it able to context-switch existing waves out, or if not, what made GCN less effective in draining wavefronts?

Process gains are now harder to come by and even harder to realize, particularly on larger dies.
Even though the node gains are usually more modest than marketing would purport, AMD's slide has the node giving ~15% of a 50% improvement to the node, which translates to a single-digit percentage gain attributed to the node jump from GF's 14nm. It seems AMD ate up most of the possible improvement by staying well past the point of diminishing return for circuit performance versus power consumption.

So, in the end we have our next gen super simd architecture?. RDNA 2, the so called next gen is then this plus Ray tracing ?. It has sense to evolve first in rasterizing architecture and after add the RT support.
There may be some elements similar to the claims, though some like a register destination cache are mentioned in the LLVM commits and not in the slide deck. I'm not sure how much the RDNA SIMD layout matches. For one, the ALU arrays are described in terms of being narrower than SIMD16, with SIMD8 being mentioned as an example. The patent's suggestion of a superscalar or long instruction word encoding to help capture unused register cycles isn't mentioned, nor is there a clear instance of an instruction being split across one type of SIMD block at the same time as another--the SFU in the slide deck appears to operate in parallel and independently of the main ALUs. There appears to be a mode that promotes a SIMD32 issue into a two-cycle form, not one that allows for two instructions to issue simultaneously on the same SIMD.
The register file and register destination cache figure more heavily in a separate register file patent that is less dependent on the SIMD organization, and there is brief mention of elements from it like the register cache and register bank conflicts. The super SIMD patent didn't necessarily require that the banking was visible to software, and it seemed to regard the registers as being addressed as rows stretching across all four banks, rather than each bank's row being a different register ID.

The 64-wide threads are described as hiding inter-instruction latency (which I interpret as referring to complex register file banking (e.g. indexed reads), lane interchange and LDS operations) better than 32-wide.
The LLVM code changes include latency figures that indicate that the overall process of reading operands from the register file and issuing an instruction is 1 cycle longer than it used to be, regardless of the addressing mode.
AMD's slides don't mention bank conflicts based on the register ID, though the LLVM changes do. AMD's instruction issue slide doesn't appear to be using conflicting register IDs, however.

interestingly you can see from this example that Navi cannot issue two instructions consecutively when there's a store to load. v0 is written in cycle 2 but cannot be consumed until cycle 7, four cycles of latency. GCN doesn't have to switch to another hardware thread in this case, while Navi does.
The traditional GCN microarchitecture matched issue latency to execution and forwarding latency, whereas Navi's implementation has removed the issue limitation. GCN's long-held promise of near-zero thought having to be dedicated to ALU and forwarding has been abandoned. This, plus the extra vector register read cycle, may point to streamlining some of the internal pipelining needed to get everything flowing in the 4-cycle cadence, and perhaps a sacrifice in circuit depth per stage to get better clock speed with a 5-cycle latency pipeline.
The most immediate upshot seems to be that the scalar unit's much quicker execution and forwarding latency are no longer dependent on the vector path's cadence. If not crossing domains, it might allow for stretches of setup code to run chains of serial scalar code without that stall.
The LLVM speed model is interesting in that it has mostly the same latency numbers as prior generations, just multiplied in cycle count by 4--aside from the vector ALU path having an extra register file cycle. In that regard, there's an updated instruction issue and sequencing element to the uarch, but it's plausible that many of the pipeline paths haven't diverged from the 4-cycle execution and forwarding pattern.

So perhaps If you have a workgroup of 128 or 256 running in Workgroup Processor mode, then you get twice as much LDS capacity and 4x the cache bandwidth as on just a single compute unit.
It does seem like a big factor is the ability for a single workgroup to leverage 4x the bandwidth than would have been possible in the past with single-CU workgroup allocation. It seems like it might help versus the competition since Volta and Turing did upgrade their cache bandwidth. If the LDS is shared, I wonder if this also means the workgroup barriers that used to be per-CU are also shared. The LLVM changes do mention there are subtleties to accessing two L0 caches, given the memory hierarchy's weak consistency.
The slide with the caches and LDS seems a little ambiguous as to how much the LDS has been upgraded, and if it might have tweaks to let a workgroup shuffle data little more efficiently between its two sides.
Without putting much thought into it, I wonder if some of the more complex merged shader stages like primitive shaders might benefit from this. One half of a workgroup could start the culling process and use the LDS to hand non-culled vertices to the the vertex processing half of the same shader, rather than having to switch back and forth between shader phases.

The cache slide shows just how much bandwidth there is internal to the cache hierarchy, which I hope people pay heed to when planning a chiplet-based future. The infinity fabric is still at the far end, handing data from one channel to the nearest L2 slice, mostly. The L2/L1 fabric carries more than twice the bandwidth, and a single dual-CU set of data paths could generate more than half of the bandwidth of the whole GDDR6 subsystem--and each L1 has 5 of them.
Trying to take any of the internal groupings out of the die is even more expensive than it was with GCN.
Some of my earlier questions about how heavily loaded the L2/L1 fabric was appears to be answered by the addition of the L1, and clients like the RBEs and rasterizer hang off the L1 rather the L2.
It doesn't seem like Navi has entirely dispensed with the old way of distributing work to the shader engines and rasterizers, since the geometry block has arrows to them that do not go through a cache.

With respect to the lower-latency hierarchy:
Per a GDC 2018 presentation on engine optimizations (https://gpuopen.com/gdc-2018-presentations/, one by Timothy Lottes), a GCN vector L1 hit has a latency of ~114 cycles. So I guess RDNA dropping its cache hit latency down to 90 cycles is better than nothing. Going by the reverse-engineering of Volta (https://arxiv.org/pdf/1804.06826.pdf) AMD is about three times slower rather than about four times slower at the L1. GPU L1 aren't speed demons, obviously, and Volta appears to have latencies close to GCN at the L2 and beyond.
 
So each Compute Unit has 64 threads (SPs) but can be split into two for two issues of 32 SPs each, and then each workgroup has some "local data share" block that does... something with cache/IO/whatever. Alright then. So higher level does look like those odd one off mobile Vegas with 20 CUs for macbooks, it's just the naming conventions threw me off. Thanks!

Sounds like kind of Bulldozer to me :D

bulldozer_031.jpg
 
Sounds like kind of Bulldozer to me :D
For the most part, if you look at GCN prior it would have most of those features. GCN shared its front end and scalar caches between up to 4 CUs.
The LDS spanning two CUs is new, since it's storage and a software-visible address range that the hardware isn't isolating to a single CU like it once did.

One thing I'm curious about is how the GCN CU went from an apparently single scheduler block to there being schedulers per SIMD.
I suspect that while GCN did have a fair chunk of scheduling hardware that rotated between SIMDs, the various SIMDs and other blocks may have had some minor sequencing to get them through to the next issue cycle.

What scheduling or issue capability still physically connects the SIMDs in RDNA, since the more they autonomously decide on their execution path and scheduling, the more they appear like a core?
 
SIMD & Wave execution:
GCN: CU has 4 x SIMD16, Wave64 execute on SIMD16 x 4cycles.
RDNA: CU has 2 x SIMD32, Wave32 execute on SIMD32 x 1cycles.

LDS:
GCN: 10 Wave64 on Each SIMD16, 2560 threads per CU. 2560 threads (1CU) share 64KB LDS.
RDNA: 20 Wave32 on Each SIMD32, 1280 threads per CU. 2560 threads (2CU) share 64KB LDS.

Shared cache:
GCN: 4 schedulers & 4 scalar units (4CU) share I$, K$
RDNA: 4 schedulers & 4 scalar units (2CU) share I$, K$
 
I'm waiting for reviews. But the best case scénario benchs they showed were "meh", prices same thing. Look like a good competitor against 1080 non ti... They're late as usual. But reviews will tell, I hope I'm wrong.

"Gaming hz", love this one..
 
Lets leave wishes, desires, and talk of other products out of this thread, and get back to AMD's Navi...
 
Just a bunch of things that occurred to me in the order I skimmed the slides (no worries if not covered):
Most of what you're asking is beyond what I was briefed on and is beyond my own expertise. But I'll answer what I can.
Is there a clear count on the number of shader engines? The diagram seems to have the GPU divided into two, but could it be four going by the way the CU arrays are arranged in blocks of 4 with their own rasterizer and primitive blocks?
Yes. There are 2 shader engines.
What specifically is in the purview of the primitive unit, rasterizer, and geometry processor? What's been "centralized" in Navi versus how shader engines each had a geometry processor?
I believe a lot of this is stylistic, but a lot of work has gone into improving their work (re)distribution. It's something I need to look more into.

Async Compute Tunneling: what makes it able to drive down the amount of lower-priority work so completely? Is it able to context-switch existing waves out, or if not, what made GCN less effective in draining wavefronts?
A new feature called Priority Tunneling has been added. Notably, this is not context switching. But it does allow the AWS to go to the top of the execution pipeline and block any new work being issued, so that it can be drained and a compute workload started immediately thereafter.
 
Most of what you're asking is beyond what I was briefed on and is beyond my own expertise. But I'll answer what I can.
Yes. There are 2 shader engines.
I believe a lot of this is stylistic, but a lot of work has gone into improving their work (re)distribution. It's something I need to look more into.

A new feature called Priority Tunneling has been added. Notably, this is not context switching. But it does allow the AWS to go to the top of the execution pipeline and block any new work being issued, so that it can be drained and a compute workload started immediately thereafter.
Radeon 5700 has 4 SE with 10 workgroup processors each, 40 CUs in total and 4 CUs deactivated. How do they deactivate 4 CUs in the 5700 if the CUs are grouped by 2 with shared cache ? Can they deactivate only one CU in a Workgroup Processor ? If yes what happens with the shared cache ?
 
I don't understand why there are "dual compute units". They appear to share instruction and scalar/constant caches, which seems like a weak gain. I don't fully understand the slide that refers to a "Workgroup Processor", it seems to be saying that because LDS and cache are "shared" huge gains in performance from issuing large workgroups are possible. So perhaps If you have a workgroup of 128 or 256 running in Workgroup Processor mode, then you get twice as much LDS capacity and 4x the cache bandwidth as on just a single compute unit.
It's not a "dual compute unit" to start with. It's 4 SIMD units with a native wave size of 32 merged into one compute unit. And while the slides don't state it, it appears reasonable to assume that everything right of the LDS isn't actually bound to a specific SIMD unit / pair, but shared for the whole CU.

Looks like an artistic choice, as the 2x32 slice was apparently easier to compare to the previous 4x16 configuration, than the full 4x32 configuration. Take everything on slide 13 x2, and you have the real numbers for RDNA. This factor is then represented in slide 20.

I must assume that the "2x registers" and "2x ALU" is limited to the VGPR registers / vector units, and refers to each subgroup of 32 / 64 threads being strictly local to a SIMD unit. Scalar registers and also scalar instructions may not have been kept to a single scalar unit, but rather fully replicated to up to all 4 SIMD groups.
 
Just a bunch of things that occurred to me in the order I skimmed the slides (no worries if not covered):
Is there a clear count on the number of shader engines? The diagram seems to have the GPU divided into two, but could it be four going by the way the CU arrays are arranged in blocks of 4 with their own rasterizer and primitive blocks?

And I thought mine were a nightmare to respond too... :p
Anyway the whole thing seems to be extremely hierarchical now, perhaps for cost reasons though maybe they're hoping for chiplets at some point. Compute units are two issue 32 thread wavefronts. Data share is now between two of these compute units. Five of which are in each upper block which share L1 cache and rasterizer, etc. Two of which are in a "shader engine", which is separate from the geometry processor etc. What exactly constitutes a "Shader Engine" then isn't clear to me either, unless the L2 Cache and memory controllers are accessible by the whole "Shader Engine" instead of by half. But the diagrams do appear accurate, so what you see is indeed what you seem to get.

Even though the node gains are usually more modest than marketing would purport, AMD's slide has the node giving ~15% of a 50% improvement to the node, which translates to a single-digit percentage gain attributed to the node jump from GF's 14nm. It seems AMD ate up most of the possible improvement by staying well past the point of diminishing return for circuit performance versus power consumption.

Perfectly true, power efficiency in current finfet node advances is always advanced far more than any available frequency increase. Not exactly sure of the physics here, some tipping point in the finfet gate structure makes power draw just go exponential at around the same frequency regardless of feature size. I'd expect the upcoming Zen 2 mobile/Navi 20cu cards to be far, far more power efficient even than mobile Vega was. So Intel's "we can match AMD in mobile GPU!" claim isn't going to last long at all.

The cache slide shows just how much bandwidth there is internal to the cache hierarchy, which I hope people pay heed to when planning a chiplet-based future. The infinity fabric is still at the far end, handing data from one channel to the nearest L2 slice, mostly. The L2/L1 fabric carries more than twice the bandwidth, and a single dual-CU set of data paths could generate more than half of the bandwidth of the whole GDDR6 subsystem--and each L1 has 5 of them.
Trying to take any of the internal groupings out of the die is even more expensive than it was with GCN.
Some of my earlier questions about how heavily loaded the L2/L1 fabric was appears to be answered by the addition of the L1, and clients like the RBEs and rasterizer hang off the L1 rather the L2.
It doesn't seem like Navi has entirely dispensed with the old way of distributing work to the shader engines and rasterizers, since the geometry block has arrows to them that do not go through a cache.

This is why I'd guess at no GPU chiplets anytime soon. The CPU chiplets AMD uses with Zen 2 have zero direct interconnects, it's all Infinity Fabric, and that bandwidth just isn't enough for a GPU.

In fact cache problems in modern architectures remind me of the rocket equation. Specifically the cache could be equated to fuel in a rocket, cache grows exponentially compared to logic growing linearly, so eventually cache is just going to start dominating die space altogether versus logic. Then at some point not far after you just can't make improvements at all, as for every N logic improvements you make you might need 2N, or N^2 or whatever bigger cache just to feed new instructions to the logic. Other vendors aren't immune from this either, Nvidia has an ever growing cache verse logic problem as well, and from what I recall a huge reason Apple CPUs are so fast is all the work on their memory/cache systems.

Some big, very different looking change is going to need to happen in regards to accessing memory if computers are going to keep getting faster. I can hardly imagine what you'd need if graphene or some other 2d material replaces silicon and suddenly you can clock to over a 100ghz+.
 
A new feature called Priority Tunneling has been added. Notably, this is not context switching. But it does allow the AWS to go to the top of the execution pipeline and block any new work being issued, so that it can be drained and a compute workload started immediately thereafter.
So it would seem that in prior GPUs the high-priority queue could arbitrate for most--but not all--of the workgroup launch slots that would become available during the lifetime of a high-priority task? Then the AWS can more completely monopolize the shader engines.

It's not a "dual compute unit" to start with. It's 4 SIMD units with a native wave size of 32 merged into one compute unit. And while the slides don't state it, it appears reasonable to assume that everything right of the LDS isn't actually bound to a specific SIMD unit / pair, but shared for the whole CU.
If by right of the LDS you mean the texture blocks and L0, I think there is some evidence that those are not shared.
The LLVM changes specifically point out that in workgroup processor mode that the two halves of a WGP will not see a consistent view of memory because the L0 is per-CU and they cannot see possibly inconsistent versions of data in the other L0. Cache invalidation or some other kind of synchronization is necessary to get correct behavior out of vector memory accesses in that mode.
There may also be some other subtleties to the hardware IDs and resource management that recognize the CUs separately. Everything to the left of the LDS seems to be more independent already, so the difference between a wavefront running independently on a SIMD may seem mostly the same whether it's independent of the SIMD next to it or the SIMD in the next CU. Whether there are operations, side effects, or hardware settings that might have more immediate effect within a CU aren't clear at this point.

Perfectly true, power efficiency in current finfet node advances is always advanced far more than any available frequency increase. Not exactly sure of the physics here, some tipping point in the finfet gate structure makes power draw just go exponential at around the same frequency regardless of feature size. I'd expect the upcoming Zen 2 mobile/Navi 20cu cards to be far, far more power efficient even than mobile Vega was. So Intel's "we can match AMD in mobile GPU!" claim isn't going to last long at all.
All nodes have an inflection point where power consumption goes super-linear, though finFETs may have a more pronounced rise past it.
AMD and other have warned that certain circuit parameters are not improving much, such as wire resistance and capacitance.
The wire component is not governed by the gate type, but finFETs do complicate the latter.
Zen 2's designers commented that even the modest clock gains for the new core did take special effort to achieve in the face of poorer scaling of some facets of circuit performance. GCN's clock speeds are still far from the realm of those CPUs, but it's likely working with transistors sized for higher density and deals with a pipeline design that has more layers of logic and more distance for signals to travel versus CPU cores that tune things more narrowly.

Specifically the cache could be equated to fuel in a rocket, cache grows exponentially compared to logic growing linearly, so eventually cache is just going to start dominating die space altogether versus logic.
The challenge is data movement, both in moving enough of it and moving it as little a distance as practical. In that regard, logic can easily scale demand without concerning itself with the question of how it can be fed efficiently.
GPUs don't quite hit the cache levels of CPUs because they focus on a particularly mathematically dense set of workloads, and also have higher demands in terms of raw bandwidth versus CPU caches whose hit rates are driven more by latency.

Anyone have any insight as to why the transistor count nearly doubled over Polaris?
More L2, additional L1s, more scalar hardware, more features in the CU, new memory type, more command processor and geometry hardware, more ROPs.
Higher clock targets can mean more transistors, as most of Vega's transistor gains over Fury were credited to buffers and wire-delay improvements rather than extra features.
The new node may have favored new implementations of logic blocks that added to the transistor count versus 14nm.
 
Do we know anything about DX12.1 feature levels? Is it still all Tier levels up, like Vega, or any regression? We know VRS won't be supported, anything else?
 
Status
Not open for further replies.
Back
Top