AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
I’m no trinibwoy, but in comparison to Navi21 on 7nm, the number of CUs has shrunk by 20%, the memory I/O is halved, and the cache may or may not be halved too (I hope to God not given the memory interface). Since TSMC additionally claims that the 6nm tweak offer 18% higher logic density, the potential for a smaller die than 440mm2 seems to be there. Of course, since we don’t know exactly what RDNA3 adds that may require additional gates, or to what extent, it really is anybodys guess at this point.

The smallest dies in a family always have a die size disproportionately larger than you'd expect from the ratio of CUs/WGPs/Shader engines compared to the larger ones, because the smallest ones have a larger percentage of the die taken up by items that take up a fixed area cost.

Think display engines and video encode/decode blocks in particular. The video encode/decode blocks were big enough that in TU117 Nvidia went to all the effort to smush Volta's encode block in there because it was smaller.

https://www.anandtech.com/show/14270/the-nvidia-geforce-gtx-1650-review-feat-zotac/2
 
The smallest dies in a family always have a die size disproportionately larger than you'd expect from the ratio of CUs/WGPs/Shader engines compared to the larger ones, because the smallest ones have a larger percentage of the die taken up by items that take up a fixed area cost.

Think display engines and video encode/decode blocks in particular. The video encode/decode blocks were big enough that in TU117 Nvidia went to all the effort to smush Volta's encode block in there because it was smaller.

https://www.anandtech.com/show/14270/the-nvidia-geforce-gtx-1650-review-feat-zotac/2

Uhm, why refer to the Nvidia GTX1650 when we are comparing Navi 21 with (rumored) Navi 33?
Here are two links to annotated die shots of Navi 21 (reddit, TechPowerUp) and as you see, for these products, those fixed blocks are a very minor part of the die. So yes in general, but in this case - no.
The rumors surrounding Navi 33 describe a very cost controlled product, 128-bit memory bus, 8GB of RAM, very mature process node (TSMC 7nm has been in volume production for four years now) - the production costs should be in the Navi 22/RX6700 ballpark, possibly a bit lower.
 
https://gitlab.freedesktop.org/mesa...t_id=1a01566685c3d2651bbfc72738de6a1e38ba8251

Notes for GFX11 changes above:

CMASK/FMASK removed - RIP VK_AMD_shader_fragment_mask and MSAA in general

CB_RESOLVE removed - MSAA resolve is now done in software with the compute pipeline

NGG pipeline is always enabled - No fallback legacy geometry pipeline

Image descriptors are now just 32 bytes in size - Older HW generations used to have an option as high as 64 bytes to store FMASK information

Biggest change seems to be DCC functionality:
DCC for storage images is always enabled
Arbitrary DCC format reinterpretation applies to all formats
Does this mean DCC decompression never happens anymore on GFX11 ?
Obvious byproduct is that all D3D resources/VK image views can be respectively kept typeless/mutable for API usage convenience without performance impact

Does this mean we can finally ditch D3D resource states and VK image layouts so that explicit APIs become simpler to use ?
 
Why do you want a rop less gpu ?
Because it's die space that's doing nothing a lot of the time.

It's similar to how GPUs changed from having dedicated vertex shader ALUs and dedicated pixel shader ALUs, in favour of a unified design. This was done because the old design always led to ALUs sat doing nothing because of an imbalance between vertex compute workload and pixel compute workload. Either the vertex ALUs were fully occupied and the pixel shaders were wasting time, or vice versa.

So ROPs are die space that's spending a lot of time, per frame, doing nothing. Sure, there are bursts of "full utilisation", like shadow buffer fill, but those bursts don't bottleneck the entire duration of frame rendering.
 
Because it's die space that's doing nothing a lot of the time.
Is it a lot of die space though? We're not exactly die space constrained these days and it is generally preferable to have a dedicated h/w implementation if that saves you a lot of power (i.e. cycles on the main math pipeline).
 
Is it a lot of die space though?

jxniefh8pvb41.jpg


Looks like quite a lot of die space to me in Navi 21.
 
jxniefh8pvb41.jpg


Looks like quite a lot of die space to me in Navi 21.
So that's like 8 additional "CUs"? +10% of math throughput likely lowered by additional power draw hit on actual clocks to something like +5% - basically unnoticeable in gaming and likely very noticeable there due to the need to perform all raster operations on the main math h/w now.

I'd expect that to be a net loss in performance with very minor gains in some limited scenarios.
 
N21 is 40 WGPs / 80 CUs, no?
Sorry, I made a mistake. That's a picture of Navi 10, not Navi 21.

So, in that picture, the RBEs take up about 20% of the die space used by WGPs, or, an area equivalent to about 8 CUs. Navi 21 has twice the CUs and twice the RBEs, so in Navi 21 it's also 20%.
 
Wow this seems epic. My dream of a ROP-less GPU gets closer. If it truly is ROP-less then that'll make my year.
Depth and stencil can probably get away with buffing the unordered memory atomics. Color ROPs seem difficult though. ROV is the only existing instrument that can emulate color ROP, but not even Intel the introducer (PixelSync) went off a fully programmable path with their latest Arc architecture...
 
When you say difficult, do you mean algorithm or quantity of work? Is your concern about ordering? Or volume of cache/buffer data required while fragments are in flight?

With some kind of tiled rasterisation, triangle ordering and blending mode is clear before the first fragment shader instruction is issued. That information could live for the lifetime of the fragment, until it is written to the render target. Yes, it's an overhead, but the buffering required to perform tiled rasterisation is already a substantial overhead, what with vertex attributes having an indeterminate lifetime.

I think delta colour compression might be the most fiddlesome aspect of deleted ROPs.

In the end, I'm working on the theory that RDNA 3 GPUs have a fat cache hierarchy (starting at L1, where RDNA currently seems weak) that will support tiled rasterisation and perhaps ray sorting/grouping, so this might extend naturally to ROP-less hardware. It seems to me one of the key mistakes AMD has been making is to keep L1 and LDS separate from each other and effectively to lock them in size - NVidia's floating boundary seems much smarter to me and it seems crucial to tiled rasterisation in those GPUs.

Thinking about it, I'm kind of surprised NVidia hasn't already done a ROP-less consumer GPU (i.e. ignoring data-centre GPUs which might have rasterisers), since robust tiling has been around for so long. Well, as the quantity of compute in a GPU crosses a threshold, then it seems to make no sense to retain ROPs, so Ada/RDNA 3 might be where we see the ROP-less revolution. Fingers-crossed.

Meanwhile, I'm going to assume Arc is just a market-beta GPU, to get Intel past the curse of Larrabee.
 
When you say difficult, do you mean algorithm or quantity of work? Is your concern about ordering? Or volume of cache/buffer data required while fragments are in flight?

IIRC, ROP guarantees deterministic blending results by scoreboarding the results and blending them by the API submission order as they come through. This allows overlapping fragment shader invocations to be running in parallel and completing out of order.

ROV strives to provide a programmable in-shader solution for them, but evidently ROV on AMD GPUs has not been as performant as Nvidia or Intel implementations, and it has remained unimplemented in Vulkan. So I don't see a path to dropping ROP completely, unless we either say people will move away fixed-function blending with inherently a deterministic order (even so, a huge bank of existing software depends on it), or somehow a hardware technique to improve ROV drastically is discovered.
 
Underutilized silicon is fine for the most part because of heat and power in modern chips are becoming so dense. Making every transistor do work every clock would create an uncoolable chip guzzling massive amount of power. Obviously you don't want significant under utilization either but dark silicon is part of modern chip design and there really isn't much of a way to optimize it away.
 
IIRC, ROP guarantees deterministic blending results by scoreboarding the results and blending them by the API submission order as they come through. This allows overlapping fragment shader invocations to be running in parallel and completing out of order.

ROV strives to provide a programmable in-shader solution for them, but evidently ROV on AMD GPUs has not been as performant as Nvidia or Intel implementations, and it has remained unimplemented in Vulkan.
AMD clearly supports it on D3D. "Performance is bad" is not an excuse for never making performance better. See tessellation. See, hopefully, ray tracing!

So I don't see a path to dropping ROP completely, unless we either say people will move away fixed-function blending with inherently a deterministic order (even so, a huge bank of existing software depends on it), or somehow a hardware technique to improve ROV drastically is discovered.
Tiling,...

Remember hardware isn't magic. There is no class of algorithms that hardware gets access to that software is entirely blocked from. Even if the algorithm is dependent upon a piece of hardware or is dependent upon a memory layout.

This slide deck is fun:

Implementing old-school graphics chips with Vulkan compute (themaister.net)

He covers the options really nicely and dives into obscure topics such as subgroups and quad_perm on AMD.

AMD GCN Assembly: Cross-Lane Operations - GPUOpen

You would expect AMD to make use of these low-level intrinsics in a ROP-less implementation.

NGG culling replaces hardware culling in RDNA 3. NGG has taken its time but it got there eventually (according to rumours) ...

Making every transistor do work every clock would create an uncoolable chip guzzling massive amount of power.
Yet, somehow, GPUs keep gaining ALU lanes as process nodes improve and the amount of transistors per unit die area increases. And, clock frequencies keep increasing too.
 
We know that a CU in Navi 21 is about 2mm². So 80 of them take about 160mm² (less than 1/3 of the die).

At TSMC, 5nm has a scaling of 84%, "worst case", claimed versus 7nm. So Navi 21 CUs would be 90mm².

So the simple baseline for 12,288 ALU lanes that is the hot topic of current rumours would amount to 216mm². Add 30% for shader engine stuff, such as fine-grained rasterisation, RBEs and L2 cache, to get to 281mm².

Splitting that into two GCDs we get around, say, 150mm² per GCD.

Then I suppose we'd be looking at around 125mm² for each of 4 cache chiplets (assuming they each have about 20mm² of GDDR PHY) and maybe another 150mm² for an IO chiplet (which also has global work scheduling responsibilities), all on 6nm.

So that's about 900 to 950mm² of GPU chiplets, with about 300mm² at 5nm, assuming that 7 chiplets is still the rumour.
 
Last edited:
So that's about 900 to 950mm² of GPU chiplets, with about 300mm² at 5nm, assuming that 7 chiplets is still the rumour.
I'm not holding my breath for 7 chiplet solution, I think 2 G(raphics)C(compute)D(ies) + one IOD with all the cache (+ which could possibly be 4th die 3d stacked on top of the IOD) is more likely for first generation solution
 
Status
Not open for further replies.
Back
Top