AMD: RDNA 3 Speculation, Rumours and Discussion

Discussion in 'Architecture and Products' started by Jawed, Oct 28, 2020.

Tags:
  1. T2098

    Newcomer

    Joined:
    Jun 15, 2020
    Messages:
    55
    Likes Received:
    115
    The smallest dies in a family always have a die size disproportionately larger than you'd expect from the ratio of CUs/WGPs/Shader engines compared to the larger ones, because the smallest ones have a larger percentage of the die taken up by items that take up a fixed area cost.

    Think display engines and video encode/decode blocks in particular. The video encode/decode blocks were big enough that in TU117 Nvidia went to all the effort to smush Volta's encode block in there because it was smaller.

    https://www.anandtech.com/show/14270/the-nvidia-geforce-gtx-1650-review-feat-zotac/2
     
    Man from Atlantis likes this.
  2. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,360
    Likes Received:
    1,377
    Uhm, why refer to the Nvidia GTX1650 when we are comparing Navi 21 with (rumored) Navi 33?
    Here are two links to annotated die shots of Navi 21 (reddit, TechPowerUp) and as you see, for these products, those fixed blocks are a very minor part of the die. So yes in general, but in this case - no.
    The rumors surrounding Navi 33 describe a very cost controlled product, 128-bit memory bus, 8GB of RAM, very mature process node (TSMC 7nm has been in volume production for four years now) - the production costs should be in the Navi 22/RX6700 ballpark, possibly a bit lower.
     
  3. Lurkmass

    Regular

    Joined:
    Mar 3, 2020
    Messages:
    565
    Likes Received:
    711
    https://gitlab.freedesktop.org/mesa...t_id=1a01566685c3d2651bbfc72738de6a1e38ba8251

    Notes for GFX11 changes above:

    CMASK/FMASK removed - RIP VK_AMD_shader_fragment_mask and MSAA in general

    CB_RESOLVE removed - MSAA resolve is now done in software with the compute pipeline

    NGG pipeline is always enabled - No fallback legacy geometry pipeline

    Image descriptors are now just 32 bytes in size - Older HW generations used to have an option as high as 64 bytes to store FMASK information

    Biggest change seems to be DCC functionality:
    DCC for storage images is always enabled
    Arbitrary DCC format reinterpretation applies to all formats
    Does this mean DCC decompression never happens anymore on GFX11 ?
    Obvious byproduct is that all D3D resources/VK image views can be respectively kept typeless/mutable for API usage convenience without performance impact

    Does this mean we can finally ditch D3D resource states and VK image layouts so that explicit APIs become simpler to use ?
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Wow this seems epic. My dream of a ROP-less GPU gets closer. If it truly is ROP-less then that'll make my year.
     
    Lightman, SpeedyGonzales and Krteq like this.
  5. Rootax

    Veteran

    Joined:
    Jan 2, 2006
    Messages:
    2,401
    Likes Received:
    1,845
    Location:
    France

    Why do you want a rop less gpu ?
     
    DegustatoR and PSman1700 like this.
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Because it's die space that's doing nothing a lot of the time.

    It's similar to how GPUs changed from having dedicated vertex shader ALUs and dedicated pixel shader ALUs, in favour of a unified design. This was done because the old design always led to ALUs sat doing nothing because of an imbalance between vertex compute workload and pixel compute workload. Either the vertex ALUs were fully occupied and the pixel shaders were wasting time, or vice versa.

    So ROPs are die space that's spending a lot of time, per frame, doing nothing. Sure, there are bursts of "full utilisation", like shadow buffer fill, but those bursts don't bottleneck the entire duration of frame rendering.
     
  7. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,242
    Likes Received:
    3,402
    Is it a lot of die space though? We're not exactly die space constrained these days and it is generally preferable to have a dedicated h/w implementation if that saves you a lot of power (i.e. cycles on the main math pipeline).
     
    PSman1700 likes this.
  8. Rootax

    Veteran

    Joined:
    Jan 2, 2006
    Messages:
    2,401
    Likes Received:
    1,845
    Location:
    France
    Wait, aren't the rop responsable for all the texturing stuff and determine the fillrate of a card ?
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    [​IMG]

    Looks like quite a lot of die space to me in Navi 21.
     
  10. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,242
    Likes Received:
    3,402
    So that's like 8 additional "CUs"? +10% of math throughput likely lowered by additional power draw hit on actual clocks to something like +5% - basically unnoticeable in gaming and likely very noticeable there due to the need to perform all raster operations on the main math h/w now.

    I'd expect that to be a net loss in performance with very minor gains in some limited scenarios.
     
    PSman1700 likes this.
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Yeah.

    +20%
     
  12. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,242
    Likes Received:
    3,402
    N21 is 40 WGPs / 80 CUs, no?
     
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Sorry, I made a mistake. That's a picture of Navi 10, not Navi 21.

    So, in that picture, the RBEs take up about 20% of the die space used by WGPs, or, an area equivalent to about 8 CUs. Navi 21 has twice the CUs and twice the RBEs, so in Navi 21 it's also 20%.
     
  14. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    416
    Likes Received:
    379
    Depth and stencil can probably get away with buffing the unordered memory atomics. Color ROPs seem difficult though. ROV is the only existing instrument that can emulate color ROP, but not even Intel the introducer (PixelSync) went off a fully programmable path with their latest Arc architecture...
     
    Lightman likes this.
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    When you say difficult, do you mean algorithm or quantity of work? Is your concern about ordering? Or volume of cache/buffer data required while fragments are in flight?

    With some kind of tiled rasterisation, triangle ordering and blending mode is clear before the first fragment shader instruction is issued. That information could live for the lifetime of the fragment, until it is written to the render target. Yes, it's an overhead, but the buffering required to perform tiled rasterisation is already a substantial overhead, what with vertex attributes having an indeterminate lifetime.

    I think delta colour compression might be the most fiddlesome aspect of deleted ROPs.

    In the end, I'm working on the theory that RDNA 3 GPUs have a fat cache hierarchy (starting at L1, where RDNA currently seems weak) that will support tiled rasterisation and perhaps ray sorting/grouping, so this might extend naturally to ROP-less hardware. It seems to me one of the key mistakes AMD has been making is to keep L1 and LDS separate from each other and effectively to lock them in size - NVidia's floating boundary seems much smarter to me and it seems crucial to tiled rasterisation in those GPUs.

    Thinking about it, I'm kind of surprised NVidia hasn't already done a ROP-less consumer GPU (i.e. ignoring data-centre GPUs which might have rasterisers), since robust tiling has been around for so long. Well, as the quantity of compute in a GPU crosses a threshold, then it seems to make no sense to retain ROPs, so Ada/RDNA 3 might be where we see the ROP-less revolution. Fingers-crossed.

    Meanwhile, I'm going to assume Arc is just a market-beta GPU, to get Intel past the curse of Larrabee.
     
    Lightman likes this.
  16. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    416
    Likes Received:
    379
    IIRC, ROP guarantees deterministic blending results by scoreboarding the results and blending them by the API submission order as they come through. This allows overlapping fragment shader invocations to be running in parallel and completing out of order.

    ROV strives to provide a programmable in-shader solution for them, but evidently ROV on AMD GPUs has not been as performant as Nvidia or Intel implementations, and it has remained unimplemented in Vulkan. So I don't see a path to dropping ROP completely, unless we either say people will move away fixed-function blending with inherently a deterministic order (even so, a huge bank of existing software depends on it), or somehow a hardware technique to improve ROV drastically is discovered.
     
    PSman1700 likes this.
  17. Esrever

    Regular

    Joined:
    Feb 6, 2013
    Messages:
    846
    Likes Received:
    647
    Underutilized silicon is fine for the most part because of heat and power in modern chips are becoming so dense. Making every transistor do work every clock would create an uncoolable chip guzzling massive amount of power. Obviously you don't want significant under utilization either but dark silicon is part of modern chip design and there really isn't much of a way to optimize it away.
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    AMD clearly supports it on D3D. "Performance is bad" is not an excuse for never making performance better. See tessellation. See, hopefully, ray tracing!

    Tiling,...

    Remember hardware isn't magic. There is no class of algorithms that hardware gets access to that software is entirely blocked from. Even if the algorithm is dependent upon a piece of hardware or is dependent upon a memory layout.

    This slide deck is fun:

    Implementing old-school graphics chips with Vulkan compute (themaister.net)

    He covers the options really nicely and dives into obscure topics such as subgroups and quad_perm on AMD.

    AMD GCN Assembly: Cross-Lane Operations - GPUOpen

    You would expect AMD to make use of these low-level intrinsics in a ROP-less implementation.

    NGG culling replaces hardware culling in RDNA 3. NGG has taken its time but it got there eventually (according to rumours) ...

    Yet, somehow, GPUs keep gaining ALU lanes as process nodes improve and the amount of transistors per unit die area increases. And, clock frequencies keep increasing too.
     
    Lightman, Rootax, Krteq and 2 others like this.
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    We know that a CU in Navi 21 is about 2mm². So 80 of them take about 160mm² (less than 1/3 of the die).

    At TSMC, 5nm has a scaling of 84%, "worst case", claimed versus 7nm. So Navi 21 CUs would be 90mm².

    So the simple baseline for 12,288 ALU lanes that is the hot topic of current rumours would amount to 216mm². Add 30% for shader engine stuff, such as fine-grained rasterisation, RBEs and L2 cache, to get to 281mm².

    Splitting that into two GCDs we get around, say, 150mm² per GCD.

    Then I suppose we'd be looking at around 125mm² for each of 4 cache chiplets (assuming they each have about 20mm² of GDDR PHY) and maybe another 150mm² for an IO chiplet (which also has global work scheduling responsibilities), all on 6nm.

    So that's about 900 to 950mm² of GPU chiplets, with about 300mm² at 5nm, assuming that 7 chiplets is still the rumour.
     
    #1239 Jawed, May 6, 2022
    Last edited: May 6, 2022
    Lightman likes this.
  20. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,465
    Location:
    Finland
    I'm not holding my breath for 7 chiplet solution, I think 2 G(raphics)C(compute)D(ies) + one IOD with all the cache (+ which could possibly be 4th die 3d stacked on top of the IOD) is more likely for first generation solution
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...