AMD: RDNA 3 Speculation, Rumours and Discussion

Discussion in 'Architecture and Products' started by Jawed, Oct 28, 2020.

Tags:
  1. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    So, we're back to 4 SIMDs per CU (or whatever arbitrary name you will call it). Now all we need is a 4-cycle round robin cadence to be back at GCN with 4x throughput. ;)
     
    Lightman likes this.
  2. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    That's the Work Group Processor, where two CUs are tightly integrated (sharing constant/instruction caches and the LDS). I think this is analogous to Nvidia's TPC, where two SMs are chained together (three in GT200).
     
    PSman1700, Krteq and DegustatoR like this.
  3. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    No, that's per 'CU'.
    gfx11 WGP is 8 SIMD32 or so.
    TPC is just a physical layout thingy; you can't just double the shmem available to a single SM with a toggle.
    Similar enough I guess?
     
    Lightman likes this.
  4. tsa1

    Newcomer

    Joined:
    Oct 8, 2020
    Messages:
    89
    Likes Received:
    97
    I'd say some techtubers already did that with a smattering of chinese forums auto-translate (along with ol' good fud)
     
  5. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    415
    Likes Received:
    379
    Triple confirmed: RDNA 3 has 128-wide wavefronts and yet again zero-cycle ALU instruction latency. :mrgreen:
     
    Lightman and CarstenS like this.
  6. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    415
    Likes Received:
    379
    Not sure whether the programming model reflects your speculation:
    * Memory access within a workgroup is coherent only for requests skipping L0$, or atomic operations. (except when you specify the CU mode at dispatch time, presumably).
    * Barriers are monitored at the WGP level anyway, since it is a workgroup primitive.
    * LDS are equally reachable by all SIMDs in WGP.

    Though the ISA documentation does say explicitly that CU mode can lead to higher effective LDS bandwidth. But that sounds more like a specific matter of certain memory access patterns disfavoring(?) the WGP mode of the LDS, rather than something to do with some unspecified tight integration within a WGP half.
     
    #506 pTmdfx, Jul 26, 2021
    Last edited: Jul 26, 2021
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    I was referring to CU mode, not WGP mode, though I admit I didn't reinforce that constraint at that point.

    In CU mode barriers are CU level.

    Only in WGP mode.

    There are two LDS arrays inside LDS, each array is localised to the parent CU. So in CU mode you get doubled bandwidth.

    I didn't see the note in the ISA about reduced bandwidth in WGP mode, but that merely backs up what I was saying before.
     
  8. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    415
    Likes Received:
    379
    The nuance here is that these are all currently execution mode differences at runtime, where they can be emulated alright by actual architectural hardware features at WGP level. You are trying to extrapolate that there are more benefits from this mode, which is possible. But the reason can also be as simply as a compatibility mode for kernels previously written with certain programming model assumptions that have been around for a decade (e.g. wavefront size = 64), which isn't something a shader compiler alone can fix/patch transparently.

    The whitepaper says: "Each compute unit has access to double the LDS capacity and bandwidth". It does not seem like CU mode or not should matter in the peak throughput. After all, in WGP mode, you are meant to be able to address all of the LDS, which means both arrays collectively have to be capable to deliver 2x 128B/cycle to either of the SIMD pairs anyway.

    If anything that could hamper bandwidth, it is perhaps the documented fact that a SIMD pair shares the request & return buses, in which case effective bandwidth can be indeed halved in unideal scenarios where all active work-items in a workgroup are biased towards either of the two SIMD pairs. But even then, it does not fit the "localized LDS = CU mode better" theory.
     
  9. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    Yep, I meant WGP cache so L0.

    How did you get more WGPs in Navi 31? Navi 21 has 40, latest rumours say Navi 31 has only 30.

    I'm not following your thought process. Of course wavefronts within a workgroup can be allocated to different SIMDs within an SM/CU - this has been the case forever. What is it that we need to infer exactly?

    I'm referring to your use of the term workgroup in the post that I quoted. You seem to be treating them as groups of threads that must be launched and retired together and that reside on the same SIMD - this is a wavefront not a workgroup. AMD explicitly says that 64-wide wavefronts run on the same 32-wide SIMD and just take multiple cycles to retire each instruction. There is no suggestion that multiple SIMDs will co-operate in the execution of such a wavefront. A workgroup per the OpenCL spec is the equivalent of a CUDA block and consists of multiple wavefronts that can be distributed across the SIMDs in an SM/WGP. So it's a bit confusing as to what you're referring to when you describe groups of threads. Do you mean wavefronts (bound to a SIMD) or workgroups (bound to a WGP)?
     
  10. Digidi

    Regular

    Joined:
    Sep 1, 2015
    Messages:
    428
    Likes Received:
    239
    When AMD put more shaders in the WGps how does this effect theire Raytracing approach? I thought they need mor WGps for Raytracing not less?

    My second question is that also the Frontend is a black box for me. In driver you find always the hint that you have 4 Rasterizer but 8 Scan Converter. So Scan Converter is the main Part which transforms Polygons into pixels. So when 1 Polygon comes from Rasterizer but you have 2 Scan Converter, 1 Scan Converter is running empty?
     
  11. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    It's one RA per CU as it is. Or two RAs per WGP. They are closely coupled to if not an integral part of the TMU. If they don't change the ALU:TEX ratio, there's no reason to believe, RT performance won't scale with ALU performance. Bigger ∞$ would only help RT perf, especially if they can make the BVH stick in that $.
     
    Digidi likes this.
  12. Leoneazzurro5

    Regular

    Joined:
    Aug 18, 2020
    Messages:
    335
    Likes Received:
    348
    30 per die, but Navi31 seems to have two dies, so the total is 60.
     
  13. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    Oh yes that’s right. Though that’s not 3x Navi 21 if latest rumors are accurate.
     
  14. It's 60 WGPs but each WGP now has 256 ALUs, whereas one WGP in Navi 21 has 128 ALUs.

    5120 ALUs on Navi 21 vs. 15360 ALUs on dual-chiplet Navi 31. Furthermore, Navi 31 has up to 512MB Infinity Cache, 4x the 128MB in Navi 21.

    We should also expect N31's clocks to reach higher frequencies, since it's made on N5P instead of Navi 21's N7P.

    It should be >3x Navi 21, even if power consumption jumps to RTX 3090 levels.
     
  15. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    Scaling hard!
    Correct.
    Think the entire lineup gets 4x LLC bumps as one last SRAM huzzah.
     
    Lightman likes this.
  16. 4*96MB = 384MB LLC on Navi 32?
    128MB LLC on Navi 33?

    :runaway:
     
  17. Leoneazzurro5

    Regular

    Joined:
    Aug 18, 2020
    Messages:
    335
    Likes Received:
    348
    Your stance is correct if the ratio between RA and WGP is the same, I think Jawed was supposing that the ratio between RA and ALU stays the same, which could also be. Or it may be that ratio between RA and WGP is indeed the same, but the RA capabilities are increased... There are too few details atm for having a definitive answer. I would find very strange, however, if AMD increased the base shading power almost threefold (which is not the limit of RDNA2) while keeping Ray Tracing hardware (which is the weakest point of RDNA2) with moderate increase.
     
  18. Leoneazzurro5

    Regular

    Joined:
    Aug 18, 2020
    Messages:
    335
    Likes Received:
    348
    I don't think AMD will clock N31 so much higher than N21, the reason being power consumption. It is more possible to have it clocked the same (and enjoying the power reduction) or very slightly higher. Then people could maybe enjoy some overclok, if the board design allows it.
     
  19. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,235
    Likes Received:
    4,259
    Location:
    Guess...
    But these lower tier chips will be single chiplet presumably and it's just double the LLC per chiplet from what I've read? Not that that's still not awesome obviously.

    My expectation at this stage is that the single chiplet N31 will be the direct replacement for the 6900XT with 50% more ALU, double the LLC, maybe faster clocks and improved IPC etc... for a more traditional performance uplift. But then we get a 6900XT X2 equivalent GPU at the very top end which harks back to the old Crossfire on a card days where they are just stupid prices for clearly Halo products. The difference here being that hopefully scaling is much less like 2 GPU's in Crossfire and more like 1 big GPU, with hopefully no compatibility issues. So outside of that one crazy expensive halo "X2" card, the rest of the stack will be a more traditional performance uplift of around 50% (a bit more if we're very lucky).
     
    PSman1700 likes this.
  20. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    GCDs have no LLC.
    At all.
    No such thing.
    Below N31 goes N32 and after that a single die N33.
    It is one big GPU.
    A very expensive one to boot.
    No, it's the biggest gen on gen uplift in eons across the stack.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...