AMD: RDNA 3 Speculation, Rumours and Discussion

Discussion in 'Architecture and Products' started by Jawed, Oct 28, 2020.

Tags:
  1. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
    Intel isn’t playing that game and are counting EUs not SIMD lanes. It’s debatable though whether the number of EU/CU/SM is more accurate or more helpful than the number of SIMD lanes when it comes to graphics and compute performance.

    This is especially true when the width of each of these units is vastly different. An EU is 8-wide and a CU is 64-wide. It doesn’t make sense to compare them directly. Also an EU isn’t functionally equivalent to an SM or CU. It’s closer to just one of the SIMDs (or partitions in Nvidia’s case) within those units.
     
    DegustatoR likes this.
  2. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    Yeah Intel should stop fucking around and start counting subslices.
     
  3. Leoneazzurro5

    Regular

    Joined:
    Aug 18, 2020
    Messages:
    335
    Likes Received:
    348
    Well there is no simple response to that, as the structures are different among vendors, and so are the capabilities of the units or "CU" or "CUDA Cores". For certain things, it's a no brainer to count per WGP or SM or EU, because there are shared structures (flow control, registers, cache, and so on) so logically that is the "block" you use for building your GPU. But, if you are interested at the peak FP, then it's the FP ALU number to look at - even if bottlenecks here and there can reduce the actual FP ALU utilization by a lot (even in teh software rather than in hardware). But as said, many people look only at who has the "bigger number". It was so during the Megahertz race (even after Pentium 4 came out), the amount of VRAM (opinion being: I have a 6 Gbytes card, so it's more powerful than your 4 Gbytes one regardless of which the actual GPU die and memory bus was) and now we have the number of ALUs or the TeraFLOPs.
     
  4. Dangerman

    Newcomer

    Joined:
    Apr 1, 2014
    Messages:
    43
    Likes Received:
    8
    Hmmmmm:


    Adds up to 7680 Cores. I wonder if a single RDNA3 GCD has 7680 cores on each one. So Navi 31 has 7680x2 = 15360.
     
    Lightman likes this.
  5. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    yea
    (it's 32 * 8 * 10 * 3 * 2)
     
  6. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
    Navi 21: 4 shader engines, 10 WGPs each, 4 32-wide SIMDs per WGP.
    Navi 31: 6 shader engines, 10 WGPs each, 4 32-wide SIMDs per WGP.
     
    Lightman likes this.
  7. Leoneazzurro5

    Regular

    Joined:
    Aug 18, 2020
    Messages:
    335
    Likes Received:
    348
    So, any guess about the GCD size? At 5nm, I think we could see something around the 350 mm^2 die size. The interposer+cache is trickier, I think. I don't think AMD would use a 5nm process for that, so on 7nm we would see something near or exceeding the 400mm^2, counting only the SRAM. (512 Mbytes)...
     
  8. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    8.
    8 SIMDs per WGP.
    Less.
    The what.
    The thing is SoIC, no passive slabs or anything.
    Think different.
    Think the smallest possible reuse unit.
     
  9. So 3 shader engines, 10 WGPs each, 8x32 wide SIMDs each WGP.
    Per GCD.

    RDNA3 looks like the largest departure from GCN yet, at least from a high level perspective.
     
    Lightman likes this.
  10. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    I very much welcome our many-GPU ULTRA HALO overlords back.
    Hopefully PC does another 4 slot Red Devil 13 cuz why not.

    Yep.
    True dat.
    Ideologically the fat SM approach feels closer to previous NV gens or IMG A/B-series.

    There's a ton of uArch changes for gfx11 (both variants) but we gotta talk them at a later date, if ever.
     
    Lightman likes this.
  11. Leoneazzurro5

    Regular

    Joined:
    Aug 18, 2020
    Messages:
    335
    Likes Received:
    348
    Oh, well, I was conservative, my raw calculation was little less than 330 mm^2 or so, then it depends on actual scaling and not by gross estimates made by TSMC

    Oh well, that was my mistake the cache part I thought it was stacked but at a second thought yes, it makes no sense to use passive parts when your inter-die communication is done through the cache die.

    In effect TSMC was developing 7nm stacked on 7nm and 5nm stacked on 5nm, but there was no hint about hybrid stacking. It would be interesting to see what the real product will look.
     
  12. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    Depends on how AMD implements stuff too!
    CDNA1 to CDNA2 PPA will be funny given iso node.
    No, it's just generic hybrid bonding.
    All the nodes define is minimum pitch.
    C'mon the usual Taiwan IP vendor (what's its name?) already announced 3D d2d solution for 7 on 5.
     
  13. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
    It does make sense in terms of scaling fixed function hardware. Double FP throughput for the same number of TMUs. Wonder what will happen with ray accelerators. It’s not clear that they’re a bottleneck on RDNA2.

    The WGP memory subsystem should be interesting. That’s 8 waves potentially hitting L1 and LDS each clock.
     
  14. Leoneazzurro5

    Regular

    Joined:
    Aug 18, 2020
    Messages:
    335
    Likes Received:
    348
    Oh, I missed that. I will look around for the news, thanks.
     
  15. Leoneazzurro5

    Regular

    Joined:
    Aug 18, 2020
    Messages:
    335
    Likes Received:
    348
    It is quite possible the ray accelerators and TMUs are "beefier" now, with even wider data paths, possibly enabling concurrent utilization of TMU and Ray accelerator(s).
     
  16. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
    These 2 statements appear to contradict each other. AMD explicitly says that 64-item wavefronts are run on a single SIMD over multiple cycles. Why do we need to infer anything about running across SIMDs?

    Also I think you’re using workgroup where you should be using wavefront.
     
    #496 trinibwoy, Jul 26, 2021
    Last edited: Jul 26, 2021
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    L0$ per WGP, L1$ per shader array. Well, that's how RDNA (2) is configured.

    8 SIMDs sharing a single TMU is what I've been suggesting, because texturing rates with more TMUs would be disproportionately high. But Carsten's point about bandwidth into each SIMD from L0$ still stands. A multi-ported L0$ doesn't seem like a good idea (increased latency and LDS-like banking rules making latency/bandwidth more unpredictable).

    Instead, an L0$ per SIMD. But the problem with that is the quantity of L0$ taken by data that's present in nearby L0$s (i.e. within the same WGP) - duplication is going to waste a lot of these 8 L0$s..

    So I am really struggling to justify, one way or another, how L0$ (and to a similar extent TMU) works when there's 8 SIMDs. Maybe a 32KB L0$ is enough for 8 SIMDs and maybe a single TMU is the same.

    An individual SIMD has a bursty relationship with L0$ and TMU, because of latency hiding. Additionally, clause-based operation of L0$ and TMU commands intensifies this burstiness, since groups of multiple commands will be issued (effectively by the complier), rather than them being spaced-out. These groups minimise the count of context switches seen by a single hardware thread.

    So perhaps 8 SIMDs all doing their bursty thing can be seen as safely keeping out of each others' way in the general case.

    Ray accelerators, on the other hand, seems to be an even more troublesome question. Since AMD is SIMD-traversing a BVH, one can argue that the intersection test rate is fine with one RA per WGP (8 SIMDs). We could expect the compiler to make bursty BVH queries (several queries at a time per work item, e.g. multiple rays and/or multiple child-nodes per work-item).

    The ALU:RA ratio is pretty high in RDNA 2 already, so making it much higher may well be harmless given that the entire BVH, in the worst case, can't fit into infinity cache (i.e. lots of latency). More WGPs in Navi 31 still means a theoretical increase in intersection throughput (e.g. 3x Navi 21).

    A workgroup, at the very least, is 128 work items in size at its maximum size (across all GPUs and all non-proprietary APIs running on those GPUs). In the good old days a workgroup was up to 1024 work items at maximum. The different APIs impose varying restrictions on the size of a workgroup which further muddies the waters. And then there's the consoles, which appear to have the loosest restrictions (for a given GPU architecture).

    RDNA (2) supports 1024 work items in a workgroup.

    We can infer that 2 SIMDs can cooperate in a non-pixel-shader workgroup because it reduces latency (wall clock latency for all the work items in the workgroup). Pixel shading is a special case, where 64 work items sharing a SIMD as hi and lo halves, in general, benefits directly from attribute interpolation sharing (LDS locality) and texel locality.

    For non-pixel-shading kernels (and specifically those that have low or no use of LDS) there is less reason not to spread a workgroup across all SIMDs. This becomes "essential" when a workgroup is high in work items and also high in register allocation: the register file in a given SIMD is literally too small. Additionally if there's only 2 workgroups that can fit into a CU, then you want both SIMDs to be time-sliced by the work of one or the other of the workgroups. By making them both time-slice you maximise maximal concurrent utilisation.

    You can argue that wall-clock latency is worse when workgroups fight over a single SIMD: one hardware thread from workgroup 1 wants to run because it's received its data from L0$ and a hardware thread on the same SIMD for workgroup 2 wants to run because it's received its LDS data. I think you'd agree that's in the category of "low probability".

    Of course there are scenarios where WGP mode is preferred for compute (when an algorithm requires large: VGPR and/or LDS and/or work item allocations).

    This comes back to my question: why does RDNA (2) even have a compute unit concept? Is it merely for L0$, TMU and RA (scheduling/throughput)? Or is it to make corner-case GCN-focussed shaders happy instead of suffering performance that falls off a cliiff? Or...?

    I think you're going to have to be very specific in pointing out a problem with what I've written. I'm not saying I haven't made a mistake, but while you're hiding the problem you've identified I can't read your mind and I'm not motivated to find the problem in text that I've already spent well over 6 hours writing (I posted version 3, in case you're wondering).

    I deliberately use "workgroup" and "hardware thread" because in discussing multiple architectures, and architectures over time from the same IHV, you get contradictory models of the hardware if you use "wavefront" and "block" and "warp". Remember, G80 had two warp sizes, for example: 16 and 32. Graphics and compute APIs can easily hide hardware threading models, which fucks-up discussions of the hardware.

    For example I have a theory that RDNA (2) only has one hardware thread size: 32. The idea that pixel shaders are "wave64" (implying that they are a hardware thread of 64 work items) contradicts this, but can easily be viewed as an abstraction to maximise equivalency with the operation of GCN. There's even a gotcha in the operation of "wave64 mode":

    "Subvector looping imposes a rule that the “body code” cannot let the working half of the exec mask go to zero. If it might go to zero, it must be saved at the start of the loop and be restored before the end since the S_SUBVECTOR_LOOP_* instructions determine which pass they’re in by looking at which half of EXEC is zero."

    from RDNA_Shader_ISA.pdf (amd.com)

    It looks like a fossil of GCN, where only VCC and EXEC hardware registers are 64-bit (both required to support wave64). Perhaps RDNA 3 will entirely abandon wave64. One motivation could be that 8 SIMDs per WGP need filling, so LDS locality for attribute interpolation and L0$ locality for texturing (burstiness) is less of a win when there's so many SIMDs to keep occupied.

    (Amusingly, ISA also says this about S_SUBVECTOR_LOOP_BEGIN and S_SUBVECTOR_LOOP_END: "This opcode has well-defined semantics in wave32 mode but the author of this document is not aware of any practical wave32 programming scenario where it would make sense to use this opcode.")

    Also, "SGPRs are no longer allocated: every wave gets a fixed number of SGPRs" - means that per-SIMD hardware-thread ID directly indexes SGPRs, which would also mesh easily with there being only one hardware thread size of 32.

    Wave64 and the necessity of optional sub-vector looping smells like a hack... Sub-vector looping helps with VGPR allocation, as it happens. But AMD also increased the size of the register files in RDNA...
     
    #497 Jawed, Jul 26, 2021
    Last edited: Jul 26, 2021
    Lightman likes this.
  18. upload_2021-7-26_16-8-53.png
    N5 on N6 in 2022
     
    Leoneazzurro5 likes this.
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
  20. Bondrewd

    Veteran

    Joined:
    Sep 16, 2017
    Messages:
    1,682
    Likes Received:
    846
    Lightman likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...