AMD: RDNA 3 Speculation, Rumours and Discussion

Discussion in 'Architecture and Products' started by Jawed, Oct 28, 2020.

Tags:
  1. Bondrewd

    Bondrewd Veteran

    You're talking to the dude writing them.
    Still excellent PPA.
     
    Wesker and Deleted member 13524 like this.
  2. Jawed

    Jawed Legend

    The whitepaper is quite explicit about CU mode:

    "The RDNA architecture has two modes of operation for the LDS, compute-unit and workgroupprocessor mode, which are controlled by the compiler. The former is designed to match the behavior of the GCN architecture and statically divides the LDS capacity into equal portions between the two pairs of SIMDs. By matching the capacity of the GCN architecture, this mode ensures that existing shaders will run efficiently. However, the work-group processor mode allows using larger allocations of the LDS to boost performance for a single work-group."

    Agreed, the peak is the the same. The peak count of DWORDs read or written per LDS bank is constant whether in CU mode or LDS mode.

    An atomic instruction like BUFFER_ATOMIC_SWAP means the peak is capped. In WGP mode all 64 banks of LDS are addressable but an atomic that returns data, by definition, can only touch a single bank per cycle. This is the scenario I was referring to originally when I was talking about a CU-less design.

    I was referring specifically to bandwidth and there are some LDS operations where it appears that bandwidth is higher in CU mode because a CU can only touch one array in LDS.

    I can't think of a mechanism in WGP mode that can guarantee that two BUFFER_ATOMIC_SWAP instructions can always run concurrently. If two SIMDs (e.g. 0 and 2) are both doing BUFFER_ATOMIC_SWAP but they are in distinct workgroups, then it should be possible for LDS to run BUFFER_ATOMIC_SWAP for both concurrently. Or, BUFFER_ATOMIC_SWAP for one workgroup and some other "less serial" LDS instruction

    Yes, I'm looking for corner cases. I'm trying to identify the problems with a CU-less WGP.

    And that's all I was inferring.

    I was contrasting a workgroup of 64 pixels (work items) which are allocated to a single SIMD with a non-pixel-shader workgroup of 64 work items that will be allocated across both SIMDs. I was pointing out that the pixel shader allocation to SIMDs was unusual. And that LDS appears to these two SIMDs in a CU the same as in GCN (though shared by four SIMDs). A half of LDS (one array out of the two) is dedicated to a CU in CU mode.

    TMUs, RAs and L0$ bandwidth are dedicated to a CU all of the time, whether in CU or WGP mode.

    Though we also know that RDNA (2) has increased throughputs and sizes versus GCN at the CU level (e.g. L0$ cache lines are twice the size of L1$ in GCN).

    What you've written is true for a pixel shader for 64 pixels. It's described as a wave64 and corresponds with a workgroup of size 64. The unusual aspect here is that it's scheduled on a single SIMD with 2 work items sharing a SIMD lane, whereas in general workgroups are spread across SIMDs width first.

    But note a workgroup of 128 work items bound to an RDNA CU necessarily results in 2 work items sharing a SIMD lane. With increasing multiples for larger workgroups up to those of size 1024. So the pixel shader configured as wave64 appears to be a special case of workgroup. As a special case it's designed explicitly to gain from VGPR, LDS and TMU locality for pixels whose quad work item layout and use of quad-based derivatives is a special case of locality.

    This then leads me to believe that "wave64" for pixel shading is a hack that ties together two hardware threads on a SIMD. Not only is there a clue in the gotcha that I described earlier, but one mode of operation (alternating halves per instruction) is just like two hardware threads that issue on alternating cycles - which is a feature of RDNA.

    In the end pixel shaders are merely a subset of compute shaders, where some of the work item state (values in registers and constant buffers) is driven by rasterisation.

    You can see the problem AMD has introduced with RDNA: two different wavefront sizes. Are they both hardware threads? Or is wave64 an emulated hardware thread, simulated by running two hardware threads?

    Running a wave64 on a single SIMD appears to be enough to define it as a hardware thread. I believe that's where you're coming from. But a SIMD can support multiple hardware threads from the same workgroup, too. So that definition looks shaky.

    Also, you can say (and I believe this is really your point) that running all work items from a workgroup on a SIMD means it's a hardware thread. So wave64 is a hardware thread. Whether it is or isn't a hardware thread, it's still a workgroup.

    I'm sure that leaves you unhappy with what I've said. Ah well. Maybe RDNA 3 won't have wave64. Three distinct execution models for pixel shaders is beginning to take the piss. Hoping that the compiler will choose the right one is how you drive customers away. (RDNA took a year+ for drivers to settle at adequate performance?...)

    What's interesting about LDS usage by a pixel shader for attribute interpolation is that the bandwidth of LDS is multiplied by sharing data across 64 work items (e.g. a single triangle has 64 or more pixels). Whether wave64 is a hardware thread or not, this is an example of shared data and reduced wall clock latency for pixel shading.

    Of course that falls apart when the triangles are small: you really want wave32 (32 work item workgroups) then.

    All this came up because I'm trying to work out how RDNA 3 with no CUs inside a WGP configures TMUs, RAs and LDS. Bearing in mind that a "WGP" seems as if it would actually be a "compute unit", a compute unit generally has a single TMU and a single LDS. So my question was whether a CU with 8 SIMDs can have adequate performance with a single TMU and a single LDS.

    Need to be careful to distinguish the two things that RAs do:
    1. box intersections (4 tests per ray query) - intrinsic performance here seems completely fine
    2. triangle intersections (1 test per ray query - I am not sure if this is a confirmed fact)
    and separate them from the idea of "hardware acceleration accepts a ray and produces the triangle that was hit" (black box).

    Because BVH traversal is a shader program, increased work items and/or rays in flight may actually be the key to improved performance. If AMD can do work item reorganisation for improved coherence during traversal, then that's another win. Indeed work item reorganisation may be the only way that AMD can meaningfully improve performance.

    The alternative could be narrower SIMDs.

    "RA as being like texturing" is a useful metaphor. Both operations depend upon global memory accesses and so their latency needs to be hidden.

    I have not found a way to determine the precise extent of RDNA 2's bottlenecks. I've been working on it for a couple of weeks, but haven't got very far. Of course bottlenecks are super-slippery in real time graphics, so there's a limit to what can be gleaned.
     
  3. trinibwoy

    trinibwoy Meh Legend

    Over-engineered maybe but could still be a win from a yield and performance perspective. The hard part will be moving data between the dies without blowing the power budget.

    Why would a 160SM die be necessary? An RDNA3 WGP is 2x as wide as an Ampere SM. So if just talking flops you would need 80 Ampere SMs to match 40 RDNA3 WGPs. That is well within the reach of a monolithic 5nm die.
     
    DegustatoR likes this.
  4. Leoneazzurro5

    Leoneazzurro5 Regular

    That's only talking FLOPS but in games it's a little different: if the capabilities of those SMs would be similar to Ampere, then in rasterization there is not much difference between an Ampere SM and a RDNA2 CU (with a slight advantage for Ampere so far), looking at what a 3080 and a 6800XT can do with 68SM vs 72 CU (In Ray tracing it's different but next gen is unknown so this is a great X). So if we go for ALU count and what these ALU to for actual performance, you'll need 144 Ampere SM to match/slightly beat the equivalent of 160CU-10240 FP32 unit (that being probably N32 while N31 seems going for the 60 WGP-15360 shader units). Incidentally, the rumors about Lovelace seems to point exactly at that count on 5nm. But then there is the issue of feeding all those ALUs - that is, bandwidth. AMD is throwing tons of cache at the problem. It must be seen, if for Lovelace Nvidia will do the same. They can go again for a chip near the reticle limit, but if AMD delivers what N31 seems to be, I have difficulties to see it competing with those sheer numbers. Of course, there will be arch improvements, but that goes both ways.
     
  5. DegustatoR

    DegustatoR Veteran

    An 8 SIMD WGP will have a different performance profile compared to a 4 SIMD WGP (even disregarding all the other changes for now) so I'm not so sure that you can just extrapolate N31 performance from RDNA2 results, either by unit numbers or FLOPs.
     
  6. Leoneazzurro5

    Leoneazzurro5 Regular

    As you cannot extrapolate that performance will be lower instead of being not the same or even higher, especially when hystorically it basically never happened to have a regression in performance in iterative steps of the same architerture family - so let's see.
     
  7. Bondrewd

    Bondrewd Veteran

    Ehhhhh gfx11 is a pretty clean break.
     
  8. DegustatoR

    DegustatoR Veteran

    You can't of course but if we think on that "historically" then let me remind you about Vega scaling. Or Ampere compared to Turing for that matter.
     
  9. Leoneazzurro5

    Leoneazzurro5 Regular

    It may and it will be, but simply I don't think AMD would decrease the "per ALU" performance - the task with RDNA and successive steps was clearly to increase ALU utilization, which was quite lacking in GCN. I may be wrong, but I don't think AMD will lose this focus, especially looking at their recent history.
     
  10. Bondrewd

    Bondrewd Veteran

    RDNA2 is already stuff but on speed and it scales effortlessly.
    That one added a second FP FMA lane per 4 r/w ports and not outright double the partitions.
    Yep, their PPA considerations are pretty anal.
     
  11. Frenetic Pony

    Frenetic Pony Regular

    Calculated the approximate power cost for infinity fabric v1 a while back, and even with that the power budgets for two dies were tolerable for today's high end GPUs, let alone whatever improved version they have now. Not trivial, or even cheap mind, but within budget for a <400 watt card.

    The real power question for these upcoming GPUs versus "the rumors" is how exactly either AMD or Nvidia plan to get the power usage of compute itself down. Double the performance in two years or less? Last I looked there was no secret silicon miracle in the back pocket of either they were just waiting to pull out. So when was the last time any major GPU architecture doubled its power efficiency in that short a time? It almost feels rhetorical.

    For AMD I can see it with the big qualifier that it only applies to the hardware RT titles it currently performs the worst in; sure maybe the performance there could double, especially with the higher headroom they have for cranking up power. But Nvidia's 300+ watt already cards, doubling performance, really?
     
  12. Bondrewd

    Bondrewd Veteran

    What they're using now is a gazillion times lower pJ/b.
    Like orders and orders of magnitude lower.
    No.
    It's 2.7x or so in raster gen over gen.
     
    Frenetic Pony likes this.
  13. Frenetic Pony

    Frenetic Pony Regular

    :shocked:
     
  14. What is even going to make use of this level of performance?
    Isn't that like 7 to 8x more performance than the current generation consoles?
     
  15. trinibwoy

    trinibwoy Meh Legend

    I’ll take super resolution 8k downscaled to 4k please.
     
    Kej, CeeGee, BRiT and 2 others like this.
  16. Jawed

    Jawed Legend

    NVidia with a monster cache could go 256-bit too? NVidia in its COPA paper:

    2104.02188.pdf (arxiv.org)

    has been contemplating extremely large L3 cache (see section E). Silly to assume that NVidia is not going to do crazy things. Returning big time to TSMC might be a reflection that TSMC's tech is crucial to NVidia's plans, not just for data centre but everything from 2022. No, I'm not suggesting NVidia will put 1GB of L3 in Lovelace, but 256MB?

    Well AMD needs to sort out ray tracing performance. Less than 2x gain over Navi 21 will be seen as a major fail and consoles are not a useful benchmark for ray tracing performance. Also, leet console dev ray tracing voodoo is going to make Navi 31 look worse anyway, so the pressure is on.

    Yep, super-sampled 4K is nice. Especially when sat a metre from a 48" where them pixels are not what you'd call "retina".

    It would be amusing if LG brought 8K OLED to small TVs such as 48" in a couple of years.
     
    Lightman likes this.
  17. techuse

    techuse Veteran

    2.7x a 6900xt across a general performance summary? Very hard to believe in 2022 or even 2023.
     
  18. Bondrewd

    Bondrewd Veteran

    I dunno.
    4k HRR?
    Less.
    That's the whole gimmick!
    No one built above reticle GPUs before because it was outright impossible to do so before.
     
  19. JoeJ

    JoeJ Veteran

    First thing i come up with is full scene fluid sim (fog, smoke) and volumetric lighting (glowing projectiles with awesome glow). That's the stuff we need brute force power because there are not much options to fake well or reduce work with clever algorithms.
    But that's still some complexity to implement. And on/off makes a big difference, and can't scale down well. So not that practical and attractive for current cross platform games landscape.
    Well, maybe if next gen is the only target it would scale down good enough.

    Other than that, high resolution and fps as usual. :/
    If they want better RT perf, HW traversal is the obvious improvement to make.

    Maybe arcade machines would make sense again? Competing cinemas? If there were no covid?

    Well, I really hope chiplets will also help with smaller and practical GPUs after that nice proof of concept. I'm impressed, but also disappointed about another high end craze.
     
    Kej, BRiT and Deleted member 13524 like this.
  20. Bondrewd

    Bondrewd Veteran

    Don't worry, smaller and practical also get a chungus bump in RDNA3.
    Everyone will be very happy and very broke!
     
    Lightman and JoeJ like this.
Loading...

Share This Page

Loading...