AMD: RDNA 3 Speculation, Rumours and Discussion

Discussion in 'Architecture and Products' started by Jawed, Oct 28, 2020.

Tags:
  1. ToTTenTranz

    Legend Veteran

    Joined:
    Jul 7, 2008
    Messages:
    12,752
    Likes Received:
    7,766

    I thought the MCDs were being packaged underneath the GCDs. That way there would be two MCDs of 256MB each.

    Though perhaps it would make more sense if the LLC is indeed centralized and serves both GCDs, for coherency.
     
  2. ethernity

    Newcomer

    Joined:
    May 1, 2018
    Messages:
    151
    Likes Received:
    377
    Patents indicate 1 MCD and also Kopite is hinting 1 MCD

     
    Lightman and Man from Atlantis like this.
  3. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,564
    Likes Received:
    757
    Wut.
    N5p is better logic scaling over N5.
    Don't mistake it for N5HPC which is a memetastic offering AMD would never touch.
    Yea it's a config designed around being unreachable by anything or anyone.
    Patents aren't really if at all representative of the real thing.
     
    Lightman and BRiT like this.
  4. ethernity

    Newcomer

    Joined:
    May 1, 2018
    Messages:
    151
    Likes Received:
    377
    Ah yes you are right, I had in my head as HP
    But the scaling is anyway minimal form N5 to N5P, low single digit, mainly power efficiency gain.

    Yes I dont imagine they will touch that, too high leakage
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,484
    Likes Received:
    1,844
    Location:
    London
    If WGPs are the smallest unit of compute (i.e. CU is disappearing) then we better be getting conditional routing. 480x SIMD-32s? 960x SIMD-16s.? 240x SIMD-64s?

    AMD has already set the precedent that graphics kernels can use either 32 or 64 work item workgroups, so is there any reason not to continue down this path and increase the count of workgroup sizes that the hardware directly supports? Once that's done, adding conditional routing on top and letting the hardware vary workgroup sizes on the fly (splitting and merging workgroups) would be cool.

    I've never bothered to understand what Intel has done with its variable width workgroups, but I'm going to guess that something like that is what AMD is planning. But allowing kernels that feature dynamic control flow to be able to be conditionally-routed directly by the hardware would be a huge step up from that.

    It would be amusing if NVidia is also about to do the same thing in Hopper/Lovelace.
     
    Lightman and SpeedyGonzales like this.
  6. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    11,683
    Likes Received:
    2,601
    Location:
    New York
    Smallest unit of compute doesn't normally indicate the width of each SIMD. CUs have multiple independent fixed-width SIMDs today so future WGPs can presumably follow the same model.

    Workgroups (blocks) or wavefronts (warps)?
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,484
    Likes Received:
    1,844
    Location:
    London
    I'm actually speculating that AMD goes with a sea of conditionally routed SIMD-16s per WGP. When there's no dynamic branching the compiler can choose to emit workgroups of 32 or 64 (2- or 4-cycle instruction-repeats).

    I use OpenCL's terminology and I'm referring to graphics kernels (not compute kernels) since with graphics kernels (commonly vertex and pixel shaders) developers are not setting the size of the workgroup explicitly. So the driver (or hardware) can do what it wants.

    For compute, with dynamic branching, being able to go down to a workgroup size of 16 is advantageous. If the hardware is using a base of SIMD-16 but can merge workgroups to run as a virtual workgroup size of 32 or 64 (or larger) for clauses of kernels where there is no dynamic branching then the hardware can adapt more flexibly to the work it's given. There are plenty of times that compute prefers larger workgroup sizes and sometimes the developer knows that too...

    AMD introduced SIMD-32 with workgroups of 32 for graphics kernels because graphics benefits from the shorter latencies of that configuration (individual work items spend less time parked and therefore are more likely to get repeat hits in texture cache rather than having cache trashed by other work items).

    You can do conditional routing with SIMD-64, but it seems likely that narrower SIMDs result in easier wiring/timing (since the register file is a bottleneck and then, erm, there's LDS). SIMD-16 with conditional routing could have performance more like SIMD-8 when there's dynamic branching. Pixel shaders may well benefit from being issued in workgroups of 16 work items too.

    If the death of the CU is true, then it seems to me that the WGP is a single level of control for all use of TMUs, RAs, LDS, SIMDs and perhaps conditional routing. The WGP is now directly controlling more SIMDs. Finally, if the WGP is being redesigned without a CU sub-level then while introducing conditional routing it would be cool to use narrower SIMDs to really make the most of the complexity that conditional routing requires.

    Oh and maybe there'll only be one TMU per WGP? Texel rates have reached silliness and I suspect do not need to go higher than currently seen. That wouldn't directly affect the RA count in my opinion, e.g. there could be 4 RAs per WGP.

    In the end I'm saying all of this simply because "no CUs" implies a major re-working of the control of all the modules inside a WGP. And I'm throwing conditional routing in there because fucking hell, it's about time. And ray traversal shaders could do with some help there.
     
  8. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    627
    Likes Received:
    414
    I agree with the rest, but this is not true. That might be the stated reason, but the actual reason is that too much code is specifically optimized for nV cards, which have SIMD-32 workgroups.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,484
    Likes Received:
    1,844
    Location:
    London
    Seems reasonable, but in pixel shading there's not much that a developer can do to specify how the pixel shader maps to the hardware's base workgroup size.

    I think that a g-buffer fill pass in deferred rendering will benefit from RDNA's SIMD-32s (reduced latency). Here the pixel shader is low in texturing, which reduces the need for texture latency hiding. That should then mean that render target writes tend towards being in localised clumps within the render target, rather than tending to be scattered by the longer-latency execution pattern seen in GCN.

    GCN prefers to have lots of fragments in flight per SIMD because the cost of switching hardware thread is relatively longer latency - i.e. latency hiding (switching hardware thread) in GCN adds yet more latency to the overall kernel duration than seen in RDNA. GCN has a double-whammy of latency as it were.

    Localised clumps of render target writes help with render target caching, with or without delta colour compression. Maxwell, onwards, gained a huge bump in g-buffer fill pixel shading because of the tiled forward architecture, which further intensifies locality in render target writes.

    So my guess is that as deferred rendering has become dominant, with g-buffer fill becoming so painful, RDNA has clawed back against NVidia just because of reduced pixel shading latency, because it improves the coherency of render target writes.

    The cost for RDNA is the extra scheduling logic and the fact that that logic is running more intensively. If RDNA 3 has conditional routing and SIMD-16s then there would be yet more scheduling cost. And register file buffering. It seems like this would end up needing complex scoreboarding for hardware threads and register reads.

    RDNA has already paid many costs in scheduling overhead versus GCN, yet it gained efficiency. It can be argued that RDNA achieved the efficiency gain despite increased overheads specifically because of its wider SIMDs, changing SIMD-16 to SIMD-32.

    My proposal for single-cycle SIMD-16 issue would be backwards in those terms, due to another increase in scheduling overhead. My guess is that conditional routing itself adds so much scheduling overhead that the sweet spot changes again, towards SIMD-16 from SIMD-32.

    I suppose I should read that patent document that was linked earlier in the thread:

    COMPUTE UNIT SORTING FOR REDUCED DIVERGENCE - Advanced Micro Devices, Inc. (freepatentsonline.com)

    There's nothing to say that conditional routing needs SIMD-16, I'm just wondering whether the simplification of removed CU-level scheduling might be traded for conditional routing complexity and narrower SIMDs.
     
    Lightman and DavidGraham like this.
  10. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    11,683
    Likes Received:
    2,601
    Location:
    New York
    Yeah I'm surprised TMU ratios haven't already been cut back. Maybe there are bursty texture heavy workloads that still benefit.

    I don't see why it implies a major reworking. In RDNA each SIMD is already independent with its own instruction scheduler, wavefront controller, scalar ALU and SFU. In practical terms "no CUs" just means that it's always operating in "workgroup-processor-mode" and each workgroup always has access to the full LDS within the WGP. I said this at RDNA launch - AMD should drop the unnecessarily confusing WGP terminology and just stick with CU. The CU is just bigger and badder now.
     
  11. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,658
    Likes Received:
    3,662
    Location:
    Germany
    Regarding TMUs: You need the fast datapaths to the CU's caches for compute and RT as well. Might keep the little filters around for a while longer.
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,484
    Likes Received:
    1,844
    Location:
    London
    I agree that WGP terminology in a "no CUs" configuration would be confusing and could revert to simply CU-centric.

    I'm struggling to understand what the CUs are for in RDNA (2) apart from localising TMUs/L0$, making them "more similar to the operation of GCN".

    From the perspective of a "no CUs" configuration RDNA (2) can be seen as a half step.

    I'll return to your point about "major reworking" when I post about the compute unit sorting patent document.

    This is very true.

    AMD did go to town by doubling cache line widths for RDNA and increasing texturing throughput per unit. I dare say it seems unlikely cache line widths would double again. The only meaningful per-unit throughput increase available would be for filtering of 128-bit textures (e.g. 4x fp32), since fp16 is already full rate (isn't it? - I can't find benchmarks).

    AMD has demonstrated a trend towards increased cache density and texturing throughput per work item, both of which contradict my opinion about a singular, larger, L0$ and reduced texturing unit count per WGP.

    It's worth noting that L0$ cache density per work item hasn't increased with RDNA (L1$ in GCN, there was no L0$), merely for I$ and K$. So perhaps I'm not entirely pissing into the wind.

    I found this slide deck, which I've not seen before:

    AMD PowerPoint- White Template (gpuopen.com)

    The first bulleted list seems to exaggerate when it says "Resources of two Compute Units available to a single workgroup": yes the LDS which is a resource nominally per-CU can be fully utilised by a single workgroup running across all 4 SIMDs (any SIMD can access any address in LDS). But a SIMD does not have access to all of the WGP's L0$ and all of the WGP's texturing throughput.

    That can be seen as deliberate: texture/data fetch needs to be optimised for bandwidth.

    If optimising for latency you would expect that each SIMD would have its own L0$ and texturing. Yet a primary directive in the design of RDNA was reduced latency.

    So there's a conflict between bandwidth/throughput versus localisation/latency.

    Perhaps a larger WGP-level L0$ but with multiple (two?) TMUs.
     
  13. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,658
    Likes Received:
    3,662
    Location:
    Germany
    Yes. As well as FP32 can be, as long as it's only using two channels.

    edit: it also says so on p 36 of that arch presentation
     
    #473 CarstenS, Jul 24, 2021
    Last edited: Jul 24, 2021
    Jawed likes this.
  14. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    392
    Likes Received:
    355
    You could in theory take the SVE variable vector path to the extreme, having a single-lane ALU pipeline (i.e., non-SIMD), and executing any arbitrary sized wavefront by looping the same instructions with contiguous blocks of registers as src/dst, while skipping items disabled in the exec mask. That would effectively turn an RDNA WGP into a 128 CPU-ish core cluster... :razz:

    Though of course, this means you would be putting in 32x the control logic too in instruction sequencing/pipelining, and 32x the execution latency for 32-wide wavefronts. Very close to the Nvidia Echelon model IIRC.

    As far as I know, Intel's approach is similar to this/SVE, except that the hardware is SIMD4/SIMD8, paired with a complex RF design.
     
    #474 pTmdfx, Jul 24, 2021
    Last edited: Jul 24, 2021
  15. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    8,928
    Likes Received:
    3,671
    Location:
    Guess...
    So about 7 PS5's the the low low price of 5 PS5's, Sounds like a bargain.
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,484
    Likes Received:
    1,844
    Location:
    London
    So

    COMPUTE UNIT SORTING FOR REDUCED DIVERGENCE - Advanced Micro Devices, Inc. (freepatentsonline.com)

    is extremely underwhelming. All the tricky questions:
    • how are thresholds for profit in performing reorganisation assessed
    • degradation due to count of branch targets
    • quantity of work item state that is reorganised
    • nested divergent control flow (e.g. count of branch targets keeps increasing and even worse if it does so at a high rate relative to instructions executed)
    • impact on execution (stalls) caused by having to wait for enough hardware threads to "settle" as a prerequisite for sorting and reorganisation
    were completely ignored. Actual sorting techniques were not provided, the capacities of the hardware were not acknowledged as determining factors for sorting and no hint of mitigations for the difficulties of moving data around a SIMD or CU or WGP were provided.

    RDNA does have cross-lane data reads which could be a major component of "intra-wavefront" (intra-hardware thread) reorganisation. The bandwidth is low though, so there's a high cost to moving VGPRs on demand and then returning them after work items reconverge to their original scheduling. e.g. there might be 5 VGPRs required for the different sections of code target by control flow, but each work item has 20 VGPRs allocated in total.

    The description of the "inter-wavefront" technique avoids tackling the question of whether the two hardware threads in question are running on the same SIMD. The closest the document comes to acknowledging this question is "Different wavefronts of a single workgroup do not execute in the simultaneous SIMD manner described herein, although such wavefronts can execute concurrently on different SIMD units 138 of a single compute unit 132."

    So in all the document barely acknowledges the logistical problems of the algorithms it presents. I'm not convinced there's anything novel (patentable) in the algorithms presented.

    It's nice that the document referred expressly to ray tracing, and shows the problems associated with "uber ray shading" that comes when "closest hit", "miss" and "any hit" shaders are composed into a single uber shader (on top of the actual BVH traversal shader).

    The general case of divergent control flow while shading ray results is unavoidable, so anything that mitigates SIMD-wastage is welcome.

    Multiple execution items per work item is presented.

    We're already familiar with this, in a sense. When a 64-item workgroup runs on an RDNA (2) SIMD, the compiler generates "hi" and "lo" 32-work-item halves, which are either tackled by alternating each instruction for each half, or scheduling one half to run to completion followed by the other.

    From the perspective of a hardware thread this is sort of as if each work item is mapped to two. For instance in a pixel shader that is generally ran as a 64 work item workgroup, pixels 0 and 32 can be thought of as two execution items, both sharing work item 0 in the hardware thread. The register file can be viewed the same way: r0 in execution item 0 is matched by r32 in execution item 32.

    The document presents this as a way to run ray tracing: 2 or more rays (each being an execution item) share a work item. Then intra-wavefront reorganisation is used to "move" rays so that, after sorting, divergence is minimised across all of the execution items.

    The document also presents the idea of using criteria other than the branch target as input to the sorting algorithm (though it doesn't actually put this into the claims), e.g. the direction of rays. Say rays are split into one of two hemispheres of a sphere, then the rays can be sorted into two groups, with the expectation of increased coherence as they traverse (or at least for their next intersection test).

    Throughout there's a de-emphasis of hardware functionality, instead descriptions focus on shader code inserted by the compiler, to perform the sorting. We can liken this to how vertex attribute interpolation is inserted by the compiler into pixel shaders. So the result of running such code (sorting by arbitrary conditions in order to "minimise divergence") could, ideally, be fed into hardware. Or it might just be turned into a stream of cross-lane or LDS-mediated data moves.

    For example, in the same way that an execution mask is a 32- or 64-wide register (per hardware thread) that SIMDs refer to when running instructions, a lane-mapping mask would instruct the SIMD and the operand collector how to run reorganised work items. There is no attempt to elucidate this, as the document is merely about sorting work items for reduced divergence and reorganising them within the workgroup (as if by magic).

    Currently we are left to infer that a workgroup of more than 32 work-items will run concurrently (though not necessarily in lock-step) across both SIMDs in an RDNA (2) CU, unless it's a pixel shader (which is usually 64 work items assigned to a single SIMD).

    What appears as a logically single LDS is effectively two LDSs, each with its own crossbar and queue, in order to serve its parent CU.

    So LDS and texturing (addressing, fetching, filtering) are notionally controlled by CU-level scheduling hardware.

    Only if workgroup processing mode is activated will the hardware threads have the option to occupy all four SIMDs and gain access to all physical addresses in LDS (e.g. a single 256-work item workgroup uses all 128KB of LDS). In this scenario the crossbars and queues are merged into a single functional unit in the worst case, to deal with atomics (e.g. read after increment).

    So now the CUs can no longer own their use of LDS, the WGP has to take on that responsibility. So some logic is required to support dual-responsibility for LDS usage.

    Without CUs, WGP has full control in theory. Can that be simplified so that LDS is always singular? Or does LDS remain as two arrays? With 2 arrays in CU mode, a WGP has "doubled" bandwidth, because the 2 arrays are truly independent. In WGP mode is that doubled bandwidth still available?

    In a CU-less WGP would that doubled bandwidth be available? Or will it still be 32-lanes at-a-time constrained?

    Without CUs, what's the allocation policy for hardware threads across SIMDs inside the WGP, when those hardware threads are from the same workgroup? Always in (0,1) (2,3) pairs? Or greedy, finding the SIMD groups with the least work?

    What if there's 8 SIMDs? Do you really want to use anything other than greedy allocation?

    What's the effect of allocation policy on LDS bandwidth and latency?

    Blimey I've spent over 6 hours on this today and it isn't even bedtime. Result!
     
    T2098, Lightman, ethernity and 3 others like this.
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,484
    Likes Received:
    1,844
    Location:
    London
    I don't think you got it wrong.

    I didn't consider the effect on coherence versus LDS due to reorganisation, as I was too wrapped-up in working out how to maintain efficient use of each work item's VGPR state (and effectively SGPR state too). Well in theory you can add indirection to all the LDS accesses, which is natively supported. Worth noting that as long as the reorganisation remains within the workgroup, the LDS base address is unaffected by reorganisation.

    Overall, yet another reason why the patent document seems so stunningly glib.

    A crazy person could implement sorting, data moves and LDS indirection. Or we could just wait a few years to see if AMD can find a crazy person to finally get this working automagically in the driver.

    Maybe, right now some leet console dev is finalising their implementation. Looks entirely doable in software on RDNA 2. Maybe an uber ray tracing shader is the only way this can be a win?
     
  18. LordEC911

    Regular

    Joined:
    Nov 25, 2007
    Messages:
    875
    Likes Received:
    205
    Location:
    'Zona
    In a future design... Would there be any major drawbacks to a "graphics" base die/s with cache in the middle and a shader/compute die/s on top?

    In my head the base die would likely have to be ~80-150mm2 for the minimum sized interface to reach the required bandwidth levels, 64/96/128bit.
     
  19. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,564
    Likes Received:
    757
    3-Hi on logic sounds toasty.
    So probably not soon.
    Base dies are meems when you can just chain stuff together.
     
  20. Leoneazzurro5

    Regular Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    305
    Likes Received:
    326
    There is a valid reason about not going for the "workgroup count" route and that is marketing. Look at Nvidia and their "CUDA cores" marketing, with considering the FP32-capable ALUs as a separate "core" while in reality the logic is shared at SM-level. Because it is bad to market "SM" where you have the 3080 and the 2080Ti at the same count when you can market "8704" vs "4352" CUDA Cores. Yes, peak FP32 capability is double per SM (but it is also true you cannot count on that peak rate in every workload. ) marketing loves numbers, the bigger the number is, the better. Now, if AMD would go for a "workgroup" count they would call for a 33% regression for each die, even if the FP resources would be 50% more, per die. Would you market the card at "30 Workgroup per die" vs "40 Workgroup per die" or "7860 CU per die vs "5120 CU per die"? For people understanding tech terms, it would be the same, for most of customers 30 is less than 40.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...