AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
It should be on MCD.
I imagine AMD would take the best of both worlds. N5P GCD for absolute logic density and performance and N6 MCD with HD/SRAM optimized libraries for lower cost per MB IC.
N5P SRAM density gain over N7/N6 is very mediocre.
512MB SRAM on N6 with optimized libraries would only be 280-300m2 (Figures estimated from wikichip data, behind paywall). On N5 hardly any better around 250+mm2 but much costlier.
But all those logic blocks can scale very high almost 1.48x with N5P (assuming AMD goes with N5P for GPUs else 1.85x on plain N5)
I suppose 2x GCD + 1x MCD would be closing in around 1000mm2 or maybe even more. Will cost a pretty penny.

I don't know if @Bondrewd can give a hint if N5 or N5P


I thought the MCDs were being packaged underneath the GCDs. That way there would be two MCDs of 256MB each.

Though perhaps it would make more sense if the LLC is indeed centralized and serves both GCDs, for coherency.
 
But all those logic blocks can scale very high almost 1.48x with N5P (assuming AMD goes with N5P for GPUs else 1.85x on plain N5)
Wut.
N5p is better logic scaling over N5.
Don't mistake it for N5HPC which is a memetastic offering AMD would never touch.
I suppose 2x GCD + 1x MCD would be closing in around 1000mm2 or maybe even more. Will cost a pretty penny.
Yea it's a config designed around being unreachable by anything or anyone.
Patents indicate 1 MCD and also Kopite is hinting 1 MCD
Patents aren't really if at all representative of the real thing.
 
N5p is better logic scaling over N5.
Don't mistake it for N5HPC which is a memetastic offering AMD would never touch.
Ah yes you are right, I had in my head as HP
But the scaling is anyway minimal form N5 to N5P, low single digit, mainly power efficiency gain.

Yes I dont imagine they will touch that, too high leakage
 
If WGPs are the smallest unit of compute (i.e. CU is disappearing) then we better be getting conditional routing. 480x SIMD-32s? 960x SIMD-16s.? 240x SIMD-64s?

AMD has already set the precedent that graphics kernels can use either 32 or 64 work item workgroups, so is there any reason not to continue down this path and increase the count of workgroup sizes that the hardware directly supports? Once that's done, adding conditional routing on top and letting the hardware vary workgroup sizes on the fly (splitting and merging workgroups) would be cool.

I've never bothered to understand what Intel has done with its variable width workgroups, but I'm going to guess that something like that is what AMD is planning. But allowing kernels that feature dynamic control flow to be able to be conditionally-routed directly by the hardware would be a huge step up from that.

It would be amusing if NVidia is also about to do the same thing in Hopper/Lovelace.
 
If WGPs are the smallest unit of compute (i.e. CU is disappearing) then we better be getting conditional routing. 480x SIMD-32s? 960x SIMD-16s.? 240x SIMD-64s?

Smallest unit of compute doesn't normally indicate the width of each SIMD. CUs have multiple independent fixed-width SIMDs today so future WGPs can presumably follow the same model.

AMD has already set the precedent that graphics kernels can use either 32 or 64 work item workgroups

Workgroups (blocks) or wavefronts (warps)?
 
Smallest unit of compute doesn't normally indicate the width of each SIMD. CUs have multiple independent fixed-width SIMDs today so future WGPs can presumably follow the same model.
I'm actually speculating that AMD goes with a sea of conditionally routed SIMD-16s per WGP. When there's no dynamic branching the compiler can choose to emit workgroups of 32 or 64 (2- or 4-cycle instruction-repeats).

Workgroups (blocks) or wavefronts (warps)?
I use OpenCL's terminology and I'm referring to graphics kernels (not compute kernels) since with graphics kernels (commonly vertex and pixel shaders) developers are not setting the size of the workgroup explicitly. So the driver (or hardware) can do what it wants.

For compute, with dynamic branching, being able to go down to a workgroup size of 16 is advantageous. If the hardware is using a base of SIMD-16 but can merge workgroups to run as a virtual workgroup size of 32 or 64 (or larger) for clauses of kernels where there is no dynamic branching then the hardware can adapt more flexibly to the work it's given. There are plenty of times that compute prefers larger workgroup sizes and sometimes the developer knows that too...

AMD introduced SIMD-32 with workgroups of 32 for graphics kernels because graphics benefits from the shorter latencies of that configuration (individual work items spend less time parked and therefore are more likely to get repeat hits in texture cache rather than having cache trashed by other work items).

You can do conditional routing with SIMD-64, but it seems likely that narrower SIMDs result in easier wiring/timing (since the register file is a bottleneck and then, erm, there's LDS). SIMD-16 with conditional routing could have performance more like SIMD-8 when there's dynamic branching. Pixel shaders may well benefit from being issued in workgroups of 16 work items too.

If the death of the CU is true, then it seems to me that the WGP is a single level of control for all use of TMUs, RAs, LDS, SIMDs and perhaps conditional routing. The WGP is now directly controlling more SIMDs. Finally, if the WGP is being redesigned without a CU sub-level then while introducing conditional routing it would be cool to use narrower SIMDs to really make the most of the complexity that conditional routing requires.

Oh and maybe there'll only be one TMU per WGP? Texel rates have reached silliness and I suspect do not need to go higher than currently seen. That wouldn't directly affect the RA count in my opinion, e.g. there could be 4 RAs per WGP.

In the end I'm saying all of this simply because "no CUs" implies a major re-working of the control of all the modules inside a WGP. And I'm throwing conditional routing in there because fucking hell, it's about time. And ray traversal shaders could do with some help there.
 
AMD introduced SIMD-32 with workgroups of 32 for graphics kernels because graphics benefits from the shorter latencies of that configuration

I agree with the rest, but this is not true. That might be the stated reason, but the actual reason is that too much code is specifically optimized for nV cards, which have SIMD-32 workgroups.
 
I agree with the rest, but this is not true. That might be the stated reason, but the actual reason is that too much code is specifically optimized for nV cards, which have SIMD-32 workgroups.
Seems reasonable, but in pixel shading there's not much that a developer can do to specify how the pixel shader maps to the hardware's base workgroup size.

I think that a g-buffer fill pass in deferred rendering will benefit from RDNA's SIMD-32s (reduced latency). Here the pixel shader is low in texturing, which reduces the need for texture latency hiding. That should then mean that render target writes tend towards being in localised clumps within the render target, rather than tending to be scattered by the longer-latency execution pattern seen in GCN.

GCN prefers to have lots of fragments in flight per SIMD because the cost of switching hardware thread is relatively longer latency - i.e. latency hiding (switching hardware thread) in GCN adds yet more latency to the overall kernel duration than seen in RDNA. GCN has a double-whammy of latency as it were.

Localised clumps of render target writes help with render target caching, with or without delta colour compression. Maxwell, onwards, gained a huge bump in g-buffer fill pixel shading because of the tiled forward architecture, which further intensifies locality in render target writes.

So my guess is that as deferred rendering has become dominant, with g-buffer fill becoming so painful, RDNA has clawed back against NVidia just because of reduced pixel shading latency, because it improves the coherency of render target writes.

The cost for RDNA is the extra scheduling logic and the fact that that logic is running more intensively. If RDNA 3 has conditional routing and SIMD-16s then there would be yet more scheduling cost. And register file buffering. It seems like this would end up needing complex scoreboarding for hardware threads and register reads.

RDNA has already paid many costs in scheduling overhead versus GCN, yet it gained efficiency. It can be argued that RDNA achieved the efficiency gain despite increased overheads specifically because of its wider SIMDs, changing SIMD-16 to SIMD-32.

My proposal for single-cycle SIMD-16 issue would be backwards in those terms, due to another increase in scheduling overhead. My guess is that conditional routing itself adds so much scheduling overhead that the sweet spot changes again, towards SIMD-16 from SIMD-32.

I suppose I should read that patent document that was linked earlier in the thread:

COMPUTE UNIT SORTING FOR REDUCED DIVERGENCE - Advanced Micro Devices, Inc. (freepatentsonline.com)

There's nothing to say that conditional routing needs SIMD-16, I'm just wondering whether the simplification of removed CU-level scheduling might be traded for conditional routing complexity and narrower SIMDs.
 
Oh and maybe there'll only be one TMU per WGP? Texel rates have reached silliness and I suspect do not need to go higher than currently seen. That wouldn't directly affect the RA count in my opinion, e.g. there could be 4 RAs per WGP.

Yeah I'm surprised TMU ratios haven't already been cut back. Maybe there are bursty texture heavy workloads that still benefit.

In the end I'm saying all of this simply because "no CUs" implies a major re-working of the control of all the modules inside a WGP. And I'm throwing conditional routing in there because fucking hell, it's about time. And ray traversal shaders could do with some help there.

I don't see why it implies a major reworking. In RDNA each SIMD is already independent with its own instruction scheduler, wavefront controller, scalar ALU and SFU. In practical terms "no CUs" just means that it's always operating in "workgroup-processor-mode" and each workgroup always has access to the full LDS within the WGP. I said this at RDNA launch - AMD should drop the unnecessarily confusing WGP terminology and just stick with CU. The CU is just bigger and badder now.
 
Regarding TMUs: You need the fast datapaths to the CU's caches for compute and RT as well. Might keep the little filters around for a while longer.
 
I don't see why it implies a major reworking. In RDNA each SIMD is already independent with its own instruction scheduler, wavefront controller, scalar ALU and SFU. In practical terms "no CUs" just means that it's always operating in "workgroup-processor-mode" and each workgroup always has access to the full LDS within the WGP. I said this at RDNA launch - AMD should drop the unnecessarily confusing WGP terminology and just stick with CU. The CU is just bigger and badder now.
I agree that WGP terminology in a "no CUs" configuration would be confusing and could revert to simply CU-centric.

I'm struggling to understand what the CUs are for in RDNA (2) apart from localising TMUs/L0$, making them "more similar to the operation of GCN".

From the perspective of a "no CUs" configuration RDNA (2) can be seen as a half step.

I'll return to your point about "major reworking" when I post about the compute unit sorting patent document.

Regarding TMUs: You need the fast datapaths to the CU's caches for compute and RT as well. Might keep the little filters around for a while longer.
This is very true.

AMD did go to town by doubling cache line widths for RDNA and increasing texturing throughput per unit. I dare say it seems unlikely cache line widths would double again. The only meaningful per-unit throughput increase available would be for filtering of 128-bit textures (e.g. 4x fp32), since fp16 is already full rate (isn't it? - I can't find benchmarks).

AMD has demonstrated a trend towards increased cache density and texturing throughput per work item, both of which contradict my opinion about a singular, larger, L0$ and reduced texturing unit count per WGP.

It's worth noting that L0$ cache density per work item hasn't increased with RDNA (L1$ in GCN, there was no L0$), merely for I$ and K$. So perhaps I'm not entirely pissing into the wind.

I found this slide deck, which I've not seen before:

AMD PowerPoint- White Template (gpuopen.com)

The first bulleted list seems to exaggerate when it says "Resources of two Compute Units available to a single workgroup": yes the LDS which is a resource nominally per-CU can be fully utilised by a single workgroup running across all 4 SIMDs (any SIMD can access any address in LDS). But a SIMD does not have access to all of the WGP's L0$ and all of the WGP's texturing throughput.

That can be seen as deliberate: texture/data fetch needs to be optimised for bandwidth.

If optimising for latency you would expect that each SIMD would have its own L0$ and texturing. Yet a primary directive in the design of RDNA was reduced latency.

So there's a conflict between bandwidth/throughput versus localisation/latency.

Perhaps a larger WGP-level L0$ but with multiple (two?) TMUs.
 
The only meaningful per-unit throughput increase available would be for filtering of 128-bit textures (e.g. 4x fp32), since fp16 is already full rate (isn't it? - I can't find benchmarks).
Yes. As well as FP32 can be, as long as it's only using two channels.

edit: it also says so on p 36 of that arch presentation
 
Last edited:
I've never bothered to understand what Intel has done with its variable width workgroups, but I'm going to guess that something like that is what AMD is planning. But allowing kernels that feature dynamic control flow to be able to be conditionally-routed directly by the hardware would be a huge step up from that.
You could in theory take the SVE variable vector path to the extreme, having a single-lane ALU pipeline (i.e., non-SIMD), and executing any arbitrary sized wavefront by looping the same instructions with contiguous blocks of registers as src/dst, while skipping items disabled in the exec mask. That would effectively turn an RDNA WGP into a 128 CPU-ish core cluster... :p

Though of course, this means you would be putting in 32x the control logic too in instruction sequencing/pipelining, and 32x the execution latency for 32-wide wavefronts. Very close to the Nvidia Echelon model IIRC.

As far as I know, Intel's approach is similar to this/SVE, except that the hardware is SIMD4/SIMD8, paired with a complex RF design.
 
Last edited:
So

COMPUTE UNIT SORTING FOR REDUCED DIVERGENCE - Advanced Micro Devices, Inc. (freepatentsonline.com)

is extremely underwhelming. All the tricky questions:
  • how are thresholds for profit in performing reorganisation assessed
  • degradation due to count of branch targets
  • quantity of work item state that is reorganised
  • nested divergent control flow (e.g. count of branch targets keeps increasing and even worse if it does so at a high rate relative to instructions executed)
  • impact on execution (stalls) caused by having to wait for enough hardware threads to "settle" as a prerequisite for sorting and reorganisation
were completely ignored. Actual sorting techniques were not provided, the capacities of the hardware were not acknowledged as determining factors for sorting and no hint of mitigations for the difficulties of moving data around a SIMD or CU or WGP were provided.

RDNA does have cross-lane data reads which could be a major component of "intra-wavefront" (intra-hardware thread) reorganisation. The bandwidth is low though, so there's a high cost to moving VGPRs on demand and then returning them after work items reconverge to their original scheduling. e.g. there might be 5 VGPRs required for the different sections of code target by control flow, but each work item has 20 VGPRs allocated in total.

The description of the "inter-wavefront" technique avoids tackling the question of whether the two hardware threads in question are running on the same SIMD. The closest the document comes to acknowledging this question is "Different wavefronts of a single workgroup do not execute in the simultaneous SIMD manner described herein, although such wavefronts can execute concurrently on different SIMD units 138 of a single compute unit 132."

So in all the document barely acknowledges the logistical problems of the algorithms it presents. I'm not convinced there's anything novel (patentable) in the algorithms presented.

It's nice that the document referred expressly to ray tracing, and shows the problems associated with "uber ray shading" that comes when "closest hit", "miss" and "any hit" shaders are composed into a single uber shader (on top of the actual BVH traversal shader).

The general case of divergent control flow while shading ray results is unavoidable, so anything that mitigates SIMD-wastage is welcome.

Multiple execution items per work item is presented.

We're already familiar with this, in a sense. When a 64-item workgroup runs on an RDNA (2) SIMD, the compiler generates "hi" and "lo" 32-work-item halves, which are either tackled by alternating each instruction for each half, or scheduling one half to run to completion followed by the other.

From the perspective of a hardware thread this is sort of as if each work item is mapped to two. For instance in a pixel shader that is generally ran as a 64 work item workgroup, pixels 0 and 32 can be thought of as two execution items, both sharing work item 0 in the hardware thread. The register file can be viewed the same way: r0 in execution item 0 is matched by r32 in execution item 32.

The document presents this as a way to run ray tracing: 2 or more rays (each being an execution item) share a work item. Then intra-wavefront reorganisation is used to "move" rays so that, after sorting, divergence is minimised across all of the execution items.

The document also presents the idea of using criteria other than the branch target as input to the sorting algorithm (though it doesn't actually put this into the claims), e.g. the direction of rays. Say rays are split into one of two hemispheres of a sphere, then the rays can be sorted into two groups, with the expectation of increased coherence as they traverse (or at least for their next intersection test).

Throughout there's a de-emphasis of hardware functionality, instead descriptions focus on shader code inserted by the compiler, to perform the sorting. We can liken this to how vertex attribute interpolation is inserted by the compiler into pixel shaders. So the result of running such code (sorting by arbitrary conditions in order to "minimise divergence") could, ideally, be fed into hardware. Or it might just be turned into a stream of cross-lane or LDS-mediated data moves.

For example, in the same way that an execution mask is a 32- or 64-wide register (per hardware thread) that SIMDs refer to when running instructions, a lane-mapping mask would instruct the SIMD and the operand collector how to run reorganised work items. There is no attempt to elucidate this, as the document is merely about sorting work items for reduced divergence and reorganising them within the workgroup (as if by magic).

I don't see why it implies a major reworking. In RDNA each SIMD is already independent with its own instruction scheduler, wavefront controller, scalar ALU and SFU.
Currently we are left to infer that a workgroup of more than 32 work-items will run concurrently (though not necessarily in lock-step) across both SIMDs in an RDNA (2) CU, unless it's a pixel shader (which is usually 64 work items assigned to a single SIMD).

What appears as a logically single LDS is effectively two LDSs, each with its own crossbar and queue, in order to serve its parent CU.

So LDS and texturing (addressing, fetching, filtering) are notionally controlled by CU-level scheduling hardware.

Only if workgroup processing mode is activated will the hardware threads have the option to occupy all four SIMDs and gain access to all physical addresses in LDS (e.g. a single 256-work item workgroup uses all 128KB of LDS). In this scenario the crossbars and queues are merged into a single functional unit in the worst case, to deal with atomics (e.g. read after increment).

So now the CUs can no longer own their use of LDS, the WGP has to take on that responsibility. So some logic is required to support dual-responsibility for LDS usage.

Without CUs, WGP has full control in theory. Can that be simplified so that LDS is always singular? Or does LDS remain as two arrays? With 2 arrays in CU mode, a WGP has "doubled" bandwidth, because the 2 arrays are truly independent. In WGP mode is that doubled bandwidth still available?

In a CU-less WGP would that doubled bandwidth be available? Or will it still be 32-lanes at-a-time constrained?

Without CUs, what's the allocation policy for hardware threads across SIMDs inside the WGP, when those hardware threads are from the same workgroup? Always in (0,1) (2,3) pairs? Or greedy, finding the SIMD groups with the least work?

What if there's 8 SIMDs? Do you really want to use anything other than greedy allocation?

What's the effect of allocation policy on LDS bandwidth and latency?

Blimey I've spent over 6 hours on this today and it isn't even bedtime. Result!
 
I could imagine the compiler simply looks for branches in code, adds 'sorting barriers', workgroups go idle on that and SIMDs continue with another one. Another unit performs the sorting / reordering on the idle tasks concurrently?
If so, ray tracing may have been listed just as an example in the patent.

Edit: probably i get it wrong, as reordering threads this way would destroy their context to LDS memory. But 'single threaded' shader stages like pixel or ray shaders might work.
I don't think you got it wrong.

I didn't consider the effect on coherence versus LDS due to reorganisation, as I was too wrapped-up in working out how to maintain efficient use of each work item's VGPR state (and effectively SGPR state too). Well in theory you can add indirection to all the LDS accesses, which is natively supported. Worth noting that as long as the reorganisation remains within the workgroup, the LDS base address is unaffected by reorganisation.

Overall, yet another reason why the patent document seems so stunningly glib.

A crazy person could implement sorting, data moves and LDS indirection. Or we could just wait a few years to see if AMD can find a crazy person to finally get this working automagically in the driver.

Maybe, right now some leet console dev is finalising their implementation. Looks entirely doable in software on RDNA 2. Maybe an uber ray tracing shader is the only way this can be a win?
 
In a future design... Would there be any major drawbacks to a "graphics" base die/s with cache in the middle and a shader/compute die/s on top?

In my head the base die would likely have to be ~80-150mm2 for the minimum sized interface to reach the required bandwidth levels, 64/96/128bit.
 
I agree that WGP terminology in a "no CUs" configuration would be confusing and could revert to simply CU-centric.

I'm struggling to understand what the CUs are for in RDNA (2) apart from localising TMUs/L0$, making them "more similar to the operation of GCN".

There is a valid reason about not going for the "workgroup count" route and that is marketing. Look at Nvidia and their "CUDA cores" marketing, with considering the FP32-capable ALUs as a separate "core" while in reality the logic is shared at SM-level. Because it is bad to market "SM" where you have the 3080 and the 2080Ti at the same count when you can market "8704" vs "4352" CUDA Cores. Yes, peak FP32 capability is double per SM (but it is also true you cannot count on that peak rate in every workload. ) marketing loves numbers, the bigger the number is, the better. Now, if AMD would go for a "workgroup" count they would call for a 33% regression for each die, even if the FP resources would be 50% more, per die. Would you market the card at "30 Workgroup per die" vs "40 Workgroup per die" or "7860 CU per die vs "5120 CU per die"? For people understanding tech terms, it would be the same, for most of customers 30 is less than 40.
 
Status
Not open for further replies.
Back
Top