AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
Would a ring with 16 stops (8x SIMDs/RFs + 8x TMUs/RAs) be practical?
To maintain the current performance profile of the CU mode (64KB), the SIMD pair needs uniform access to the 64KB array. I don't see that changing.

My speculation is far region requests and responses are carried between the two 64KB arrays through a 2-stop ring bus. So in a 8 SIMD settings, we might be looking at 4-stop ring bus that moves maybe 128B/clk in both directions. It is also likely an implementation detail hidden behind the LDS request scheduler, since the gather/scatter model means any lane could load from any region (on top of bank conflits), and the LDS scheduler got to break down the requests (currently 32 rounds for the worst bank conflict scenario :p) and also makes sure the results are packed back into a vector-32 before sending it to the RF's way.
 
Last edited:
I won't hinge on these ratios too much. Empirically the architecture has already been dealing with asymmetry in wavefront width,
RDNA actually emulates hardware thread size of 64 by joining together two hardware threads. 32 is the hardware thread size now.

execution unit width and non-pipelining stages — texture filtering being the staple example.
RA throughput (latency) seems to be the real problem:


I think that's what that's showing. What do you think?

To maintain the current performance profile of the CU mode (64KB), the SIMD pair needs uniform access to the 64KB array. I don't see that changing.
My understanding is that LDS has been 32-wide per clock for a long time, so 64 work items require at least two clocks. And with "CU mode" disappearing (according to rumours), the entire WGP seems like it will be based upon variable latency LDS with 32 as the baseline.

I didn't know about the "near" versus "far" thing, so it seems like that would fit into a design where LDS is segmented into a "near" per SIMD and then "far" for the other SIMDs. Near would suit things like vertex attributes consumed by a fragment shader.

Then there'd be 32KB LDS arrays, one per SIMD? Or maybe double that, because ray tracing seems to put a lot of stress on latency-hiding and it hits LDS hard.

My speculation is far region requests and responses are carried between the two 64KB arrays through a 2-stop ring bus. So in a 8 SIMD settings, we might be looking at 4-stop ring bus that moves maybe 128B/clk in both directions. It is also likely an implementation detail hidden behind the LDS request scheduler, since the gather/scatter model means any lane could load from any region (on top of bank conflits), and the LDS scheduler got to break down the requests (currently 32 rounds for the worst bank conflict scenario :p) and also makes sure the results are packed back into a vector-32 before sending it to the RF's way.
One of the problems with my theory about LDS acting as a "crossbar" for TMU/RA queries and responses is that the TMUs/RAs now also need to engage in LDS requests and therefore the WGP needs to be master of these whether for ALUs/RFs or TMUs/RAs.

So a ring bus would solve the problem of queries/responses having to make a stop over within LDS.
 
RDNA actually emulates hardware thread size of 64 by joining together two hardware threads. 32 is the hardware thread size now.
I do not mean these things on their own being asymmetric, but the asymmetry across the execution pipelines.

For example, TMU can filter only up to 4 FP16 texels per clock (so data twice the width of a 32-bit lane; working only on 1/8 of a wave at a time). The results have to be coalesced — and expanded if requested — back into 2 to 4 vec32 before writing back to the RF. Meanwhile, instructions are routed to the TA/TMU as a 32 pack of parameters, hence an asymmetry in wavefront width and what TMU can actually do per clock (which can also vary based on the input parameters).

There are clues across ISA manuals and patents indicating that the architectures have been dealing these asymmetries with schedulers & result return queues.

My understanding is that LDS has been 32-wide per clock for a long time, so 64 work items require at least two clocks. And with "CU mode" disappearing (according to rumours), the entire WGP seems like it will be based upon variable latency LDS with 32 as the baseline.
That is indeed a possibility. But if one wants to keep the bank interleaving while scaling beyond 128B/clk (1 vec32/clk), it does not seem an easy goal at all.

I didn't know about the "near" versus "far" thing, so it seems like that would fit into a design where LDS is segmented into a "near" per SIMD and then "far" for the other SIMDs. Near would suit things like vertex attributes consumed by a fragment shader.

Then there'd be 32KB LDS arrays, one per SIMD? Or maybe double that, because ray tracing seems to put a lot of stress on latency-hiding and it hits LDS hard.
There are currently two 64KB LDS arrays, one for each pair of SIMDs ("CU"). Both SIMDs in a pair have full access to the LDS (sharing the 128B/clk data paths of course). The near/far thing applies only when LDS WGP mode is enabled, where all SIMDs can access both 64KB LDS arrays (resulting in shared 128KB).


One of the problems with my theory about LDS acting as a "crossbar" for TMU/RA queries and responses is that the TMUs/RAs now also need to engage in LDS requests and therefore the WGP needs to be master of these whether for ALUs/RFs or TMUs/RAs.

So a ring bus would solve the problem of queries/responses having to make a stop over within LDS.
I don't think there will ever be a unified hierarchy like this, since they work in drastically different ways. Gather/scatter in LDS works by breaking down requests into 32-bit chucks and routes to 32-bit banks with a full crossbar; VMEM instead (very likely) iteratively loads a minimal set of 128B cache lines and gather the results it needs. That is unless they went the Nvidia way: a giant scratchpad memory that does all trades of LDS and L0 (eh, unlikely).

VMEM and LDS do currently share the request and response buses. But this sounds more like a higher level sharing, at the level of control flows, moving VReg inputs (addresses, offsets, T#, etc), and writebacks of already packed VReg results.
https://www.amd.com/system/files/documents/rdna-whitepaper.pdf

The new cache hierarchy starts with the same SIMD request and response buses used by the LDS; since the buses are optimized for 32-wide data flow, the throughput is twice as high as GCN. Moreover, each dual compute unit contains two buses and each bus connects a pair of SIMDs to an L0 vector cache and texture filtering logic, delivering an aggregate bandwidth that is 4X higher than the prior generation
It is worth noting that bandwidth amplification at all levels was empirically a key design goal of RDNA. Any increased level of sharing seems unlikely... especially this close to the SIMD.
 
Last edited:
I don't think there will ever be a unified hierarchy like this, since they work in drastically different ways. Gather/scatter in LDS works by breaking down requests into 32-bit chucks and routes to 32-bit banks with a full crossbar; VMEM instead (very likely) iteratively loads a minimal set of 128B cache lines and gather the results it needs. That is unless they went the Nvidia way: a giant scratchpad memory that does all trades of LDS and L0 (eh, unlikely).
The NVidia arrangement has been at the back of my mind for months now, thinking about the WGP architecture.

For what it's worth, it's intriguing that this:

Register saving for function calling - Advanced Micro Devices, Inc. (freepatentsonline.com)

refers to LDS as a possible target for register saving. Of course the target could be L0 cache...
 
The NVidia arrangement has been at the back of my mind for months now, thinking about the WGP architecture.

For what it's worth, it's intriguing that this:

Register saving for function calling - Advanced Micro Devices, Inc. (freepatentsonline.com)

refers to LDS as a possible target for register saving. Of course the target could be L0 cache...
I think this is a software patent. Register spilling has always been done at compile time into the preallocated scratch space in VRAM of a predetermined size. This patent sounds like a technique to use spare LDS capacity (also known at compile time) for register spilling, before resorting to the scratch space.
 
Last edited:
I'd guess... 4 MCDs and 2 GCDs.

MCD: Each gets a 128MB slice of LLC (and I/F links of course). Made on an older process (N6?), since N7/6->N5 SRAM scaling is tapering.
GCD: Each gets 120 CUs and 128-bit GDDR6 I/O. Probably a die around 350-450 mm^2 — the estimation being: (i) stripping Navi 21 off of the entire LLC and 50% the DRAM I/O — likely gives a >>50% 5nm die shrink; (ii) then throw in 150% more compute units asides from CU/WGP improvements.

Adds up to 15360 SIMD lanes, 512 MB LLC and 32 GB Memory.

So... it sounds feasible to me, assuming you are also on the mutli-die SoIC/CoW Navi 31 choo choo train, started by a resident rumor mill...
:p

Edit: This patent looks very... hmm... familiar.
 
Last edited:
Thinking about the pin-outs of the GCDs and MCDs:

The MCDs already, in theory, have to be designed to support 180 degree rotational symmetry, assuming that both GCDs in a SKU have the same layout and assuming that there's more than 2x MCDs required to bridge the GCDs. This would be
  • GCD 1 zone A connecting to GCD 2 zone D
  • GCD 1 zone B to GCD 2 zone C
  • GCD 1 zone C to GCD 2 zone B
  • GCD 1 zone D to GCD 2 zone A
One reason to make a GPU from chiplets is that you can reduce the count of chiplets in some SKUs.

So instead of a SKU that uses 2x GCDs + 4x MCDs, you'd like to be able to make a SKU from one GCD. But the MCDs that are designed to fit on top of a pair of GCDs need to be "supported" by some kind of "blank die" that isn't a GCD.

An alternative is to design the MCDs so that they can be rotated by 90 degrees and still be pin-compatible with the GCD. So now this would be a SKU with 1x GCD and 2x MCDs, with each MCD connecting to two zones on the GCD.

So instead of a single MCD connecting one GCD cache zone (A) on chiplet 1 to a cache zone (D) on GCD chiplet 2, the MCD would connect cache zones A and B on a single GCD. Then a second MCD would connect cache zone D on chiplet 1 to cache zone C on the same GCD.

As far as I can see, though, it's not possible to make MCDs that are compatible with both 180 degree rotation and 90 degree rotation, unless there's two sets of connecting pins in each MCD connection zone on the GCDs. One set would be used for 180 degree configurations and the other set would be for 90 degree configurations.

You could argue that the connecting pins could be made multi-functional to solve this problem. But I would expect that data/addressing pins can't be mixed with power/ground pins. But maybe it's possible to come up with a layout which makes data/addressing pins multifunctional within their set and then power/ground multifunctional within their set. The latter set should consist of less pins, anyway...

With these multi-functional sub-zones, 180 degree and 90 degree rotations could be possible, while using all pins and not suffering from wasted pin capacity.

As the number of MCDs increases beyond 4, I suppose this gets more complex. But maybe sub-zone multi-functionality would still work...
 
I think this is a software patent. Register spilling has always been done at compile time into the preallocated scratch space in VRAM of a predetermined size. This patent sounds like a technique to use spare LDS capacity (also known at compile time) for register spilling, before resorting to the scratch space.
Agreed, it's purely a compiler technique.

Specifically with regard to ray tracing, using spare LDS capacity for function calls appears problematic, simply because apparently LDS gets heavily used during ray traversal. There shouldn't be much "spare" space in LDS.

So, either LDS is getting much bigger in RDNA at some point (3?) or AMD is trimming-down the LDS usage.

It may be that AMD is already doing this. I haven't looked at the ISA code in any detail to find out. It's technically possible right now, I'd say!
 
So instead of a SKU that uses 2x GCDs + 4x MCDs, you'd like to be able to make a SKU from one GCD. But the MCDs that are designed to fit on top of a pair of GCDs need to be "supported" by some kind of "blank die" that isn't a GCD.
It is a design goal question on whether it is worth making everything a reusable part at one go. They could instead make smaller GCDs, while maintaining the dual-GCD setup that reuses MCDs. While it does not enable them to cut the amount of masks for the compute dies, it does allow them to break the rectile limit while persumably achieving better yields with individually smaller dies. Sounds a decent step already.
 
It looks like AMD brings the hammer with Navi31. Better for them to easily beat AD102. Otherwise it will be a shame...
The only way it will get positive reviews is if ray tracing performance is on par.

I'm fairly doubtful AMD will improve ray tracing performance enough. It'll probably be a repeat of the tessellation "pain" which took most of a decade to get sorted.
 
The only way it will get positive reviews is if ray tracing performance is on par.

I'm fairly doubtful AMD will improve ray tracing performance enough. It'll probably be a repeat of the tessellation "pain" which took most of a decade to get sorted.
Totally agree, at this performance level and this market positioning, nobody will care if the halo SKU will do 140 or 170fps in pure raster at 4k in modern AAA game (probably CPU limited anyway). But if the gap is 50fps RT ON, in other words 50 vs 100fps, then it will be a deal breaker for the looser...
 
The only way it will get positive reviews is if ray tracing performance is on par.
I'd say it should be as much faster as rasterization. Otherwise you're looking at a 7 chiplet and likely very expensive design to be "on par" with a single chip AD102 solution - which is not a very good outcome for the former.
 
I'd say it should be as much faster as rasterization. Otherwise you're looking at a 7 chiplet and likely very expensive design to be "on par" with a single chip AD102 solution - which is not a very good outcome for the former.
While packaging chiplets is more expensive than monoliths, those monoliths suffer from exponentially worse yields
 
I'd say it should be as much faster as rasterization. Otherwise you're looking at a 7 chiplet and likely very expensive design to be "on par" with a single chip AD102 solution - which is not a very good outcome for the former.
*cough* 47 tiles PVC vs monolithic H100 *cough*
 
Status
Not open for further replies.
Back
Top