AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
Really? You may have seen these in another thread. Or you thinking these (first two) are already in Ampere?
https://www.freepatentsonline.com/11295508.html
https://www.freepatentsonline.com/11282261.html
https://www.freepatentsonline.com/y2022/0027194.html

Oh right, we did talk about the multi-box patent. That one is interesting because it implies Nvidia is encoding 12 BVH nodes into a single cache line and the RT core can do 12 box tests per cycle which is pretty impressive.

The sharding patent is reminiscent of Volta's independent thread scheduling. Did that not already make it over to Turing and Ampere?
 
So the RA isn't just re-using the TMU L1 memory pipeline. It's also sharing silicon between the box and triangle intersection logic. Very elegant and area efficient but one triangle per clock probably isn't going to cut it for RDNA 3.
This part of the pipeline is of variable latency and results are written back asynchronously. Not saying RDNA 3 will or will not do it, but the intersection performance can still be scaled vertically as long as the memory hierarchy can sustain it. Evidently TMU in RDNA1 had doubled the FP16 texture filtering rate (and 4x GCN 1 if my memory served me right), while all macro ratios stayed the same (lane:TMU).
 
Four SIMDs sharing a single TMU/RA implies AMD thinks ray acceleration is fast enough and ray tracing performance is now down solely to traversal and hit/miss shader throughput.

Well, that assumes RA hasn't been changed and that traversal is still in software...
One can also argue that if they are happy to throw xtors at 2x scalar+vector throughput, they might also be inclined to throw xtors at hardware traversal. There is no apparent causation, so extrapolation works both directions. :p
 
Yes, independent thread scheduling is the warp execution model for the Volta and it's successor generations.
Subwarp Interleaving (3rd link) is the advanced version of it. I recommend that you read the research paper.

Seems unlikely that this has made it into hardware. The paper concludes that today's workloads won't benefit much even when running RT. Reducing latency (bigger caches) or increasing parallelism (bigger register files) may be a simpler solution to the latency problem.
 
One can also argue that if they are happy to throw xtors at 2x scalar+vector throughput, they might also be inclined to throw xtors at hardware traversal. There is no apparent causation, so extrapolation works both directions. :p
Putting four SIMDs into a single CU with two CUs in a WGP possibly stretches the crossbars (or ring bus, as you've suggested) and theoretically requires a beefier LDS, along with more L0 cache. It all multiplies.

A pay-back for more SIMDs inside a CU (or WGP) is that more extensive scheduling hardware is amortised across more compute.
 
Putting four SIMDs into a single CU with two CUs in a WGP possibly stretches the crossbars (or ring bus, as you've suggested) and theoretically requires a beefier LDS, along with more L0 cache. It all multiplies.

A pay-back for more SIMDs inside a CU (or WGP) is that more extensive scheduling hardware is amortised across more compute.
They could also double the LDS size without doubling the bandwidth (so 4 SIMDs sharing the existing 2 128B/cycle datapaths). This can at least enable more concurrent workgroups on a CU/WGP.
 
Expecting both N31 and AD102 to be less than 2x performance of current top end cards. Hype is getting beyond ridiculous at this point.
I only see >2x possible with absurd power draw or some type of multi-GPU situation akin to Crossfire.
Navi 31 just got another upgrade. 3Ghz clock for a mind boggling 92 TF.

https://videocardz.com/newz/amd-nav...-fp32-performance-four-times-more-than-navi21
There is just no way it will be anywhere near 90 TFLOPS in the same vein that a 6900xt is 23.
 
Last edited:
They could also double the LDS size without doubling the bandwidth (so 4 SIMDs sharing the existing 2 128B/cycle datapaths). This can at least enable more concurrent workgroups on a CU/WGP.
Yes, I was referring to a beefier LDS. But it sounds like you might be describing a pair of LDSs, each the size of an RDNA 2 LDS, one LDS being private to each CU.
 
Yes, I was referring to a beefier LDS. But it sounds like you might be describing a pair of LDSs, each the size of an RDNA 2 LDS, one LDS being private to each CU.
Not private, or else WGP will stop being WGP. I meant a possibility of keeping the current data path arrangement (2 arrays of 32 32B banks; one shared request/response bus per CU; i.e. 4 SIMDs sharing, up from 2 today), while increasing the size of individual banks (1024 entries from 512).

I might be wrong about my “near-far” read though. My second thought is that RDNA (2) could be a two-level setup, where each CU gets its own LDS “front end”, independent of the two LDS bank arrays.

The “front end” handles sequencing and result buffering with also a 32-lane crossbar. Each request would be broken down by array into 1 or more conflict-free sub-requests (bound to a specific array), each of which would also have addresses & data sorted by bank order, before being sent out to the array. This way, an actual LDS bank array can have a very slim control & datapaths around the banks, while it would require a simple(r) 2x2 crossbar for both CUs to have uniform access to both arrays.

So if it were to be scaled up for 8 SIMDs, it could be extended as a 4x4 crossbar (moving 32x4B=128B lines), with 4 such “front ends” and 4 bank arrays.

This still might not resemble the actual thing, but is probably closer to the truth than a naive 64x64 or (128x128 for 8 SIMDs) monolithic bank-level xbar. :p
 
Last edited:
Not private, or else WGP will stop being WGP.
Which is why I shake my head at the continued existence of the WGP + CU combination... Sharding texture L0, TMU and RA within a WGP seems problematic.

I meant a possibility of keeping the current data path arrangement (2 arrays of 32 32B banks; one shared request/response bus per CU; i.e. 4 SIMDs sharing, up from 2 today), while increasing the size of individual banks (1024 entries from 512).

I might be wrong about my “near-far” read though. My second thought is that RDNA (2) could be a two-level setup, where each CU gets its own LDS “front end”, independent of the two LDS bank arrays.

The “front end” handles sequencing and result buffering with also a 32-lane crossbar. Each request would be broken down by array into 1 or more conflict-free sub-requests (bound to a specific array), each of which would also have addresses & data sorted by bank order, before being sent out to the array. This way, an actual LDS bank array can have a very slim control & datapaths around the banks, while it would require a simple(r) 2x2 crossbar for both CUs to have uniform access to both arrays.

So if it were to be scaled up for 8 SIMDs, it could be extended as a 4x4 crossbar (moving 32x4B=128B lines), with 4 such “front ends” and 4 bank arrays.
Certainly as the count of client SIMDs for LDS increases, this concept of "bank-alignment coalescing" becomes more attractive. Once the GPU is past simple 1:1 mappings for LDS banks, the variable latencies involved make this coalescing more productive, so more clients = more win.
 
https://videocardz.com/newz/amd-rdn...ns-navi-31-with-up-to-12288-stream-processors

The Navi 31 with 6 Shader Engines, 12 Shader Arrays and 48 Work Group Processors would ship with up to 12288 Stream Processors, that’s a reduction of 20% in core count compared to previously rumored 15360 cores. The same applies to Navi 32, which instead of 10240 cores would ship with 8192 Stream Processors. For the Navi 31 this means 4096 cores instead of 5192

Do people really believe the flagship will have 3 GCDs in the first attempt? It’s much more likely to be 2. AMD likes powers of 2.

Also there’s no way Navi 33 with 4096 processors is 440mm^2.
 
Last edited:
Explain your train of thought. Navi21 with 5120 units is 520mm² and N6 is only a tiny shrink over N7.
I’m no trinibwoy, but in comparison to Navi21 on 7nm, the number of CUs has shrunk by 20%, the memory I/O is halved, and the cache may or may not be halved too (I hope to God not given the memory interface). Since TSMC additionally claims that the 6nm tweak offer 18% higher logic density, the potential for a smaller die than 440mm2 seems to be there. Of course, since we don’t know exactly what RDNA3 adds that may require additional gates, or to what extent, it really is anybodys guess at this point.
 
Last edited:
Navi 33 isn't 5nm? I read the 440mm^2 in the videocardz table as the size of the chip not the package.
It's on N6, pretty cheap/reasonable for that die size.
I’m no trinibwoy, but in comparison to Navi21 on 7nm, the number of CUs has shrunk by 20%, the memory I/O is halved, and the cache may or may not be halved too (I hope to God not given the memory interface). Since TSMC additionally claims that the 6nm tweak offer 18% higher logic density, the potential for a smaller die than 440mm2 seems to be there. Of course, since we don’t know exactly what RDNA3 adds that may require additional gates, or to what extent, it really is anybodys guess at this point.
Well yes, all those cuts and then you add back in the extra RDNA3 increases. It sounds very reasonable to me.
 
Status
Not open for further replies.
Back
Top