AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
Also, this might be contested but they could save a noticeable amount of die space if they didn't include tensor cores in their hardware design since there's only one end user oriented application (DLSS) for this specialized hardware.

Nah. That'd be very, very dumb.

First, DLSS is extremly good and loved by many. So the tensor cores are already justified with that feature as DLSS boost performance way higher than additional shader cores could.

Second, the future is all about ML. We will see more and more games using neural networks in the future for all kinds of stuff, I am very certain of this, and then the tensor cores will come in handy as they can accelerate these ML models much faster and more efficiently than shader cores.

Third, there are many use cases aside from gaming like OptiX denoising in Blender or certain accelerated AI effects when using video editors.

So yeah, matrix accelerators are going to stay and they will be even more important in next gen games. I assume RDNA3 or RDNA4 will incorporate matrix accelerators as well.
 
I think it's a very safe bet that RDNA 3 will still be substantially slower in RT than Lovelace.



AMD could add tensor cores to their GPU without much issue. The real work is on the software side. Even if RDNA 2 had the cores there would be no DLSS equivalent.

On RT it may well be but we don't know exactly, that was my point. On DLSS I did not comment because I agree with you.
 
Only if you cherry-pick View instancing level. In other DirectX 12 Ultrimate features RDNA 2 is more advanced than Ampere (RDNA 2 supports 44bit virtual addressing, Ampere 40bit. RDNA 2 supports Sampler feedback 1.0, Ampere just 0.9. Ampere doesn't support Stencil reference value from Pixel Shader at all, RDNA 2 does).

Higher support level for view instancing doesn't add any new functionality ...

Nah. That'd be very, very dumb.

First, DLSS is extremly good and loved by many. So the tensor cores are already justified with that feature as DLSS boost performance way higher than additional shader cores could.

Second, the future is all about ML. We will see more and more games using neural networks in the future for all kinds of stuff, I am very certain of this, and then the tensor cores will come in handy as they can accelerate these ML models much faster and more efficiently than shader cores.

Well we're arguably living in the future now and the only end user application for tensor cores is DLSS. Chances are with the introduction of new hardware, the only end user application will still be DLSS while adding more features which will bloat the hardware design even further ...

Third, there are many use cases aside from gaming like OptiX denoising in Blender or certain accelerated AI effects when using video editors.

Pretty sure you can do these things offline without specialized hardware. Blender ? Video editing ? I don't think most users intend to run those applications in real time. Exactly how much time do you expect professionals to save with special hardware that will still take minutes or even hours to run with these applications ?
 
That doesn't change the fact, that Nvidia used 256bit bus for the same target performance as AMD 512bit.
Currently AMD uses 256bit bus for the same performance target as Nvidia 320-384bit. So the difference is way smaller.
It's obviously not the same performance target as was discussed above already. AMD should have used 384-bit bus on N21.
 
AMD could add tensor cores to their GPU without much issue. The real work is on the software side. Even if RDNA 2 had the cores there would be no DLSS equivalent.

Agree. AMD could and probably will add hardware acceleration for ML/AI, probably also ray tracing hardware acceleration. About the software, i wouldnt be so sure that AMD couldnt pull it off, they probably have something in the works for RDNA3+. Atleast i hope so. Competition is good.

Only if you cherry-pick View instancing level. In other DirectX 12 Ultrimate features RDNA 2 is more advanced than Ampere (RDNA 2 supports 44bit virtual addressing, Ampere 40bit. RDNA 2 supports Sampler feedback 1.0, Ampere just 0.9. Ampere doesn't support Stencil reference value from Pixel Shader at all, RDNA 2 does).

Thats not the whole picture though. From what i gathered, Ampere is having the more advanced feature-set overall (besides RT and TC acceleration hw).

It doesn't make the product more bandwidth bottlenecked, but less bandwidth bottlenecked.

Thats not what i ment. When you pull high resolutions, the infinity cache might become the bottleneck since RDNA2 without IC isnt a BW monster directly. NV's Ampere lineup is all the way up to 900gb/s. So at higher resolutions when the 128mb IC aint enough, thats what i ment.
 
Anyway, when Nvidia increased LLC size in Maxwell by up to eightfold (from 256 kB of GK107 to 2048 kB of GM107), it was praised by all the reviews as a great bandwidth- and power-saving feature. Now I see all of the reviews were wrong and the big cache in fact limited Maxwell greatly!

When AMD had more raw FPU and Bandwidth resources than comparable Nvidia cards in each segment back then it was actually pointed out that it was likely AMD's cards would possibly scale better when resolution moved up and should games become more compute/shader heavy. Which is actually the same narrative as now just flipped around. This did occur to some extent in hindsight (albeit we can maybe debate the entirety of the reasons).

Reviewers likely commonly highlighted GM107's cache as it was part of Nvidia's marketing and reviewer guides. Similarly how AMD is heavily marketing and branding cache in both it's CPU and GPU products. It's basically technical marketing. I remember at the time of of Maxwell and Tonga that both Nvidia/AMD were also pushing things in such as texture compression and the concept of effective bandwidth to combat the perception of the drop in raw bandwidth.

Interestingly GM2xx products actually had lower cache ratios than GM107 did, I believe GM204 had the same 2mb L2 as GM107 despite otherwise being way bigger. In hindsight part of I wonder if part of the reason was that Nvidia wanted to keep more secret the other improvements they put into Maxwell. Remember it came out after the fact I believe through third parties that they implemented some novel changes at the time that they never publicized as part of marketing for Maxwell, eg - https://forum.beyond3d.com/threads/tile-based-rasterization-in-nvidia-gpus.58296/
 
Maybe they don't look too much at power consumption numbers, but , imho, they care about noise and heat. And not blowing up their psu :eek:
I've experienced a situation where my powersupply was insufficient for the video card, and it was never a "blowing up" situation. Rather, it was a strange instability that I couldn't pin down for probably a week until it suddenly clicked in my head. I'd run the MSI Kombustor fuzzy donut thing to test my GPU overclocks, and then would run the modern version of Intel Burn Test to stress the CPU overclocks, and after two weeks of finding my perfect stable point I began playing games. And then on rare occasions, the machine would just crash to desktop or even just reboot.

Turns out my 1KW power supply, which was a quality unit seven years ago when I bought it, just wasn't up to the task any more. No sparks, no overt noises, just strange system stability issues.
 
I've experienced a situation where my powersupply was insufficient for the video card, and it was never a "blowing up" situation. Rather, it was a strange instability that I couldn't pin down for probably a week until it suddenly clicked in my head. I'd run the MSI Kombustor fuzzy donut thing to test my GPU overclocks, and then would run the modern version of Intel Burn Test to stress the CPU overclocks, and after two weeks of finding my perfect stable point I began playing games. And then on rare occasions, the machine would just crash to desktop or even just reboot.

Turns out my 1KW power supply, which was a quality unit seven years ago when I bought it, just wasn't up to the task any more. No sparks, no overt noises, just strange system stability issues.

Always and always invest in a quality PSU. Ive been running Seasonics for the last 15 or so years, prior that FSP group ones. My current seems to last forever, it doesnt want to give up.
 
So, I'm going to hypothesise that the SIMDs in each WGP are no longer operating in pairs, with each pair sharing L1, LDS, texturing and ray acceleration hardware as seen in RDNA 2.

So, per SIMD, the theoretical ray-triangle test rate would be doubled. Coupled with a 3x increase in SIMDs across the entire GPU, this would imply a limit of 6x faster ray-triangle testing than seen in Navi 21. Is that enough to be competitive?
 
So, I'm going to hypothesise that the SIMDs in each WGP are no longer operating in pairs, with each pair sharing L1, LDS, texturing and ray acceleration hardware as seen in RDNA 2.

So, per SIMD, the theoretical ray-triangle test rate would be doubled. Coupled with a 3x increase in SIMDs across the entire GPU, this would imply a limit of 6x faster ray-triangle testing than seen in Navi 21. Is that enough to be competitive?

That would be a nice improvement but what would be the benefit of the WGP structure in that case? Might as well stick to CUs with dedicated resources like RDNA1.
 
So, I'm going to hypothesise that the SIMDs in each WGP are no longer operating in pairs, with each pair sharing L1, LDS, texturing and ray acceleration hardware as seen in RDNA 2.

So, per SIMD, the theoretical ray-triangle test rate would be doubled. Coupled with a 3x increase in SIMDs across the entire GPU, this would imply a limit of 6x faster ray-triangle testing than seen in Navi 21. Is that enough to be competitive?

I was under the impression the bottleneck was more traversing the BVH using shaders (i.e. a highly branching workload on SIMD) rather than the intersection rate being insufficient.
 
That would be a nice improvement but what would be the benefit of the WGP structure in that case? Might as well stick to CUs with dedicated resources like RDNA1.
Workgroup is the unit of work at dispatcher level. The point of WGP appears to enable more parallelism (thread-level specifically) available to a single workgroup, which RDNA has made doubled over GCN (alongside the ILP improvements) if LDS WGP mode is used.

But even if fewer resources were to be "unshared" between CU pairs, LDS has continued to be shared as it is workgroup scoped. There is also hardware to manage workgroup barriers.
 
Last edited:
What I'm trying to think about (out loud :p) is how AMD would achieve more finely-grained scheduling in the seemingly new WGP which does not contain CUs and at the same time make ray acceleration work better.

That would be a nice improvement but what would be the benefit of the WGP structure in that case? Might as well stick to CUs with dedicated resources like RDNA1.
With AMD seemingly moving away from 64-work-item hardware threads to 32, the gains in granularity while scheduling dynamic branching during ray tracing (traversal and hit shaders) will not be achieved if RA throughput is too low.

So my interpretation/guess regarding the rumours is specifically that there is no contention across VALU SIMDs for the TMU/RA. Currently the TMU/RA hardware is shared by two SIMDs. That introduces a level of contention which can't be handled exclusively at the WGP level and can't be handled exclusively at the SIMD level.

In truth, there's always going to be a similar intermediate contention level, if LDS is common to all SIMDs within a WGP.

TMUs and RAs are both "compute-limited", they each feature a pipeline that implements instructions of some type to achieve their results.

LDS is wiring/banking limited, where memory banks and read/write ports have to support multiple paths to registers/VALUs. There's varying amounts of latency which SIMD-level scheduling already has to account for.

So it would seem that while it makes sense to grow LDS and make it dual-mode (shared or private per SIMD), to achieve greater ray tracing throughput AMD has no choice but to implement more RAs. It appears that at least some of the wiring if not the logic (e.g. for addressing in local cache for fetches) is common to both TMUs and RAs, so when AMD implements more RAs, more TMUs come along too.

Rumours seem to suggest that we're looking at 30 WGPs per graphics chiplet, with eight SIMD-32s per WGP. That's a lot of sharing/wiring/banking/variable-latencies for LDS. That makes me feel queasy about the practicalities. Also, with a limit of 128 work items in D3D on PC, supporting 256 across 8 SIMDs seems pointless, adding to my queasiness.

To reduce my queasiness, while RA throughput is being boosted within a WGP, it could all be shared by all of the SIMDs, similarly to how LDS is shared. This is a different theory, where I'm contemplating the use of LDS during ray traversal. Essentially the shaders that perform ray traversal make heavy use of LDS to track the state of each ray, since some of this state is shared by all rays. So, instead of making TMUs and RAs private per SIMD, it might be preferable to make LDS a "porthole" for the operation of the TMU/RA hardware.

With LDS made a porthole like this (and despite it seeming to be a bottleneck) it would provide SIMDs with a more responsive amount of RA throughput. In simple terms if one SIMD issues a ray query, and the other 7 SIMDs are doing other work (traversal or hit evaluation), the SIMD gains 8x the RA throughput it would otherwise have had if it was using a private RA.

So in this theory LDS becomes more central to the scheduling of work inside the WGP. Its latencies will probably be worse than seen in RDNA 2, and it is currently a bottleneck due to its size in RDNA 2 ray tracing ("barely holds the data required"). So RDNA 3 LDS would need to be much bigger to take on more functionality during ray acceleration scheduling and also cope with more work items in flight.

So, I'm torn. Fine-grained private-per-SIMD TMU/RA is sort of the trend we've seen with AMD (currently private per CU). But a group of TMUs/RAs shared by all of a WGP is attractive because it can soak up bursts of work better

NVidia seems to be doing broad sharing of TMUs and ray traversal, so it would seem more likely that AMD would go the latter way, with everything shared by all SIMDs in a WGP. At which point, maybe the count of TMUs/RAs per SIMD does not need to increase, as their utilisation would be higher...

I was under the impression the bottleneck was more traversing the BVH using shaders (i.e. a highly branching workload on SIMD) rather than the intersection rate being insufficient.
The highly-branching workload associated with ray tracing is not merely during ray traversal, it's also in hit shading. (Shadows are the exception, since there's not really any shading to perform with shadows). So to disentangle BVH traversal, ray-triangle intersection and hit shading is quite hard.

In the end, NVidia doubled intersection-testing throughput in Ampere versus Turing, so it would appear it's an element of performance that AMD might want to attack. I'm not saying that by doubling RA throughput per SIMD in RDNA 3, AMD will entirely solve the performance deficit it has. I'm simply thinking about the internal structure of the WGP and whether the rumours hint at a change in RA throughput per SIMD.

It may be that average RA throughput in RDNA 2 really isn't the bottleneck. Instead, performance problems associated with the RAs could be a problem of burstiness, where queue lengths grow and stalls propagate. In that situation, sharing RAs across more SIMDs would be the solution.
 
So it would seem that while it makes sense to grow LDS and make it dual-mode (shared or private per SIMD), to achieve greater ray tracing throughput AMD has no choice but to implement more RAs.

The ISA documentation seemed to suggest that the LDS WGP mode uses a near-far arrangement, rather than full-fledged bank interleaving across both arrays. So it is possible that this sharing is enabled by a simpler cost effective option like... a ring bus, which is expansion friendly.
 
The ISA documentation seemed to suggest that the LDS WGP mode uses a near-far arrangement, rather than full-fledged bank interleaving across both arrays. So it is possible that this sharing is enabled by a simpler cost effective option like... a ring bus, which is expansion friendly.
Hmm, interesting.

Would a ring with 16 stops (8x SIMDs/RFs + 8x TMUs/RAs) be practical?

"Zen 3" Chiplet Uses a Ringbus, AMD May Need to Transition to Mesh for Core-Count Growth | TechPowerUp

Maybe two rings? One for SIMDs/RFs and the other for TMUs/RAs?
 
So, I'm torn. Fine-grained private-per-SIMD TMU/RA is sort of the trend we've seen with AMD (currently private per CU). But a group of TMUs/RAs shared by all of a WGP is attractive because it can soak up bursts of work better
I won't hinge on these ratios too much. Empirically the architecture has already been dealing with asymmetry in wavefront width, execution unit width and non-pipelining stages — texture filtering being the staple example. So there are rooms of possibilities. For example, they could introduce a ray traversal state machine as a new unpipelined execution unit, but it would still use the ray intersection facilities in the texture cache hierarchy. They could also raise the number of the intersection units within a CU, since it only works max 1 lane per clock currently (4 box/1 tri for the same lane).
 
Last edited:
Status
Not open for further replies.
Back
Top