AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
Well it worked out well for AMD in the CPU space. I don't think Ryzen processors were necessarily Halo/crown taking over Intel, but they effectively got similar performance at significantly lower prices which is really what vaulted them forward.
Ryzen had the advantage to be different, offering much more cores instead of peak single thread performances, while gpus are mainly just measured in frames at the end.
 
The 75 TFLOPS figure is based on doubling ALUs per WGP, which won't equate to doubled performance. We just saw this with Turing->Ampere where the doubled throughput equated to ~30% more performance. I suspect RDNA3 will be similar.

I’m not making any claims about RDNA3 efficiency. What I said is that winning the paper flops battle isn’t required to win the performance crown.
 
At the highest level the "doubled compute" of Ampere and (interpretations/predictions) of RDNA 3 can be compared thus:
  • Ampere requires co-issue for one of:
    • two independent floating point instructions from a single hardware thread to be available
    • a floating point and an integer instruction from a single hardware thread to be available
  • RDNA 3 requires one of:
    • dual-issue of two ready to execute hardware threads - either two "wave 32" or a single "wave 64" which is composed of two hardware threads that are linked
    • co-issue of two instructions from the same hardware thread encoded as VOPD
Note I'm referring to the issue rate of the highest throughput SIMDs (ignoring double precision and transcendentals) and deliberately avoiding the tangle of theoretical FLOPS, since there's so many caveats there.

RDNA 3 appears to suffer from complicated operand bandwidth gotchas, which means that sometimes the dual-issue scenario fails, so a fall-back to single issue from one of the hardware threads is all that can be achieved.

Obviously I'm speculating about RDNA 3 based on code published by AMD and patent documents.

By the way I see no reason why NVidia wouldn't improve scheduling and operand handling so that dual-issue is supported in addition to co-issue. I'm not saying it's easy, but it appears to be an opportunity ripe for exploitation. Frankly I'm expecting NVidia to achieve a significant utilisation-gain.
 
Why single h/w thread though?

I was wondering the same thing. Given each of the 2 SIMDs is only 16 wide the assumption is that Turing and Ampere are issuing to each SIMD in different clocks from the warp dispatcher. Those instructions by definition can come from different warps. Dual issue is an obvious fit for this setup.

Whats the evidence for Ampere relying on co-issue and how would that work?
 
Until games move to universal raytracing or they find some cheaper alternative to interposers, chiplets for scaling don't make a lot of sense. Interposers could get the bandwidth/latency to stitch two dies together for traditional APIs, but that's such a tiny niche and so expensive only Apple can sell it profitably (and they don't even really need it for the GPUs, tilers don't really benefit).

With universal raytracing you mostly get rid of the vertex to pixel shader pipelining which makes a mess of parallelism.

PS. 3D stacking is hell on cooling.
 
Last edited:
I was wondering the same thing. Given each of the 2 SIMDs is only 16 wide the assumption is that Turing and Ampere are issuing to each SIMD in different clocks from the warp dispatcher. Those instructions by definition can come from different warps. Dual issue is an obvious fit for this setup.

Whats the evidence for Ampere relying on co-issue and how would that work?
The "evidence" is the disappointment at Ampere scaling versus Turing. There are corner cases of expected performance...

It may well be capable of dual-issue and the shortfall is merely a symptom of operand bandwidth.

If that's the case then RDNA 3 might be very similar in disappointment factor.

It might be nothing more than having enough hardware threads in flight. RDNA 3's rumoured substantial increase in register file size might be an attempt to ameliorate that problem. AMD has been too stingy with register file for a very long time. With so much work apparently going into operands, RDNA 3 might finally be freed from disastrous corner cases, of which there have been far too many over the years.

Again, NVidia is likely to solve this kind of problem. We don't know the root-cause.

Otherwise, the disappointment with Ampere may simply be a gross misunderstanding of its mechanics - and so maybe Ada and RDNA 3 will both be disappointing with their "doubled-up compute".
 
The "evidence" is the disappointment at Ampere scaling versus Turing.
Whose disappointment?
The issue with assessing said scaling is constantly the same - people forget that Turing already had a secondary SIMD but since it was for INTs only its presence wasn't reflected in FLOPs figures.
According to Nv's own numbers Turing was running some 25-33% of gaming math on its secondary SIMD - something which Ampere does also despite said SIMD now being capable of running FP math too.
If you add such percentages to Turing FP figures and then compare them to Ampere's then the scaling becomes a lot less "disappointing".
And outside of gaming, when running pure FP math Ampere scales mightily fine.

Again, NVidia is likely to solve this kind of problem. We don't know the root-cause.
I doubt they will because I somewhat doubt that there even is a problem to solve. Lovelace should scale better of course but for a different reasons.
 
Whats the evidence for Ampere relying on co-issue and how would that work?
There is none because it doesn't do any co-issue. FMA co-issue has been dropped in Maxwell and FMA + LD/ST co-issue has been dropped in Volta.
My guess if co-issuing of instructions was efficient it would have not been dropped, but even if it's perf/area/watt neutral there are reasons why a company might want to opt for it (even just that it does look good on paper).
 
Last edited:
That’s not evidence. Which apps are you referring to that should benefit from dual-issue FMA but don’t?
I'm referring to the game performance which was disappointingly far from double Turing, whether expectations where defined in terms of FLOPS, TEX, bandwidth or power, or some combination. 3090Ti versus 2080Ti is pretty damning...

Pure compute applications, typically not games, saw doubling or more in performance. I'm not aware of any analysis that identified the reasons. Crucially, to identify whether co-issue or dual-issue is the source of the performance gain.

FMA isn't the right way to think about primary ALU performance. Apart from anything else FMA isn't the only floating point instruction. It's why I talked about instruction throughput.

Ampere may well do dual-issue, and the disappointing game performance uplift it saw may also apply to RDNA 3, which looks highly likely to be a dual-issue.

For what it's worth dual-issue of FMA in RDNA 3 looks like it will be impossible in a subset of operand availability situations, since the register file can only provide four out of six of the required operands. In theory one or two operands can come from the destination operand cache and one operand can be supplied as a literal. So there are some situations where the dual-issue will work, but plenty not.

So RDNA 3 is likely to look worse than Ampere on dense FMA code with tons of instruction-level parallelism. Whether that's detectible is another question.
 
I'm referring to the game performance which was disappointingly far from double Turing, whether expectations where defined in terms of FLOPS, TEX, bandwidth or power, or some combination. 3090Ti versus 2080Ti is pretty damning...

Pure compute applications, typically not games, saw doubling or more in performance. I'm not aware of any analysis that identified the reasons. Crucially, to identify whether co-issue or dual-issue is the source of the performance gain.

FMA isn't the right way to think about primary ALU performance. Apart from anything else FMA isn't the only floating point instruction. It's why I talked about instruction throughput.

Ampere may well do dual-issue, and the disappointing game performance uplift it saw may also apply to RDNA 3, which looks highly likely to be a dual-issue.

For what it's worth dual-issue of FMA in RDNA 3 looks like it will be impossible in a subset of operand availability situations, since the register file can only provide four out of six of the required operands. In theory one or two operands can come from the destination operand cache and one operand can be supplied as a literal. So there are some situations where the dual-issue will work, but plenty not.

So RDNA 3 is likely to look worse than Ampere on dense FMA code with tons of instruction-level parallelism. Whether that's detectible is another question.

Right. Rendering a frame is an intricate dance where the the bottleneck shifts every few milliseconds. Even if dual-issue isn’t possible that on its own wouldn’t explain the lack of scaling in games. Most passes are bandwidth or occupancy limited anyway.
 
3090Ti versus 2080Ti is pretty damning...
In fairness, the 2080Ti should be compared against a 3090, since both are incomplete dies. The 3090Ti is a complete die that should be compared to RTX Titan.

Anyway, there are cases where the 3090 is a good 75% or more faster than 2080Ti, mainly involving extreme RT at 4K .. I am going to list benchmarks comparing the 3080Ti vs 2080Ti, since that alleviates any potential VRAM scaling issues, the 3090 should be 5% to 10% above the 3080Ti.

Crysis 3 Remastered, 3080Ti is 72% faster, the 3090 should be at least 77% faster

Guardian Of The Galaxy, 3080Ti is 90% faster, the 3090 should be at least 95% faster

Hitman 3, 3080Ti is 70% faster, the 3090 should be at least 75% faster

Cyberpunk, 3080Ti is 72% faster, the 3090 should be at least 77% faster

Dying Light 2, 3080Ti is 80% faster, the 3090 should be at least 85% faster
 
You haven't mentioned that.
All you've said is that it's hell on cooling yet MI300 works just fine.
By using a huge relatively low utilisation cache on the bottom. The cache will have poor cooling. If you tried to stack GPUs directly without that expensive huge cache die, it would be a problem.

It's an even more expensive solution than interposers, I don't see them building consumer devices with multiple GPU dies on top of cache.
 
Status
Not open for further replies.
Back
Top