AMD: RDNA 3 Speculation, Rumours and Discussion

fehu · Sep 13, 2022

iroboto said:
Well it worked out well for AMD in the CPU space. I don't think Ryzen processors were necessarily Halo/crown taking over Intel, but they effectively got similar performance at significantly lower prices which is really what vaulted them forward.

Ryzen had the advantage to be different, offering much more cores instead of peak single thread performances, while gpus are mainly just measured in frames at the end.

trinibwoy · Sep 13, 2022

Qesa said:
The 75 TFLOPS figure is based on doubling ALUs per WGP, which won't equate to doubled performance. We just saw this with Turing->Ampere where the doubled throughput equated to ~30% more performance. I suspect RDNA3 will be similar.

I’m not making any claims about RDNA3 efficiency. What I said is that winning the paper flops battle isn’t required to win the performance crown.

Jawed · Sep 14, 2022

At the highest level the "doubled compute" of Ampere and (interpretations/predictions) of RDNA 3 can be compared thus:

Ampere requires co-issue for one of:
- two independent floating point instructions from a single hardware thread to be available
- a floating point and an integer instruction from a single hardware thread to be available
RDNA 3 requires one of:
- dual-issue of two ready to execute hardware threads - either two "wave 32" or a single "wave 64" which is composed of two hardware threads that are linked
- co-issue of two instructions from the same hardware thread encoded as VOPD

Note I'm referring to the issue rate of the highest throughput SIMDs (ignoring double precision and transcendentals) and deliberately avoiding the tangle of theoretical FLOPS, since there's so many caveats there.

RDNA 3 appears to suffer from complicated operand bandwidth gotchas, which means that sometimes the dual-issue scenario fails, so a fall-back to single issue from one of the hardware threads is all that can be achieved.

Obviously I'm speculating about RDNA 3 based on code published by AMD and patent documents.

By the way I see no reason why NVidia wouldn't improve scheduling and operand handling so that dual-issue is supported in addition to co-issue. I'm not saying it's easy, but it appears to be an opportunity ripe for exploitation. Frankly I'm expecting NVidia to achieve a significant utilisation-gain.

DegustatoR · Sep 14, 2022

Jawed said:
Ampere requires co-issue for one of:

two independent floating point instructions from a single hardware thread to be available

a floating point and an integer instruction from a single hardware thread to be available

Why single h/w thread though?

trinibwoy · Sep 14, 2022

DegustatoR said:
Why single h/w thread though?

I was wondering the same thing. Given each of the 2 SIMDs is only 16 wide the assumption is that Turing and Ampere are issuing to each SIMD in different clocks from the warp dispatcher. Those instructions by definition can come from different warps. Dual issue is an obvious fit for this setup.

Whats the evidence for Ampere relying on co-issue and how would that work?

MfA · Sep 14, 2022

Until games move to universal raytracing or they find some cheaper alternative to interposers, chiplets for scaling don't make a lot of sense. Interposers could get the bandwidth/latency to stitch two dies together for traditional APIs, but that's such a tiny niche and so expensive only Apple can sell it profitably (and they don't even really need it for the GPUs, tilers don't really benefit).

With universal raytracing you mostly get rid of the vertex to pixel shader pipelining which makes a mess of parallelism.

PS. 3D stacking is hell on cooling.

Jawed · Sep 14, 2022

trinibwoy said:
I was wondering the same thing. Given each of the 2 SIMDs is only 16 wide the assumption is that Turing and Ampere are issuing to each SIMD in different clocks from the warp dispatcher. Those instructions by definition can come from different warps. Dual issue is an obvious fit for this setup.

Whats the evidence for Ampere relying on co-issue and how would that work?

The "evidence" is the disappointment at Ampere scaling versus Turing. There are corner cases of expected performance...

It may well be capable of dual-issue and the shortfall is merely a symptom of operand bandwidth.

If that's the case then RDNA 3 might be very similar in disappointment factor.

It might be nothing more than having enough hardware threads in flight. RDNA 3's rumoured substantial increase in register file size might be an attempt to ameliorate that problem. AMD has been too stingy with register file for a very long time. With so much work apparently going into operands, RDNA 3 might finally be freed from disastrous corner cases, of which there have been far too many over the years.

Again, NVidia is likely to solve this kind of problem. We don't know the root-cause.

Otherwise, the disappointment with Ampere may simply be a gross misunderstanding of its mechanics - and so maybe Ada and RDNA 3 will both be disappointing with their "doubled-up compute".

Bondrewd · Sep 14, 2022

MfA said:
some cheaper alternative to interposers

Gosh if only N31/32 weren't using InFO-R at a funny pitch...
It's very organic I'd say.

MfA said:
chiplets for scaling don't make a lot of sense

COPIUM.

MfA said:
PS. 3D stacking is hell on cooling.

How the hell does MI300 work then I wonder.
Must be magic AMD cocaine.

DegustatoR · Sep 14, 2022

Jawed said:
The "evidence" is the disappointment at Ampere scaling versus Turing.

Whose disappointment?
The issue with assessing said scaling is constantly the same - people forget that Turing already had a secondary SIMD but since it was for INTs only its presence wasn't reflected in FLOPs figures.
According to Nv's own numbers Turing was running some 25-33% of gaming math on its secondary SIMD - something which Ampere does also despite said SIMD now being capable of running FP math too.
If you add such percentages to Turing FP figures and then compare them to Ampere's then the scaling becomes a lot less "disappointing".
And outside of gaming, when running pure FP math Ampere scales mightily fine.

Jawed said:
Again, NVidia is likely to solve this kind of problem. We don't know the root-cause.

I doubt they will because I somewhat doubt that there even is a problem to solve. Lovelace should scale better of course but for a different reasons.

OlegSH · Sep 14, 2022

trinibwoy said:
Whats the evidence for Ampere relying on co-issue and how would that work?

There is none because it doesn't do any co-issue. FMA co-issue has been dropped in Maxwell and FMA + LD/ST co-issue has been dropped in Volta.
My guess if co-issuing of instructions was efficient it would have not been dropped, but even if it's perf/area/watt neutral there are reasons why a company might want to opt for it (even just that it does look good on paper).

trinibwoy · Sep 14, 2022

Jawed said:
The "evidence" is the disappointment at Ampere scaling versus Turing. There are corner cases of expected performance...

That’s not evidence. Which apps are you referring to that should benefit from dual-issue FMA but don’t?

MfA · Sep 14, 2022

Bondrewd said:
How the hell does MI300 work then I wonder.

High margin accelerator bullshit, I care about the important stuff. Games.

Bondrewd · Sep 14, 2022

MfA said:
High margin accelerator bullshit

You haven't mentioned that.
All you've said is that it's hell on cooling yet MI300 works just fine.

MfA said:
I care about the important stuff. Games.

So do I, your point being?
Anyway, COPIUM.

Jawed · Sep 14, 2022

trinibwoy said:
That’s not evidence. Which apps are you referring to that should benefit from dual-issue FMA but don’t?

I'm referring to the game performance which was disappointingly far from double Turing, whether expectations where defined in terms of FLOPS, TEX, bandwidth or power, or some combination. 3090Ti versus 2080Ti is pretty damning...

Pure compute applications, typically not games, saw doubling or more in performance. I'm not aware of any analysis that identified the reasons. Crucially, to identify whether co-issue or dual-issue is the source of the performance gain.

FMA isn't the right way to think about primary ALU performance. Apart from anything else FMA isn't the only floating point instruction. It's why I talked about instruction throughput.

Ampere may well do dual-issue, and the disappointing game performance uplift it saw may also apply to RDNA 3, which looks highly likely to be a dual-issue.

For what it's worth dual-issue of FMA in RDNA 3 looks like it will be impossible in a subset of operand availability situations, since the register file can only provide four out of six of the required operands. In theory one or two operands can come from the destination operand cache and one operand can be supplied as a literal. So there are some situations where the dual-issue will work, but plenty not.

So RDNA 3 is likely to look worse than Ampere on dense FMA code with tons of instruction-level parallelism. Whether that's detectible is another question.

trinibwoy · Sep 14, 2022

Jawed said:
I'm referring to the game performance which was disappointingly far from double Turing, whether expectations where defined in terms of FLOPS, TEX, bandwidth or power, or some combination. 3090Ti versus 2080Ti is pretty damning...

Pure compute applications, typically not games, saw doubling or more in performance. I'm not aware of any analysis that identified the reasons. Crucially, to identify whether co-issue or dual-issue is the source of the performance gain.

FMA isn't the right way to think about primary ALU performance. Apart from anything else FMA isn't the only floating point instruction. It's why I talked about instruction throughput.

Ampere may well do dual-issue, and the disappointing game performance uplift it saw may also apply to RDNA 3, which looks highly likely to be a dual-issue.

For what it's worth dual-issue of FMA in RDNA 3 looks like it will be impossible in a subset of operand availability situations, since the register file can only provide four out of six of the required operands. In theory one or two operands can come from the destination operand cache and one operand can be supplied as a literal. So there are some situations where the dual-issue will work, but plenty not.

So RDNA 3 is likely to look worse than Ampere on dense FMA code with tons of instruction-level parallelism. Whether that's detectible is another question.

Right. Rendering a frame is an intricate dance where the the bottleneck shifts every few milliseconds. Even if dual-issue isn’t possible that on its own wouldn’t explain the lack of scaling in games. Most passes are bandwidth or occupancy limited anyway.

DavidGraham · Sep 15, 2022

Jawed said:
3090Ti versus 2080Ti is pretty damning...

In fairness, the 2080Ti should be compared against a 3090, since both are incomplete dies. The 3090Ti is a complete die that should be compared to RTX Titan.

Anyway, there are cases where the 3090 is a good 75% or more faster than 2080Ti, mainly involving extreme RT at 4K .. I am going to list benchmarks comparing the 3080Ti vs 2080Ti, since that alleviates any potential VRAM scaling issues, the 3090 should be 5% to 10% above the 3080Ti.

Crysis 3 Remastered, 3080Ti is 72% faster, the 3090 should be at least 77% faster

Crysis 3 Remastered тест GPU/CPU | Action / FPS / TPS | Тест GPU

В роли суперсолдата Пророка вы продолжаете поиски альфа-цефа и получаете новое задание: теперь вам нужно выв

gamegpu.com

Guardian Of The Galaxy, 3080Ti is 90% faster, the 3090 should be at least 95% faster

Marvel’s Guardians of the Galaxy тест 2021 | Action / FPS / TPS | Тест GPU

Отправляйтесь бороздить космос вместе со Стражами Галактики Marvel! В этом приключенческом боевике от третьего

gamegpu.com

Hitman 3, 3080Ti is 70% faster, the 3090 should be at least 75% faster

HITMAN 3 тест RT | Action / FPS / TPS | Тест GPU

В HITMAN 3 безжалостный профессионал Агент 47 возвращается ради самых важных контрактов в своей карьере, где каж

gamegpu.com

Cyberpunk, 3080Ti is 72% faster, the 3090 should be at least 77% faster

Cyberpunk 2077 v. 1.5 тест GPU/CPU | Action / FPS / TPS | Тест GPU

Cyberpunk 2077 — приключенческая ролевая игра, действие которой происходит в мегаполисе Найт-Сити, где власть, ро

gamegpu.com

Dying Light 2, 3080Ti is 80% faster, the 3090 should be at least 85% faster

Dying Light 2: Stay Human тест GPU/CPU | Action / FPS / TPS | TEST GPU

Dying Light 2 Stay Human – это экшен, который продолжает события первой части. Спустя 15 лет борьбы с вирусом болезнь вн

gamegpu.com

MfA · Sep 15, 2022

Bondrewd said:
You haven't mentioned that.
All you've said is that it's hell on cooling yet MI300 works just fine.

By using a huge relatively low utilisation cache on the bottom. The cache will have poor cooling. If you tried to stack GPUs directly without that expensive huge cache die, it would be a problem.

It's an even more expensive solution than interposers, I don't see them building consumer devices with multiple GPU dies on top of cache.

Bondrewd · Sep 15, 2022

MfA said:
By using a huge relatively low utilisation cache on the bottom

But have you seen MI300?

MfA said:
It's an even more expensive solution than interposers

But have you seen the cost structure?

MfA said:
don't see them building consumer devices with multiple GPU dies on top of cache.

Good.
More surprises!

fehu · Sep 15, 2022

https://twitter.com/x/status/1570098102153121792

It's how you use it that counts!

Seanspeed · Sep 15, 2022

In the GPU world, with a given architecture+process, yea, it kinda does.

AMD: RDNA 3 Speculation, Rumours and Discussion

fehu

trinibwoy

Meh

Jawed

DegustatoR

trinibwoy

Meh

MfA

Jawed

Bondrewd

DegustatoR

OlegSH

trinibwoy

Meh

MfA

Bondrewd

Jawed

trinibwoy

Meh

DavidGraham

Crysis 3 Remastered тест GPU/CPU | Action / FPS / TPS | Тест GPU

Marvel’s Guardians of the Galaxy тест 2021 | Action / FPS / TPS | Тест GPU

HITMAN 3 тест RT | Action / FPS / TPS | Тест GPU

Cyberpunk 2077 v. 1.5 тест GPU/CPU | Action / FPS / TPS | Тест GPU

Dying Light 2: Stay Human тест GPU/CPU | Action / FPS / TPS | TEST GPU

MfA

Bondrewd

fehu

Seanspeed

Similar threads