AMD Execution Thread [2023]

Status
Not open for further replies.
LDS size did not change though, it's the same 128 KB per WGP (or 64 KB per CU) as in RDNA1
Thanks for the correction. Probably i've got confused because of WGP vs. CU again at some point.
I did not so many compute things, but interestingly the given LDS size was always (at least barely) enough and seemed quite right, so i wondered why they would double it.

Curious about dual issue in practice. Personally i'm always optimistic about this or double rate fp16. It's surely not easy to figure out how to utilize those features, but there must be a way... : )
 
AMD cites "17.4% architectural improvement clock-for-clock", so with a 8% frequency increase that should have resulted in a 25% higher performance, but so far such improvements mostly manifest in compute-limited tasks, like heavy levels of raytracing.

2nzd2VPZ7VyshqC32tXE6c.jpg
More evidence(or even proof) that the thing 'wrong' with RDNA3 is not necessarily some inability to hit expected clockspeeds. There is something else fundamentally wrong with the architecture that's preventing it from performing as AMD had initially tested and expected it would.

Such a weird situation.
 
Not only is Starfield missing among the games getting future FSR 3 support but also Bethesda is not listed as one supporting future FSR 3 partners. Maybe it's contract related and implementation might require a revised (new) contract as opposed to just implementing FSR 3 at a later date?
I would find it very strange that the contract would stop you from using newer tech, especially when the tech is open source.
Take into account they would also want to market FSR3 anyway.

My opinion.
They know they will have their hands full supporting what they already have, than to start to publicity commit to adding anything else.

The idea that the amount of bugs is indicative of level of crunch is incorrect.
Also, even if there was no crunch doesn't mean there was spare time.
 
For Starfield, it's likely FSR(2?) is the only thing that'll ever be supported, at least until a remaster comes out. And it's likely only there because they need it on console to hit decent framerates. So, I doubt it'll support FSR3, at least anytime before a remaster or possibly a GOTY edition.

Not because of a contract. Not because of money.

Regards,
SB
 
For Starfield, it's likely FSR(2?) is the only thing that'll ever be supported, at least until a remaster comes out. And it's likely only there because they need it on console to hit decent framerates. So, I doubt it'll support FSR3, at least anytime before a remaster or possibly a GOTY edition.

Not because of a contract. Not because of money.

Regards,
SB
Well it's only 30fps on console, and I'd bet that the reason is CPU-related, so I'm not sure FSR2 is some requirement for performance. Cant imagine 30fps should be too hard to hit for XSX's GPU.

But yea, who knows. We'll find out in the coming days.
 
For Starfield, it's likely FSR(2?) is the only thing that'll ever be supported, at least until a remaster comes out. And it's likely only there because they need it on console to hit decent framerates. So, I doubt it'll support FSR3, at least anytime before a remaster or possibly a GOTY edition.

Not because of a contract. Not because of money.

Regards,
SB
there is a chance it gets new upscaling with the DLC content coming out. Moving to FSR3 at least on consoles might give them image and performance boosts and could be good press for them
 
Moving to FSR3 at least on consoles might give them image and performance boosts and could be good press for them
it's unlikely to happen because FSR 3 recommends the framerate to be at least 60 before applying frame generation..
Also all current-gen consoles does not support Anti-Lag+.
edit: typo
 
More evidence(or even proof) that the thing 'wrong' with RDNA3 is not necessarily some inability to hit expected clockspeeds. There is something else fundamentally wrong with the architecture that's preventing it from performing as AMD had initially tested and expected it would.

Such a weird situation.
Yes, with the official reveal of Navi 32 specifications the "clockspeeds bad => Navi 32 fixed" excuses went finally void.

The story of RDNA3 is worthy an analysis. The hype surrounding the 5nm frequency jump with "4GHz are coming, bro", "so huge it requires chiplets" all over the place. AMD slides presented nice perf/W and IPC gains. Then Navi 31 launched with a tiny die, lowish frequencies, power issues, and very low IPC gains. Then the "frequency bug" excuses emerged - "a Navi 31 respin saves the day", "Navi 31/33 sux, wait for Navi 32 which comes late and is fixed!". The drivers were terrible, got fixed, but no "unlock the dual-issue gravy, bruh" arrived.

RDNA3 is just another Vega. Slides full of cool tech but underwhelming real-world results.

The R&D cost vs income ratio must be terrible. Chiplets, packaging, new shader architecture, advanced clocking, etc.

RDNA4 being cancelled seems like a wise decision.
 
Yes, with the official reveal of Navi 32 specifications the "clockspeeds bad => Navi 32 fixed" excuses went finally void.

The story of RDNA3 is worthy an analysis. The hype surrounding the 5nm frequency jump with "4GHz are coming, bro", "so huge it requires chiplets" all over the place. AMD slides presented nice perf/W and IPC gains. Then Navi 31 launched with a tiny die, lowish frequencies, power issues, and very low IPC gains. Then the "frequency bug" excuses emerged - "a Navi 31 respin saves the day", "Navi 31/33 sux, wait for Navi 32 which comes late and is fixed!". The drivers were terrible, got fixed, but no "unlock the dual-issue gravy, bruh" arrived.

RDNA3 is just another Vega. Slides full of cool tech but underwhelming real-world results.

The R&D cost vs income ratio must be terrible. Chiplets, packaging, new shader architecture, advanced clocking, etc.

RDNA4 being cancelled seems like a wise decision.
Sad, but true ...

from hotchips 2023:

https://www.servethehome.com/amd-versal-premium-vp1902-next-gen-chiplet-fpga-at-hot-chips-2023/

AMD-XCVP1902-Next-Gen-Chiplet-FPGA-HC35-_Page_06-scaled.jpg
 
Last edited:
RDNA4 being cancelled seems like a wise decision.
I dont know, RDNA3 being lousy seems to be even more reason to really push hard with RDNA4 if they could, to move on and keep up. But if RDNA4 is perhaps plagued with whatever's hobbling RDNA3, then maybe they dont want to throw too much at it and hope to correct with RDNA5. Nvidia perhaps pushing next gen to 2025 could be giving AMD the room to do this.
 
I dont know, RDNA3 being lousy seems to be even more reason to really push hard with RDNA4 if they could, to move on and keep up. But if RDNA4 is perhaps plagued with whatever's hobbling RDNA3, then maybe they dont want to throw too much at it and hope to correct with RDNA5. Nvidia perhaps pushing next gen to 2025 could be giving AMD the room to do this.
what is the difference between RDNA3 and RDNA3.5 ?
 
what is the difference between RDNA3 and RDNA3.5 ?
IIRC, beefy scalar units.

Cancelling a gen would usually leave room for a deep rethinking of the plans and redo of the org structure. This happened in 2012 after the Bulldozer fiasco - they skipped 2-3 gens almost completely while making room for the Zen project. Additionally, the current desktop 10% market share means the absolute sum of lost income when sliding to, let's say 5%, would not hurt them.
 
I would hope it's more than beefy scalar units to justify iterating on the core architecture though...

From personal experience, there is a large cost to any organisation for making multiple products with slightly different architectures rather than fully identical ones just scaled up/down, so you really need a good reason to do so. Then again, executives tend to underestimate that kind of engineering cost and the impact it has on other projects, so...
 
I dont know, RDNA3 being lousy seems to be even more reason to really push hard with RDNA4 if they could, to move on and keep up. But if RDNA4 is perhaps plagued with whatever's hobbling RDNA3, then maybe they dont want to throw too much at it and hope to correct with RDNA5. Nvidia perhaps pushing next gen to 2025 could be giving AMD the room to do this.
And Nvidia would be very happy to have no competition for the next two years
 
And Nvidia would be very happy to have no competition for the next two years
I have zero insider information on this, but I feel like the other elephant in the room is RDNA vs CDNA and the growing importance of AI acceleration. The two architectures have been diverging more and more, and while AMD claims that CDNA is optimised for its class of workloads, I am extremely skeptical that CDNA's cache hierarchy is actually better than RDNA's for example. Heck, I'm really not sure what changes in RDNA are actually supposed to be *worse* for any workload... I can see a lot of them not mattering, but being worse? I could be wrong but it just feels like they diverged this much because it's more efficient to treat them as fully separate engineering projects rather than being interdependent.

If high-end RDNA4 is actually cancelled, the engineers that would be working on it aren't going to be idle, and it feels like CDNA is the obvious thing for them to focus on...

And I think NVIDIA must be a lot more focused on AI competition than consumer GPU competition at this point. I hope that roadmap showing Ada-Next as being 2025 is either wrong or it's Q1 2025 because having no high-end dGPU refresh from either NVIDIA or AMD for 3+ years (except for a boring fully enabled RTX 4090 Ti) would be very disappointing.
 
it's unlikely to happen because FSR 3 recommends the framerate to be at least 60 before applying frame generation..
Also all current-gen consoles does not support Anti-Lag+.
edit: typo
60fps min is a 'strong recommendation' not an actual requirement.

Anti-Lag I don't believe is a requirement, just an added facility that amd will use to help latency.
Remember FSR3 will work on non AMD GPUs, will also work on console.
Console probably not how many people think, as in getting from 30->60fps.
Mostly used to get to 120fps.
 
Then of course there are supercalar dual-issue ALUs capable of processing FP32 and FP32/Int8 register / immediate operands in one clock, using either V_DUAL instructions in VOPD encoding for Wave32 threads, or Wave64 threads which would transparently dual-issue all VOPD instructions as proved by Chips&Cheese benchmarks below (but bizarrely Wave64 mode is not supported in HIP/ROCm due to hardware limitations, and there are further limits such as source VGPRs from different banks and cache ports, even/odd register numbers etc.).

In practice Wave32 is the prevalent mode, so dual-issue relies on compilers/optimizers which do a very poor job of automatically scheduling parallel multiple-operand instructions (Intel and HP will testify to the spectacular failure of the much touted Explicitly Parallel Instruction Computing (EPIC) architecture for the Itanium (IA64) processors, which relied on exactly the same kind of compiler optimizations in order to extract 6-way parallelism from its VLIW instruction bundles).
Having spent a lot of time optimising Imagination's Rogue architecture with Dual FMA and Vec4 FP16 (which I personally spent a few months optimising the compiler for), I don't think the problem is Dual-FMA per-se, it's the fact they have an insane number of interdependent restrictions that the compiler needs to optimise for at the same time. One fundamental problem with VLIW is conditional execution and short basic blocks resulting in partially filled instructions because there just isn't enough work to do with the same execution predicate, and that is probably a bigger problem for modern graphics/compute algorithms today than it was 10 years ago (and it's potentially really bad on CPUs which is one of the many reasons why Itanium was doomed from the start), but I don't think that's the main problem here.

It's surprisingly easy for a good compiler to solve one very hard problem, and surprisingly hard to solve many "not-so-hard-but-interdependent" problems. The fact dual-issue is so strongly dependent on register allocation due to heavy bank/port restrictions is horrific to me, it's so much cleaner if you can do register allocation after scheduling without any interdependence. If that part of the design was over-specced so that the compiler could assume it would probably be OK for the initial pass that'd be one thing, but 1 read port per SRC per bank is not over-specced at all... I know what the area/power cost of 2R+1W latch/flop arrays is versus 1R+1W, and I'd be shocked if this was worth the HW savings.

Also, the fact they can only have 4 unique operands in Wave32 mode feels like it might be because they are restricting the VOPD instructions to still only be 64-bit (or 96-bit with a 32-bit immediate) which is just insane when NVIDIA's single FMA instructions are all 128-bit. You'd have more than enough space for 6 unique operands with a 96-bit instruction + 32-bit immediate which still ends up at only the same 128-bit as NVIDIA's instructions for 2x as many FMAs (so 2x density). AMD's instruction set has plenty of very long instructions but not for the main VOP/FMA pipeline, so it sounds like they just reuse the same fetch/decode unit without any major changes instead of increasing maximum length from 96-bit to 128-bit... Arguably you really need both changes together to get good utilisation in Wave32 mode.

If you had dual-issue with 2R1W-per-souce-per-bank (or 3R1W-per-bank) and with 128-bit instructions, I suspect you'd get much higher VOPD utilisation. There would be a small area/power penalty for the extra read ports but I don't think it'd be that bad at all (especially if it sometimes allowed you to save an extra RAM read). Alternatively, you could have a more complicated register caching scheme that isn't (fully?) bank-based and/or allow cases with more reads to opportunistically go at full speed if they use the forwarding path etc...

In my opinion, the only way this architecture makes sense ***for Wave32*** is if AMD hacked the whole VOPD thing together in months rather than years as a panic reaction to NVIDIA's A102 dual-FMA design (which is also not going to hit anywhere near 100% peak for other reasons although I suspect it's better than this) without enough modelling or time to do compiler/hardware co-design. Or if they intentionally decided to focus on Wave64 (and they ended up not being able to use it in as many cases as expected).

It's an awful lot better for Wave64 without manual dual-issue though and if you can apply Wave64 to everything then these design decisions don't really matter anymore. I have never managed to get good data on that, but I suspect running two parts of the same wavefront/warp on the same ALU/FMA unit back-to-back will result in noticeably lower power consumption because the input/output data is much more similar than for a random uncorrelated FMA from another wavefront/warp, so you get less switching (lower hamming distance) which is the main component of dynamic power consumption. So there might be an argument in favour of running Wave64 anyway for other reasons for ALU-heavy workloads, while keeping Wave32 for branch/memory-heavy workloads only.

As you point out, Wave64 isn't supported for HIP/ROCm which is not great... the permute limitation is unfortunate but should be easy-ish to fix in future generations. I suspect they can use Wave64 most of the time for most graphics workloads though, so even though it's not ideal, it might not have much to do with their efficiency problems in graphics workloads which is the main thing RDNA3 is being sold for (while CDNA is focused on HIP and doesn't have these problems). And finally, even if it wasn't optimal in graphics, if the main bottleneck with RDNA3 isn't the number of FMAs anyway (*cough* raytracing and power efficiency *cough*) then it doesn't really matter and it's as good a way as any to double the number of theoretical flops at little area/power and most importantly engineering cost...! So in the end, maybe it wasn't worth it for you to read this analysis at all, but too late now ;)

P.S.: If you're curious about the insanity that is Imagination/PowerVR's Rogue ISA, it is publicly available here (I couldn't find another public link that didn't require downloading the SDK): https://docplayer.net/99326391-Powervr-instruction-set-reference.html - the Vec4 FP16 instruction is documented(-ish) on Page 22 Section 6.1.13... I find it absolutely hilarious that the example in that section doesn't actually manage to generate a Vec4 SOPMAD (just 2 Vec2 SOPs), probably because that document was created in 2014 before I worked on those compiler improvements...! The one nice thing about that instruction is the muxing was extremely flexible, so you could read any 6 32-bit operands (coming from 8 main banks + other register types) and mux it in any way you want to the 12 16-bit sources. That represents 12x 12:1 muxes which needless to say were extremely expensive in both area/power and in terms of instruction length - I think the largest instructions could get to slightly less than 512-bit *per instruction*!
 
Last edited:
Status
Not open for further replies.
Back
Top