RDNA4

Bondrewd · Jan 30, 2024

Kaotik said:
AMD already explained why shader engine chiplets is a bad (expensive) idea when they first brought chiplets to GPUs

Yet they're doing that anyway.
Frankly not the most crackpot of their ideas, they have GPU register renaming patented somewhere.

WIPO - Search International and National Patent Collections

Yea. here it is.

DegustatoR · Feb 5, 2024

AMD RDNA 3.5’s LLVM Changes

Integrated graphics have been a key part of AMD’s strategy ever since they bought ATI. Bringing CPU and GPU blocks together in the same chip has given AMD substantial wins, including in Micro…

chipsandcheese.com

Frenetic Pony · Feb 19, 2024

Random stab:

40cu/20wgp, 12gb
20cu/10wgp, 9gb?
3ghz, N4P

AMD trying to concentrate on their biggest market (mobile) until RDNA5 which is scheduled for "next year when it's ready".

Bondrewd · Feb 22, 2024

Frenetic Pony said:
40cu/20wgp, 12gb
20cu/10wgp, 9gb?
3ghz, N4P

They're not that tiny and they clock higher than that.

Frenetic Pony said:
AMD trying to concentrate on their biggest market (mobile)

That's all APUs.

Bondrewd · Feb 24, 2024

Arun said:
Are you saying you know 10 WGP discrete doesn't exist from insider information

Yes, that segment is dead for both vendors.

Arun said:
For what it's worth, your original statement read to me as "There is definitively no 20 WGP discrete GPU in the RDNA4 generation based on what I've heard from insiders" as well.

No, it's "both dies are bigger".
Which they are.

Arun said:
but I don't see why 20 WGP would be completely out of the question.

It kinda overlaps Strix-halo in some ways.

DegustatoR said:

even hawk1 clocks better than previous RDNA3 thingies (especially at low power).

Deleted member 2197 · Mar 2, 2024

Rumor:

AMD RDNA4 graphics cards may only receive a minor bump in ray tracing performance

Sources have told popular YouTuber RedGamingTech that AMD's upcoming RDNA 4 graphics cards will only see a roughly 25 percent ray tracing performance uplift over the Radeon...

www.techspot.com

BTW, is AMD on an annual cadence regarding changing architecture? Or every two - three years?

Newguy · Mar 2, 2024

pharma said:
BTW, is AMD on an annual cadence regarding changing architecture? Or every two - three years?

To my knowledge the latest thing we have is AMD's roadmap from mid 2022:

https://images.anandtech.com/doci/17443/2022-06-09%2013_39_26.jpg

AMD’s 2022-2024 Client GPU Roadmap: RDNA 3 This Year, RDNA 4 Lands in 2024

www.anandtech.com

Releases are generally around 18-24 months. Vega was mid 2017, RDNA 1 mid 2019, RDNA 2 end 2020, RDNA 3 end 2022. RDNA 4 rumours are middish 2024 and I think RDNA 5 rumours are end 2025 so potentially both around 18 months

Bondrewd · Mar 2, 2024

pharma said:
BTW, is AMD on an annual cadence regarding changing architecture?

6-8Q.

eastmen · Mar 2, 2024

pharma said:
Rumor:

AMD RDNA4 graphics cards may only receive a minor bump in ray tracing performance

Sources have told popular YouTuber RedGamingTech that AMD's upcoming RDNA 4 graphics cards will only see a roughly 25 percent ray tracing performance uplift over the Radeon...

www.techspot.com

BTW, is AMD on an annual cadence regarding changing architecture? Or every two - three years?

That would be very disappointing

Frenetic Pony · Mar 2, 2024

eastmen said:
That would be very disappointing

I'd kinda expect the highest end one to be competitive with like, 4070ti Super in say, Frontiers of Pandora (1440p 60 on Ultimate settings). Honestly, the price for performance matter more than relative performance between RT and non RT. If it's $649 with a pack in Star Wars Outlaws, that'd probably sound damn good to potential buyers versus Nvidia, even if they drop the 4070tiS by $100.

If say, Horizon Forbidden West, runs (relatively) 25% faster that's just a bonus. That being said, AMD would have to bloody learn how to PR to sell that appealing message.

Arun · Mar 3, 2024

Deleted a couple of posts as they go against the guidelines of this subforum (and the topic in general would have to be handled more carefully elsewhere as well...)

Disallowed thread topics: Anything related to business performance, sales, marketing, etc (Graphics and Semiconductor Industry for those)

Please resume your previously scheduled RDNA4 speculation

Seanspeed · Mar 4, 2024

eastmen said:
That would be very disappointing

Thankfully it's a RGT rumor, so entirely safe to discard its credibility.

eastmen · Mar 4, 2024

Frenetic Pony said:
I'd kinda expect the highest end one to be competitive with like, 4070ti Super in say, Frontiers of Pandora (1440p 60 on Ultimate settings). Honestly, the price for performance matter more than relative performance between RT and non RT. If it's $649 with a pack in Star Wars Outlaws, that'd probably sound damn good to potential buyers versus Nvidia, even if they drop the 4070tiS by $100.

If say, Horizon Forbidden West, runs (relatively) 25% faster that's just a bonus. That being said, AMD would have to bloody learn how to PR to sell that appealing message.

Yea but will the 4070 be its competition or a 5060/5070 ? Its hard to compare it to something that is on the market and expecting the market not to change over what is likely another 6-8 months before the release of the product.

You also have to factor in power drain and heat production. So the nvidia competition could trump them on everything except price.

Albuquerque · Mar 8, 2024

MOD MODE AGAIN: I've cleaned house. If it isn't about RDNA4, then it doesn't belong in this thread. If you want to discuss efficiency of prior gens, you're absolutely welcome (and very strongly encouraged) to do it in another thread.

Ext3h · Mar 9, 2024

Seanspeed said:
Thankfully it's a RGT rumor, so entirely safe to discard its credibility.

So far it's plausible though. Unless we see new instructions which hint to a fully offloaded, fixed function traversal approach, it's still going to be the same limitations as with the RDNA3 - too much register pressure, too much synchronous alteration between hit shaders and triangle soup filtering in the shader code, with a ton of reloads from L2 too as the BVH traversal is purely software based as well.

And you can't just continue the RDNA3 route of increasing register counts even further, just so you can keep more rays in flight, just so you can better utilize the units in the shader array aside from triangle filter, while simultaneously getting hit even worse by cache spilling.

That article had summarized it quite well: https://chipsandcheese.com/2023/03/22/raytracing-on-amds-rdna-2-3-and-nvidias-turing-and-pascal/

How a fixed function traversal unit, which can asynchronously pre-fetch upcoming hits (buffered into LDS), enables the use of ultra-wide (thereby usually unfeasible expensive...) tree nodes / triangle groups (don't assume they wouldn't have an interrnal subdivison though - just that it's ISA-specific while the outer, observable layout is shader defined) and in return reduce the memory latency penalty cost of a deeper structure instead.

fellix · Mar 9, 2024

Ext3h said:
And you can't just continue the RDNA3 route of increasing register counts even further, just so you can keep more rays in flight, just so you can better utilize the units in the shader array aside from triangle filter, while simultaneously getting hit even worse by cache spilling.

I wonder if AMD was kind of cornered by its market position to "brute force" the RDNA3 design -- just more of everything (caches, VGPRs and constipated dual-issue) and very little specific/targeted improvements. And what's up with WMMA support? FSR should have already been forked to implement inference on RDNA3 for higher quality output to gain some points against DLSS.

Bondrewd · Mar 9, 2024

fellix said:
I wonder if AMD was kind of cornered by its market position to "brute force" the RDNA3 design -- just more of everything (caches, VGPRs and constipated dual-issue)

It has less MALL, mainstream WGP has the same vRF and the thing is made for clocks anyway.
Area is down iso WGP count iso node so idk your point.

fellix said:
FSR should have already been forked to implement inference on RDNA3 for higher quality output to gain some points against DLSS.

That's not gonna fly on APUs or consoles.
Useless.
RDNA5 has actual matrix cores so maybe then.

trinibwoy · Mar 9, 2024

Ext3h said:
…a ton of reloads from L2 too as the BVH traversal is purely software based as well.

How would fixed function hardware help with L2 thrashing though? Nvidia’s RT patents talk about a local BVH cache inside the RT unit but AMD may not go that route and just work off the existing L0/L1/LDS in the WGP.

Also what’s stopping the traversal shader from prefetching wide nodes into LDS?

Ext3h · Mar 9, 2024

trinibwoy said:
Nvidia’s RT patents talk about a local BVH cache inside the RT unit

That's probably something different. If you have a cached path all the way to the most recently used triangle soup (or a non-exposed sub-group inside such), and you end up finding a hit there for an (at least somewhat) coherent ray, it's an instant, "free" hit without having to repeat any part of the pointer chase for the traversal.

Guess why their guide tells you not to have overlapping bottom level acceleration structures. It's because that results in a reduced cache hit rate or even RT unit internal cache spilling, even if a sibling or nested ray was coherent as too many structures alias spatially.

trinibwoy said:
Also what’s stopping the traversal shader from prefetching wide nodes into LDS?

There's not really a point in prefetching wide nodes in whole. Too much stuff you are never going to need / hit. Most of it is better streamed and then discarded right away. Either you get a coherent hit to the exact same path (or a prefix of it!), or you are better off re-filtering starting at the best cached approximation.

Even though you do want to "keep" parts of the tree in cache which represent siblings which are also already known to hit.

E.g. when filtering for matching BLAS, you already found a 2nd matching one and you write that straight to the cache as well as an additional entry point for further traversal so you get it "instantly" if the traversal was to resume. While streaming the parent structure, filtering for more than one potential hit was "for free" after all as you already had the memory fetch pipelined...

Dang it, such a cache is actually a pretty smart construct, as you get spatially coherent hits first "by design" (as cached subtrees are explored first), which further provides a massive boost to efficiency of the actual traversal and hit shaders...

PS: No, I did not read the patent. You just said "there is a cache", and the rest is just an extrapolation based on some extremely basic understanding of cache architectures and the implications a loss of coherency would have for traversal performance...

trinibwoy · Mar 9, 2024

Ext3h said:
That's probably something different. If you have a cached path all the way to the most recently used triangle soup (or a non-exposed sub-group inside such), and you end up finding a hit there for an (at least somewhat) coherent ray, it's an instant, "free" hit without having to repeat any part of the pointer chase for the traversal.

No idea if the patent is implemented in any shipping products. But yes it gives free hits against recently touched geometry. Aside from caching recently used data the L0 cache also directs the scheduling of ray intersection work. i.e. rays that need data already present in the cache get scheduled first. Replicating these scheduling tricks on the general purpose SIMDs would be challenging given the wavefront level granularity.

"TTU 700 has its own internal small but efficient streaming cache 750 here called an “L0 cache” (“L zero cache” or “level zero cache”). In the non-limiting example shown in FIG. 9, the L0 cache is within the TTU 700 itself. This TTU L0 cache 750 is backed by a larger, more powerful memory system including an additional L1 cache (“level one cache”) and possibly other cache levels such as a level 2 cache etc. ultimately providing access to main memory 140 (see FIG. 1). In the example non-limiting embodiment, L0 cache 750 is used only by and is dedicated to TTU 700. This L0 cache 750 pulls in data for use by the TTU 700 and also schedules use of that data against the rays that want to test against it. The cache 750 performs its scheduling function implicitly through the order in which it streams data down the data path to the other parts of TTU 700."

"To provide high efficiency, the example non-limiting embodiment L0 cache 750 provides ray execution scheduling via the data path into the cache itself. In example non-limiting embodiments, the cache 750 performs its ray execution scheduling based on the order in which it fulfills data requests. In particular, the cache 750 keeps track of which rays are waiting for the same data to be returned from the memory system and then—once it retrieves and stores the data in a cache line—satisfies at about the same time the requests of all of those rays that are waiting for that same data."

Ext3h said:
There's not really a point in prefetching wide nodes in whole. Too much stuff you are never going to need / hit. Most of it is better streamed and then discarded right away. Either you get a coherent hit to the exact same path (or a prefix of it!), or you are better off re-filtering starting at the best cached approximation.

Discarding immediately sounds wasteful particularly if you're doing any sort of coherency sorting. Multiple rays/wavefronts will likely need the same data.

RDNA4

Bondrewd

DegustatoR

AMD RDNA 3.5’s LLVM Changes

Frenetic Pony

Bondrewd

Bondrewd

Deleted member 2197

Guest

AMD RDNA4 graphics cards may only receive a minor bump in ray tracing performance

Newguy

AMD’s 2022-2024 Client GPU Roadmap: RDNA 3 This Year, RDNA 4 Lands in 2024

Bondrewd

eastmen

AMD RDNA4 graphics cards may only receive a minor bump in ray tracing performance

Frenetic Pony

Arun

Unknown.

Disallowed thread topics: Anything related to business performance, sales, marketing, etc (Graphics and Semiconductor Industry for those)

Seanspeed

eastmen

Albuquerque

Red-headed step child

Ext3h

fellix

Bondrewd

trinibwoy

Meh

Ext3h

trinibwoy

Meh

Similar threads

RDNA4

Deleted member 2197

Guest

Unknown.

Disallowed thread topics: Anything related to business performance, sales, marketing, etc (Graphics and Semiconductor Industry for those)​

Red-headed step child

Meh

Meh

Similar threads

Disallowed thread topics: Anything related to business performance, sales, marketing, etc (Graphics and Semiconductor Industry for those)