AMD RDNA4 Architecture Speculation

This post’s accuracy is not substantiated. Exercise discretion and wait for verified sources before accepting its claims.
Random stab:

40cu/20wgp, 12gb
20cu/10wgp, 9gb?
3ghz, N4P

AMD trying to concentrate on their biggest market (mobile) until RDNA5 which is scheduled for "next year when it's ready".
 
This post’s accuracy is not substantiated. Exercise discretion and wait for verified sources before accepting its claims.
Are you saying you know 10 WGP discrete doesn't exist from insider information
Yes, that segment is dead for both vendors.
For what it's worth, your original statement read to me as "There is definitively no 20 WGP discrete GPU in the RDNA4 generation based on what I've heard from insiders" as well.
No, it's "both dies are bigger".
Which they are.
but I don't see why 20 WGP would be completely out of the question.
It kinda overlaps Strix-halo in some ways.
even hawk1 clocks better than previous RDNA3 thingies (especially at low power).
 
BTW, is AMD on an annual cadence regarding changing architecture? Or every two - three years?
To my knowledge the latest thing we have is AMD's roadmap from mid 2022:



Releases are generally around 18-24 months. Vega was mid 2017, RDNA 1 mid 2019, RDNA 2 end 2020, RDNA 3 end 2022. RDNA 4 rumours are middish 2024 and I think RDNA 5 rumours are end 2025 so potentially both around 18 months
 
That would be very disappointing

I'd kinda expect the highest end one to be competitive with like, 4070ti Super in say, Frontiers of Pandora (1440p 60 on Ultimate settings). Honestly, the price for performance matter more than relative performance between RT and non RT. If it's $649 with a pack in Star Wars Outlaws, that'd probably sound damn good to potential buyers versus Nvidia, even if they drop the 4070tiS by $100.

If say, Horizon Forbidden West, runs (relatively) 25% faster that's just a bonus. That being said, AMD would have to bloody learn how to PR to sell that appealing message.
 
Last edited:
I'd kinda expect the highest end one to be competitive with like, 4070ti Super in say, Frontiers of Pandora (1440p 60 on Ultimate settings). Honestly, the price for performance matter more than relative performance between RT and non RT. If it's $649 with a pack in Star Wars Outlaws, that'd probably sound damn good to potential buyers versus Nvidia, even if they drop the 4070tiS by $100.

If say, Horizon Forbidden West, runs (relatively) 25% faster that's just a bonus. That being said, AMD would have to bloody learn how to PR to sell that appealing message.

Yea but will the 4070 be its competition or a 5060/5070 ? Its hard to compare it to something that is on the market and expecting the market not to change over what is likely another 6-8 months before the release of the product.

You also have to factor in power drain and heat production. So the nvidia competition could trump them on everything except price.
 
MOD MODE AGAIN: I've cleaned house. If it isn't about RDNA4, then it doesn't belong in this thread. If you want to discuss efficiency of prior gens, you're absolutely welcome (and very strongly encouraged) to do it in another thread.
 
Thankfully it's a RGT rumor, so entirely safe to discard its credibility.
So far it's plausible though. Unless we see new instructions which hint to a fully offloaded, fixed function traversal approach, it's still going to be the same limitations as with the RDNA3 - too much register pressure, too much synchronous alteration between hit shaders and triangle soup filtering in the shader code, with a ton of reloads from L2 too as the BVH traversal is purely software based as well.

And you can't just continue the RDNA3 route of increasing register counts even further, just so you can keep more rays in flight, just so you can better utilize the units in the shader array aside from triangle filter, while simultaneously getting hit even worse by cache spilling.

That article had summarized it quite well: https://chipsandcheese.com/2023/03/22/raytracing-on-amds-rdna-2-3-and-nvidias-turing-and-pascal/

How a fixed function traversal unit, which can asynchronously pre-fetch upcoming hits (buffered into LDS), enables the use of ultra-wide (thereby usually unfeasible expensive...) tree nodes / triangle groups (don't assume they wouldn't have an interrnal subdivison though - just that it's ISA-specific while the outer, observable layout is shader defined) and in return reduce the memory latency penalty cost of a deeper structure instead.
 
Last edited:
And you can't just continue the RDNA3 route of increasing register counts even further, just so you can keep more rays in flight, just so you can better utilize the units in the shader array aside from triangle filter, while simultaneously getting hit even worse by cache spilling.
I wonder if AMD was kind of cornered by its market position to "brute force" the RDNA3 design -- just more of everything (caches, VGPRs and constipated dual-issue) and very little specific/targeted improvements. And what's up with WMMA support? FSR should have already been forked to implement inference on RDNA3 for higher quality output to gain some points against DLSS.
 
This post’s accuracy is not substantiated. Exercise discretion and wait for verified sources before accepting its claims.
I wonder if AMD was kind of cornered by its market position to "brute force" the RDNA3 design -- just more of everything (caches, VGPRs and constipated dual-issue)
It has less MALL, mainstream WGP has the same vRF and the thing is made for clocks anyway.
Area is down iso WGP count iso node so idk your point.
FSR should have already been forked to implement inference on RDNA3 for higher quality output to gain some points against DLSS.
That's not gonna fly on APUs or consoles.
Useless.
RDNA5 has actual matrix cores so maybe then.
 
…a ton of reloads from L2 too as the BVH traversal is purely software based as well.

How would fixed function hardware help with L2 thrashing though? Nvidia’s RT patents talk about a local BVH cache inside the RT unit but AMD may not go that route and just work off the existing L0/L1/LDS in the WGP.

Also what’s stopping the traversal shader from prefetching wide nodes into LDS?
 
Nvidia’s RT patents talk about a local BVH cache inside the RT unit
That's probably something different. If you have a cached path all the way to the most recently used triangle soup (or a non-exposed sub-group inside such), and you end up finding a hit there for an (at least somewhat) coherent ray, it's an instant, "free" hit without having to repeat any part of the pointer chase for the traversal.

Guess why their guide tells you not to have overlapping bottom level acceleration structures. It's because that results in a reduced cache hit rate or even RT unit internal cache spilling, even if a sibling or nested ray was coherent as too many structures alias spatially.

Also what’s stopping the traversal shader from prefetching wide nodes into LDS?
There's not really a point in prefetching wide nodes in whole. Too much stuff you are never going to need / hit. Most of it is better streamed and then discarded right away. Either you get a coherent hit to the exact same path (or a prefix of it!), or you are better off re-filtering starting at the best cached approximation.

Even though you do want to "keep" parts of the tree in cache which represent siblings which are also already known to hit.

E.g. when filtering for matching BLAS, you already found a 2nd matching one and you write that straight to the cache as well as an additional entry point for further traversal so you get it "instantly" if the traversal was to resume. While streaming the parent structure, filtering for more than one potential hit was "for free" after all as you already had the memory fetch pipelined...

Dang it, such a cache is actually a pretty smart construct, as you get spatially coherent hits first "by design" (as cached subtrees are explored first), which further provides a massive boost to efficiency of the actual traversal and hit shaders...

PS: No, I did not read the patent. You just said "there is a cache", and the rest is just an extrapolation based on some extremely basic understanding of cache architectures and the implications a loss of coherency would have for traversal performance...
 
Last edited:
That's probably something different. If you have a cached path all the way to the most recently used triangle soup (or a non-exposed sub-group inside such), and you end up finding a hit there for an (at least somewhat) coherent ray, it's an instant, "free" hit without having to repeat any part of the pointer chase for the traversal.

No idea if the patent is implemented in any shipping products. But yes it gives free hits against recently touched geometry. Aside from caching recently used data the L0 cache also directs the scheduling of ray intersection work. i.e. rays that need data already present in the cache get scheduled first. Replicating these scheduling tricks on the general purpose SIMDs would be challenging given the wavefront level granularity.

"TTU 700 has its own internal small but efficient streaming cache 750 here called an “L0 cache” (“L zero cache” or “level zero cache”). In the non-limiting example shown in FIG. 9, the L0 cache is within the TTU 700 itself. This TTU L0 cache 750 is backed by a larger, more powerful memory system including an additional L1 cache (“level one cache”) and possibly other cache levels such as a level 2 cache etc. ultimately providing access to main memory 140 (see FIG. 1). In the example non-limiting embodiment, L0 cache 750 is used only by and is dedicated to TTU 700. This L0 cache 750 pulls in data for use by the TTU 700 and also schedules use of that data against the rays that want to test against it. The cache 750 performs its scheduling function implicitly through the order in which it streams data down the data path to the other parts of TTU 700."

"To provide high efficiency, the example non-limiting embodiment L0 cache 750 provides ray execution scheduling via the data path into the cache itself. In example non-limiting embodiments, the cache 750 performs its ray execution scheduling based on the order in which it fulfills data requests. In particular, the cache 750 keeps track of which rays are waiting for the same data to be returned from the memory system and then—once it retrieves and stores the data in a cache line—satisfies at about the same time the requests of all of those rays that are waiting for that same data."


There's not really a point in prefetching wide nodes in whole. Too much stuff you are never going to need / hit. Most of it is better streamed and then discarded right away. Either you get a coherent hit to the exact same path (or a prefix of it!), or you are better off re-filtering starting at the best cached approximation.

Discarding immediately sounds wasteful particularly if you're doing any sort of coherency sorting. Multiple rays/wavefronts will likely need the same data.
 
Back
Top