AMD RDNA4 Architecture Speculation

I don't know if you know this but SER is just a poor man's version of callable shaders/function calls ...

On Intel HW, you don't need to expose an explicit API to do SER since all of the RT shader stages (ray generation/intersection/any hit/closest hit/callable) in the RT pipeline are purely implemented as callable shaders which makes it trivial for their driver/HW to determine whether or not to spill this state since their reordering mechanism can be done after every function call ...

Exposing a hobbled alternative (SER) that's less powerful/general than current functionality (callable shaders) doesn't seem all that attractive in the eyes of graphics programmers and I'm not sure if the industry is clamouring over the idea of more fixed function/"special state" HW. An explicit API for SER might be more interesting if it can be applied to graphics or compute shaders/pipelines or HW vendors can just straight up expose callable shaders/function calls for all other shader stages ...

Faster & more flexible implementations will be great. I’m not aware of any sorting functionality in RDNA 3 today. Nvidia claims its approach gives the application more control over the sort key and theoretically more opportunity for app specific optimizations. Could be true, could be marketing. I expect RDNA 4 will need some form of ray sorting to be competitive. My guess is that AMD will follow Intel and not require/allow an explicit sort key.
 
Faster & more flexible implementations will be great. I’m not aware of any sorting functionality in RDNA 3 today. Nvidia claims its approach gives the application more control over the sort key and theoretically more opportunity for app specific optimizations. Could be true, could be marketing. I expect RDNA 4 will need some form of ray sorting to be competitive. My guess is that AMD will follow Intel and not require/allow an explicit sort key.
They can already do this 'sorting' via a function call but in the vast majority of cases it's just not worth publicizing about it for them since it often results in spilling large amounts of state (due to 'uber' compute shader implementation) to memory since the faster caches in the memory hierarchy are too small to spill/reload this state ...
 
They can already do this 'sorting' via a function call but in the vast majority of cases it's just not worth publicizing about it for them since it often results in spilling large amounts of state (due to 'uber' compute shader implementation) to memory since the faster caches in the memory hierarchy are too small to spill/reload this state ...

Well it’s software so you can technically do anything. In practice though sorting isn’t practical without efficient state mgmt. So room for improvement there on RDNA 4 to provide an actually usable (and marketable) implementation.
 
Well it’s software so you can technically do anything. In practice though sorting isn’t practical without efficient state mgmt. So room for improvement there on RDNA 4 to provide an actually usable (and marketable) implementation.
Hard to make performant spilling compatible with a megakernel RT PSO implementation when all your RT shaders are inlined as a single compute shader ...
 
Hard to make performant spilling compatible with a megakernel RT PSO implementation when all your RT shaders are inlined as a single compute shader ...

Is that due to the sheer amount of live state required by the ubershader or some other concern? If RDNA 4 moves to HW accelerated traversal as rumored there should be less state for the shader to track.

I’m sure AMD will figure out a coherency sorting option even if it means pivoting away from ubershaders. Assuming they’re serious about RT that is.
 
Is that due to the sheer amount of live state required by the ubershader or some other concern? If RDNA 4 moves to HW accelerated traversal as rumored there should be less state for the shader to track.
Implementing special HW traversal state can potentially free up some amount of memory that was used up in the LDS for the traversal state but it still doesn't really help for the shader states itself that you still have to do worst case register allocation for the most expensive branch along with allocating resources for other unique variables that are used up in other branches in the ubershader ...
I’m sure AMD will figure out a coherency sorting option even if it means pivoting away from ubershaders. Assuming they’re serious about RT that is.
You would think so but pathological cases like RTX Remix shows us that the pure inline RT path with no option to do any reordering is actually faster on Intel HW than their mixed inline RT/PSO path!

That strongly implies that the overhead from spilling to reorder the shader invocations outweighs the inlining optimizations gained with inline RT on Intel HW ...

For NV, I theorize that the fact that you can only do SER for the ray generation shader stage (as opposed to Intel for ALL RT shader stages since all stages gets lowered to callable shaders) means they have some sort of special shader state to introduce an optimal HW path to do function calls/spilling specifically for the ray generation shader stage. (Does this mean that they have an actual HW shader stage for ray generation shaders because they have this special state compared to other RT shader stages ?) In RTX Remix, they have a hybrid inline RT/PSO approach on their latest architectures where they do inline RT inside their ray generation shaders presumably so that they can have access to both SER and inlining optimizations in order to control the optimal balance between them ...

With AMD and Intel, they represent the two different extremes between inlining vs spilling respectively. On NV, you have a zoo of special HW states which admittedly makes for a fast HW RT implementation but their programming model highly restrictive (no flexible traversal/specific design patterns to use HW optimally) and discourages all sorts of experimentation on a programming level. Some graphics programmers ideally want function calls (Intel) outside of ray tracing (AMD) ...

I think Andrew characterized it best in his presentation years ago when they both want flexibility (AMD/Intel) and fast (NV) ...
 
So, RDNA5 could be the first RT "native" GPU from AMD? Given how minimalist is the RT implementation is on their current architecture iteration, the rumors of skipping large die RDNA4 do have a merit if something more fundamental is being cooked meanwhile.
it reminds me time before Vega launch, so much hype, cool features like Primitive Shader, HBCC, Draw Stream Binning Rasterizer, NGG and then .... cold douche :no:

 
Last edited:
Implementing special HW traversal state can potentially free up some amount of memory that was used up in the LDS for the traversal state but it still doesn't really help for the shader states itself that you still have to do worst case register allocation for the most expensive branch along with allocating resources for other unique variables that are used up in other branches in the ubershader ...

That makes sense and is another compelling reason for AMD to rethink the ubershader approach going forward. How does this work in practice in an engine like UE5 that supports tons of materials and therefore lots of hit shader permutations? In a multi-bounce PT kernel on a modern game it could get nasty.

You would think so but pathological cases like RTX Remix shows us that the pure inline RT path with no option to do any reordering is actually faster on Intel HW than their mixed inline RT/PSO path!

That strongly implies that the overhead from spilling to reorder the shader invocations outweighs the inlining optimizations gained with inline RT on Intel HW ...

Interesting. That’s contrary to Intel’s own advice. Maybe remixes of old games don’t use a lot of different materials or shader permutations and divergence isn’t as much of a problem? BVH complexity is also a lot lower.

For NV, I theorize that the fact that you can only do SER for the ray generation shader stage (as opposed to Intel for ALL RT shader stages since all stages gets lowered to callable shaders) means they have some sort of special shader state to introduce an optimal HW path to do function calls/spilling specifically for the ray generation shader stage. (Does this mean that they have an actual HW shader stage for ray generation shaders because they have this special state compared to other RT shader stages ?) In RTX Remix, they have a hybrid inline RT/PSO approach on their latest architectures where they do inline RT inside their ray generation shaders presumably so that they can have access to both SER and inlining optimizations in order to control the optimal balance between them ...

Even if there was some custom HW for ray gen they would still need to handle the regular register & LDS state. So yeah it’s curious why Nvidia would limit sorting to ray gen only. Intel has taken a considerably different approach where sorting is implicit at dispatch whereas Nvidia allows you to sort arbitrarily at any point in the shader. Not sure I would call that less flexible.

With AMD and Intel, they represent the two different extremes between inlining vs spilling respectively.

Yeah and I don’t see how AMD continues down the inlining route while introducing support for coherency sorting.
 
it reminds me time before Vega launch, so much hype, cool features like Primitive Shader, HBCC, Draw Stream Binning Rasterizer, NGG and then .... cold douche :no:
We honestly just have to wait for details and reviews before judging what it's doing with RT. So many rumors about RDNA features and whatnot end up being quite wrong. And there's been all sorts of conflicting information on this, especially in regards to ray tracing, so we shouldn't just jump to believing the most recent rumor as the most credible one.
 
That makes sense and is another compelling reason for AMD to rethink the ubershader approach going forward. How does this work in practice in an engine like UE5 that supports tons of materials and therefore lots of hit shader permutations? In a multi-bounce PT kernel on a modern game it could get nasty.
On AMD, it's purely by design to simply avoid spilling at nearly all cost since they don't have enough cache memory to hold all the shader state to reshuffle the shader invocations ...
Interesting. That’s contrary to Intel’s own advice. Maybe remixes of old games don’t use a lot of different materials or shader permutations and divergence isn’t as much of a problem? BVH complexity is also a lot lower.
The artist can introduce "custom materials" (it might be using generative AI for this) since it's renderer can do custom textures ...

Divergence is naturally less of a problem on Intel graphics regardless since they ban SIMD32 mode for ray tracing altogether so possible to envision some wins with inlining optimizations for SIMD8/16 modes ...
Even if there was some custom HW for ray gen they would still need to handle the regular register & LDS state. So yeah it’s curious why Nvidia would limit sorting to ray gen only. Intel has taken a considerably different approach where sorting is implicit at dispatch whereas Nvidia allows you to sort arbitrarily at any point in the shader. Not sure I would call that less flexible.
On Intel, the spilling happens at "function call boundaries" and they can do this for any arbitrary RT shader stage ...
Yeah and I don’t see how AMD continues down the inlining route while introducing support for coherency sorting.
Inlining optimizations can still happen with an RT PSO implementation depending on the vendor's driver compiler in question as can be seen on AMD but RT pipelines really ties their hands since it becomes really easy to potentially do spilling in this model ...

Besides if function calls/spilling were truly somehow the 'future' then we would ditch the PSO model altogether and do separate shader objects and callable shaders for EVERYWHERE but even I don't think Intel wants to commit purely to an RT SSO model where spilling truly becomes the literal ONLY OPTION over there despite having the most advanced implementation for function calls on their hardware outside of Apple ... (I don't think Nvidia wants spilling only either since use inline RT with RT PSOs)

 
That makes sense and is another compelling reason for AMD to rethink the ubershader approach going forward. How does this work in practice in an engine like UE5 that supports tons of materials and therefore lots of hit shader permutations? In a multi-bounce PT kernel on a modern game it could get nasty.

For that I'm surprised no one else has figured out what it seems like Crytek's been doing for quite a while now. Which is build the materials into a unique texture atlas as you build the BVH.

I.E. Just procedurally megatexture the whole thing. It's RT, it can be super low res lambertian only clipmaps of the high res versions, people won't notice. Heck since it's all timesliced you can easily compress it all into BC7 then shade visibility buffer style, saving even more memory and bandwidth.

This kinda what UE half does with it's cards and radiance cache, but with newer research like hashgrids or half-edge texture mapping you can do all objects seamlessly as you go along. You'll need to pay per object ubershader as you build, but with the right timeslicing/work splitting it should be fairly efficient.
 
Last edited:
does N3P use tsmc high-na euv? or was that the 2nm and below performance nodes?
Not even those, they're planning to do even A14 (or whatever they'll end up calling it) with normal EUV and expect it might actually still be cheaper despite multipatterning requirements. Some reports said they might not do High NA before 2030
 
Not even those, they're planning to do even A14 (or whatever they'll end up calling it) with normal EUV and expect it might actually still be cheaper despite multipatterning requirements. Some reports said they might not do High NA before 2030
Damn foundry's are pretty greedy, I would have used low-NA EUV at the 8nm node and high-NA at 3nm, but i guess them's the brakes.
 
Damn foundry's are pretty greedy, I would have used low-NA EUV at the 8nm node and high-NA at 3nm, but i guess them's the brakes.
First production highNA tool from ASML was shipped to Intel in Dec '23.
They have to install it, tune it and run quals, then finish developing the process before they start wafers with customer's designs.

First production EUV tool was shipped in 2013. EUV obviously had other complications (yield and mask) that didn't really get worked out until ~2016 and mass production didn't start until 2018.
 
Back
Top