AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
Well that's the point - we don't really know how much effort it would take. In many areas this effort likely won't need to be huge and in some cases tools may be just as good already.
Who has we seen so far who dropped their own tech in favor of UE5? CDPR? AFAIU they plan to still develop their own RED Engine, it's just that building a new team for the new Witcher games is easier when using UE.
Crystal Dynamics? It's unclear so far if it's a "switch" or they are doing something similar to CDPR and using UE5 to build up their new Austin studio.
Anyone else?

Those are obviously the newest announcements.

XL game is switching from CryEngine to UE5 for ArchAge 2.

Arkane Studios is switching to UE5 for their next game, Redfall. Their last game, Deathloop, used their in house Void Engine.

GSC Game World dropped their in house X-Ray Engine in favor of using UE5 for Stalker 2.

Those are just the ones from off the top of my head without looking up. There's also others that I'm not at liberty to talk about because nothing has been announced.

Regards,
SB
 
I obviously don't have any data but my feeling is that this past generation the usage of UE(4) in AAA projects have actually been smaller than it was with UE(3) in PS3/360 era.
So if UE5 will win some AAA projects back from own or other engines then it will be more of a return to how AAA scene was back in PS3/360 days than a complete substitution of proprietary tech or engines like Unity.

One of the possible shifts along those lines might be EA/Bioware from UE to Frostbite and now back to UE.
 
Those are obviously the newest announcements.

XL game is switching from CryEngine to UE5 for ArchAge 2.

Arkane Studios is switching to UE5 for their next game, Redfall. Their last game, Deathloop, used their in house Void Engine.

GSC Game World dropped their in house X-Ray Engine in favor of using UE5 for Stalker 2.

Those are just the ones from off the top of my head without looking up. There's also others that I'm not at liberty to talk about because nothing has been announced.

Regards,
SB
Out of these only Arcane is AAA I'd say and Redfall studio in Austin has been using Unreal for Dishonored 1 and then CryEngine for Prey so it's not really a big win for UE5 IMO, more like a choice of 3rd party tech done for the next project by a studio which always use 3rd party tech.
 
Same doom and gloom threats for other engines are thrown around every time there's major update to Unreal Engine. So far none of them came true.

History repeats itself ;) I think these kind of things is what drives people/forums, we need doom & gloom to keep things interesting.
 
Out of these only Arcane is AAA I'd say and Redfall studio in Austin has been using Unreal for Dishonored 1 and then CryEngine for Prey so it's not really a big win for UE5 IMO, more like a choice of 3rd party tech done for the next project by a studio which always use 3rd party tech.

In the Asian gaming sphere (SEA, China) and extending into Russia, XL Games is a AAA developer with development budgets similar to Western and especially Japanese AAA devs.

Regards,
SB
 
There is a better API design available on AMD HW for the ray tracing pipeline but it would involve developers resorting to use driver extensions like AGS where you can replace the TraceRay() intrinsic with TraceRayAMD() intrinsic ...

All ray tracing shaders get compiled as compute shaders on AMD HW so the only possible advantage specialized ray tracing shaders is the lower register usage but risk of introducing more different shaders as opposed to using an "uber-material shader" instead is that function calls stop getting inlined which causes spilling in the process ...

Ubershaders can work for simple visibility queries for RT AO or shadows but aren’t a viable general purpose RT solution. Once you get into multi bounce use cases or path tracing don’t you have to spill state anyway? The only difference is whether you manage it yourself or get help from the hardware.

The Basemark RT benchmark isn’t pretty but it does some interesting things like reflections of reflections. I’m guessing they’re doing separate passes for each “layer” of reflections.
 
The Basemark RT benchmark isn’t pretty but it does some interesting things like reflections of reflections.
Like here?:


Now, whether AMD ping-pongs via memory for each reflection pass, who knows. But it seems unlikely in this demo.

Microsoft® DirectX® Raytracing (DXR) adds a new level of graphics realism to video games previously only available in movies, and AMD has collaborated with Microsoft on the design of DXR 1.1, an update to DXR that can deliver better efficiency and performance in many raytracing effects. This video is to give you a taste of the photorealistic realism DXR 1.1 will enable when using hardware-accelerated raytracing on our upcoming AMD RDNA 2 gaming architecture.
 
Like here?:


Now, whether AMD ping-pongs via memory for each reflection pass, who knows. But it seems unlikely in this demo.

I probably need to look at it on a bigger screen but I’m not seeing reflections of reflections in that demo.

It’s easier to see in Basemark.

 
Once you get into multi bounce use cases or path tracing don’t you have to spill state anyway?

I don't think you understood what I meant ...

FPy22m-UcAUWBd8


The reason why shader tables are potentially suboptimal from AMD's perspective is that function arguments could potentially spill into the LDS depending on the number of shaders in the table ...

Also if you're going to use the argument of performance to discredit AMD's implementation of RT then you might as well do same for every other IHV in regards to legit multi-bounce lighting solutions because virtually no real-time application do multi-bounce lighting without hacks like temporal accumulation or caching ...
 
The reason why shader tables are potentially suboptimal from AMD's perspective is that function arguments could potentially spill into the LDS depending on the number of shaders in the table ...

Got it. So is the issue that there’s no real support for callable shaders and everything is inlined? That would explain AMDs stance.

Also if you're going to use the argument of performance to discredit AMD's implementation of RT then you might as well do same for every other IHV in regards to legit multi-bounce lighting solutions because virtually no real-time application do multi-bounce lighting without hacks like temporal accumulation or caching ...

Discredit? What I said was that ubershaders aren’t viable as a general purpose RT solution. I suppose you could loop over bounces and try to sort rays within the workgroup on each iteration but that would require shuffling a lot of data around. Also how would it work for transparencies? Ultimately you’ll need some form of recursion either by explicitly writing out state after each bounce or having the hardware manage it for you. Do you disagree with that?

Yes multi-bounce is slow everywhere. Nvidia recommends a maximum of 2 bounces. First bounce for reflection. Second for shadowing the reflected object. Intel on the other hand seems to be very proud of their coherency sorting hardware for handling multiple bounces more efficiently.

RDNA 3 needs to tackle 3 RT problems and hopefully does all 3.

1. SIMD traversal is no good for incoherent ray packets e.g. random sampling for GI. (Nvidia and Intel are MIMD)
2. Developers must manage coherency of ray casting. Hardware can’t help. (Nvidia is trying)
3. Developers must manage coherency of shading. Hardware can’t help. (Intel is trying)
 
Got it. So is the issue that there’s no real support for callable shaders and everything is inlined? That would explain AMDs stance.

What does "real support" mean exactly ? "Callable shaders" is just another one of the many API abstractions out there ...

Intel has some limited form of dynamic dispatch (bindless dispatch/shaders) that's exclusive to it's ray tracing pipeline but there's no way to exploit this ability in other pipelines (i.e. the regular graphics or compute pipeline). With AMD, callable shaders or the shader binding table in general is just a compute shaders with big switching statements. I'll admit that it's indeed ideal to have the shaders inlined over there, yes ...

Discredit? What I said was that ubershaders aren’t viable as a general purpose RT solution. I suppose you could loop over bounces and try to sort rays within the workgroup on each iteration but that would require shuffling a lot of data around. Also how would it work for transparencies? Ultimately you’ll need some form of recursion either by explicitly writing out state after each bounce or having the hardware manage it for you. Do you disagree with that?

That's entirely dependent on factors like hardware and material complexity. If we take Quake II RTX as an example of one of the few games with multi-path lighting effects, it's not implemented with either recursion or loops but with a unique PSO per path (may not use an ubershader ?) or with inline RT ...

Recursion in general is not a hard requirement to implement multi-path lighting effects. We're very far away applying "general purpose" RT solutions as is with modern AAA games with them using roughness cutoffs, simpler material shading, temporal accumulation, avoiding transparent/scattering materials altogether etc since it's way too early to be insistent on anything ...
 
What does "real support" mean exactly ? "Callable shaders" is just another one of the many API abstractions out there ...

Meaning it’s more than just API syntax and is actually scheduled and executed as a composable function ala Intel and also per Microsoft’s intent.

From the DXR spec: Implementations are expected to schedule callable shaders for execution separately from the calling shader, as opposed to the code being optimally inlined with the caller.

Intel has some limited form of dynamic dispatch (bindless dispatch/shaders) that's exclusive to it's ray tracing pipeline but there's no way to exploit this ability in other pipelines (i.e. the regular graphics or compute pipeline). With AMD, callable shaders or the shader binding table in general is just a compute shaders with big switching statements. I'll admit that it's indeed ideal to have the shaders inlined over there, yes ...

Right so isn’t it AMD’s choice of implementation that’s imposing additional limitations on the usage of callable shaders?

That's entirely dependent on factors like hardware and material complexity. If we take Quake II RTX as an example of one of the few games with multi-path lighting effects, it's not implemented with either recursion or loops but with a unique PSO per path (may not use an ubershader ?) or with inline RT ...

Interesting, is that documented somewhere? So there’s no shader table and every nth bounce uses the same nth shader for every ray?

Recursion in general is not a hard requirement to implement multi-path lighting effects. We're very far away applying "general purpose" RT solutions as is with modern AAA games with them using roughness cutoffs, simpler material shading, temporal accumulation, avoiding transparent/scattering materials altogether etc since it's way too early to be insistent on anything ...

That’s a bit of a chicken and egg problem. Current games are limited by current hardware. It shouldn’t mean that it’s ok for future hardware to double down on those limitations. I like the way Intel is tackling the problem head on. Of course they have to prove their hardware works but at least they’re being proactive about it.
 
It's an interesting idea that there's one "MCD" and it, perhaps, has all of the GDDR connections. And PCI Express and maybe even some high level graphics command functionality.

Then each GCD has all of the shader engines (including ROPs and L2s) as well as the cache chiplets stacked on top of them.

But, infinity cache is an L3 concept, which is localised to GDDR, not shader engine L2. So maybe it doesn't make sense to move cache chiplets away from GDDR? In which case cache chiplets would be on top of the MCD. I suppose that would help with power/heat, since we can expect a large portion of MCD, taken up with GDDR PHYs, is not high in power density.

In reality, we're all kinda playing chiplet bingo.

Digging around in twitter threads, this comes up (the first 28 minutes):

HIR Chiplet Workshop: Architectures and Business Aspects for Heterogeneous Integration | IEEETV

But I don't think we can take much that's concrete from that.
 
But, infinity cache is an L3 concept, which is localised to GDDR, not shader engine L2. So maybe it doesn't make sense to move cache chiplets away from GDDR?
In a routing based NoC, they can be designed not to co-locate. The memory side cache and the coherence controller it tied to can live elsewhere, and misses can be fed back into the NoC to get routed to the right DRAM controller at the edge.

Eypc Rome/Milan’s single socket NUMA configurability is an empirical evidence.

In any case, I think cache-in-bridge-chiplets still have the highest score given its presence in patents. The reusing V-Cache die angle is not convincing because of the significance differences in design targets (Zen core clock vs fabric clock, banking, cache line size, etc).
 
Last edited:
But, infinity cache is an L3 concept, which is localised to GDDR, not shader engine L2. So maybe it doesn't make sense to move cache chiplets away from GDDR? In which case cache chiplets would be on top of the MCD. I suppose that would help with power/heat, since we can expect a large portion of MCD, taken up with GDDR PHYs, is not high in power density.

I was envisioning them stacked upon the MCD for the exact same reasons; the MCD needs to be fairly large physically just for all the external I/O balls anyway, and is relatively low power density compared to the GCDs. Once they've done RDNA3 as a pipe-cleaner/proof of concept on the high-end, in my mind at least this opens them up to really leveraging some of their older nodes for the I/O.

For an RDNA4 midrange product, I was thinking 12nm (or a very mature at that point Samsung 8nm or TSMC 7nm) MCD, exploiting the lower cost per transistor of the older nodes, and turning the larger die size into somewhat of a positive feature. All that silicon area gives you lots of room to tile up standardized L3 chiplets (or even dummy spacer silicon for product segmentation, not necessarily having to fill every L3 'pad') as desired. All comes down to how costly and well-yielding the advanced packaging processes are, I suppose.
 
Does tape out include the full assembled chiplet or the single modules?
Tape-out is purely for chips (chiplets).

With the rumours suggesting that Navi 31 consists of three distinct chiplet designs, I would say that it doesn't tell us much. If two of the three chiplet designs are shared between Navi 31 and 32, we can only ask "why not do the smaller chiplet-based GPU first?"

Code numbers were originally supposed to be about sequencing, i.e. that 31 is designed before 32, which is designed before 33. If 33 is taken to be the simplest iteration of RDNA 3 versus RDNA 2, then you could say it's reasonable to leave 33 tape-out until last, because it will proceed with the least risk.

We've already seen that AMD is quite happy to wait months/years to release low-end designs, so we should expect 33 to be later. Why 32 is the latest might be a risk-reward trade-off based solely on Navi 31 progress. The competitive performance of Navi 21 may have affected the ordering, such that AMD brought Navi 31 much further forward, taking more risks, leaving a gap that 33 could fill.

I don't believe AMD will deliver 31 and 32 "on time", for what it's worth. It's clear that 5800X 3D took much longer to get to market than V-cache should have taken (Zen 2 was built for V-cache), and Navi 31/32 are way more complex with tougher thermals, tougher packaging and tougher drivers than Navi 33.
 
Status
Not open for further replies.
Back
Top