AMD RDNA4 Architecture Speculation

Okay, so the patent exactly matches what you'd naively assume in 1 man-hour of brainstorming that a cache in that spot would do. That being patented is extremely bad news then, as working around that patent appears to be quite hard, and that guiding effect is a basic necessity to turn it efficient... Patents simply suck.
Discarding immediately sounds wasteful particularly if you're doing any sort of coherency sorting. Multiple rays/wavefronts will likely need the same data.
Yes and no. They will need the same hits, which satisfies all the any-hit use cases. They do not need to re-scan each group in that case. But you might of course use your generic L2 data cache, to cater for the usecases where you need to re-sweep (e.g. because you have a no-hit scenario with coherent rays as the worst case).

EDIT: No, you don't need to cache the (full) AS for the no-hit scenario either. At least on AS level, you only need to ensure that the empty space is filled with referable nodes too, so the non-hit is effectively cache-able too. But with a bias in the guidance so that any-hit has priority over no-hit. Only only the bottom-most triangle soup level, you actually need to re-sweep and would therefore benefit significantly from "preferred" caching of the entire section.

EDIT2: What the patent didn't mention, is the feedback channel from accepted nearest-hit back to the cache, so that previous near-hits are also tried first for coherent rays in order to massively speed up ray-truncation. It's misleading with the implementation hint about fulfillment order. The real order of evaluation is: Cached nearest hit > cached any-hit > stalled (potential) any-hit (by order of fulfilled request) > cached no-hit

EDIT3: At least that fast-truncation approach of re-using the cached hierarchy of a previous nearest-hit is possible to do in software on AMD hardware via LDS too!

EDIT4: Nothing in that patent prevents you from providing a cache which serves only to record nearest hits and yields the estimated new nearest hit (or at least the best possible entry point into the hierarchy), thereby at least solving most of the any-hit / close-miss scenarios trivially. It only restricts the use of streaming caches withing a fixed function traversal unit for AS data, but not when the cache is explicitly populated from shader code!

Manually enter nearest-hits, as well as all known empty nodes into the cache for no-hits (in both cases with absolute bounding boxes), run rays against only against that cache for the first round, and you can mostly decide with zero data fetches / indirection if the ray will be truncated (both on the near and far plane!) or not. Worst case, it will still give you a very good hint about the closest point where it is missing prefetched data (near plane truncation!) or in the best case that cache alone will truncate both new and far plane exactly to give you the nearest hit and nothing else. On average, it will still get you deep into the AS with a pre-computed global bounding box so you can still skip several rounds of pointer chase.

Only the coherent memory access on a non-cache-hit / any-hit is still blocked by that patent.
 
Last edited:
Okay, so the patent exactly matches what you'd naively assume in 1 man-hour of brainstorming that a cache in that spot would do. That being patented is extremely bad news then, as working around that patent appears to be quite hard, and that guiding effect is a basic necessity to turn it efficient... Patents simply suck.

Yeah I doubt something as mundane as a cache patent is enforceable. AMD is surely free to add a dedicated "RT cache" as part of their fixed function traversal implementation if they want to.

Yes and no. They will need the same hits, which satisfies all the any-hit use cases. They do not need to re-scan each group in that case. But you might of course use your generic L2 data cache, to cater for the usecases where you need to re-sweep (e.g. because you have a no-hit scenario with coherent rays as the worst case).

EDIT: No, you don't need to cache the (full) AS for the no-hit scenario either. At least on AS level, you only need to ensure that the empty space is filled with referable nodes too, so the non-hit is effectively cache-able too. But with a bias in the guidance so that any-hit has priority over no-hit. Only only the bottom-most triangle soup level, you actually need to re-sweep and would therefore benefit significantly from "preferred" caching of the entire section.

EDIT2: What the patent didn't mention, is the feedback channel from accepted nearest-hit back to the cache, so that previous near-hits are also tried first for coherent rays in order to massively speed up ray-truncation. It's misleading with the implementation hint about fulfillment order. The real order of evaluation is: Cached nearest hit > cached any-hit > stalled (potential) any-hit (by order of fulfilled request) > cached no-hit

No there's no mention of feedback between individual rays to influence the order of intersection operations. That would require explicit grouping of rays that would be allowed to share feedback. There's no such explicit grouping mentioned in the patent and each ray is treated as its own thread within the RT unit. The cache optimization is relatively straightforward and just schedules rays together that need the same data. This grouping is local to the RT unit and is independent of how those rays were issued from the shader. The grouping is fluid and groups are formed/broken on the fly as rays diverge.
 
That would require explicit grouping of rays that would be allowed to share feedback.
It doesn't :)

A cached nearest-hit is just a (yet to be confirmed) any-hit (or no-match due to rejection of the entire BLAS / flag based rejection) for another ray. So it's safe to share within a TLAS. Only when accepted, it's very likely to be not just a random any-hit but actually a good candidate for the nearest hit.

Cached "empty" placeholder nodes are likewise uncritical. (But potentially trashing the cache.)

And as long as the entire hierarchy enters the cache as individual cache entries (not just the leaf), it still gives you a high hit rate in the cache as long as there is remotely any coherency in the initial set of rays scheduled on the CU. You do actually get not just one qualified "hint" out of such a cache, but already an ordered list of variying selectitivity.
 
Last edited:
Ok but how do you guarantee the explicitly cached nodes won't be flushed by other rays/wavefronts before they're reused? You would need a pretty big cache. I'm not sure if you're referring to LDS or something bigger down the chain. The cache I'm referring to would be local to the CU.
 
Likewise. CU local, but actually dedicated to this functionality as it does need to evaluate some logic for the actual cache lookup. Just a rough guestimate, but about a 100 entries (assuming an up to 10 level deep BVH tree up to the final triangle) should be able achieve a somewhat steady level 8-10 cache hit for up to a 500 concurrent rays if originally coherent in screen space (and no Monte Carlo sampling!), as long as there is a bias to eviction in favor of keeping nodes closer to root and keeping once hit nodes for longer (thus favoring implicitly large nodes).

But that's already too large for a fully associative cache when shared across a full CU.

So it makes more sense to go smaller. 30 entries or so, and when in doubt have more than one per CU, one per 4-8 concurrent rays. Just barely enough to consistenly cache the number of rays (respectively their common occluder!) typically launched from a single screen space fragment for basic GI.

Early rejection of TLAS nodes (not enough savings as most of the traversal wouldn't be avoided afer all) and micro-gometry from entering the cache should help with avoiding accidental flushes too.
 
Last edited:
I wonder if AMD was kind of cornered by its market position to "brute force" the RDNA3 design -- just more of everything (caches, VGPRs and constipated dual-issue) and very little specific/targeted improvements. And what's up with WMMA support? FSR should have already been forked to implement inference on RDNA3 for higher quality output to gain some points against DLSS.
It's probable that AMD doesn't see a future in how hardware ray tracing can succeed especially as we go deeper into the current generation. Virtualized geometry could potentially catch on in other applications beyond just UE5 based games and most hardware vendors haven't yet figured out a reasonable way for their acceleration structures to cope with that amount of data nor if they ever will. We also have people that want efficient function calls to implement megakernels for ray tracing due to the limited ray payload budget but that idea also becomes untenable with more complex materials like UE5's substrate system since it'll lead to the compiler spilling ...

WMMA instructions don't have any improved computational throughput over using dual-issue variants of vector dot product instructions on current AMD graphics architecture. Based on the LLVM patches, their next graphics architecture doesn't appear to have implemented any special hardware path for it either. They've added sparsity support for WMMA which can possibly aid in skipping some work but the performance gains in practice are relatively peanuts in the AI/ML domain ...
 
It's probable that AMD doesn't see a future in how hardware ray tracing can succeed especially as we go deeper into the current generation. Virtualized geometry could potentially catch on in other applications beyond just UE5 based games and most hardware vendors haven't yet figured out a reasonable way for their acceleration structures to cope with that amount of data nor if they ever will. We also have people that want efficient function calls to implement megakernels for ray tracing due to the limited ray payload budget but that idea also becomes untenable with more complex materials like UE5's substrate system since it'll lead to the compiler spilling ...

WMMA instructions don't have any improved computational throughput over using dual-issue variants of vector dot product instructions on current AMD graphics architecture. Based on the LLVM patches, their next graphics architecture doesn't appear to have implemented any special hardware path for it either. They've added sparsity support for WMMA which can possibly aid in skipping some work but the performance gains in practice are relatively peanuts in the AI/ML domain ...

Giant BVH trees are branchy as all get out by nature cause, well it's in the name, they're trees. So parallelization is always going to be hard by nature. But they're required for performance reasons because triangle hits in RT are sloooooow, no matter the arch right now they're super slow and you need super tight BVHs to get performance. The interesting thing is, something like an SDF is fast for tracing, even in software. Splatting is even faster.

Now how do you animate splatting? That's a million dollar question, but if it's solved then there's a solution to realtime RT that skips a lot of BVH building (which is hard on animation anyway) and a lot BVH traversal, splatting is fast because it's just plain simple. The "neural" part of all this neural rendering is nigh useless, it's all going towards rasterizing splats anyway, because you don't need to "learn" anything for the rendering equation itself, it's dead simple. But! But there's still a lot of cool progress there on splatting, because game programmers have skipped splatting because triangles are just more familiar.

But! But a relatively simple acceleration structure is probably a good idea even if splatting is the future. Fixed function hardware that does what software is supposed to be has been at terrible idea for GPUs for over a decade. CPUs don't have specialized units doing anything as such, why would they. All CPUs have is different execution units for different instruction types. So why on earth do GPUs have such, why can't GPUs be parallel versions of CPUs. I can see maybe a special CU type for going through branches as fast as possible, with its own memory cache dedicated to it, but it can't be some black box, programmers should still be able to define how it functions.

As for Matrix, yeah I'm kinda surprised there's not some double rate matrix execution. For upscaling/denoising by nature is just statistical guessing, a solid choice then for machine learning. XESS and DLSS have the advantage of producing better image quality for less developer work than FSR, and executing such faster would be a good idea for AMD. AMD has CDNA, it's right there, maybe that's coming in RDNA5 though.
 
The interesting thing is, something like an SDF is fast for tracing, even in software. Splatting is even faster.
SDF is fast because people use low res approximations.
Splatting is eventually fast because it's rasterization from a single view, so comparing it to RT is apples vs. oranges.

The "neural" part of all this neural rendering is nigh useless
It's used to optimize / reduce the number of splats needed to represent the scene well enough.
If we remove this, animating the splats is no more problem, but we need 100 times more splats and eventually splatting is no longer 'fast'. (They still need 16ms just to sort and render a static and prelit scene in the SG paper, which isn't fast at all.)

It's probable that AMD doesn't see a future in how hardware ray tracing can succeed especially as we go deeper into the current generation. Virtualized geometry could potentially catch on in other applications beyond just UE5 based games and most hardware vendors haven't yet figured out a reasonable way for their acceleration structures to cope with that amount of data nor if they ever will.
But Epic already manages BVH for Nanite which they use for lod selection and culling. Why should AMD think the same thing can't be done in the same way for RT?

I'm with you, guys. But those speculations feel somewhat baseless to me right now.
I'm also surprised about recent assumptions RDNA 4 again won't get HW traversal. It looked the other way around some months ago.
 
Giant BVH trees are branchy as all get out by nature cause, well it's in the name, they're trees. So parallelization is always going to be hard by nature. But they're required for performance reasons because triangle hits in RT are sloooooow, no matter the arch right now they're super slow and you need super tight BVHs to get performance.

Seems a bit early to give up on triangle tracing. There’s still opportunity to throw transistor budget and software optimizations at the problem. In the future when big caches move to a separate die there will be even more room to dedicate to hardware RT on compute dies. Bouncing rays off triangles is too simple, flexible and elegant of a solution to not be a first class citizen going forward.

We think RT is slow because we’re evaluating it in an arena designed around raster limitations (few shadow casting lights, static or missing GI, terrible reflections). RT is more viable when aiming higher and hopefully we continue to do just that.
 
SDF is fast because people use low res approximations.
Splatting is eventually fast because it's rasterization from a single view, so comparing it to RT is apples vs. oranges.

SDF is faster even at the same res, everyone uses low poly approximations of models today, you can go as high as you want with SDF and it's still faster. Same with splatting, you can splat individual "rays" all you want, they're always faster than triangles. A 3090 would get crushed trying to trace triangles on these scenes, for an SDF it does them at 60fps: https://jcgt.org/published/0011/03/06/paper-lowres.pdf

It's used to optimize / reduce the number of splats needed to represent the scene well enough.
If we remove this, animating the splats is no more problem, but we need 100 times more splats and eventually splatting is no longer 'fast'. (They still need 16ms just to sort and render a static and prelit scene in the SG paper, which isn't fast at all.)

The papers your talking about are old. Very old. Here's a newer paper on splatting 3d guassians that gives a more up to date view: https://arxiv.org/pdf/2401.06003.pdf

Here's how to set them up for normal rendering instead of inverse rendering: https://nju-3dv.github.io/projects/Relightable3DGaussian/

You can run 3d guassians on your phone at 60 today: https://webgl-gaussian-splatting.vercel.app/ the sorting problem isn't fundamentally a neural problem, it doesn't need an AI to learn each scene, that would be a ridiculous waste. It's just generic sorting and can be solved as such.

The question isn't "is something like 3d guassians faster", we know they are, the question is how to efficiently animate them skeletally without cracks appearing. So far there's been stuff like tracking the animation with new guassians over time, but that's obviously not workable for interactive stuff. Still, it seems a tractable problem.

Seems a bit early to give up on triangle tracing. There’s still opportunity to throw transistor budget and software optimizations at the problem. In the future when big caches move to a separate die there will be even more room to dedicate to hardware RT on compute dies. Bouncing rays off triangles is too simple, flexible and elegant of a solution to not be a first class citizen going forward.

We think RT is slow because we’re evaluating it in an arena designed around raster limitations (few shadow casting lights, static or missing GI, terrible reflections). RT is more viable when aiming higher and hopefully we continue to do just that.

The question for RDNA4, and other architectures, isn't "how do we solve software's problem better than software", but "how do we let software solve the problems they have as fast as possible." If people want to continue on triangles that's fine, but the hardware shouldn't be limited to that, or limited to BVHs. BVHs are really hard to rebuild in realtime, maybe there's other accelerations structures that can handle something like dynamic foliage in realtime far better than a BVH. But right now we have ray/box test units and such that lock us into the idea of building a bvh. Tessellation, geometry shaders, etc. etc. Hardware has tried solving problems for software for years and it doesn't work and it's time to stop.
 
Last edited:
But Epic already manages BVH for Nanite which they use for lod selection and culling. Why should AMD think the same thing can't be done in the same way for RT?

I'm with you, guys. But those speculations feel somewhat baseless to me right now.
I'm also surprised about recent assumptions RDNA 4 again won't get HW traversal. It looked the other way around some months ago.
Nanite only uses the BVH for quick LoD rejection. For LoD selection, our offline acceleration structure building process involve merging and splitting of clusters which makes it a DAG data structure and it helps that we don't traverse it at runtime either ...

There's benefits to keeping traversal logic flexible since it can allow us to do less work in certain cases. We could possibly use traversal shaders to do a lazy update of the BLAS, multi-level instancing, or stochastic LoD selection but if some vendors don't want to progress in this direction because of 'performance' (or a loss thereof in relative advantage) then developers should stop using ray tracing APIs that continues to have unresolved issues if it doesn't meet their requirements (virtual geometry ?) anymore ...
 
The question for RDNA4, and other architectures, isn't "how do we solve software's problem better than software", but "how do we let software solve the problems they have as fast as possible." If people want to continue on triangles that's fine, but the hardware shouldn't be limited to that, or limited to BVHs. BVHs are really hard to rebuild in realtime, maybe there's other accelerations structures that can handle something like dynamic foliage in realtime far better than a BVH. But right now we have ray/box test units and such that lock us into the idea of building a bvh. Tessellation, geometry shaders, etc. etc. Hardware has tried solving problems for software for years and it doesn't work and it's time to stop.

Fair question. But haven’t we had decades of R&D in software renderers on both CPUs and GPUs and haven’t they mostly converged on tracing rays into a triangle BVH? Blender, Arnold etc.
 
I see raytracing as the next evolution of PBR. What is a more accurate way to depict the visual experience of real life than simulating photons bouncing off and through physical objects in the world? Seems like the "real" physically-based rendering to me.

Obviously there are more iterations to be done, however ray tracing seems the natural progression of things. That's just me though, feel free to refute.
 
Nanite only uses the BVH for quick LoD rejection. For LoD selection, our offline acceleration structure building process involve merging and splitting of clusters which makes it a DAG data structure and it helps that we don't traverse it at runtime either ...
They also use BVH to cull occluded branches, at least that's my impression from skimming shader code. Probably they can do this similar to lod selection, probably by caching the cut and going up or down the hierarchy next frame to avoid / minimize a need for traversal. But my point is that the BVH could be used for traversal and thus RT as well, which proofs it fits into memory for such high detail.
There's benefits to keeping traversal logic flexible since it can allow us to do less work in certain cases. We could possibly use traversal shaders to do a lazy update of the BLAS, multi-level instancing, or stochastic LoD selection but if some vendors don't want to progress in this direction because of 'performance' (or a loss thereof in relative advantage) then developers should stop using ray tracing APIs that continues to have unresolved issues if it doesn't meet their requirements (virtual geometry ?) anymore ...
Exactly. : )
Which is why won't be too disappointed if HW traversal is still missing in RDNA4, since RT feels too expensive to be worth it for what i could do with blackboxed BVH.
Microsofts leaked slide about next XBox gives me some hope. They say 'next gen RT', which i take as a promise they'll fix the limitations on DX12 API level.
Idk what else they could do to call it 'next gen'. But i'm ready to be disappointed again. Maybe it's just about better HW acceleration.
 
SDF is faster even at the same res, everyone uses low poly approximations of models today, you can go as high as you want with SDF and it's still faster. Same with splatting, you can splat individual "rays" all you want, they're always faster than triangles.
Well, to approximate triangles at the 'same res', you would need infinite resolution. And the hit point does not tell you anything about material, so you need another data structure, requiring some form of point query (traversal) just to find that.
But that's just me maybe. I think volume data is extremely costly, and alternatives are always better, this includes classical RT.

To trace splats, you need again some acceleration structure to find them, so the only difference to triangle tracing is the final intersection function, which is always cheap compared to the traversal. So imo splatting competes and compares to rasterization but not ray tracing. RT just remains the same concept no matter what's the final primitive defining the surface.

But thanks for the papers. Will read.

The question isn't "is something like 3d guassians faster", we know they are, the question is how to efficiently animate them skeletally without cracks appearing. So far there's been stuff like tracking the animation with new guassians over time, but that's obviously not workable for interactive stuff. Still, it seems a tractable problem.
Animation never seems a big problem to me. You could just deform the splats with the skin, so it would not introduce new cracks if the deformation is smooth enough. Acceleration structures can be deformed in the same way without requiring to change their structure.
SG would support such deformation, so it's not much different than deforming triangle meshes. That's nice, and image quality is awesome.
Due to transparency support, it could also blend multiple LODs, turning a hard and almost impossible problem into something as simple as mip mapping. That's where my primary interest comes from.
 
For LoD selection, our offline acceleration structure building process involve merging and splitting of clusters which makes it a DAG data structure and it helps that we don't traverse it at runtime either ...
Which is perfectly fine for rendering in screen-space, I suppose? You are still traversing after all, you did amortize your costs on the cluster level and had that capped to a constant number of clusters.

However, once you leave screen-space or any form of coherent perspective projection and go GI, you loose the option to merely advance on the rasterization approach, as you loose coherence entirely.

I do understand why you want to make the actual traversal software defined. But I still do expect that an extension of the following form could prove valuable:
Code:
// Inject a bounding box with a defined priority and a user defined ptr into a bounded size LRU cache for use by the same kernel launch on this CU. priority_class != 0 and user_ptr != 0
// Use on-hit when you assume further coherence, or use to pre-warm the cache.
spatial_cache_insert_bb uint64_t _in_ user_ptr, vec3 _in_ bb_tlf, vec3 _in_ bb_brb, uint64_t _in_ priority_class

// Yield user_ptr for highest (priority_class | priority_mask) (priority_class | priority_mask) != 0 in cache which has an intersection with line from origin to target.
// On a hit, user_ptr != 0 and the ray is truncated to the matching ray segment.
// Ray-box tests are kind of expensive, so this might end up with variable latency...
spatial_cache_query_ray uint64_t _out_ user_ptr, float _inout_ tmin, float _inout_ tmax, vec3 _in_ origin, vec3 _in_ direction_norm, uint64_t _in_ priority_mask

// Yield user_ptr for highest (priority_class | priority_mask) with (priority_class | priority_mask) != 0 in cache which has any intersection with probe bounding box.
// On a hit, user_ptr != 0 and the query bounding box is reduced to the intersection.
// This is easily doable at full throughput...
spatial_cache_query_bb uint64_t _out_ user_ptr, vec3 _inout_ bb_tlf, vec3 _inout_ bb_brb, uint64_t _in_ priority_mask

// Flush all bounding boxes with (priority_class | priority_mask) != 0 from the cache.
spatial_cache_flush uint64_t _in_ priority_mask

The point being that in terms of transistor budget, it's cheap to match a "huge" list of recently used nodes against some query in hardware, much cheaper than you could do in software for comparable cache sizes. The hardware can give you a cache hit or miss in constant time, and without having to reload anything for that purpose, so all that at a fraction of the power budget you would burn for a single test.

That wouldn't tie you to a BVH tree as the linkage structure, but still gives you that amortized cost for the cases where you have e.g. an non-trivial and due to variant view-points non-precomputable inner hull which most samples will hit for GI, but the actual hit points end up randomized enough so that a deterministic lookup strategy over the full hierarchy depth would end up in an unfeasible traversal costs. You'd also reserve that much needed choice which parts of your acceleration structure to use as reentry points, and which not, based on what you already have made as an educated guess.
 
Last edited:
If it's 64CU and 3.2ghz(ish) and decent RT improvement, that should put it ≥ 4070ti S in terms of raytraced games. So $699, maybe with Assassin's Creed Katana (or whatever they name the Red/Japan one).

Then call it, 56CU 3ghz for a cut down one, around 4070 - 4070s? $599 16gb with AC Red, the selling point of extra ram and a pack in game might tempt them for otherwise performance/price parity here. Especially since this would lessen the temptation for Nvidia to drop the 4070s price to compete.

32CU 3.3ghz, 16gb, $399. Call it a 4060ti competitor, look just as fast in or faster *generally.
28CU 3ghz 8gb, $249, gotta compete with the 4060 somehow.

Now are they launching with GDDR6 towards the middle of the year, or GDDR7 towards the end, or maybe a split? I could see GDDR7 being helpful at @4k for the highest end one.
 

possible NAVI 48/44 leak

however NAVI48 die size doesn´t fit

The MCD interface ona NAVI32 is about 146mm2 and the reason why they are made with a different manufacturing process is because such interfaces/cache shows absolutely minimall scaling with newer nodes, i.e. you need to add at least 130mm2 area of controllers and IF cache to so called scaled down "NAVI32 GDC" made on 4nm and that doesn't fit to 240mm2 because the difference between 5nm and 4nm is absolutely marginal
 
Last edited:

possible NAVI 48/44 leak

NAVI44 is pretty small, however NAVI48 die size doesn´t fit
IMO neither really "fit"... both are crazy small. I wonder if they did something with the cache, chiplet for stacked or SxS.
N48 is basically the same size of N23 but doubles everything... near perfect logic scaling(?)
N33 is 204mm2 on N6, perfect scaling based on marketing numbers of N4P would be ~105mm2. Somehow they got ~80% of marketing logic density benefits? I was expecting something around 150mm2.
With something closer to 180mm2, I would question if it had 192bit or 36/40CUs to get closer to 7700XT but then I realized that cutting down N48 would be a better option to keep N44 as optimized as possible, I'm expecting ~100w TDPs.
I thought full N44 would need 19.5/20Gbps GDDR6 with that faster Infinity Cache to break +20% over 7600XT but that rumor says they are still using 18Gbps GDDR6.

N44 2SE 32CU 64ROP @ ~3.4ghz performance +25% 7600XT (~3070)

N48, I was expecting a die size around ~280-320mm2. And was bouncing around between 3SE or 4SE.
3SE 60CU 96ROP @ ~3.33ghz if stuck with 20Gbps GDDR6

Though this rumor appears to be pointing towards a
4SE 64CU 128ROP @ ~+3ghz with 21.65Gbps GDDR6

Wonder if they did something with the frontend and both N44 and N48 are 2SE designs but N48's is buffed up a bit.
Funny to think N48 *might* basically be a doubled up Vega64, 7 years later.
 
Last edited:
Back
Top