Speculation and Rumors: AMD RDNA4 ...

So the chances of ShaderEngine1 requesting BVH data that is stored in some MCD not directly connected is almost 0? And if should it happen would have in the end almost no perf loss? Im surprised then.
So i deduce in the foreseable future (say next architecture or 2) that GPU's on their own will never update the BVH data on their own, and will always be done CPU side?
 
So the chances of ShaderEngine1 requesting BVH data that is stored in some MCD not directly connected is almost 0? And if should it happen would have in the end almost no perf loss? Im surprised then.
So i deduce in the foreseable future (say next architecture or 2) that GPU's on their own will never update the BVH data on their own, and will always be done CPU side?
Video memory is conceptually one big linear space. Then the hardware maps this conceptually linear space by "interleaving" it with 24 memory channels in Navi 31 (where each MCD owns a group of 4 channels), i.e., divide the whole space into blocks of 256* bytes of raw data, and then shuffle the blocks across all 24 memory channels. This way you maximize the bandwidth usage, regardless of how and where your resources are placed.

Then your BVH tree is stored as a resource in this one big linear video memory space.

So your BVH node lookups can go anywhere. Just like your texture/sampler accesses and what not.


* assuming the granularity hasn't changed versus its predecessors (VLIW/GCN). Could be 128 bytes now.
 
Last edited:
So the chances of ShaderEngine1 requesting BVH data that is stored in some MCD not directly connected is almost 0? And if should it happen would have in the end almost no perf loss? Im surprised then.
So i deduce in the foreseable future (say next architecture or 2) that GPU's on their own will never update the BVH data on their own, and will always be done CPU side?


I *think* Imagination / PowerVR have on paper a Level 5 RT solution (so with full BVH generation on the gpu. Maybe even Level 4 is, I'm not techy enough to be sure. On latest Photon I read LV4 on some slide...). Levels are explained here : https://gfxspeak.com/featured/the-levels-tracing/

Sorry for the OT, let's get back to RDNA4. Well, if RDNA4 is the generation when they go with more RT offloading like nvidia and intel, maybe they can go level 5 from the start, it would be nice.
 
If you think about mathematics, if a ray hits a pixel on GDC 1 and is reflected to a pixel to GDC 2 this could lead to high bandwidth between GDC.
But rays (usually) don't care about pixels, nor do they jump from one thread group to another (unsure about related SER details). They intersect BVH, which is on VRAM.
If we generate reflection rays from the frame buffer, we will use some tiling anyway to increase coherence. So rays on a compute chiplet will originate from the same tile, and will access similar portions of BVH with caching as usual.

So the concerns you have would not apply to HW RT, but more likely to SSR or SSGI, if compute chiplets have a tile of the frame buffer on chip.
Image post processing with wide kernels may take a hit from chiplet design.
 
Back
Top