You've got neural net itus! There are many ways to create textures procedurally. Perhaps the most straightforward is to exectue the actions of the artists, so compile and execute a Substance material in realtime rather than baking it. See the classic .kkrieger FPs in 96kb.
My recollection of kkreiger's process was that it ran through the creation steps for the assets during the game's load time. The constructed game would then have much more than 96KB in RAM that allowed it to reuse those results in successive frames. An algorithm fetching instructions, inputs, and looping intermediate buffers is going to generate more accesses than it would take to later access those results, and the load time seemed significant enough that it's unlikely that the more significant serial component of running through a compressed list of steps could be hidden if done on the fly. It would be a net loss if those results were discarded nearly immediately and done again in the next frame.
That's what I imagined, but could a smaller subset of the top end of the tree be stored for faster sorting, and only when you find areas on the low levels would you then need to go to RAM. Potentially, have a permanent top-level map and a cache of a smaller lower level to load in necessary spaces? I suppose that only works with convergent rays, so reflections. Scattered light traces absolutely anywhere.
The effectiveness of a top-level cache would depend on how many accesses hit the top-level versus the acceleration structure in the bottom-level. It seems like the majority of accesses in decently complex objects would be in the bottom level.
TLBs and page walker buffers tend to store a limited set of most recently used entries. The higher levels tend to change less frequently than the lower ones, and the buffers can leverage temporal locality to save misses to cache or memory.
A large table of top-level instances may still be too big for the storage available in the L1 or local buffers of an RT core, but it's able to get some level of spatial or temporal locality, a buffer of the current object and some of the most recently traversed BVH nodes could be applied to multiple rays.
So for RT next-gen, BW is going to be a premium?
RT does increase shading load and has a compute burden from BVH construction/update as well. Bandwidth use can increase, though it's apparently early days in finding out how games in general behave with it. There may be future optimizations beyond just conserving raw bandwidth, such as finding better ways of controlling the divergence of accesses. Disjoint accesses could potentially lead to stalls in the RT hardware or memory subsystem, which would in raw bandwidth terms look deceptively low.
Enter STT-MRAM. It's how we'll get more density than SRAM with near-SRAM level performance.
Perhaps at some point, the most recent products and announcements still had endurance fall short of SRAM and DRAM, which would make it less viable for on-die caches operating in the GHz range. Another trade-off for removing the standby current of SRAM besides endurance is write energy, which has historically been significantly higher.
Thanks. I ended up adding this to my first revision because I thought it was relevant enough to consider, especially since HBM was considered on X1X but decided against with access granularity being one of the drawbacks mentioned.
What access granularity problem would there be? HBM has 8 independent 128-bit channels with a burst length of 2, so 256 bits per burst. GDDR5 has a 32-bit channel and burst length of 8, so 256 bits as well.
GDDR5X was the one that doubled prefetch on a 32-bit channel and got bursts of 512 bits.
One of the reasons cited for GDDR6's transition to two channels was to stop the increase in the width of the internal array accesses, so GDDR6 drops back down to the 256 bits per access.
The other kind of granularity is the page width of the DRAM, which is usually around 2KB. GDDR5X in some configurations also doubles this, whereas HBM's pseudo-channel mode can actually halve the page width.