How much can raytracing be separated from shading?

Shifty Geezer

uber-Troll!
Moderator
Legend
A rumour in the console prediction discussion has a theoretical 7TF Navi console with 800 GB/s bandwidth and an unspecified amount of raytracing hardware.

DXR incorporates raytracing into shaders, with the surfaces being computed on shader cores and accelerating RTX hardware being invoked only to perform ray intersect tests, so RT performance is bound to compute. With such a model, 7TFs compute seems ineffective and doesn't tally with the very high bandwidth.

In the case this console is Sony's PS5, it would not being bound to DXR and Sony are free to pursue any other avenues. Ergo, what are the options for RT solutions given such a bandwidth-heavy solution? Could you trace geometry in bespoke hardware and populate a geometry/surface buffer with surface IDs, perhaps, and then shade? The traditional raytracing algorithm is very simple, tracing surfaces as it goes, iteration by iteration, but given the success of deferred rendering, is there an opportunity for some form of deferred raytracing, or something else that handles geometry and shading differently, wherein a 7TF console would only be needing those shaders for surface shading?
 
Sounds reasonable to be. Seperating tracing and shading, and doing both seperately is what i would had expected for realtime HW.
So, could look like this:

Trace for one intersection, append all hitpoints to a huge list.
Bin / sort hitpoints by material / textures to get cache efficient shading like we have with rasterization.
Determinate the next ray direction of the path (which depends on material if we want proper MIS).
Optional: Reorder rays so they are spatially coherent, and we get cache efficient tracing.
Continue with step 1 for the next ray of the path (submitting a large buffer of rays to RT cores).

This is how powerful HW raytracing might / should look like, and initally i assumed RTX would do all this already.
Likely, tracing and shading can run in parallel, shading on compue cores while RT cores trace independently. Both with haevy demands on bandwidth, so doubling this would make sense.

Downside: Reduced general purpose compute power because necessary RT die area would be large this time?
Solution: Remove rasterization hardware. At this point no more need for hybrid. (IMO).

And yes, i know we talk about rumors here, just for fun :)
 
I suspect there is little point in deferred raytracing right now. Tracing intersections will be heavily memory bandwidth bound, not shader computation bound.

With raytracing, we start from a pixel in screen coordinates and shoot rays into our scene at different angles - when each ray would need to travel a significant part of the BVH tree, especially when it has to be rebuilt for each frame, memory access will be very incoherent.

For each source pixel, there could be multiple hits along each ray, which would invoke anyhit shaders for all transparent surfaces and the closest-hit shader for the final non-transparent surface, each involving texture access and/or additional rays spawned towards light sources.
That could take enormous amounts of memory bandwidth, depending on the properties of BVH partitioning algorithm.


I'd guess initial hardware implementation will still be based on the current DXR paradigm, where raytracing constitutes a different processing pipeline as opposed to the 'traditional' rasterization pipeline (or the new task/mesh rasterization pipeline).
These pipelines will be scheduled to run in parallel, with each one having its own command queues.

This way, the 'traditional' rasterization pipeline, and especially the meshlet-based rasterizer, would be compute-shader intensive to allow highly parallel processing of complex geometry, while the raytracing pipeline would be memory access intensive - and to maintain reasonable performance tradeoff, it would operate with simplified scene geometry and reduced texture resolution (and maybe reduced front-buffer resolution as well).
 
Last edited:
For each sourve pixel, there could be multiple hits along each ray, which would invoke ahyhit shaders for any transparent surfaces and the closest-hit shader for the final non-transparent surface,
Instead using any hit shaders and recursive control flow, my proposal could only do closest hits, requiring multiple passes to deal with transparency.
But ignoring transparencey and using hacks so far worked well for games, and even with any hit shaders we will continue with those hacks in any case.

I suspect there is little point in deferred raytracing right now.
Agree, but they could aim for a leap and keep quiet about it. Technically it might be possible, see the more advanced ImgTech hardware which has reordering, and maybe material sorting too (can't remember). Both those tasks could utilize the same FF HW.

But it's not that i believe this to happen so quickly. We could do similar speculations for Ampere if we want.
Reordering is only worth it for GI, but not for shadow rays or sharp refelctions.
Material sorting might only be worth it for sharp reflections.
Photorealisic Path Tracing seems still out of reach and also a very inefficient way to get there.

So, this really seems far fetched speculations. But it's the only explantation for the rumored 7TF at 800GB/s we came up with. (...would be the opposite impression of what we saw from AMDs TMU patent. They could have licensed ImgTech... who kows.)
 
But that's talking 7TF with RT, what about double that in 2020 gpu's, with possible or most likely improved RT hardware?
 
it's the only explantation for the rumored 7TF at 800GB/s we came up with. (...would be the opposite impression of what we saw from AMDs TMU patent

The increased bandwidth could be required for the ray intersection engine to operate. No matter which block performs the actual intersection processing and flow control, the BVH nodes still need to be walked through one by one, which involves traversing the tree multiple times in a very irregular memory access pattern.

https://devblogs.nvidia.com/thinking-parallel-part-iii-tree-construction-gpu/
https://devblogs.nvidia.com/thinking-parallel-part-ii-tree-traversal-gpu/


The shader units in the AMD raytracing patent are mostly used for control flow, while the actual BVH traversal and processing is performed by the fixed function ray intersection engine, which has its own dedicated ALUs for ray intersection testing, sharing the texture data paths with the TMU to fetch BVH nodes.

US 2019/0197761 A1
Jun. 27, 2019
TEXTURE PROCESSOR BASED RAY TRACING ACCELERATION METHOD AND SYSTEM
Advanced Micro Devices, Inc.
[0021] ... the ray and box and the ray and triangle intersections are the main primitive operations during the traversal phase. These operations are very memory bandwidth intensive and have high occurences of random accesses. For example, each ray may fetch over 24 different 64-byte nodes. These operations are also very arithmetic logic unit (ALU) and/or compute unit intensive. These ray traces suffer form very high divergence due to different traversal lengths, (where average wave utilization is 30%), are vector general purpose register (VGPR) use intensive, and waves waterfal frequently due to high probability of containing both triangle and box nodes.

[0022] ... A fixed-function BVH intersection testing and traversal (a common and expensive operation in ray tracers) logic is implemented on texture processors. This enables the performance and power efficiency of the ray tracing to be substantially improved without expanding high area and effort costs. High bandwidth paths within the texture processor and shader units that are used for texture processing are reused for BVH intersection testing and traversal. In general, a texture processor receives an instruction from the shader unit that includes ray data and BVH pointer information. The texture processor fetches the BVH node data from memory using, for example, 16 double word (DW) block loads. The texture processor performs four ray-box intersections and children sorting for box nodes and 1 ray-triangle intersection for triangle nodes. The intersection results are returned to the shader unit.

[0023] In particular, a fixed funcion intersection engine is added in parallel to a texture filter pipeline in a texture processor. This enables the shader unit to issue a texture instruction which contains the ray data (ray origin and ray direction) and a pointer to the BVH node in the BVH tree. The texture processor can fetch the BVH node data from memory and supply both the data from the BVH node and the ray data to the fixed function ray intersection engine. The ray intersection engine looks at the data for the BVH node and determines whether it needs to do ray-box intersection or ray-triangle intersection testing. The ray intersection engine configures its ALUs or compute units accordingly and passes ray data and BVH node data through the configured internal ALUs or compute units to calculate the intersection results. Based on the results of the intersection testing, a state machine determines how the shader unit should advance its internal stack (traversal stack) and traverse the BVH tree. The state machine can be fixed function or programmable. The intersection testing results and/or a list of node pointers which need to be traversed next (in the order they need to be traversed) are returned to the shader unit using the texture data return path. The shader unit reviews the results of the intersection and the indications received to decide how to traverse to the next node in the BVH tree.

[0024] The hybrid approach (doing fixeed function acceleration for a single node of the BVH tree) and using a shader unit to schedule the processing addresses the issues with solely hardware based and/or solely software based solutions. Flexibility is preserved since the shader unit can still control the overall calculation and can bypass the fixed-function hardware where needed and still get the performance advantage of the fixed function hardware. In addition, by utilizing the texture processor infrastructure, large buffers for ray storage and BVH caching are elimitated that are typicallly required in a hardware raytracing solution as the existing VGPRs and texture cache can be used in its place, which substantially saves area and complexibility of the hardware solution.
...
[0046] As shown and illustrated with respect to certain implementations, the intersection testing is fused with data fetch operations. The intersection testing is performed asynchronouslyfrom the shader unit similar to texture filtering operations. The intersection processing elimitates inactive lanes. Consequently, operation takes less cycles to complete on a non-fully occupied wave and effectively removes wave divergence costs from intersection testing. The texture processor does not control traversal as the traversal is controlled by the shader unit. This evables user flexibility of traversal algorithm and allows the user to install custom node types into the BVH.

[0048] It should be understood that many variations are possible based on the disclosure therein. Although features and elements are described above in particular combinations, each feature or element can be used alone without the other features and elements or in various combinations with or without other features and elements.
 
Last edited:
The increased bandwidth could be required for the ray intersection engine to operate. No matter which block performs the actual intersection processing and flow control, the BVH nodes still need to be walked through one by one, which involves traversing the tree multiple times in a very irregular memory access pattern.
The shader units in the AMD raytracing patent are mostly used for control flow, while the actual BVH traversal and processing is performed by the fixed function ray intersection engine, which has its own dedicated ALUs for intersection testing, sharing texture data paths with the TMU.

Sure, but why AMD needs twice the bandwidth than NV? Because they are bad ingenieurs? No no - if those numbers are true, secret sauce is much more likely.
Maybe the patent (which also mentions optional FF for control flow), together with "selective lighting effects" is meant just for distraction, and they mean RT when they say "Nvidia Killer" :D

reordering, and maybe material sorting too (can't remember). Both those tasks could utilize the same FF HW.

Although offtopic, i forgot to mention HW binning would have many other applications:
Acceleration structure building for geometry, photons, rays, SPH fluid, maybe stuff like SAT fo rdenoising, etc.
 
I found me an explantation that makes sense: Assuming the patent and the rumor applies, AMD could eventually double TMUs, so while tracing the CU can switch to a shading wavefront and access textures using the second TMU.
Would explain the demand on BW and even if CUs control traversal outer loop, they should be available for shading most of the time.
So no big surprise other than higher RT performance than expected. (I did expect 2060 RT perf. but not more.)

Yay, some LOD! :D
 
They don't need to increase the TMU count for that. TMUs are dedicated fixed-function ALUs that perform texture loads and trilinear/anisotropic filltering.

If they are adding REs (Raytracing Engines), that is fixed-function ALUs for BVH loads and ray intersection tests, they would need to increase the width of the memory bus and the dedicated caches, so the bandwidth could be effectively shared between TMUs and REs working in parallel.

But the number of REs is not necessarily bound to the number of TMUs one-to-one - in fact, I would expect them to add quite a significant amount of REs, so the scheduler could optimize for rays coming in approximately the same direction, and BVH nodes could be shared and re-used by different rays to improve cache locality.
 
Last edited:
They don't need to increase the TMU count for that. TMUs are dedicated fixed-function ALUs that perform texture loads and trilinear/anisotropic filltering.

If they are adding REs (Raytracing Engines), that is fixed-function ALUs for BVH loads and ray intersection tests, they would need to increase the width of the memory bus and the dedicated caches, so the bandwidth could be effectively shared between TMUs and REs working in parallel.

But the number of REs is not necessarily bound to the number of TMUs - in fact, I would expect them to add quite a significant amount of REs, so the scheduler could optimize for rays coming in the appriximately same direction, and BVH nodes could be shared and re-used by to improve cache locality.

dude, the patent...
 
I've just remembered Sony were rumoured/had patented photon-mapping tech. That's a tech I know little about. How does photon-mapping shift the balance of processing versus memory access? Well, like anything you can use different approached to favour one or other resource, but is there scope with photon-mapping to get proportionally better acceleration through memory consumption rather than compute, especially if you focus more on lighting than reflections without needing per-surface calculations for secondary rays?
 
They don't need to increase the TMU count for that. TMUs are dedicated fixed-function ALUs that perform texture loads and trilinear/anisotropic filltering.

If they are adding REs (Raytracing Engines), that is fixed-function ALUs for BVH loads and ray intersection tests, they would need to increase the width of the memory bus and the dedicated caches, so the bandwidth could be effectively shared between TMUs and REs working in parallel.

But the number of REs is not necessarily bound to the number of TMUs - in fact, I would expect them to add quite a significant amount of REs, so the scheduler could optimize for rays coming in the appriximately same direction, and BVH nodes could be shared and re-used by to improve cache locality.

Makes more sense, but then what is the connection to TMUs at all, other than memory bus? (This question bothers me since reading the patent - i'm neither good with patents nor hardware :) )

Sheduler being able to sort rays would be neat. I wonder if RTX does this already.
 
I've just remembered Sony were rumoured/had patented photon-mapping tech. That's a tech I know little about. How does photon-mapping shift the balance of processing versus memory access? Well, like anything you can use different approached to favour one or other resource, but is there scope with photon-mapping to get proportionally better acceleration through memory consumption rather than compute, especially if you focus more on lighting than reflections without needing per-surface calculations for secondary rays?
Don't know much either, but photons need some spatial structure to access them, and unlike RT BVH the data changes completely every frame so they can not refit bottom levels but may need a full rebuild each frame. Likely very BW intense.
How much this could save on compute may be a question of additionl FF HW for photon gathering and those rebuilds.
 
dude, the patent...
The patent does not claim any particular issue ratio between texture filtering and ray intersection units in the texture processor.

what is the connection to TMUs at all, other than memory bus?
The essential claims of the patent are that the texture adressing unit and texture caches are reused for loading BVH nodes and communicating with the shader unit wavefronts.

There could be various implementations of the ray intersection engine: 1) the TMU is designed as one single block, so that texture fintering unit and the ray intersection are switched functions tied to the single texture addressing block, or 2) texture filteing and ray intersection would be running in parallel, with their texture adressing blocks being a kind of scheduled, out-of-order wavefront processor with concurrent access to the memory controller.
 
Where the texture is, there must be a polygon? or something

...along those...

...lines?

shifty.gif
unamused.png
 
Actually, no. There needs to be a surface or rather, instructions to shade a pixel (or sample) a certain way. You could use a texture to colour a 3D fluid-simulated volume without any triangles. You could use a texture to shade pixels based on ray depth, or anything. All that's needed, somehow, is to create an arrangement of colouring instructions in relation to the scene geometry. So, randomly, trace 64x64 pixel tiles, populate each pixel with surface data, then send that tile off to the shaders to shade while the next tile is worked on. The problem then, I think, becomes one of batching shaders to work on coherent pixels.
 
Just compared Vega64 vs. Radeon VII, on techpowerup. The latter has 122% relative performance. TF about 13 with a difference of one, and VII bandwidth a bit more than 2x. Just to proof there is likely not much benefit from 800GB/s for rasteriszation and shading in games.

I would rule out fancy stuff like reordering because it's not worth it yet. Same applies to material sorting. (If shedulers can do some in place sorting it will take even longer until such things become a win in practice)
I would also rule out photon mapping. I think this is nice for offline, but for realtime i see much better options like texture space shading or other surface caching methods (although with some hurdles on content creation and engine complexity, stuff the movie industry would not want).
And i don't believe in 3rd party extra hardware or big differences between Sony / MS.

So in the end i'm back to the start, assuming very similar tech to RTX / DXR, but perf closer to the highest end NV models not the lower end ones. To get there the 800GB/s might be just more effective than more TF / larger GPU.
I would not wonder about HW BVH build, but anything else sounds too far fetched.
 
Back
Top