Next gen lighting technologies - voxelised, traced, and everything else *spawn*

Question regarding # of rays required in games. Imagination Tech stated for a for a fully ray traced game you would need to budget between 3 - 3.5 GRays at 1080p, 60fps. (counter 27:43) Based on some statements in the thread it sounds like that would be too few and wondered why their lower limit might not be an accurate approximation for use in games.
RT discussion begins at counter 18:00

Very interesting PowerVR hardware raytracing insight.
Especially the explanation of their 'coherency engine'.
Nvidia seems to lack anything like that, and this likely explains why actual raytracing performance is such a small fraction of claimed theoretical Grays/s.
The reflection slide shows it nicely, rays are sorted according to where they hit.
Next batches of rays with coherent hit are shaded, which results in efficient SIMD shader usage and cache locality for texture fetch.
I guess with a next generation of GPUs Nvidia will have that too, resulting in much better actual raytracing performance.
 
Last edited:
We've been asking how PVR handled ray tracing without any word. If this talk goes into detail, it'd be well worth watching.
 
Last edited:
Nvidia seems to lack anything like that, and this likely explains why actual raytracing performance is such a small fraction of claimed theoretical Grays/s.
The reflection slide shows it nicely, rays are sorted according to where they hit.
Next batches of rays with coherent hit are shaded, which results in efficient SIMD shader usage and cache locality for texture fetch.
I guess with a next generation of GPUs Nvidia will have that too, resulting in much better actual raytracing performance.

Yeah, question of the year :)
I always assumed they would batch similar rays for tracing and similar hits for shading. Likely with cuda cores under the hood. And they do this already on Volta, which has 'extended compute sheduling options' or something like that. This would explain why Volta is much faster at RT than 1080Ti.
I already finished thinking such a batching algorithm for compute RT. It's less complicated than i initially thought... can't wait to finish geometry processing work and try it out. I'm optimistic to rival GTX performance now, but my refelctions will still end up more blurry...
(I need to hurry, their next gen might destroy me, haha)
 
Very interesting PowerVR hardware raytracing insight.
Especially the explanation of their 'coherency engine'.
The reflection slide shows it nicely, rays are sorted according to where they hit.
Next batches of rays with coherent hit are shaded, which results in efficient SIMD shader usage and cache locality for texture fetch.

sounds like a thing that can still be implemented in software on DXR.
 
Most mobile GPUs have no LDS memory, so you can't do anything useful fast. Maybe that's a reason they need FF HW for batching and NV might not.
 
sounds like a thing that can still be implemented in software on DXR.
I'm not too sure how this could be done in software and how efficient that would be.
This as the ray coherency sorting based on where rays hit, would need to be a stage between BVH traversal and shading.
 
I'm not too sure how this could be done in software and how efficient that would be.
This as the ray coherency sorting based on where rays hit, would need to be a stage between BVH traversal and shading.
Yes, and i assume there is multiple staging going on. The API allows this: Rays are independent so single threaded, and there is no access to LDS.
So they can terminate the program at ray emit, sort stuff for efficiency and trace, shade, and then 'return' to the program, which can shuffle rays even to other cores, they only need to restore registers and program counter. So 'recursion' is not really recursion in the sense of what we think of CPU.

Do you think this would be unrealistic for some reason?

The alternative is terrible: Execute ray generation on cuda core, submit to RT core and wait. RT core traverses BVH for each ray incoherent (inefficient even with FF i guess). Wake up sleeping cuda core, shade each hitpoint by material (already impossible to have unique shader per thread), return to ray shader etc... horribly inefficient.

We just don't know which compromise they use, but some kind of batching must be there, and likely not just on the shading part. Their 10 years of RT experience surely isn't for nothing?
 
The alternative is terrible: Execute ray generation on cuda core, submit to RT core and wait. RT core traverses BVH for each ray incoherent (inefficient even with FF i guess). Wake up sleeping cuda core, shade each hitpoint by material (already impossible to have unique shader per thread), return to ray shader etc... horribly inefficient.
AFAIK, BVH traversal is always incoherent for the lower layers, but luckily also pretty good cache locality for the upper layers of the tree. And the cost for swapping out the stack mid-traversal, sorting for a better fit candidate would easily exceed the gains. If any at all, as even if you were to bin ray segments by area, that will only give you 1-2 layers of coherency in BVH access, and beyond that once again incoherent access unless the rays were almost identical.

Anyway, BVH traversal isn't where the cost is exploding yet, it's only on actual hit. The FF hardware is also doing a good job there, I suppose.

Sorting for coherency happens after a hit, before shading begins. For PowerVR in hardware, for DXR currently hacked on application side.
If anything, you could think about starting to bin on near hit already (prior to testing the ray against the triangle soup), in expectation that you might gain some locality there in case the BVH depth is actually insufficient.

Even though it's really a failure from NVidia that they didn't already do that in the driver.
Suppose it's because that would incur a static overhead to every ray cast, which would ruin their nice marketing statement about peak performance. And it's probably also futile if the scene complexity or number of rays exceeds certain limits, as at some point you would have to literally enqueue millions of hits in order to achieve any form of locality. At which point sorting is hardly worth it any more. Respectively if under-sampling too much, so there isn't any chance of actually utilizing a full cache line either.

So 'recursion' is not really recursion in the sense of what we think of CPU.
It is ordinary recursion, but the BVH has a fixed depth by design in the current implementation. So you do have recursion within a fixed stack size. Shuffling to another core would also mean shuffling the stack, easily a few hundred bytes (coordinates + pointer to BVH-tree node for each frame). You only can discard the stack once you have a hit.
 
Shuffling to another core would also mean shuffling the stack, easily a few hundred bytes (coordinates + pointer to BVH-tree node for each frame). You only can discard the stack once you have a hit.
Yeah makes sense - i knew i forgot about something :)
 
I thought the coherency was already loosely hacked in by the applications using course grained BVH model before using the actual detailed BVH models.
 
It would be possible to pick only one light per frame randomly and accumulate over time.
I blindly assumed this is the case in Q2 demo, but wanted to be sure so i checked above file. But i can't figure out quickly. If they iterate over all lights in the list, there would be much more rays per pixel than 4(!)

If you read the link for the quake project they explain how the light works. For each frame they select four lights with the largest impact based on various criteria.
 
I thought the coherency was already loosely hacked in by the applications using course grained BVH model before using the actual detailed BVH models.
The main reason for this is to avoid building BVH each frame from scratch. So the low level trees can be built for a character or a vehicle only once, and only the high level tree needs to be rebuilt over all low levels per frame. The low level trees require only refitting to animation.
It could make sense to batch rays to low levels because the likely have similar materials. But for the static scene this would not help well because this requires no fine grained low level trees.

If you read the link for the quake project they explain how the light works. For each frame they select four lights with the largest impact based on various criteria.
I can't find this. Are you sure? Number 4 only appears at the bottom where they say 'at least 4 rays per pixel'. (Which would mean 1 bounce, but the define in the source says 2 bounces.)
I'd need to look closer to figure out how many lights they choose...
 
The main reason for this is to avoid building BVH each frame from scratch. So the low level trees can be built for a character or a vehicle only once, and only the high level tree needs to be rebuilt over all low levels per frame. The low level trees require only refitting to animation.
It could make sense to batch rays to low levels because the likely have similar materials. But for the static scene this would not help well because this requires no fine grained low level trees.


I can't find this. Are you sure? Number 4 only appears at the bottom where they say 'at least 4 rays per pixel'. (Which would mean 1 bounce, but the define in the source says 2 bounces.)
I'd need to look closer to figure out how many lights they choose...

I misremembered what it said.

Since Quake II originally shipped with a Potentially-Visible-Set for culling occluded parts of the scene, we instead resorted to extracting lists of potentially visible lights from these sets for every part of the scene. In the current version, we implemented a semi-accurate estimation of the influence of every light in the list, by randomly picking a representative subset of these lists in every frame. Thus, the renderer quickly hits all light sources over time, and we can also do all influence estimation in a controlled, constant time budget.
 
I'm not sure the code using NUM_BOUNCES define mentioned earlier is only active if RTX is off. (likely using OptiX then)
There is a RTX section of code that builds a path of 10(!) bounces. But it does not lighten the final hit - makes no sense... I do not know if the #define RTX is on - more likely it is inactive and just misleading. There are many files to make sure... maybe i find it. Confused...

...ok, no evidence for the RTX define to be on, and the other code makes perfect sense to me.
It seems they pick one random light and there is one bounce active. As expected. Pretty sure but take it with a grain of salt.
In code they treat the first hit as the first bounce, this the NUM_BOUNCES = 2 define is misleading and in fact it is just one.
This also fits to the description on webpage of 4 rays per pixel.

If anyone is interested, i could try to compile the shader with more bounces. I could upload the Spir-V file and replacing it would allow to test. (Spir-V is bytecode so no risk to capture my viruses)
But that's some work and i'm not sure if i succeed...
 
Notice there is no surface shader used here. They have constant material for everything and just fetch the texture at each hit.
This could be done with any game, at least for 90% of the scene. Much better test case than BFV.
 
Quake 2 Path Tracing demo benchmarks:
Titan V @1080p: 34 fps
2080Ti @1080p: 100 fps

I'm a little bit more impressed of the HW now, 3 x faster is better than 2.5 :)

This is interesting:

"
There are quite a few cvars leftover from Q2PRO that are ignored by Q2VKPT. In particular it ignores the anistropic filtering cvar. It seems I accidentally disabled it at some point though :( (see src/refresh/vkpt/textures.c (579))

It might be interesting to also benchmark the "rtx on" mode, which turns all surfaces into a mirror (note that it does _not_ control the usage of the RTCores, these will always be used). Q2VKPT will trace around 10 rays per pixel in "rtx on"-mode and performs only trivial shading. This should give a better evaluation of the actual raycasting performance. Full path tracing requires lots of additional expensive computation that cannot run on the RTCore.

You can use vkpt_profiler 1 to show execution times measured on the GPU. The cpu should be close to idle.

If you are curious about some inner workings you can use vkpt_reconstruction 0/1/2 to display: actual filter input / denoised / temporal filtering weight
"

So the RTX define puzzled me in the source makes 10 mirror bounces but no lighting, likely meant as a perf test.
We could conclude running with 4-5 bounces instead would have only a minor performance impact (on TitanV), and this would be enough for 'realistic' indoors! (Not sure if it has an noticeable visual impact)
 
^^ I think the RTX Mode ON (10 ray per pixel crazy bounce of mirrors with a simple surface) shows how trivial the ray/triangle interection tests are on the RT core (it runs really well!) vs. how much more expensive shading of the results are with "RTX off" and less ray/pixel. The performance difference in context aligns with what we have been seeing across BFV and the Northlight presentations.

edit: just so people are not confused what I mean, the "RTX off" cvar still uses RTX and the RT core, as the game is still path traced, but it is not making every surface a simple mirror. Rather, just compoising it based upon the material and texture it usually has. AKA, the game looks normal.
 
Last edited:
Back
Top