Next gen lighting technologies - voxelised, traced, and everything else spawn

troyan · Feb 10, 2019

JoeJ said:
Never seen a 10 times factor in practice, and the specific workload is not always what we need for games.
The goal is not to 'beat' 2060 with a cheaper chip, the goal would have been to achieve similar quality and performance using algorithms tailored to the specific need games require.

nVidia and their employees have published numbers.
So how are you doing the same kind of BF5 reflections without DXR?

What we want is to find the most effective combination of tech, now including RTX. RTX alone does not bring photorealims to games - we would have seen this already. It can help a lot, but we need to find out it's strength, which is accuracy, not performance IMO.

Battlefield 5 doesnt look better without DXR reflections. In fact without the use it is lifeless and full of artefacts.

JoeJ · Feb 10, 2019

Shifty Geezer said:
1) Get fixed function HW and realtime RT hybrid games out there now, and lose access to the more efficient Future Methods for 10 years, resulting in slower RT effects in games for 10 years
2) Have a slower introduction of RT technologies that are more flexible and get weaker gains now, but through more flexible solutions gain more significant gains through Future Methods, resulting in significantly faster solutions 5 years from now and going forwards

3) Just expose dynamic dispatch, which most hardware can do for years. Do this 3 years ago, and you would see BFV kind of reflections now most likely. It worked on Volta, by emulating a restrictive API and in a rush. By bypassing this API and doing it as efficient as possible it would be much faster of course.

troyan · Feb 10, 2019

Maybe somebody can combine my both postings, thx.

Shifty Geezer said:
The argument is that the short-term gains come with notable long-term losses in reduced R&D and dead-end/restricted algorithm developments.

1) Get fixed function HW and realtime RT hybrid games out there now, and lose access to the more efficient Future Methods for 10 years, resulting in slower RT effects in games for 10 years
2) Have a slower introduction of RT technologies that are more flexible and get weaker gains now, but through more flexible solutions gain more significant gains through Future Methods, resulting in significantly faster solutions 5 years from now and going forwards

You know that RT Cores are the result of a research and development process in this field? They dont exist as a byproduct of some freak experiment. Nobody has stopped developers to put Raytracing into their games. But there must be a reason why Dice was the first developer who has done it with DXR.

BTW:The same happened with TensorCores in Googles TPU and Volta.

JoeJ · Feb 10, 2019

troyan said:
You know that RT Cores are the result of a research and development process in this field? They dont exist as a byproduct of some freak experiment. Nobody has stopped developers to put Raytracing into their games. But there must be a reason why Dice was the first developer who has done it with DXR.

Sounds pretty arrogant and sheep, to be honest. You ignore all the realtime RT work there was before out of NV, you ignore the alternatives and any argument discussed here, and you quote their marketing phrases. That's not that interesting and has been said often enough.

troyan · Feb 10, 2019

This isnt about nVidia. ImgTech did the same with their RT acceleration two years ago. And their approach never got the same hostile respone.

Shifty Geezer · Feb 10, 2019

troyan said:
You know that RT Cores are the result of a research and development process in this field?

Yes. The RTX units are invaluable to offline raytracing where nVidia has been working on GPU acceleration, and well worth including in GPUs designed for professionals.

They dont exist as a byproduct of some freak experiment. Nobody has stopped developers to put Raytracing into their games.

The inclusion of RTX encourages devs to use an accelerated ray-tracing solution using nVidia's BVH structure rather than explore alternatives like cone-tracing. If the RTX acceleration units were as effective for cone-tracing and other methods, including the exploration of new, as of yet unthought of, methods (doesn't matter how big your research team is, it won't manage the R&D of tens of thousands of GPU coders exploring), the development of games would be more diverse and explorative.

But there must be a reason why Dice was the first developer who has done it with DXR.

nVidia partnered with them to integrate RTX.

BTW:The same happened with TensorCores in Googles TPU and Volta.

Tensor cores are maths accelerators. They don't limit any ML algorithms and were included to solve the limitations of ML, not solve a specific problem. The BVH units in RTX are designed to solve a particular problem - traversing a particular memory structure - as opposed to being versatile accelerators. They are akin to the inclusion of video decode blocks. These video decode blocks were included after the need for video decoding was ascertained as pretty vital to any computer and the codec defined, after years and years of software decoding on the CPU gravitated towards an 'ideal' solution worth baking into hardware. The inclusion of BVH blocks has happened before the need for BVH has been ascertained as vital for games and before the format of the best memory structures for realtime lighting, shadowing, and reflections/refractions/transparencies have been identified.

Shifty Geezer · Feb 10, 2019

troyan said:
This isnt about nVidia. ImgTech did the same with their RT acceleration two years ago. And their approach never got the same hostile respone.

Do you have details on how ImgTec accelerated raytracing? Did they use a fixed vendor-specific BVH structure? Or something else?

JoeJ · Feb 10, 2019

Response to RTX isn't hostile here. It's hostile if you look at dump comments in gaming sites, but not here. This is a discussion forum, and discussion does not work by presenting marketing facts that everyone already knows at page 51.
So if you rule out any alternative to RTX, you have to present arguments against the alternatives, or how RTX does better. You could be sure i would agree to many of them.

troyan · Feb 10, 2019

Shifty Geezer said:
Do you have details on how ImgTec accelerated raytracing? Did they use a fixed vendor-specific BVH structure? Or something else?

As far as i understand it, they used are similiar way like Turing:
https://www.imgtec.com/blog/powervr-hardware-accelerated-ray-tracing-api/
http://cdn.imgtec.com/sdk-presentat...l_Rasterization_Graphics_with_Ray_Tracing.pdf

Shifty Geezer · Feb 10, 2019

Does the scene hierarchy generator have a fixed structure? If so, the same criticisms would be levied at PVR. The only reason they weren't is because no-one was talking about PVR as no-one was using the hardware.

If ImgTec released a PC GPU with RT and fixed function units in 2016, you might have the same discussion as here. It may also have avoided the same discussion if investigation into other lighting solutions wasn't quite as progressed on compute as it is now. JoeJ will be able to say whether he was exploring realtime lighting solutions in 2016 and whether he'd have found fixed function accelerators a novel idea or a limiting one back then.

However, that's irrelevant. The discussion is about the technology, not the ISP. Anyone thinking the debate is shaded by the IHV should back out of this thread, leaving those who want to talk about the technologies to debate without platform prejudice.

Ext3h · Feb 10, 2019

JoeJ said:
3) Just expose dynamic dispatch, which most hardware can do for years. Do this 3 years ago, and you would see BFV kind of reflections now most likely. It worked on Volta, by emulating a restrictive API and in a rush. By bypassing this API and doing it as efficient as possible it would be much faster of course.

You mean just streaming dispatch?
So, effectively indirect dispatch, but generalized to be streaming based on a shader writable, counting semaphore instead (plus second semaphore for exit condition)?
Should be possible to map on all existing hardware out there (can be serialized to indirect dispatch), and if batch size is constant, also quite efficient on hardware with full support...
At least it stays trivial as long as the API guarantees that dispatch may be deferred until the exit condition has been signaled.

It does get tricky when you require self-dispatch, at that time emulating by indirect dispatch does require looping, which involves either CPU round trip or mandatory hardware/firmware support. CUDA style dynamic parallelism with nested dispatch is yet another beast.

And the actual problem is, you still have a ping-pong like control flow in there (alternating between BVH traversal, filtering potential candidates, hit shader), which doesn't map to a simple, strictly forward streaming dispatch, but implies feedback loop.

If you have this sort of ping-pong pattern, what you would actually need to do, is to do it all in a single kernel which is toggling between the different possible operations per thread block. And then do your own sub-command-queues / work-queues in software.
Using different thread blocks of a single dispatch for different, divergent code paths feels odd, I know. But cooperative execution of divergent control flows of a single kernel is a surprisingly effective method.

JoeJ · Feb 10, 2019

Shifty Geezer said:
If ImgTec released a PC GPU with RT and fixed function units in 2016, you might have the same discussion as here.

My personal opinion about mobile RT is the exact opposite: Compute is no alternative and only FF can do it at all, and second: I do not understand the need for RT on mobile, while on PC / console i do.

The former comes from my assumption mobile compute is too slow. Mostly because LDS is often emulated by main memory backed by cache. So you have little benefit from parallel algorithms, and restrictions like in pixel shaders apply in practice.
Btw, most doubts efficient RT in compute can be done at all likely come from two arguments: Building acceleration structure takes too much time (solution: refit instead full rebuild), and traversal per thread is too slow (solution: don't do it this way). Both of those arguments are outdated.
But i'm not up to date with mobile and might be wrong. Turing shares LDS with cache now too i think.
Anyways i would not complain against FF on mobile.

The second is maybe explained by AR. We have discussed ImgTec here some time ago, and there was a video (including details about hardware if you missed this, i remember frame accumulator and a unit to handle ray batching).
In the video they talked about necessary performance for games, and i wondered why they would be willing to spend most performance on DOF.
Now i guess they aim for AR with eye tracking and gaze detection. Rasterization can't do DOF properly, but it may be very important for user feedback here. (Gaze also allows to reduce detail, so foveated rendering etc. - helps with perf.)
So that's likely the mobile industrys interest here, not really just games.

The Wizard GPU... if it would have become a thing (?). It would have been no success for games i guess, similar to Ageias physics accelerator. Everybody would have said: 'We don't need RT. No second GPU just for that!'... We would not argue

Recently i have read a blog from a developer, and he made this interesting speculation: NV simply can not talk about how their RT works - they would have to expect legal issues from ImgTec patents.

Maybe this goes so far that they can not implement ray batching at all for this reason, and RTX is indeed a naive BVH per thread traversal, similar to the Radeon Rays code i have criticized earlier. But i don't think so. Likely you just have to use a different algorithm and there are many.
Anyways patents are bad. E.g. the algorithm shown in Danger Planet video is patented too. Maybe that's the reason it was never used in a finished game. (Bunnell worked on two games - don't know what's gone wrong.) The algorithm has a big problem, but for Open World it would beat RTX GI with ease.

JoeJ · Feb 10, 2019

Ext3h said:
You mean just streaming dispatch?
So, effectively indirect dispatch, but generalized to be streaming based on a shader writable, counting semaphore instead (plus second semaphore for exit condition)?
Should be possible to map on all existing hardware out there (can be serialized to indirect dispatch), and if batch size is constant, also quite efficient on hardware with full support...

It does get tricky when you require self-dispatch, at that time emulating by indirect dispatch does require looping, which involves either CPU round trip or mandatory hardware/firmware support.

Problem is, you still have a ping-pong like control flow in there (alternating between BVH traversal, filtering potential candidates, hit shader), which doesn't map to a simple, strictly forward streaming dispatch, but implies feedback loop.

If you have this sort of ping-pong pattern, what you would actually need to do, is to do it all in a single kernel which is toggling between the different possible operations per thread block. And then do your own sub-command-queues in software.
Using different thread blocks of a single dispatch for different, divergent code paths feels odd, I know. But cooperative execution of divergent control flows of a single kernel is a surprisingly effective method.

That's a bit over my head... not sure about the terminology. I know a bit about the OpenCL 2.0 possibilities, but i have never used it and so... i have to admit i do not know exactly what i want, and i also don't know what exactly the hardware could do.
In any case it depends on the latter, i'll just use what's possible. But the problem i face the most is this:

I have a tree with 16-20 levels, and many of my shaders process one level, memory barrier, process next level depending on results. (Like making Mip Maps)
Mostly only 3 levels of the tree have a need to do any work at all.
But i still have to record all indirect dispatches to the command buffer, including all the useless barriers. This causes many bubbles. (I use a static command buffer which i upload only once at startup but execute each frame.)

So i would be happy with one of the following options:
Build Command buffers on GPU from compute. (discussed this with @pixeljetstream here, who has implemented NVs Vulkan Extension for GPU generated CB, but lacks the option to insert barriers)
Or have the option to skip over recorded commands from GPU.
The former would be better of course.

Actually i can fight the bubbles only with async compute, but at the moment i have no independent task yet for this. After some tests it seems this works pretty well, but the limitation should be finally addressed in Game APIs.
Pre-recorded command buffers with indirect dispatches already gave me a speed up of two when moving to VK, i expect another a big win here.
To saturate GPU, it needs to be able to operate independent from CPU, not just a client behind slow comm to server. I have no chance to saturate a big GPU actually, it's mostly bored. (will chance in practice, but still)

The other option, to launch compute shaders directly from compute shaders is of course super interesting, but i can't say quickly how i could utilize this.
I can't say if it is worth to support functions with call and return, or even recursion like seen in DXR.
On the long run we surely want this, but i'm not one of those requesting OOP or such things just for comfort. It depends on what hardware can do efficiently.
Andrew Lauritzen tried to bundle requests that arise from other applications such as rendering: https://www.ea.com/seed/news/seed-siggraph2017-compute-for-graphics
I'm no rendering expert, and do not understand everything here, but he also lists my problem described above.

Ping pong control flow is something i'm just used to. I still talk about larger workloads. Much smaller than something like a SSAO pass, but still few thousands of wavefronts of work. I use a work control shader that fills all the indirect dispatch data, problem is it just fills it with mostly zeroes.

So i do net request for totally fine grained flexibility like on CPU.

But it would be wonderful if a GPU driven command buffer could also address async compute, including the synchronization across queues. (assuming the queue concept has hardware backing at all.)
Actually it's hard to utilize async, because you need to divide your static command buffer into multiple fragments to distribute over multiple queues. This alone kills performance for small workloads. Add some sync and the benefit is lost.
It only works if you have independent large workloads, which is not guaranteed.

So what i want is command buffer recording and queue submission on GPU itself. There might be better options i'm unaware of.

Deleted member 2197 · Feb 10, 2019

JoeJ said:
Recently i have read a blog from a developer, and he made this interesting speculation: NV simply can not talk about how their RT works - they would have to expect legal issues from ImgTec patents.

Isn't this NV's patent for real time ray tracing? They do mention contributing patents and provide some detail.

JoeJ · Feb 10, 2019

I almost fell asleep while reading, but the only thing that could refer to batching is the term 'ray tree', which is not further explained. Otherwise nothing you would not read in a book 'RT for toddlers'.
I'm no lawyer, but i think there are 2 options: Make public and patent, or keep secret but do it still.

Shifty Geezer · Feb 10, 2019

That patent looks very applicable. They mention the creation of bounding volume hierarchies well suited to ray-tracing, which is where the problem lies if you want to accelerate something that isn't ray-tracing. They don't go into specifics though, and simply protect the idea of creating BVH's in hardware...

[0014]

Another aspect of the present invention enables high performance 3D-tree construction via optimizations in splitting plane selection, minimum storage construction, and tree pruning via approximate left-balancing.

[0015]
Another aspect of the invention involves the use of high-performance bounding volume hierarchies wherein, instead of explicitly representing axis-aligned bounding boxes, the system implicitly represents axis-aligned bounding boxes by a hierarchy of intervals. In one implementation, given a list of objects and an axis-aligned bounding box, the system determines L- and R-planes and partitions the set of objects accordingly. Then the system processes the left and right objects recursively until some termination criterion is met. Since the number of inner nodes is bounded, it is safe to rely on termination when there is only one object remaining.

Funny Claim 1 though limits the whole patent to rendering to an image plane. The patent is thus mitigated by rendering to a curved surface* and transforming the result into a plane.

Curved surface rendering is one of the benefits of RT

* Actually the ideal in real world imaging and the only reason to use planes is ease of manufacture and processing. We should eventually use curved imaging sensors and save truckloads on lens requirements, although the curve would have to match the lens, suggesting a future camera lens will consist of simple glass and complex image sensor. Zooms will struggle - the curvature of the sensor would need to be adjustable.

JoeJ · Feb 10, 2019

It's from 2000, though...

vipa899 · Feb 10, 2019

Shifty Geezer said:
1) Get fixed function HW and realtime RT hybrid games out there now, and lose access to the more efficient Future Methods for 10 years, resulting in slower RT effects in games for 10 years
2) Have a slower introduction of RT technologies that are more flexible and get weaker gains now, but through more flexible solutions gain more significant gains through Future Methods, resulting in significantly faster solutions 5 years from now and going forwards

We cant have both right? Option 2 might be that they are working on a better solution in the meantime and rtx as its now wont be supported as well as its now, not optimal either and shortens turings lifespan but atleast they are out with it now.

DavidGraham · Feb 10, 2019

Shifty Geezer said:
1) Get fixed function HW and realtime RT hybrid games out there now, and lose access to the more efficient Future Methods for 10 years, resulting in slower RT effects in games for 10 years
2) Have a slower introduction of RT technologies that are more flexible and get weaker gains now, but through more flexible solutions gain more significant gains through Future Methods, resulting in significantly faster solutions 5 years from now and going forwards

Definitely option 1, option 2 is full of too much unkonwn variables, romance and wishful thinking, while option 1 is focused, stands on the firm ground of what we have now and can be iterated upon to substantially improve.

I also don't accept that we are only limited to this duality, we can safely have both options at the same time, we can accelerate RT through fixed function units and through compute as well, and Turing isn't lacking in compute at all.

JoeJ · Feb 10, 2019

DavidGraham said:
Turing isn't lacking in compute at all.

I'm a bit worried after seeing this benchmarks: https://www.anandtech.com/show/13923/the-amd-radeon-vii-review/15
Prev gen is overall faster in compute it seems, except Geekbench. A sign of compute stagnation?

I do not take these benches very serious - they do not reflect what i see myself, but i remember the extreme drop of compute performance with Kepler (in favor of less power consumption for Tegra spin off, IIRC).
That's why i'm always worried about what NV is doing. Kepler is the only GPU i have which is too slow for me - GCN of same age is 5 times faster.
First thing i'll do on Turing is not playing around with RTX...

Next gen lighting technologies - voxelised, traced, and everything else spawn

troyan

JoeJ

troyan

JoeJ

troyan

Shifty Geezer

uber-Troll!

Shifty Geezer

uber-Troll!

JoeJ

troyan

Shifty Geezer

uber-Troll!

Ext3h

JoeJ

JoeJ

Deleted member 2197

Guest

JoeJ

Shifty Geezer

uber-Troll!

JoeJ

vipa899

DavidGraham

JoeJ

Similar threads

Next gen lighting technologies - voxelised, traced, and everything else *spawn*

uber-Troll!

uber-Troll!

uber-Troll!

Deleted member 2197

Guest

uber-Troll!

Similar threads

Next gen lighting technologies - voxelised, traced, and everything else spawn