Next gen lighting technologies - voxelised, traced, and everything else *spawn*

Yes, i assume it is cost free because upscaling can be done cheep. I also assume the BVH building to be related to pixel count, which is wrong too.
It's not just an upscaling, there is a complex multilayer CNN network, which takes previous frame, current frame (probably both are sparse -- jittered) and motion vectors as input and outputs higher resolution 4K frame - msalvi-dl-future-of-rendering

Based on GamerNexus results above, DLSS processing time is ~6 ms, that's a lot, I expected it to be at least 2x faster
 
Last edited:
I'll remove RTX from my todo list for now
Oh, please don't :eek: (just a joke)

I've just discovered the game Strange Brigade by Rebellion has limited raytraced reflections on PS4
Are you sure these are RT reflections? might as well be a simple dynamic cubemap (or even static + dynamic SSRs on top), these square texels on the second screenshot certainly remind me of static cube maps:smile2:, though, these are so local, that does't matter, but it's cool that so many devs have already started exploring RT
 
Last edited:
Are you sure these are RT reflections? might as well be a simple dynamic cubemap (or even static + dynamic SSRs on top), these square texels on the second screenshot certainly remind me of static cube maps:smile2:, though, these are so local, that does't matter, but it's cool that so many devs have already started exploring RT
I'll discuss Strange Brigade in that thread.
 
It's not unfortunately, latest testing on Final Fantasy 15 reveals it has a hit of about 35% from native 1440p, however it gives 25% more performance than native 4K with TAA.

35% is a whole lot, but irrelevant for me. I want to invest this time into something more useful than upscaling (Notice with texture space shading upscaling is no longer necessary).
I was following AI based tech. I remember papers about SSAO, AA and more screenspace tech done with AI. It took twice the time and resulted in half the image quality.
Spending 35% of power just for upscaling is a joke. But how can this happen at all? I would assume Tensor cores process the old frame async while the rest of the GPU works an the next. Another disappointment.
I really can't get excited about any of this. This is like Apple selling one button solutions to the masses as a stats symbol, just like 'It's hip, easy and it works - show you can afford it'.

That matches quite well with 29 FPS I calculated (+- for rounding error)

Sorry if i did not follow your thoughts exactly, same to David about the 400% perf + with Turing. Sure there is improvement, but in the end it does not contribute.
I need to make one more correction - this time not the math or number, just the unit names:

2070: 19.8 fps = (1.4 x 2.5 res) / 50ms per frame = 0.07 RT work per millisecond
1080ti: 10.1 fps = (3.8 x 2.1 res) / 100ms per frame = 0.079 RT work per millisecond

My math here is much easier to grasp than yours: GTX 1080 traces twice the pixels and gets half the framerate, thus both equally fast.
Period. And no need to look what affects those sums in detail and how. Only the sums matter.

I was talking about 2560x1440 resolution (I mentioned this several times), while 80 FPS with RTX Ultra can be achieved on Titan V only in 1080p and below, that's 1.78 less rays to trace

No, according to the poster he uses all his many NV GPUs at the same resolution. His intention is not to downplay NV at all. He is a fan just enjoying how everything works.

Sure there is constant frame cost like transforms and BVH update. I ignore that intentionally. Within my own GI work the cost of those things is totally negligible. With RTX it surely is higher but never high enough to justify equaling end results.
No matter what other pipeline details you bring up to discuss - they never can contribute enough to justify this.

So all this comes down to a simple thing - how heavy the ray-tris intersection part

I'm not aware of a single proper raytracing algorithm where ray triangle checks are the bottleneck. Instead the ray-box checks already cost more, and cache missing traversal the most, if we assume a simple implementation.
A complex implementation (MBVH, sorting many rays to larger tree branches, sorting ray hits to materials, etc.) has additional costs but are key to good performance.
We do not know what NV here does at all, and how it is distributed between RT core and compute under the hood. I guess RT Cores can do box and triangle checks in parallel, and compute does the batching stuff. Pure speculation - so is yours.

The guy mentions however that Titan performance drops at some spots if i get him right, not sure. But such problems can be solved. RT cores are useful but not justified, IMHO. I could even say they are just cooling pads looking at those results.

I also agree geometric complexity will show RT core beating the older GPUs. But the proper solution here is a data structure that combines LOD, BVH and geometry. RTX does not allow for this. If compute is equally fast, i prefer just that to solve problems directly instead hiding their symptoms with brute force.


I hope AMD and Intel draw the proper conclusions here: DXR support yes, improved compute sheduling yes, RT cores no - better more flexible CUs or drawing less power.
And i hope next gen consoles will have 'old school' GCN, and the devs find time to work on RT based on that. Personally i'll do my best here...
 
These numbers pretty much match up with remedy. 2-5x improvements, which is very significant.

I am happy with those numbers - i'm always happy with what given hardware can do.
But 2-5x speedup does not justify fixed function HW. 10x would be the minimum. Those cores take chip area that could be used to compensate and could be utilized for everything else too during the whole frame.
They claim 10x, but in practice it does not work. It is broken and wrong IMHO.
Probably driver and game improvements will help over time, but we will loose options for proper comparison pretty quickly i guess. Only hope is consoles keeping up.
 
I am happy with those numbers - i'm always happy with what given hardware can do.
But 2-5x speedup does not justify fixed function HW. 10x would be the minimum. Those cores take chip area that could be used to compensate and could be utilized for everything else too during the whole frame.
They claim 10x, but in practice it does not work. It is broken and wrong IMHO.
Probably driver and game improvements will help over time, but we will loose options for proper comparison pretty quickly i guess. Only hope is consoles keeping up.

Could the area they take compensate? I'm legit asking. I think I saw someone say 10% of the SM space is taken up by RT cores. Not sure if that number is correct. So you sacrifice 10% of general ALU for a 200-500% improvement in ray casting and intersection tests. Obviously the former is better if you're not tracing triangles, or if you're not ray tracing at all. The latter is better if you assume most games are going to still rasterized and devs will be interested in implementing some ray tracing of triangles.

I don't know what the correct call is, but if you want your game to run at 60+ fps, spending 3-5ms just on ray intersections is very costly. Once you're down to 1ms it's a lot more manageable. Just by using that 10% of space for more general ALU, I don't think you could really compensate for the loss of the RT cores if you want to trace triangles.
 
I want to invest this time into something more useful than upscaling
DLSS also does temporally stable AA with sharp edges and w/o ghosting and other TAA artifacts, if it was just upscaling, it would not cost anywhere near 6 ms on 2080 Ti.

My math here is much easier to grasp than yours:
No, it's hard to get what you are caculating here at all

GTX 1080 traces twice the pixels and gets half the framerate, thus both equally fast.
Can you explain what you are doing here? Where did you get FPS numbers and other things in steps?
If these are SW Reflection demo numbers, then they are simply wrong, 2070 (4K DLSS) is 2.7x faster than GTX1080 Ti (1440p) when you subtract DLSS cost and add TAA cost instead to get 1440p result for 2070, that's almost 3x more work done with 50% less flops and 1/2 of ROPs and geometry pipeline engines

Instead the ray-box checks already cost more
I don't separate these, by ray-triangle intersections test I meant necessary ray-box checks as well. All giga rays RT-cores numbers also count in only effective rays afaik (these, which hit final triangle in BVH)

Sure there is constant frame cost like transforms and BVH update. I ignore that intentionally.
These are heavy with RT

I could even say they are just cooling pads looking at those results.
To be honest, this is ridiculous

I hope AMD and Intel draw the proper conclusions here: DXR support yes, improved compute sheduling yes, RT cores no - better more flexible CUs or drawing less power.
AMD/ATI has always sacrificed sheduling in order to gain more flops (with the exception of async compute scheduling, which is pretty easy to amortize anyway), it would be surprising to see them turning in the exactly opposite direction, though, RT obviously requires this change (i know, there are certain patents on hybrid narrow SIMD / wide SIMD CUs from AMD, but these are just patents), otherwise 64 wide wavefront architecture would suffer badly from such a divergent workload as RT. Once they abandon 64 wide wavefronts, they will need much more wavefront schedulers per CU (Volta and Turing already have 4x more schedulers per SM), and these will have to be more complex than currently in GCN since back-to-back instruction execution will be no longer possible (unless they go for a very narrow SIMDs). Also, they will have to overhaul caches and significantly grow these in sizes (done in Volta/Turing). So yes, if they go for a complete software solution, this will be entertaining.

And i hope next gen consoles will have 'old school' GCN, and the devs find time to work on RT based on that
There is no way they can go for RT with 'old school' GCN
 
Last edited:
So you sacrifice 10% of general ALU for a 200-500% improvement in ray casting and intersection tests. Obviously the former is better if you're not tracing triangles, or if you're not ray tracing at all. The latter is better if you assume most games are going to still rasterized and devs will be interested in implementing some ray tracing of triangles.
Yes, it depends then on how much you do raytracing or anything else (including shading, physics, mining...).
But you can see yourself there is a potential conflict of interests, and the hardware vendor almost dictates what you have to do to utilize the HW the most.

And there is another, much worse and concrete problem here: Because the RT cores do intersection tests and BVH travesal. The former is a simple atomic operation deserving fixed function with little doubts (although, the box test is really simple - faster than loading the data from LDS i'm sure)
But the latter, tree traversal, is a fundamental tool of almost any algorithm that is not brute force. And traditionally GPUs are bad at traversing trees. Making this fixed function makes the data structure static, so it can not be adapted for general purpose. (It is also black boxed currently anyways)
Instead it would make so much more sense to make GPUs powerful in programmable and flexible tree traversal. This could be achieved with device side enqueue, presistent threats and other techniques available in other APIs but not gfx APIs. Plus Turing (and the TitanV already) has very fine grained new stuff here ('work generation shaders' what ever that is.)

If you add all this to compute, you can make the RT you need. This means completely different algorithms for primary / secondary / coherent and incoherent rays in practice.
It would be much more work, not something you can add in a month, but the result would be faster. NV may implement this under the hood or not and decide based on driver assumptions or given API hints. They may extend to this in hidden form as soon as other vendors catch up with next gen. But the developer neither has control nor knowledge, nor options to improve and adapt. This is frustrating and hinders progress. (Remember: The developer might want to do something completely different, like physics simulation, also requiring trees and rays.)

I've said this multiple times. But with those results here it is no longer assumption but sad truth.
 
Can you explain what you are doing here? Where did you get FPS numbers and other things in steps?

I get the numbers from the screenshot of the PCGH site i have posted earlier. David Graham posted more with similar numbers. I assume mine are worst case, but because they are doubles of each other, my math is as simple as possible with them.

If these are SW Reflection demo numbers, then they are simply wrong, 2070 (4K DLSS) is 2.7x faster than GTX1080 Ti (1440p) when you subtract DLSS cost and add TAA cost instead to get 1440p result for 2070, that's almost 3x more work done with 50% less flops and 1/2 of ROPs and geometry pipeline engines

My numbers come from this test setup: All GPUs output at 4K. But RTX renders only at 1440p - not the other way around! (In the screenshot they say 'native resolution')
But in my initial post a accidently reversed the resolution numbers - my mistake! Sorry again for all this confusion!

So once more my final result with correct numbers and math:
2070: 19.8 fps = (1.4 x 2.5 res) / 50ms per frame = 0.07 RT work per millisecond
1080ti: 10.1 fps = (3.8 x 2.1 res) / 100ms per frame = 0.079 RT work per millisecond

I hope it makes sense now.



To be honest, this is ridiculous

Exactly what i think :) :D
 
The RTX 2070 is only 7.5 Tflops. The GTX 1080ti is something like 11 Tflops. I don't really understand how this comparison works. The SW Reflections demo is also going to be heavily testing shading performance.

We know from actual developers at EA Seed and Remedy that ray costs were 2-5x lower (independently of shading) before launch comparing RT core performance on a 2080ti vs a card without RT cores (TitanV).
 
But @DavidGraham found this and added links in his post above :) : https://pclab.pl/art78828-20.html
I know, I did my calculations based on these
If these are SW Reflection demo numbers, then they are simply wrong, 2070 (4K DLSS) is 2.7x faster than GTX1080 Ti (1440p) when you subtract DLSS cost and add TAA cost instead to get 1440p result for 2070, that's almost 3x more work done with 50% less flops and 1/2 of ROPs and geometry pipeline engines
2070 shows 21.2 FPS here at 4K with DLSS - https://forum.beyond3d.com/posts/2053525/
1080 Ti shows 9.7 FPS at 1440p - https://forum.beyond3d.com/posts/2053525/
As we know, 4K DLSS is 1440p reconstructed to 4K, 4K DLSS cost is 6ms on 2080 Ti - https://forum.beyond3d.com/posts/2053548/
For 2070, DLSS cost should be around 10 ms (based on FLOPS numbers)
In order to get 1440p we need to subtract DLSS time and add TAA, (1000/21.2) total avg frame time - 10 ms DLSS + 1.6 ms (TAA time in FFXV) = 38.77 ms or 25.79 FPS at 1440p
25.79 FPS (RTX 2070) / 9.7 FPS (GTX 1080 Ti) = 2.66
That's a huge difference considering that 2070 has almost 50% less flops and 1/2 of ROPs / geometry pipeline throughput
 
Sure, TCs provide much more flops at the same power due to higher data reuse, RT Cores are much more area and power efficient at ray-triangle intersection tests as these are specialized processors (it would have been insane to make them as wide as standart SMs)
Tensor cores and RT cores both presumably free up register file accesses, either through reuse and dedicated data paths, and possibly some amount of independent sequencing that doesn't go back to the general purpose register file for each step.
A general-purpose shader cannot skip the power cost or bandwidth lost to other workloads, although some of the current excess barrier insertions appear to be losing some of that benefit.
Even so, if BVH traversal and intersection testing is able to run through multiple memory accesses, address calculations, and internal functions that do not need to contend for the instruction caches, that's a lot of general-purpose paths that are often far too wide for the workload.

For GCN, parts of this seem better suited to a separate domain like the scalar unit and register file, but the existing scalar is already heavily used and may be too constrained in the operations it supports.

Could the area they take compensate? I'm legit asking. I think I saw someone say 10% of the SM space is taken up by RT cores. Not sure if that number is correct.
I'm curious what that statement comes from, or what data is being used to infer it. So far, I have only seen marketing diagrams that I am reluctant to trust to provide good proportions.
There may also be area increase in the SM for elements like the improved load/store pipeline and caches, which the RT hardware likely needs but can be beneficial for other workloads as well.

So you sacrifice 10% of general ALU for a 200-500% improvement in ray casting and intersection tests.
Unless you run into a different limit, like register file power consumption or area lost to the infrastructure to support those ALUs.
The register files are a common refrain for both Nvidia and AMD. We can point to things like, Nvidia's operand reuse cache, AMD's reuse slots in its VLIW days, Vega's marketing talking about having input from their Zen team to optimize the register file. Various AMD patents like the "Super-SIMD" patent there was some buzz about a little while ago cited the limiting factor the register file has for GPUs. That one concerned itself with finding ways to utilize register file access cycles lost due to the architecture being sized for 3-operand instructions despite half the instructions not needing 3 operands, and an output cache to reuse outputs and avoid some accesses to the maxed-out register file. While the patents may not provide a complete picture, they seem to indicate there's a rough ceiling in terms of how many vector register files can be deployed, and general-purpose hardware is tied to them.
 
...

I'm curious what that statement comes from, or what data is being used to infer it. So far, I have only seen marketing diagrams that I am reluctant to trust to provide good proportions.
There may also be area increase in the SM for elements like the improved load/store pipeline and caches, which the RT hardware likely needs but can be beneficial for other workloads as well.

...

I think it was based on a picture in the turing whitepaper, but I may be remembering totally wrong.
 
25.79 FPS (RTX 2070) / 9.7 FPS (GTX 1080 Ti) = 2.66
That's a huge difference considering that 2070 has almost 50% less flops and 1/2 of ROPs / geometry pipeline throughput

I have checked all this:

DLSS cost:

4K+TAA = 46.5 fps 21.5 ms
4K-TAA = 50.5 fps 19.8 ms

4K cost = 1.7 ms / 2.25 = 0.75 ms for 1440p

1440p+TAA = 77 fps 13 ms
1440p-TAA = 12.25 ms

upscaled to 4K = 57 fps 17.5 ms

DLSS = 4.5 ms
-----------------------------------------

Comparison:

PCGH 1080Ti @ 4K: 10.1 fps
pclab 1080Ti @ 4K: 4.3 fps - wtf ???


continue with pclab numbers:

1080Ti 2560*1440 9.7 fps 103 ms
2070 upscaled 3840*2160 21.2 fps 47 ms - 4.5ms DLSS = 42.5ms

103 / 42.5 = 2070 is 2.42 times faster than 1080Ti


Looking closer to the numbers of pclab, they vary over a factor of >2! But i agree pclab must be correct. (I got smaller cost for DLSS which makes more sense too, but let's ignore this)
Those numbers look pretty accurate, because looking at DXR numbers here:
we can see TitanV is more than 2 times faster than 2080Ti with RT, because it already has 'advanced sheduling' stuff (Not sure how we should name this properly?)
It all makes sense now, thanks for clearing this up!

So i take back RT should be enabled for BFV, although, at 30fps or 720p they would still do.
Apologizes to NV - a bit of PCGHs fault.

If a speedup of 2.42 justifies fixed function restrictions is a matter of opinion. At the moment yes, but at the cost of a restricted future.
All this seems to confirm the match of TitanV vs. 2080Ti performance in BFV as well. So IMO RT cores are still not worth it - totally not.

But personal opinion aside, i guess we agree on the numbers now?
 
Last edited by a moderator:
I think it was based on a picture in the turing whitepaper, but I may be remembering totally wrong.
I saw a grainy die shot of TU102 in the whitepaper, and various block diagrams of the chip or an SM. It doesn't seem precise enough to be confident that those can be used to get that figure. Nvidia's diagrams don't necessarily respect the layout or the margins of chip regions.

Although if the pictures from the flickr account for Fritzchens Fritz's other die shots are any indication, we may have a better view of TU102 very soon. Inferring which blocks belong to what parts of silicon may still be a challenge, and even though the die shot may be better the pictures may be zoomed out enough to show the whole chip rather than dig down to the SM level.

edit:missing word
 
Hopefully this doesn't sound arrogant, but @JoeJ the abstraction chosen for dxr is intentionally hiding details about what the rtcores can do. Given Nvidia has been actively working in GPU raytracing for a decade and has amassed tons of experts on raytracing in general (siggraph papers etc.), you can assume that a lot experience of many people has gone into the design. While that is no guarantee for success, and every iteration in design is a compromise, some of your statements make it sound like the people who worked on this have not thought it through.

Furthermore, things like dxr are fluent apis, which will change over time based on feedback. Likewise under NDA developers know about more details and hw vendors typically visit developers and research studios and exchange ideas/plans/ possible solutions for future architectures in advance. So these new features don't happen in a vacuum of a few.

The research on doing RT without dedicated hw won't stop and will continue to be important to influence sw/hw design. If we look at fabrication processes, free launch is over, imo we will see greater efficiency gains by smart fixed function extensions, rather than brute force general purpose units.
 
Last edited:
Back
Top