DXR performance - CPU cost

Scott_Arm

Legend
I'm really curious about DXR performance. The only game I have to test right now is Control. In certain rooms my fps takes a huge hit, I'm assuming because there are more RT effects in action in those scenes. You can see as I switch from Native to DLSS Quality to DLSS Balanced and to DLSS performance, my frame rate does not go up by my gpu utilization goes down. I thought by dropping the internal resolution I'd be casting less rays and would remove an RT core bottleneck. It looks like the real hit to performance is a sunk cost elsewhere.

I think I'm cpu limited, but I'm not sure. When I check actually thread utilization, it doesn't look like that's the case. Is this a memory latency issue? Just something happening at the driver level that bottlenecks each frame but doesn't last long enough to show up in the cpu monitoring? I'm very curious. Nsight doesn't work with this game.

I have an RTX3080 with a ryzen 3600x. I'm going to replace the cpu whenever the ryzen 6000 comes out. I knew I'd be cpu limited in a whole bunch of games, but the behaviour here just looked weird to me.

Approximate 75 fps across the board.

upload_2020-12-14_20-56-35.png upload_2020-12-14_20-56-48.png

upload_2020-12-14_20-57-1.png upload_2020-12-14_20-57-17.png

For good measure here's one in 640x480 with DLSS performance mode. I gain about 7 fps.

upload_2020-12-14_20-58-22.png
 
I’m not sure about DXR costs but DLSS costs are the same regardless. So everything else can go faster but you’re going to be limited by the neural network.
 
I’m not sure about DXR costs but DLSS costs are the same regardless. So everything else can go faster but you’re going to be limited by the neural network.
So I need to upgrade my brain to get better performance in Control? Damn, that's pretty hardcore!
 
It has to be Bvh build or refit at the driver level. It’s the only thing that would remain constant regardless of the internal render resolution. If it’s single threaded or maybe cache unfriendly it could really bottleneck.

maybe expensive context switches between user and kernel mode as it’s built or refitted? Maybe some issues with copying data from ram to vram? Would be interesting if resizable BAR helped here. Some cpu scaling tests with the game set at 640x480 with performance mode dlss would go a long way. Not sure what the easiest way to lower my cpu clock would be.
 
BVH building is on the GPU, although there's considerable book-keeping of data from the CPU side. It essentially need to maintain two distinct scene "submissions" simultaniously.
 
BVH building is on the GPU, although there's considerable book-keeping of data from the CPU side. It essentially need to maintain two distinct scene "submissions" simultaniously.

well if this fixed cost of enabling ray tracing is on the gpu side it should get even worse if I lower my gpu clock. Maybe I’ll play with that tonight.
 
well if this fixed cost of enabling ray tracing is on the gpu side it should get even worse if I lower my gpu clock. Maybe I’ll play with that tonight.

It should be a regular compute shader. Someone raised the idea that there might be hardware support for BVH construction, but to me that sounds very exotic and inflexible, and not really necessary. It can be asynchronous, and low prio, and as such a bit difficult to measure. I guess Nsight can tell you how that is setup and works.
 
It should be a regular compute shader. Someone raised the idea that there might be hardware support for BVH construction, but to me that sounds very exotic and inflexible, and not really necessary. It can be asynchronous, and low prio, and as such a bit difficult to measure. I guess Nsight can tell you how that is setup and works.

I tried to use nsight on control but it says it's incompatible because it's using a D3D11ON12 API layer. My thought is I can go back to that scene where I took the screen caps and set my resolution at 640x480 with performance dlss so the internal resolution is something like 320x240. Turn off ray tracing and see what I hit in terms of fps. I'll be cpu limited, but then I can turn on ray tracing and see what frame time difference is. They I can play with my gpu clock and see if that changes. Playing with my cpu clock seems more annoying. Maybe I can do that easily with ryzen master.

Edit: I should also check whether it matters how many ray tracing effects I turn on. I have a feeling it won't matter, but maybe I'm wrong.

Edit2: I also know this will vary scene to scene. There are a couple of rooms I'm aware of where I seem to hit this kind of frame cap, but in other places I can easily hit the fps cap I've set for gsync (138) with everything maxed out.
 
Last edited:
I tried to use nsight on control but it says it's incompatible because it's using a D3D11ON12 API layer. My thought is I can go back to that scene where I took the screen caps and set my resolution at 640x480 with performance dlss so the internal resolution is something like 320x240. Turn off ray tracing and see what I hit in terms of fps. I'll be cpu limited, but then I can turn on ray tracing and see what frame time difference is. They I can play with my gpu clock and see if that changes. Playing with my cpu clock seems more annoying. Maybe I can do that easily with ryzen master.

Edit: I should also check whether it matters how many ray tracing effects I turn on. I have a feeling it won't matter, but maybe I'm wrong.
why not just remove DLSS from the picture entirely for this benchmark? I've been following along and I get the part where you can't understand how changing resolution with RT isn't impacting performance is weird. But enabling DLSS makes the issue harder to diagnose.

The challenge is that DLSS will never run any faster, it's a fixed cost, the only way it's going to run faster is to increase the GPU clockspeed to complete the network faster. So quality, balanced, and performance are different networks, but each one of those has a fixed cost regardless of how low you bring the resolution. The speed of processing that network could be your frame rate limiter once you get to high enough FPS. If everything else is already sub < 1 ms, and your neural networks take a hypothetical 4, 5, and 6 ms (p, n, q) to complete, you're sort of stuck on obtaining more speed unless you get rid of it.
 
Last edited:
why not just remove DLSS from the picture entirely for this benchmark?

The lower I can push the resolution down, the more confident I am that I'll be cpu limited. I'll play around with it though. There could be some frame rate where the tensor cores can no longer keep up. Not sure. Not a bad experiment in itself. I'll see how low the game will let me set the resolution, and then see if dlss starts to have a negative impact at extremely high frame rates.

Edit: I may also grab ghost runner or something so I have a second game to experiment with.
 
The lower I can push the resolution down, the more confident I am that I'll be cpu limited. I'll play around with it though. There could be some frame rate where the tensor cores can no longer keep up. Not sure. Not a bad experiment in itself. I'll see how low the game will let me set the resolution, and then see if dlss starts to have a negative impact at extremely high frame rates.

Edit: I may also grab ghost runner or something so I have a second game to experiment with.
I think if you're goal is to find the breaking point of being CPU limited, I would definitely consider to remove DLSS as it will bottleneck performance eventually. It can't run any faster regardless of the resolution you set it at. The only thing that affect DLSS computation times will be (a) the computational power - ie tensor cores, clockrate, bandwidth and (b) the size of the network.
Since you can't change (b) with resolution. You're left with only (a).
 
Extra cpu cost could be somewhat be tied down into deciding what data to use for BVH building. Depending how engine is built the data might be available or it might have to be mined from traditional data structures if engine wasn't designed with rt in mind. Some assets for RT might also be different than what is used for rasterization. Collecting this data and feeding it to bvh building could consume significant cpu even if bvh building itself was done with compute.
 
Last edited:
Back
Top