Nvidia's 3000 Series RTX GPU [3090s with different memory capacity]

techuse · Mar 17, 2021

DegustatoR said:
If you're saying that AMD drivers are better optimized for Zen1/2 CCX structure then I don't see why this would be limited only to DX12.

Perhaps Nvidia's multithreading supersedes any efficiency advantages. Without specific dev effort AMD DX11 CPU rendering stream is still fully single threaded no?

Rootax · Mar 17, 2021

If it's a l3 cache problem, 109xx X on x299 should be impacted too, with their mesh architecture ?

DegustatoR · Mar 17, 2021

techuse said:
Perhaps Nvidia's multithreading supersedes any efficiency advantages. Without specific dev effort AMD DX11 CPU rendering stream is still fully single threaded no?

DX11 maybe but what about Vulkan?

Also from the reports of assets loading issues on NV h/w when running CPU limited I do wonder if the initial idea of this being due to DX12 (FL12_0 at least?) resource binding model not being suitable for NV h/w is the culprit here, with NV's DX12 driver doing some pre-processing on such titles to win more performance in GPU limited scenarios? Would be interesting to see what would happen in these games on some really old driver - albeit it can be hard to run some old GPU like 980Ti CPU limited even in 720p I guess.

troyan · Mar 17, 2021

iroboto said:
hmm. I may have not followed this argument correctly so bear with me if I read your post wrong.

In my mind, I'm not sure if this particular comparison works. When DX11 is the bottleneck, it doesn't actually use more CPU, it uses less. And this is still a CPU bottleneck. The bottleneck is a failure to saturate the CPU to get through the API. DX11 over saturates the main thread, and is unable to divide the remaining work to the rest of the cores, hence CPU usage stays low.

As far as i see it in most games* DX12 has a overhead within the engine which doesnt exist with DX11 on nVidia hardware. So this is a fixed time which cant be reduced or eliminated with more cores, higher clocks or better IPC. When a game doesnt hammering the DX11 driver with workload (Hitman 2 at the beginning of the first main mission) DX11 is more efficient from a overhead perspective in CPU limited scenarios. And without proper multi-threading like with Control and WoW DX11 delivers even much more performance on processors with >6 cores.

* A positve example is "Pumpkin Jack". The developer switched to nVidia's UE4 branch for Raytracing and DX12 is 15% faster than DX11 at high framerates.

Scott_Arm · Mar 17, 2021

DegustatoR said:
If you're saying that AMD drivers are better optimized for Zen1/2 CCX structure then I don't see why this would be limited only to DX12.

The dx11 and dx12 drivers aren't the same. They have different APIs. It could be that the way AMD's dx12 driver is written that it has a slightly more cache friendly access pattern.

techuse · Mar 17, 2021

DegustatoR said:
DX11 maybe but what about Vulkan?

Also from the reports of assets loading issues on NV h/w when running CPU limited I do wonder if the initial idea of this being due to DX12 (FL12_0 at least?) resource binding model not being suitable for NV h/w is the culprit here, with NV's DX12 driver doing some pre-processing on such titles to win more performance in GPU limited scenarios? Would be interesting to see what would happen in these games on some really old driver - albeit it can be hard to run some old GPU like 980Ti CPU limited even in 720p I guess.

Our best bet is that enough testing happens and Nvidia issues a response.

GameGPU weighs in.

Scott_Arm · Mar 18, 2021

@techuse Yah, I just read the gamegpu article. The translation I read was pretty poor, but their testing with ryzen 3600 showed pretty much the same thing, with a vega64 beating the rtx3090 by quite a bit when lowering settings to be cpu limited. From HBU's testing the 5600X didn't really seem to have any problems with having the 3090 keeping close to the 6900XT. I really think it's more than just clock speed and probably has to do with memory latency.

DavidGraham · Mar 18, 2021

Scott_Arm said:
@techuse Yah, I just read the gamegpu article. The translation I read was pretty poor, but their testing with ryzen 3600 showed pretty much the same thing, with a vega64 beating the rtx3090 by quite a bit when lowering settings to be cpu limited. From HBU's testing the 5600X didn't really seem to have any problems with having the 3090 keeping close to the 6900XT. I really think it's more than just clock speed and probably has to do with memory latency.

Yeah here it is:
https://gamegpu.com/блоги/sravnenie-protsessorozavisimosti-geforce-i-radeon-v-dx12

This a monumental WTF moment right there, a Vega 64 should under no circumstances be faster than a 3090 no matter what, but it happens in those DX12 games with the Ryzen 3600X. The question now becomes: does an old Core i7/i5 exhibit the same problems?

Scott_Arm · Mar 18, 2021

DavidGraham said:
Yeah here it is:
https://gamegpu.com/блоги/sravnenie-protsessorozavisimosti-geforce-i-radeon-v-dx12

This a monumental WTF moment right there, a Vega 64 should under no circumstances be faster than a 3090 no matter what, but it happens in those DX12 games with the Ryzen 3600X. The question now becomes: does an old Core i7/i5 exhibit the same problems?

It does with a i3 10100 (64K L1, 256K L2, 6MB L3) and ryzen 3600 (64K L1, 512K L2, 16MB L3 with access latency penalties for half), but not really with a ryzen 5600 (64K L1, 512K L2, 32MB L3 with no penalties), which is why I think it's cache related. (edit: also that anecdotal but very similar BFV user video with an old i7)

Edit: Look at the gamegpu aida scores
L1 1.3ns
L2 3.9ns
L3 14.5ns (will be worse if you read across CCX boundary)
RAM 85.4ns

All it would take is for the nvidia to hit higher levels in cache more often or RAM more often and you can get a 10-20% performance difference.

Modern games probably hit caches hard which will cause more misses for other threads. Open world games like Watch Dogs would probably be the worst. You might not see the issues if you're gpu limited because maybe the memory system is able to keep up since the cpu is waiting on the gpu. If you become cpu limiited, suddenly the cpu threads start going as fast as they can and maybe these smaller or higher latency caches cause the nvidia driver a little more pain.

iroboto · Mar 19, 2021

Scott_Arm said:
It does with a i3 10100 (64K L1, 256K L2, 6MB L3) and ryzen 3600 (64K L1, 512K L2, 16MB L3 with access latency penalties for half), but not really with a ryzen 5600 (64K L1, 512K L2, 32MB L3 with no penalties), which is why I think it's cache related. (edit: also that anecdotal but very similar BFV user video with an old i7)

Edit: Look at the gamegpu aida scores
L1 1.3ns
L2 3.9ns
L3 14.5ns (will be worse if you read across CCX boundary)
RAM 85.4ns

All it would take is for the nvidia to hit higher levels in cache more often or RAM more often and you can get a 10-20% performance difference.

Modern games probably hit caches hard which will cause more misses for other threads. Open world games like Watch Dogs would probably be the worst. You might not see the issues if you're gpu limited because maybe the memory system is able to keep up since the cpu is waiting on the gpu. If you become cpu limiited, suddenly the cpu threads start going as fast as they can and maybe these smaller or higher latency caches cause the nvidia driver a little more pain.

wait.. so the reason nvidia is doing worse on cpu limited scenarios all comes down to the cache of the CPU?

So drivers are basically not cache hit friendly? weird. I would have figured that would have been the lowest hanging fruit for them.

Frenetic Pony · Mar 19, 2021

iroboto said:
wait.. so the reason nvidia is doing worse on cpu limited scenarios all comes down to the cache of the CPU?

So drivers are basically not cache hit friendly? weird. I would have figured that would have been the lowest hanging fruit for them.

Rockstar makes like a billion dollars a year on GTAV Online and never bothered to parse a single JSON file in a timely manner and so had ungodly load times for customers and the devs alike. At this point I just kind of assume the obvious can be missed even for huge, really successful companies.

techuse · Mar 19, 2021

iroboto said:
wait.. so the reason nvidia is doing worse on cpu limited scenarios all comes down to the cache of the CPU?

So drivers are basically not cache hit friendly? weird. I would have figured that would have been the lowest hanging fruit for them.

I think the most likely scenario is they just don't care enough to fix it. Given how benchmarks are conducted why would they?

Scott_Arm · Mar 19, 2021

iroboto said:
wait.. so the reason nvidia is doing worse on cpu limited scenarios all comes down to the cache of the CPU?

So drivers are basically not cache hit friendly? weird. I would have figured that would have been the lowest hanging fruit for them.

Just a guess, but from 3600 to 5600 the biggest advancements for gaming were improving cache and cache latency. Maybe AMD is just a little more efficient in terms of cache alignment, data access patterns because of all the time they spent working with the dog shit Jaguar cpus for console lol.

Putas · Mar 19, 2021

Ryzen 3300X should be interesting for latencies.

Lurkmass · Mar 19, 2021

DegustatoR said:
DX11 maybe but what about Vulkan?

Also from the reports of assets loading issues on NV h/w when running CPU limited I do wonder if the initial idea of this being due to DX12 (FL12_0 at least?) resource binding model not being suitable for NV h/w is the culprit here, with NV's DX12 driver doing some pre-processing on such titles to win more performance in GPU limited scenarios? Would be interesting to see what would happen in these games on some really old driver - albeit it can be hard to run some old GPU like 980Ti CPU limited even in 720p I guess.

FWIW I don't think any of the discussion behind multithreading or software vs hardware scheduler crap in the background are related to the reasons why NV sees higher overhead on D3D12 ...

Root-level views in D3D12 exists to cover the use cases of the binding model that would bad on their hardware but nearly no developers use them because they don't have bounds checking so they hate using the feature for the most part! This ties in with the last sentence but instead of games using SetGraphicsRootConstantBufferView, some games will spam CreateConstantBufferView just before every draw which will add even more overhead. It all starts coincidentally adding up when developers are abusing all these defects behind D3D12's binding model.

Bindless on NV (unlike AMD) has idiosyncratic interactions where they can't use constant memory with bindless CBVs so they load the CBVs from global memory which is a performance killer (none of this matters on AMD) ...

techuse · Mar 19, 2021

Lurkmass said:
FWIW I don't think any of the discussion behind multithreading or software vs hardware scheduler crap in the background are related to the reasons why NV sees higher overhead on D3D12 ...

Root-level views in D3D12 exists to cover the use cases of the binding model that would bad on their hardware but nearly no developers use them because they don't have bounds checking so they hate using the feature for the most part! This ties in with the last sentence but instead of games using SetGraphicsRootConstantBufferView, some games will spam CreateConstantBufferView just before every draw which will add even more overhead. It all starts coincidentally adding up when developers are abusing all these defects behind D3D12's binding model.

Bindless on NV (unlike AMD) has idiosyncratic interactions where they can't use constant memory with bindless CBVs so they load the CBVs from global memory which is a performance killer (none of this matters on AMD) ...

Why do you consider DX12 binding model defective?

Lurkmass · Mar 19, 2021

techuse said:
Why do you consider DX12 binding model defective?

I don't in theory but it's 'how' developers are 'using' it that makes it defective since that just means in practice that D3D12 binding model isn't all that different from Mantle's binding model which was pretty much only designed to run on AMD HW so it becomes annoying for other HW vendors trying to emulate this behaviour to be consistent with their competitor's HW ...

Microsoft revised the binding model with shader model 6.6 but I don't know if that's in response to what they saw with it's potential to be misused in games ...

Scott_Arm · Mar 19, 2021

@Lurkmass It does seem like this "issue" can pop up on dx11 games.

This is an anecdotal account, and I'd like to see more testing of dx11, but this user has a huge performance regression after upgrading from r9 390 to a 1660ti on his old i7 playing battlefield v firestorm in dx11. There's a follow up video where he says he "fixed" his issue by giving his cpu a 2% overclock, overclocking his memory substantially plus improving timings, and disabling the fullscreen optimizations setting for battlefield v. He gets a massive improvement in performance much greater than the sum of all of those things.

Scott_Arm · Mar 19, 2021

@Lurkmass Oh, global memory is heap allocated, so cache unfriendly if you're spamming allocations.

If anyone wants to know why cache hit rates matter, and why non-linear allocations in heap matter, start here

T2098 · Mar 24, 2021

Scott_Arm said:
@Lurkmass It does seem like this "issue" can pop up on dx11 games.

This is an anecdotal account, and I'd like to see more testing of dx11, but this user has a huge performance regression after upgrading from r9 390 to a 1660ti on his old i7 playing battlefield v firestorm in dx11. There's a follow up video where he says he "fixed" his issue by giving his cpu a 2% overclock, overclocking his memory substantially plus improving timings, and disabling the fullscreen optimizations setting for battlefield v. He gets a massive improvement in performance much greater than the sum of all of those things.

This one makes sense to me. Nvidia's herculean software engineering effort to extract parallelism/multi-threading in DX11 is amazing, but it can't possibly come for free.

Especially if it's not hard coded in the driver and it's analyzing everything on the fly at run time and chopping things up into bits that it can spread across multiple CPU cores, you're basically running a small compiler at the same time the game is running, that has its own memory and cache footprint.

If you've got CPU cores, cache size, and/or CPU memory bandwidth to burn, then this is probably an excellent tradeoff. If any of those 3 things are in short supply, you now have 2 things running concurrently competing for those same resources.

On a dual or quad core CPU where the inherent multi-threading built into the game engine + driver was probably enough to fully load the CPU down anyway, NV's multithreading magic is probably going to hurt performance a fair bit, burning away all those resources that might have been fully utilized by a more naive driver (Intel/AMD) that just let the DX11 code run as is.

On a powerhouse system with fast RAM, large caches, and lots of CPU cores (but still a hard ceiling on single core performance) then you want to let NV's DX11 driver run wild and try to spread the load across as many cores as possible, even if by doing so it consumes an entire CPU core worth of overhead.

Nvidia's 3000 Series RTX GPU [3090s with different memory capacity]

techuse

Rootax

DegustatoR

troyan

Scott_Arm

techuse

Scott_Arm

DavidGraham

Scott_Arm

iroboto

Daft Funk

Frenetic Pony

techuse

Scott_Arm

Putas

Lurkmass

techuse

Lurkmass

Scott_Arm

Scott_Arm

T2098

Similar threads