Nvidia's 3000 Series RTX GPU [3090s with different memory capacity]

Yeah, that was a big "woops" on their part. Not sure how that even happened.
People make mistakes, lack of communication or bad overseeing of critical required code forks for drivers already in development. Shit happens and gamers suffer more.

Hopefully they won't oops again when they release the next round of GPUs lol
 
It was visible since the first games. DX12 on nVidia hardware doesnt work right. For example Hitman 1 with DX12 is slower on my 4C/8T and 2060 notebook than DX11 in a CPU limited scenario. There is a software(API) overhead involved to get nVidia GPUs running. And without proper multi-threading the nVidia DX11 driver is just superior to DX12...
I do recall this as well.
I was never entirely sure at the time if that was just a result of architectural differences etc. I know AMD has been working really hard on their command processors for the consoles which may have carried over to the PC space, so the reliance on the CPU may be less. (just thinking about the customizations that either Sony or MS could have made been brought back) At least that's what I assumed. I was under the assumption that over time things would equalize or get better for both as DX12 matured.
 
Would have been nice if they specified which api each of the games was using. I have no idea what api planet zoo uses, for example.
Part of it is the translation required I'm sure, but the way the data is presented could be clearer. In some of the graphs they switch the colours from Nvidia to AMD even!
 
It was visible since the first games. DX12 on nVidia hardware doesnt work right. For example Hitman 1 with DX12 is slower on my 4C/8T and 2060 notebook than DX11 in a CPU limited scenario. There is a software(API) overhead involved to get nVidia GPUs running. And without proper multi-threading the nVidia DX11 driver is just superior to DX12...
There's little point to present anecdotal one-off impressions to mean anything more than that, I think the thread has established that clearly by now.

I could just easily use an example of one of the earliest games with a DX12 implementation - Rise of the Tomb Raider. Obtaining a locked 60fps at 1080p on my GTX1660/9400F system is not possible - under DX11. Under DX12, those scenes that are CPU bottlenecked are now 60+ fps. So what does that mean? In isolation, not much.
 
In the 3DMark overhead test DX12 is 5x+ faster than DX11 MT on a nVidia GPU. Yet we are here talking about games which are clearly CPU limited with DX12 and are only 2x (best case in Shadow of the Tomb Raider) or not one frame faster.
Fact is that most DX12 implemenations are still CPU limited and are nothing else than brute force implementation to get the nVidia GPU running.

Here is an example from Shadow of the Tomb Raider:



This game shows what is limiting the performance. With DX12 it's clearly not the DX12 driver when the "GPU renders" a 720p image in ~3,7ms...

And another from Hitman 2:



DX11 is faster with less than 50% GPU usage...
 
@troyan Missing your point a little. SOTR is much faster in dx12, but hits a cpu limit where the dx11 is slower and is gpu bound 37% of the time. That's on a 9900k with 8 cores. I don't know what to take from that other than dx12 is much better than dx11 in that game. Hitman 2 looks to run similar across both apis, assuming still on the 9900k?
 
One thing I find interesting about the driver comparison: From Hardware Unboxed's testing, there's a huge regression for an RTX3070 on AMD's Zen and Zen+. There's even a regression with Zen2, though not as much. AMD performance on Zen2, Zen3 allows the cards to both hit their GPU limit, where on Nvidia Zen2 is about 20% behind Zen3.

upload_2021-3-16_17-18-53.png

Then they show a similar regression from AMD to Nvidia on a 10100K, which has only 4 cores (8 threads), but importantly only 6MB of L3 cache. The drop from the 5600x to the 10100 is much bigger on Nvidia. If you are stressing the CPU, what is likely happening to the cache? Probably getting thrashed (cache misses).

upload_2021-3-16_17-21-3.png

Zen, Zen+, Zen2 all had gaming performance issues because of latency between the CCX and cache latency in general. The 5600x only has one CCX, right? Maybe it's not so much the cpu usage or the number of threads that causes the regression, but cache performance. The 5600x has 32MB of L3 and one CCX, essentially solving the issues of the i3 10100 and the zen, zen+ and zen2 chips. Maybe you can't reproduce this issue by taking a 5800x and disabling cores or clock to simulate an older cpu, or a true low end cpu, because the 5800x is one ccx, improved cache latency and a large L3 cache.

Edit: When you get a cache miss, cpus slow down (utilization drops). I believe, though I'm not 100% sure, that misses would show up as cpu usage. Cpus are busy when they're accessing memory, so it counts as "usage."

That old i7 6700K has 8MB of L3 4x2MB and smaller caches in general. It would also fit the mold of it being something to do with the way Nvidia driver uses cache.

Edit2: An interesting test might be taking an 8 core intel cpu and disabling half the cores and lowering the clock to see if it regresses to the performance of the 10100K. You can't really take a Zen3 and disable cores to simulate a zen2 or older because of the changes to the ccx layout.
 
Last edited:
In the 3DMark overhead test DX12 is 5x+ faster than DX11 MT on a nVidia GPU. Yet we are here talking about games which are clearly CPU limited with DX12 and are only 2x (best case in Shadow of the Tomb Raider) or not one frame faster.
Fact is that most DX12 implemenations are still CPU limited and are nothing else than brute force implementation to get the nVidia GPU running.

Here is an example from Shadow of the Tomb Raider:



This game shows what is limiting the performance. With DX12 it's clearly not the DX12 driver when the "GPU renders" a 720p image in ~3,7ms...

And another from Hitman 2:



DX11 is faster with less than 50% GPU usage...
hmm. I may have not followed this argument correctly so bear with me if I read your post wrong.

In my mind, I'm not sure if this particular comparison works. When DX11 is the bottleneck, it doesn't actually use more CPU, it uses less. And this is still a CPU bottleneck. The bottleneck is a failure to saturate the CPU to get through the API. DX11 over saturates the main thread, and is unable to divide the remaining work to the rest of the cores, hence CPU usage stays low.

So in the case of hitman, it looks like rendering is very much a single threaded thing there, DX12 vs DX11. So that's why you're seeing better performance on DX11.

When you look at SOTR, the frame rate difference is monumental. And this is because 12 will use all the available cores, and even though the CPU usage is the same, the output in frame rate is significantly higher. Technically speaking, wrt to the thread, DX12 is working correctly for Nvidia here. If DX12 is used correctly and assume there is no GPU bottleneck (infinite resource), then DX12 should get CPU usage up to 100% in theory - something impossible for DX11.
 
Too follow up on my post above, I'd probably design a test around the following image, which I think is watch dogs legion at 1080p medium:

upload_2021-3-16_17-21-3-png.5355



AMD
5600x 128 fps
10100 110 fps
10700 (4 cores disabled) 110 + x%

Nvidia
5600x 140 fps (+10%)
10100 94 fps (-15%)
10700 (4 cores disabled) ** if cache behaviour is the issue, this should be close to the 10700 AMD score, if it's close to the 10100 Nvidia score it's something else **
 
Last edited:
A further in depth analysis on Intel CPUs is required, I don't remember old Intel CPUs showing the same behavior in DX12 as these AMD Zen CPUs.

On intel cpus, as far as I know, all cores have access to L3 with the same latency. Zen, Zen+, Zen2 cores have access to the full L3, but if they access the half of the L3 that's attached to the other CCX, there's a latency penalty. So if a thread gets moved from one ccx to another on earlier zen cpus, they pay a latency cos when reading from L3 across the CCX boundary. Zen3 I think has the same latency penalty across CCX, but a CCX now has up to 8 cores, so anything 5800x and lower is a single CCX. Ryzen 1600, 2600 and 3600 are all 2 CCX each with 3 active cores.

I'm not sure if graphics drivers run in kernel mode (I'm assuming yes?) and as that would mean there are some context switches and cache evictions that would take place. A newer game like Battlefield V or Watch Dog's Legion might hit all cores harder, putting more burden on the cache hierarchy. If Nvidia drivers are a little more active in terms of memory accesses, they could see relatively worse performance on intel and amd cpus with small caches, or on older zens that have higher latency cache latency.
 
A further in depth analysis on Intel CPUs is required, I don't remember old Intel CPUs showing the same behavior in DX12 as these AMD Zen CPUs.
Maybe because the slightly older Intel CPUs still could hit 5Ghz whereas Zen/Zen+ had terrible clock speeds.
 
In the 3DMark overhead test DX12 is 5x+ faster than DX11 MT on a nVidia GPU. Yet we are here talking about games which are clearly CPU limited with DX12 and are only 2x (best case in Shadow of the Tomb Raider) or not one frame faster.
Fact is that most DX12 implemenations are still CPU limited and are nothing else than brute force implementation to get the nVidia GPU running.

Here is an example from Shadow of the Tomb Raider:



This game shows what is limiting the performance. With DX12 it's clearly not the DX12 driver when the "GPU renders" a 720p image in ~3,7ms...

And another from Hitman 2:



DX11 is faster with less than 50% GPU usage...
Years and several architectures later and developers still face issues. At what point does the blame begin to fall on Nvidia for their hardware design? Devs talking about the experience of working with low level APIs on each vendors GPUs would be great.
 
Maybe because the slightly older Intel CPUs still could hit 5Ghz whereas Zen/Zen+ had terrible clock speeds.

I know that in a very cpu heavy game like COD Warzone, intel is still king if you're overclocking (not sure about out of the box and with XMP profiles for 3200 RAM etc). I don't think it's so much the clock speed as it is the latency of the cache hierarchy and the ease of getting stable RAM overclocks and tight memory timings. I think the 5800x, 5900x are getting close but they're way more of a pain in the ass to overclock and tweak because there's a whole bunch of different clocks you have to worry about and ratios and whatever.

I'm imagining taking the same system and swapping gpus between AMD and Nvidia. Say the Nvidia drivers request data from memory 11:10 ratio compared to the AMD drivers. If the cache is getting hit hard and that 1 extra data request sometimes turns into a cache miss in L3, or a read across CCX, suddenly you have this extra "overhead," that wouldn't be visible whenever you were playing something with lots of headroom on the cpu.
 
Last edited:
Would be nice to try with the 6800XT (impossible to find though) but in CPU bound cases I've seen a massive improvement going from 3800X to 5800X in SotTR D3D12

sottr_720p.jpg

^ Don't think I could break 160-170 in the same test using the 3800X. Wish I had a 6800XT to compare :p
 
I know that in a very cpu heavy game like COD Warzone, intel is still king if you're overclocking (not sure about out of the box and with XMP profiles for 3200 RAM etc). I don't think it's so much the clock speed as it is the latency of the cache hierarchy and the ease of getting stable RAM overclocks and tight memory timings. I think the 5800x, 5900x are getting close but they're way more of a pain in the ass to overclock and tweak because there's a whole bunch of different clocks you have to worry about and ratios and whatever.

I'm imagining taking the same system and swapping gpus between AMD and Nvidia. Say the Nvidia drivers request data from memory 11:10 ratio compared to the AMD drivers. If the cache is getting hit hard and that 1 extra data request sometimes turns into a cache miss in L3, or a read across CCX, suddenly you have this extra "overhead," that wouldn't be visible whenever you were playing something with lots of headroom on the cpu.
If you're saying that AMD drivers are better optimized for Zen1/2 CCX structure then I don't see why this would be limited only to DX12.
 
Back
Top