Vulkan is a GCN low level construct?

We see the same in our render application (dx11) - for the same performance our process (obviously including the driver threads) is using more cpu time with nvidia drivers. (I don't think I measured how much it actually affected performance when the cpu becomes saturated)
And that is of course an entirely different question to which is more cpu limited in the usual setting with a non-saturated cpu - could very well be as sebbi suggested that nvidia is running some low priority analysis/optimizing threads.
 
It's custom in distributed computing to count aggregate MHz, so a 8 core 4 GHz CPU is really 32 GHz. If you show aggregate utilization of a game running Nvidia vs. AMD you see very clearly that the Nvidia driver achieves what it achieves with a ton of more energy/clocks.
If you feel like hacking a few games to see how drivers perform flat, use
D3D11_CREATE_DEVICE_PREVENT_INTERNAL_THREADING_OPTIMIZATIONS on the hacked CreateDevice call.
Sometimes it can be very annoying that the driver steals capacity from your own algorithms, sometimes it's annoying that you did something super effectively multi-threaded, but's slower because you interfere with the driver threads. So far the observation has been that AMD's driver only uses a single thread (+ the calling thread), which is okayish.

Luckily we don't have to be bothered by that anymore, because DX12 gave devs ownership over threading.
 
NV does inter-warp scheduling driver side so there's that, NV should have higher overhead but in practice (due to effectively multi-threaded driver) ends up less CPU-bound than AMD counterparts. In DX12 and Vulkan the situation should favor AMD in theory, but DOOM says otherwise, AotS also seems to get hit a little less hard by CPU clocks being lowered for NV than AMD




NVM I'm basically reposting what was said on last page lol
 
For pure scheduling AMD should be more efficient, they do have hardware to do it after all. There's more to a driver than just scheduling though.
 
It's custom in distributed computing to count aggregate MHz, so a 8 core 4 GHz CPU is really 32 GHz. If you show aggregate utilization of a game running Nvidia vs. AMD you see very clearly that the Nvidia driver achieves what it achieves with a ton of more energy/clocks.
If you feel like hacking a few games to see how drivers perform flat, use
D3D11_CREATE_DEVICE_PREVENT_INTERNAL_THREADING_OPTIMIZATIONS on the hacked CreateDevice call.
Sometimes it can be very annoying that the driver steals capacity from your own algorithms, sometimes it's annoying that you did something super effectively multi-threaded, but's slower because you interfere with the driver threads. So far the observation has been that AMD's driver only uses a single thread (+ the calling thread), which is okayish.

Luckily we don't have to be bothered by that anymore, because DX12 gave devs ownership over threading.
Yeah, it seems that GPU manufacturers have taken full advantage of the current situation (*). Cross platform games (and engines) have been designed to run properly on ~1.6 GHz Jaguar CPUs. This leaves lots of idle CPU cycles on high end PCs. In the current state, wasting huge amount of CPU cycles in the driver is a good proposition, as long as it gives at least a tiny bit of savings on the GPU side. This is obviously a really bad thing for gaming on laptops or any shared TDP configs such as integrated GPUs. Extra CPU work consumes the total TDP budget of the system, leaving less TDP for the GPU.

As you said, current wasteful GPU drivers are also a problem for using the PC CPU to do performance intensive multithreaded (gameplay) number crunching. You design your game logic according to the minimum required CPU. In extreme cases (lots of draw calls), the graphics driver could consume up to half of an older quad core system's clock cycles. This obviously means that the gameplay needs to be scaled down in order to run (similarly) on the minimum hardware configuration.

(*) Intel has specifically said that they don't like big bloated drivers. They don't track all bad application behavior (like setting duplicate state). This saves them CPU time, but costs them GPU time in badly written applications. Intel of course is in different situation than AMD/Nvidia as all their GPUs are integrated and share the TDP with the CPU.
For pure scheduling AMD should be more efficient, they do have hardware to do it after all. There's more to a driver than just scheduling though.
NV does inter-warp scheduling driver side so there's that, NV should have higher overhead but in practice (due to effectively multi-threaded driver) ends up less CPU-bound than AMD counterparts. In DX12 and Vulkan the situation should favor AMD in theory, but DOOM says otherwise, AotS also seems to get hit a little less hard by CPU clocks being lowered for NV than AMD
Most of the driver overhead goes to resource, state and residency tracking and translation/validation of commands. Ivan's (2014) presentation about the topic gives a good overview: http://www.slideshare.net/DevCentralAMD/introduction-to-dx12-by-ivan-nevraev

Nvidia's software scheduling doesn't seem to cost much CPU time, as DX12 and Vulkan show (almost) similar improvements as AMD gets. Do we know exactly what part of the scheduling Nvidia does in software?
 
Last edited:
There's this switch in the Nvidia drivers where you can turn off MT optimization. Is that not working correctly? I thought it was.

WRT to driver using the CPU: I think the Nvidia driver handles this quite intelligently. I've been doing some comparisions where I let our CPU-tests (obviously rather CPU heavy scenes, 720p no AA/AF and minimal post-processing as allowed per game) run on AMD and Nvidia GPUs not only on top-hardware but also lower end CPUs (i3-6100/FX-6300) of what a gamer would buy nowadays. No great surprises there, except that AMD seems to need that ONE STRONG CPU thread even more than Nvidias approach, because in some games like Anno or Ass. Creed Syndicate, the gap actually widened between the GPU parts when going to the lower end CPUs. Maybe I can post the results later today.

And then there's this:
Direct3D 11
Feature Level D3D_FEATURE_LEVEL_11_0
Driver Concurrent Creates Yes
Driver Command Lists Yes
Double-precision Shaders Yes
DirectCompute CS 4.x Yes
 
Last edited:
Nvidia's software scheduling doesn't seem to cost much CPU time, as DX12 and Vulkan show (almost) similar improvements as AMD gets. Do we know exactly what part of the scheduling Nvidia does in software?
If they are largely just block scheduling the entire device I wouldn't expect it to take much processing power. Async shading and high priority compute I would think make things interesting to obtain peak performance and proper execution time.

The resource/state/residency portion would largely go away with the low level APIs and bundles. Or anything approaching a 100GB/s link to system memory.
 
Most of the driver overhead goes to resource, state and residency tracking and translation/validation of commands. Ivan's (2014) presentation about the topic gives a good overview: http://www.slideshare.net/DevCentralAMD/introduction-to-dx12-by-ivan-nevraev

Thanks, I wanted to understand where DX11 was differing i.e. from console approach. It is for DX12, but i see it quite compelling to my question/case.

...wait: slide 39 reports lot of KMD usage - which was my raised point before (I think with another guy, OpenGLguy?).
IF all the work is moved to userland command queues, which are available by all GPUs I suppose, what does KMD has to do there?
 
Last edited:
It doesn't go away, it's now on your toast! You can be smarter about it, or not. That's why low-level is a two sided sword.
Bundles are much easier to implement on application side, because the application knows which draw calls and data are tied together (spatially and temporally). For example you could simply build bundles based on your octree leaf nodes. Same nodes are usually visible at consecutive frames. Bundles greatly reduce the driver translation and validation work, as the driver validates the bundle at creation time only (it can be reused with much smaller cost).
 
They do mention later in the article that there is currently ongoing work with Vulkan ...
In parallel, we are also working with the respective Khronos working groups in order to find the best way to bring cross-vendor shader intrinsics already standardized in OpenGL over to Vulkan and SPIR-V. We are furthermore working on Vulkan and SPIR-V extensions to expose our native intrinsics, but prioritized the cross-vendor functionality higher, especially since there is notable overlap in functionality.
 
They do mention later in the article that there is currently ongoing work with Vulkan ...
Yep agreed 'ongoing' and 'working on' being the optimal word, which can be seen as more recent tests has shown performance improvements recently in Vulkan for Doom with Nvidia cards, but worth remembering how mature Nvidia's development is with OpenGL and extensions/driver optimisation.
From that context there is still more Nvidia can do with Vulkan (and associated extensions) and should continue to see performance gains IMO.

Cheers
 
Yep agreed 'ongoing' and 'working on' being the optimal word, which can be seen as more recent tests has shown performance improvements recently in Vulkan for Doom with Nvidia cards, but worth remembering how mature Nvidia's development is with OpenGL and extensions/driver optimisation.
From that context there is still more Nvidia can do with Vulkan (and associated extensions) and should continue to see performance gains IMO.

Cheers
Some gains, but I have suspicions they already got the big ones with cross lane functionality. That would be a game changer for optimizing many compute based postprocess effects and saving bandwidth as there are no compute compression technologies. Doom performance looks pretty damn good from both IHVs now which seems a testament to the power of Vulkan.

It wouldn't be surprising if Nvidia simply didn't have intrinsics publicly exposed for Vulkan when Doom first released. Seeing as how OGL worked well for Nvidia and not for AMD releasing when they did makes sense. If they waited on Nvidia they could have simply been stonewalled for a competitive advantage against AMD. Ultimately their choice didn't hurt anyone beyond the appearance Nvidia may not be as good with Vulkan. Maybe there was some shader replacement involved because of lack of optimization by Id, but it seems more likely it was just a matter of finalizing some driver work.

It could be interesting if a site ran some Vulkan benchmarks on other titles to see if there are any differences. Talos perhaps, which I think is starting to get more optimizations completed.
 
.....
It wouldn't be surprising if Nvidia simply didn't have intrinsics publicly exposed for Vulkan when Doom first released. Seeing as how OGL worked well for Nvidia and not for AMD releasing when they did makes sense. If they waited on Nvidia they could have simply been stonewalled for a competitive advantage against AMD. Ultimately their choice didn't hurt anyone beyond the appearance Nvidia may not be as good with Vulkan. Maybe there was some shader replacement involved because of lack of optimization by Id, but it seems more likely it was just a matter of finalizing some driver work.

It could be interesting if a site ran some Vulkan benchmarks on other titles to see if there are any differences. Talos perhaps, which I think is starting to get more optimizations completed.
Yeah agreed,
which is what I was inferring some weeks ago in this thread.
Good point on Talos and would be great if that was also benchmarked again since recent updates of game and Nvidia drivers.

Cheers
 
Back
Top