Vulkan is a GCN low level construct?

Discussion in 'Rendering Technology and APIs' started by DavidGraham, Jul 18, 2016.

  1. Psycho

    Regular

    Joined:
    Jun 7, 2008
    Messages:
    746
    Likes Received:
    41
    Location:
    Copenhagen
    We see the same in our render application (dx11) - for the same performance our process (obviously including the driver threads) is using more cpu time with nvidia drivers. (I don't think I measured how much it actually affected performance when the cpu becomes saturated)
    And that is of course an entirely different question to which is more cpu limited in the usual setting with a non-saturated cpu - could very well be as sebbi suggested that nvidia is running some low priority analysis/optimizing threads.
     
    CarstenS likes this.
  2. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    946
    Likes Received:
    413
    It's custom in distributed computing to count aggregate MHz, so a 8 core 4 GHz CPU is really 32 GHz. If you show aggregate utilization of a game running Nvidia vs. AMD you see very clearly that the Nvidia driver achieves what it achieves with a ton of more energy/clocks.
    If you feel like hacking a few games to see how drivers perform flat, use
    D3D11_CREATE_DEVICE_PREVENT_INTERNAL_THREADING_OPTIMIZATIONS on the hacked CreateDevice call.
    Sometimes it can be very annoying that the driver steals capacity from your own algorithms, sometimes it's annoying that you did something super effectively multi-threaded, but's slower because you interfere with the driver threads. So far the observation has been that AMD's driver only uses a single thread (+ the calling thread), which is okayish.

    Luckily we don't have to be bothered by that anymore, because DX12 gave devs ownership over threading.
     
    pMax and Lightman like this.
  3. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    NV does inter-warp scheduling driver side so there's that, NV should have higher overhead but in practice (due to effectively multi-threaded driver) ends up less CPU-bound than AMD counterparts. In DX12 and Vulkan the situation should favor AMD in theory, but DOOM says otherwise, AotS also seems to get hit a little less hard by CPU clocks being lowered for NV than AMD




    NVM I'm basically reposting what was said on last page lol
     
  4. Anarchist4000

    Veteran

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    For pure scheduling AMD should be more efficient, they do have hardware to do it after all. There's more to a driver than just scheduling though.
     
    ieldra likes this.
  5. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    Yeah, it seems that GPU manufacturers have taken full advantage of the current situation (*). Cross platform games (and engines) have been designed to run properly on ~1.6 GHz Jaguar CPUs. This leaves lots of idle CPU cycles on high end PCs. In the current state, wasting huge amount of CPU cycles in the driver is a good proposition, as long as it gives at least a tiny bit of savings on the GPU side. This is obviously a really bad thing for gaming on laptops or any shared TDP configs such as integrated GPUs. Extra CPU work consumes the total TDP budget of the system, leaving less TDP for the GPU.

    As you said, current wasteful GPU drivers are also a problem for using the PC CPU to do performance intensive multithreaded (gameplay) number crunching. You design your game logic according to the minimum required CPU. In extreme cases (lots of draw calls), the graphics driver could consume up to half of an older quad core system's clock cycles. This obviously means that the gameplay needs to be scaled down in order to run (similarly) on the minimum hardware configuration.

    (*) Intel has specifically said that they don't like big bloated drivers. They don't track all bad application behavior (like setting duplicate state). This saves them CPU time, but costs them GPU time in badly written applications. Intel of course is in different situation than AMD/Nvidia as all their GPUs are integrated and share the TDP with the CPU.
    Most of the driver overhead goes to resource, state and residency tracking and translation/validation of commands. Ivan's (2014) presentation about the topic gives a good overview: http://www.slideshare.net/DevCentralAMD/introduction-to-dx12-by-ivan-nevraev

    Nvidia's software scheduling doesn't seem to cost much CPU time, as DX12 and Vulkan show (almost) similar improvements as AMD gets. Do we know exactly what part of the scheduling Nvidia does in software?
     
    #85 sebbbi, Sep 5, 2016
    Last edited: Sep 5, 2016
  6. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    There's this switch in the Nvidia drivers where you can turn off MT optimization. Is that not working correctly? I thought it was.

    WRT to driver using the CPU: I think the Nvidia driver handles this quite intelligently. I've been doing some comparisions where I let our CPU-tests (obviously rather CPU heavy scenes, 720p no AA/AF and minimal post-processing as allowed per game) run on AMD and Nvidia GPUs not only on top-hardware but also lower end CPUs (i3-6100/FX-6300) of what a gamer would buy nowadays. No great surprises there, except that AMD seems to need that ONE STRONG CPU thread even more than Nvidias approach, because in some games like Anno or Ass. Creed Syndicate, the gap actually widened between the GPU parts when going to the lower end CPUs. Maybe I can post the results later today.

    And then there's this:
     
    #86 CarstenS, Sep 5, 2016
    Last edited: Sep 5, 2016
    CSI PC and DavidGraham like this.
  7. Anarchist4000

    Veteran

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    If they are largely just block scheduling the entire device I wouldn't expect it to take much processing power. Async shading and high priority compute I would think make things interesting to obtain peak performance and proper execution time.

    The resource/state/residency portion would largely go away with the low level APIs and bundles. Or anything approaching a 100GB/s link to system memory.
     
  8. pMax

    Regular

    Joined:
    May 14, 2013
    Messages:
    327
    Likes Received:
    22
    Location:
    out of the games
    Thanks, I wanted to understand where DX11 was differing i.e. from console approach. It is for DX12, but i see it quite compelling to my question/case.

    ...wait: slide 39 reports lot of KMD usage - which was my raised point before (I think with another guy, OpenGLguy?).
    IF all the work is moved to userland command queues, which are available by all GPUs I suppose, what does KMD has to do there?
     
    #88 pMax, Sep 5, 2016
    Last edited: Sep 5, 2016
  9. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    946
    Likes Received:
    413
    It doesn't go away, it's now on your toast! You can be smarter about it, or not. That's why low-level is a two sided sword.
     
    ieldra and CarstenS like this.
  10. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    Bundles are much easier to implement on application side, because the application knows which draw calls and data are tied together (spatially and temporally). For example you could simply build bundles based on your octree leaf nodes. Same nodes are usually visible at consecutive frames. Bundles greatly reduce the driver translation and validation work, as the driver validates the bundle at creation time only (it can be reused with much smaller cost).
     
    Heinrich04 likes this.
  11. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Thanks.
    Although the point was more focused upon Vulkan and Doom; and it is pretty clear the extensions they use for OpenGL in Doom is currently better than what they have for Vulkan for Doom (or were anyway back then), but I do appreciate this is muddied a bit with driver-API optimised performance.
    Cheers
     
    #91 CSI PC, Sep 6, 2016
    Last edited: Sep 6, 2016
  12. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,889
    Likes Received:
    4,536
    They do mention later in the article that there is currently ongoing work with Vulkan ...
     
    DavidGraham likes this.
  13. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Yep agreed 'ongoing' and 'working on' being the optimal word, which can be seen as more recent tests has shown performance improvements recently in Vulkan for Doom with Nvidia cards, but worth remembering how mature Nvidia's development is with OpenGL and extensions/driver optimisation.
    From that context there is still more Nvidia can do with Vulkan (and associated extensions) and should continue to see performance gains IMO.

    Cheers
     
  14. Anarchist4000

    Veteran

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Some gains, but I have suspicions they already got the big ones with cross lane functionality. That would be a game changer for optimizing many compute based postprocess effects and saving bandwidth as there are no compute compression technologies. Doom performance looks pretty damn good from both IHVs now which seems a testament to the power of Vulkan.

    It wouldn't be surprising if Nvidia simply didn't have intrinsics publicly exposed for Vulkan when Doom first released. Seeing as how OGL worked well for Nvidia and not for AMD releasing when they did makes sense. If they waited on Nvidia they could have simply been stonewalled for a competitive advantage against AMD. Ultimately their choice didn't hurt anyone beyond the appearance Nvidia may not be as good with Vulkan. Maybe there was some shader replacement involved because of lack of optimization by Id, but it seems more likely it was just a matter of finalizing some driver work.

    It could be interesting if a site ran some Vulkan benchmarks on other titles to see if there are any differences. Talos perhaps, which I think is starting to get more optimizations completed.
     
  15. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Yeah agreed,
    which is what I was inferring some weeks ago in this thread.
    Good point on Talos and would be great if that was also benchmarked again since recent updates of game and Nvidia drivers.

    Cheers
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...