DX12 Performance Discussion And Analysis Thread

Discussion in 'Rendering Technology and APIs' started by A1xLLcqAgt0qc2RyMz0y, Jul 29, 2015.

  1. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Sorry to ask but did you restart your computer after the change?
    Seems we have a divergence to that of PadyEOS who also has a 980TI
    Cheers
     

  2. Interesting.

    In your case, there doesn't seem to be any relevant CPU activity.. but, there's no "Async" happening at all either.
    Your "Async" results are almost a carbon copy of what the "pure compute" results show if you just add the 16.15ms render time on top of them:

    [​IMG]



    EDIT: results with TDR disabled are practically the same. As I said, disabling TDR is simply preventing the graphics driver from interrupting the program due to a timeout in the third test:

    [​IMG]
     
    #422 Deleted member 13524, Sep 3, 2015
    Last edited by a moderator: Sep 3, 2015
  3. Rurouni

    Veteran

    Joined:
    Sep 30, 2008
    Messages:
    1,101
    Likes Received:
    432
    What Nvidia driver version did he use?
     
  4. Devnant

    Newcomer

    Joined:
    Sep 3, 2015
    Messages:
    10
    Likes Received:
    7
    Yes, but to be perfectly honest I didn't completely disable TDR. I just increased the delay to 10 seconds so the test wouldn't crash.
     
  5. Devnant

    Newcomer

    Joined:
    Sep 3, 2015
    Messages:
    10
    Likes Received:
    7
    355.82
     
  6. So with nVidia Maxwell 2, specifically a GM200 Geforce 980 Ti we have 2 different results so far:

    1 - Async exists somehow but it causes a huge load on the CPU -> PadyEos' result using 355.82

    2 - Async doesn't exist at all -> Devnant's result using 355.82 too...


    I wonder what's causing this "CPU-assisted Async" to kick-in.
    If it was pure CPU performance, then it should be Devnant's machine the one to activate Async, since his CPU has twice the cores/threads.
    Perhaps CPU frequency?
     
  7. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    too early to determine that is even the case. Going that road about cpu frequency probably not.
     
  8. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    OK that possibly explains the divergence :)
    Can you do the test with actually disabling it like PadyEOS did?
    Cheers
     
  9. trandoanhung1991

    Joined:
    Sep 2, 2015
    Messages:
    6
    Likes Received:
    6
    970 on 355.82 OCed. Didn't get a TDR. Also attached is Afterburner log if anyone is interested.
     

    Attached Files:

  10. Well I can't spend the whole afternoon drawing graphics in excel but looking at your results within 5 different points (0, 128, 256, 384, 512), async time = compute time+render so it doesn't look like your test isn't doing any async compute either.
     
  11. Devnant

    Newcomer

    Joined:
    Sep 3, 2015
    Messages:
    10
    Likes Received:
    7
    Just did, but that didn't change a thing. Same results. Seems like I don't get any async benefits, maybe because we are using different CPUs? I just don't know.
     
  12. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Thanks Devnant,
    at least you were both using the same drivers.
    Cheers
     
  13. ka_rf

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    12
    Likes Received:
    19
    More Maxwell 2 results. It seems the 980 Ti that showed some asynch compute happening with weird cpu spikes is the odd man out.

    [​IMG]
    [​IMG]
     
  14. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Are you basing PadyEOS results from page 12?
    He did the change and had a different behaviour afterwards.
    Regardin Async behaviour, why are threads around 200 to 260 consistently showing Graphics+compute faster than them separate when looking at his results after the change?
    Cheers
     
  15. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    ToTTenTranz,
    nothing is truly consistent so far as one has to remove certain results to get a conclusion one way or the other (yes/no support); and unfortunately as I mentioned this is compounded that none of this is being done on a controlled environment PC.
    Cheers
     
    Razor1 and pharma like this.
  16. Again: TDR won't change any outcome, it will just prevent the third test (which prevents Async from kicking in, so it wouldn't tell us if Async is working either way) from crashing.
    Here's PadyEos' latest result anyway.

    [​IMG]


    CSI PC, no one is removing results. In fact I've been asking people to bring more results into the equation.
    As far as consistency goes, except from PadyEos' results (which we're still trying to find a way to replicate), almost all Maxwell 2 results are pretty damn consistent: Async Compute isn't working in this test at all.
     
  17. comprodigy

    Joined:
    Sep 1, 2015
    Messages:
    3
    Likes Received:
    0
    Question. Wouldnt Nvidia technically be performing Async computation/shaders exactly how MS documentation and even AMDs own slide by the results that we are seeing. I could have missed something more solid on the topic in the API reference, but this is what I was able to find.

    Command queue overview

    Direct3D 12 command queues replace hidden runtime and driver synchronization of immediate mode work submission with APIs for explicitly managing concurrency, parallelism and synchronization. Command queues provide the following improvements for developers:

    • Allows developers to avoid accidental inefficiencies caused by unexpected synchronization.
    • Allows developers to introduce synchronization at a higher level where the required synchronization can be determined more efficiently and accurately. This means the runtime and the graphics driver will spend less time reactively engineering parallelism.
    • Makes expensive operations more explicit.
    These improvements enable or enhance the following scenarios:

    • Increased parallelism - Applications can use deeper queues for background workloads, such as video decoding, when they have separate queues for foreground work.
    • Asynchronous and low-priority GPU work - The command queue model enables concurrent execution of low-priority GPU work and atomic operations that enable one GPU thread to consume the results of another unsynchronized thread without blocking.
    • High-priority compute work - This design enables scenarios that require interrupting 3D rendering to do a small amount of high-priority compute work so that the result can be obtained early for additional processing on the CPU.
    Thats the command queue, which from the results were are seeing Nvidia run 31 in parallel, so it seems that Nvidia is meeting that requirement (or at least that description)

    And then there is the command list

    Executing command Lists
    After you have recorded a command list and either retrieved the default command queue or created a new one, you execute command lists by calling ID3D12CommandQueue::ExecuteCommandLists.

    Applications can submit command lists to any command queue from multiple threads. The runtime will perform the work of serializing these requests in the order of submission.

    The runtime will validate the submitted command list and will drop the call to ExecuteCommandLists if any of the restrictions are violated. Calls will be dropped for the following reasons:

    Nvidia is executing command lists in serial as well (graphics + compute) per command queue. Unless im missing something totally obvious, it seems like Nvidia is handling this exactly how they should. I havent seen anywhere, where it says that command lists on the same queue are supposed to be executed asynchronously. Please, if I am missing something here, let me know.
     
  18. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    ToTTenTranz,
    look at around thread 200 to say 265 (happens in other places).
    You agree it shows Compute+render time is consistently less than the individual runs of Compute and Async?
    So how is this possible serially?
    As you said earlier when you said it is not async compute capable it would need Compute+Render to be around the same time of both of those combined, but in places it is pretty clear the improvements are greater than marginal.
    So this is what I mean about having to ignore aspects of variables to reach conclusions that are being said.
    One cannot say for sure what is happening apart from something is not right and maybe it is driver/scheduler or indeed something more architectural, but no facts can be conclusively said because it needs further testing; in fact I see no-one as even raised the consideration of what changed between Star Swarm and Ashes, both from Nitrous engine perspective and NVIDIA drivers.
    Caveat there being are the results from Star Swarm comparable to Ashes on low settings, but so much has not been clarified and tested, including all of these from clean-controlled PC environment where there are no dual drivers/"boost" software/etc,
    And yes I agree something is strange when looking at the various results, I do think a couple of other measurement tests posted showed that sections of threads had consistent improvements for Compute+Render.
    But the behaviour performance is far from consistent in terms of what the trend should be for either supporting Async or not on NVIDIA.

    Cheers
     
  19. If you're referring to PadyEos' results, those are the same that show about 50% usage of a 8-thread CPU when the "Async" test starts.
     
    digitalwanderer likes this.
  20. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    again trying to draw a parallel from cpu usage to async happening is hard to do, unless you know exactly what the drivers, gpu and cpu, are doing at that point, what is the purpose of the cpu usage has not been quantified to any degree.

    If we don't know and draw that parallel, it might be wrong and that is no good because it changes the way we think about the situation, in essence prejudices us and in turn force us to make incorrect assumptions.
     
    #440 Razor1, Sep 3, 2015
    Last edited: Sep 3, 2015
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...