DX12 Performance Discussion And Analysis Thread

Discussion in 'Rendering Technology and APIs' started by A1xLLcqAgt0qc2RyMz0y, Jul 29, 2015.

  1. Kobata

    Joined:
    May 18, 2015
    Messages:
    3
    Likes Received:
    0
    Yes, they (except the 7950) are.

    I put my general thoughts in an earlier post, but if you look at the numbers for say, the Fury X, you'd see that single commandlist starts as equal to graphics-only+compute-only, and then increases in increments of compute-only every 65 invocations added (1, 66, 131, ...), while the 'async' version starts equal to max(graphics-only, compute-only), then increments every 30 invocations (1, 31, 61, ...). The 7950 data on the other hand keeps the 65-invocation scaling in both modes, so async ends up always being better by a constant factor.
     
  2. Nobu

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    21
    Likes Received:
    1
    Could the AoS bench/game be benefitting from this hardware (or, I guess, firmware/driver) optimization somehow? Seems like for a naive implementation of async that would give a fair advantage in the general case. Of course, I'm sort of talking out of my ass right now--I have next to zero understanding of either nvidia's or amd's async architecture.
     
  3. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    696
    Likes Received:
    446
    Location:
    Slovenia
    With two GPUs in SLI each GPU can execute it's own queue independent of each other and as there are no dependencies between the two that's an ideal outcome anyway.
     
  4. RedditUserB

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    24
    Likes Received:
    1
    Interesting. So what's your interpretation of the results so far?
     
  5. Luore

    Joined:
    Sep 1, 2015
    Messages:
    1
    Likes Received:
    0
    It's not the Titan SLI, the selected highlighted result is the 290x
     
  6. Nub

    Nub
    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    10
    Likes Received:
    18
    This is, indeed, correct. It was my mistake for not making the selected item stand out more. The position of the tooltip is also misleading if you're not actually using the tool; the data is shown for whichever bar you have your mouse cursor on.
    I usually make things to work for myself that i tend to neglect how misleading they can be for other users.

    I've fixed that and i've also added a label at the top of the chart to show what value the y-axis is, at the very top of the page.
    Again, here's the link, for the convenience of not having to open previous pages to look for it.
     
  7. Kobata

    Joined:
    May 18, 2015
    Messages:
    3
    Likes Received:
    0
    I doubt it has an amazing effect on most practical D3D12 programs, if only because nVidia's driver/hw stack isn't doing anything special, so anything relying on that optimization would run horribly there. I would guess that it's probably a 'common' low-level optimization that has a better effect on D3D11/OpenGL/etc.
     
  8. RedditUserB

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    24
    Likes Received:
    1
    Again, what is the expected result?

    ie.
    25ms Graphics
    25ms Compute

    With Async Compute enabled, the combined Graphics + Compute task should be completed in.... ? 25ms?

    With Async Compute disabled, the combined Graphics + Compute task should be completed in....? 50ms?
     
  9. Godfavor

    Joined:
    Aug 31, 2015
    Messages:
    2
    Likes Received:
    0
    8970m (CCN 1.0 propably)
     

    Attached Files:

  10. SHLee

    Joined:
    Aug 11, 2015
    Messages:
    1
    Likes Received:
    0
    Hello, this is my first post in this forum.
    The attached file is a log of Async Compute Test written by MDolenc on my Fury X with Catalyst 15.8b.

    Thanks to you people for making an great investigation on this subject.

    (My English is poor. So I don't make a post in forums using English such as here , but I really wanted to say thank you many times.)
     

    Attached Files:

  11. RedditUserB

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    24
    Likes Received:
    1
    This is the most logical interpretation. An async task would not exhibit a time to completion that's near the sum of the two tasks separately. That indicates NO async mode is functional and its defaulting to serial operations.

    Since I'm good at car analogies, let's do that.

    2 Cars are on the road, let's call them Car 1 (Compute) and Car 2 (Graphics). Both cars are trying to go from A - > B.

    The time it takes for Car 1 to travel the journey is 1 hour. The time it takes for Car 2 to travel the journey is 2 hours.

    The question is, how long does it take for both Cars to reach destination B?

    1. Both Cars can travel on the road together, simultaneously, starting at the same time: 2 hours.
    2. Only ONE Car can be on the road at once, so Car 1 goes first (order doesn't matter), finishes, then Car 2 starts. Thus, both Cars reach their destination in: 3 hours.

    Minor variations aside, that should be the expected behavior, correct? #1 would therefore be Async Mode, and #2 is not.
     
  12. InsidiousBoot

    Joined:
    Sep 1, 2015
    Messages:
    1
    Likes Received:
    0
    First post here, I was curious on this matter so I ran both on my spare and main card.

    Just sharing results if it might prove useful.

    AsyncCompute
    written by MDolenc

    7950 Catalyst 15.8b

    Graphics only: 57.37ms (29.24G pixels/s)
    Graphics + compute: 238.70ms (7.03G pixels/s)
    Graphics, compute single commandlist: 295.77ms (5.67G pixels/s)

    980 Forceware 355.82

    Graphics only: 23.23ms (72.21G pixels/s)
    Graphics + compute: 103.58ms (16.20G pixels/s)
    Graphics, compute single commandlist: 2433.35ms (0.69G pixels/s)
     

    Attached Files:

  13. Khipu

    Joined:
    Sep 1, 2015
    Messages:
    2
    Likes Received:
    0
    Can we try running bandwidth heavy compute in parallel with computation heavy compute?
     
  14. RedditUserB

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    24
    Likes Received:
    1

    Thanks. We need the Compute only times to analyze, from your data:

    1st Kernel

    980:

    Compute only: 5.33ms
    Graphics only: 23.23ms
    Graphics + compute: 27.68ms

    7950:

    Compute only: 29.91ms
    Graphics only: 57.37ms
    Graphics + compute: 57.41ms

    64th Kernel:

    980:

    Compute only: 14.33ms
    Graphics only: 23.23ms
    Graphics + compute: 37.32ms

    7950:

    Compute only: 29.91ms
    Graphics only: 57.37ms
    Graphics + compute: 57.35ms


    128th Kernel

    980:

    Compute only: 28.70ms
    Graphics only: 23.23ms
    Graphics + compute: 46.62ms (The 129th Kernel jumps to 51.58ms)


    7950:

    Compute only: 59.72ms
    Graphics only: 57.37ms
    Graphics + compute: 60.01ms
     

  15. That would be the case for the results we're getting with MDolenc's tests, yes. (BTW that's a "highway lanes analogy, not a cars analogy ;) )

    Basically, if (graphics+compute) time = (graphics time) + (compute time), then at least with this code the hardware isn't running Async Compute.
    And that's what we're seeing with both Kepler+Maxwell 1 (which do not support Async Compute by nVidia's own spec) and Maxwell 2.

    As far as I can see, there are 3 very odd things with the results so far:

    1 - Maxwell 2 isn't doing Async Compute in this test. Pretty much all results are showing that.
    Razor1 pointed to someone with two Titan Xs being seemingly able to do Async but it seems the driver is just cleverly sending the render to one card and the compute to another (which for PhysX is actually something that you could toggle in the driver since G80, so the capability was been there for many years). Of course, if you're using two Maxwell cards for SLI in the typical Alternate Frame Rendering mode, this "feature" will be useless because both cards are rendering. The same thing will happen for a VR implementation where each card is rendering each eye.

    2 - Forcing "no Async" in the test (single command queue) is making nVidia chips to serialize everything. This means that the last test with rendering + 512 kernels will take the Render-time + 512x(Compute-time of 1 kernel). That's why the test times end up ballooning, which eventually crashes the display driver.


    3 - Forcing "no Async" is making GCN 1.1 chips doing some very weird stuff (perhaps the driver is recognizing a pattern and skipping some calculations as suggested before?). GCN 1.0 like Tahiti in the 7950 is behaving like it "should": (compute[n] + render) time = compute[n] time + render time.




    That's not what your performance log shows... That's the time for 512 kernels in pure compute mode.
     
    RedditUserB likes this.
  16. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,245
    Likes Received:
    4,465
    Location:
    Finland
    Regardless of the actual time itself, NVIDIA is failing by the fact that they can't hide anything with async compute, while AMD can hide most and often even all the grapics latency with async compute
    From one of the earlier results:
    980 Ti: Compute ~10ms, Graphics ~18ms, Compute + Graphics ~28ms
    Fury X (or 390X, not sure since i'm copypasting) Compute ~50ms, Graphics ~25ms, Compute + Graphics ~50-60ms
    See the difference?
     
  17. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    doesn't matter what the performance difference is if there is even a small amount doing async code, it is functional, its not about the end performance vs the different IHV's, its about is it capable or not, and it is capable. Serial path should always be the same or higher than doing it asynchronously if the variables are the same and if the processor is being tasked enough.
     
    pharma likes this.
  18. Serg

    Joined:
    Sep 2, 2015
    Messages:
    1
    Likes Received:
    0
    Radeon hd 7850 ))

     
  19. RedditUserB

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    24
    Likes Received:
    1
    It's not functional. I looked through the results graphed up by Nub, for Kepler & Maxwells over all the Kernels, there ARE specific instances of +/- faster/slower in async mode, but it evens out to be almost the SAME as the sum of compute & graphics individually.

    If anyone is trying to compare the time (ms) for compute/graphics, etc, and link a performance comparison to GCN vs Maxwell or whatever, please note that there's already plenty of DX11 games that use compute. If GCN indeed has a 50ms latency for compute (as indicated by the program), it means the games cannot exceed 20fps. Thus, this particular program is not to be used as a performance comparison. But I believe this point was raised earlier already by others.

    Now, the conclusion from this program is either Async Compute is not functional, there's no simultaneous execution of graphics + compute (leading to a big performance gains), or the program is wrong and Maxwell 2 can indeed do parallel graphics + compute, ie. functional AC.
     
  20. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    why don't you explain then why the difference is there, on all maxwell 2 cards in favor of async compute vs serial, why is it always faster by a few microseconds. I have looked at majority of the reports too. You said it compile the data and link it as a spreadsheet.

    I don't think this program was made for performance so take that out of the equation right now.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...