DX12 Performance Discussion And Analysis Thread

Discussion in 'Rendering Technology and APIs' started by A1xLLcqAgt0qc2RyMz0y, Jul 29, 2015.

  1. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Is it really clear cut that Maxwell 2 (just focusing on that for now) does not support Async?
    To me it is more muddy than that conclusion because of the behaviour at certain points highlighted by Ka_rf; rather than focus on just where it fails look at where it works (beginning and say around 240-330 threads).
    So is it possibly a driver/scheduler issue?
    Cheers

    PS. sorry previous post it should had said "evolved since then for Ashes?" and not "involved".
     
  2. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    A single lane compute work needs the scheduler to spawn a single wave on all GPUs. On AMD the wave is 64 wide, meaning that the architecture is designed to run/manage less waves (as each do more work). If you spawn single lane work, you will more likely end up under-utilizing GCN compared to other GPUs. Work group sizes can also expose (academic) bottlenecks, since the resource (gpr, lds) acquisition/release is done at work group granularity.

    I am just saying that using a workload that is not realistic can cause various bottlenecks that will not matter in a realistic scenario. Some people seem to be drawing all kinds of conclusions based on the results of this thread. Even though the results might not mean anything for most applications (especially games - the main purpose for DX12 API).
    This is a downside of the GCN architecture, but it almost never matters in real software.
    PC DirectX 12 abstracts the barriers quite a bit, so we don't know whether we are timing end-of-pipe or something else. This matters quite a bit, since the GPUs have super long pipelines.
    A single CU can run multiple kernels, and since the test is using async compute, we can assume that multiple queues produce work to the same CU. This means that it can interleave SALU from the other kernel and VALU from the other, making both process twice the rate.
    Of course there are kernels that are fastest with thread blocks of 64 or 1024 threads. But these are fast because of some other bottlenecks, you most likely are trading some GPU utilization for other improvements (like reduced memory traffic). Also if the GCN OpenCL compiler is very smart, it could compile thread groups of 64 threads differently (the scalar unit could be exploited more).
    It measures many things, not just async compute. This makes the results hard to understand, and people are making wrong conclusions.
    All modern GPUs are capable of running multiple graphics tasks and multiple compute tasks in parallel, when these tasks originate from the same queue. This has been true for long time already. However DirectX 11 API and DX 11 drivers are quite defensive in their resource tracking, meaning that concurrent execution for compute doesn't happen often. Concurrent execution for multiple graphics draw calls however happens regularly (unless the graphics shaders use UAVs). How many graphics draws are executed simultaneously depends on fixed function resource limitations (only a limited amount of global state combinations can execute concurrently).
    Yes, single command list by definition is not async. Shaders still can run concurrently even from a single command list (but not asynchronously). DirectX 12 exposes resource barriers to the programmer, giving the programmer more control over concurrent execution from a single command queue. Manual resource barriers allow greater and more controlled parallelism from a single queue, and this also is supported by more GPU vendors. If you don't necessarily need async, this is a good way to maximize the GPU utilization.
    Exactly!

    This is not a performance (maximum throughput) benchmark. However it seems that less technically inclined people believe it is, because this thread is called "DX12 performance thread". This thread does't in any way imply that "asynchronous compute is broken in Maxwell 2", or that "Fiji (Fury X) is super slow compared to NVIDIA in DX12 compute". This benchmark is not directly relevant for DirectX 12 games. As some wise guy said in SIGGRAPH: graphics rendering is the killer-app for compute shaders. DX12 async compute will be mainly used by graphics rendering, and for this use case the CPU->GPU->CPU latency has zero relevance. All that matters is the total throughput with realistic shaders. Like hyperthreading, async compute throughput gains are highly dependent on the shaders you use. Test shaders that are not ALU / TMU / BW bound are not a good way to measure the performance (yes I know, this is not even supposed to be a performance benchmark, but it seems that some people think it is).

    This benchmark has relevance for mixed tightly interleaved CPU<->GPU workloads. However it is important to realize that the current benchmark does not just measure async compute, it measures the whole GPU pipeline latency. The GPUs are good at hiding this latency internally, but are not designed to hide it to external observers (such as the CPU).
     
    #402 sebbbi, Sep 3, 2015
    Last edited: Sep 3, 2015
    Kej, drSeehas, pharma and 8 others like this.
  3. RedditUserB

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    24
    Likes Received:
    1
    @sebbbi
    Maybe you should write a simple app to test Async Compute functionality if you don't believe this current one is valid? :)
     
  4. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    This benchmark is valid for testing async compute latency. This is important for some GPGPU applications.

    It is important to notice that this benchmark doesn't even try to measure async compute performance (GPU throughput). This is the most important thing for improved graphics rendering performance in games. As this thread is called "DX12 performance thread", I just wanted to point this out.
     
  5. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,894
    Likes Received:
    4,549
    #405 pharma, Sep 3, 2015
    Last edited: Sep 3, 2015
  6. pMax

    Regular

    Joined:
    May 14, 2013
    Messages:
    327
    Likes Received:
    22
    Location:
    out of the games
    I apologize but... didnt we have people using Async compute in PS4 showing massive benefits due to it? Your game won't be eternally bound in ALU/BW and all the extra 'free time' you get can be used by async compute. So, the fact that you can reduce the rendering time by using async compute do have relevance and matters alot - especially when XB/PS games will be moved to the PC arena using DX12 features.
     
  7. I only found one user, PadyEos, who made the test with a 980 Ti at the same time as he logged his CPU usage.
    Here's kar_rf's plot of his results:

    [​IMG]


    We see that somewhere near the 256 kernels the "Async" results do seem to approach the Sync results, suggesting there could be some Async Compute happening after all.

    However after looking at RedditUserB's suggestion, I compared the GPU usage to the CPU usage logs that PadyEos took, and here's what I found (poorly glued through paint):



    [​IMG]

    By the time the Async test starts (which should be around the middle of the test), the CPU usage jumps towards ~50% in all cores and threads.
    I imagine PadyEos is probably using a Core i7 of some kind. If a test as simple as this is getting this kind of CPU usage just imagine the impact of it in a game.. Imagine this in a Core i5 or a mobile CPU..

    This seems to be exactly the issue the Oxide developer was mentioning. nVidia is trying to emulate the lack of hardware Async Compute through the CPU and that comes at the cost of a huge performance penalty. No wonder they had to put a vendor-specific code in the game, so that nVidia cards won't use Async.



    This is still just one test and it could be wrong, though. Perhaps more people could re-run the test on a Maxwell 2 card and check their CPU usage when the Async Compute starts?
     
    Jackalito and RedditUserB like this.
  8. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,894
    Likes Received:
    4,549
    Correct me if I'm wrong but isn't that CPU usage graph include TDR enabled which caused Maxwell GPU's to crash? I think he tested later w/o TDR and was able to finish the test but did not include the CPU usage graph.
     
  9. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    ToTTonTranz,
    didn't one tester show that their NVIDIA 9xx card went from toggling GPU utilisation to full 100% when looking at changing TDR?
    Has anyone else looked at that with their GPU?
    Another problem is that these are not controlled environments, meaning many are using 3rd party boosting/analysing software and sometimes multiples.
    I know of a recent case where it went viral on various forums that NVIDIA was meant to be downgrading the graphics in BF4 when someone did an amateur review-comparison, but in reality it came back to the user having interaction issues when they switched between AMD to NVIDIA; not quite the same thing but point is a separate entity affected the behaviour of the NVIDIA driver from an aliasing perspective.
    Cheers
     
  10. The reason the Maxwell cards were crashing is because the "forced no-Async" mode was making them simply add the time needed for one activated kernel over each iteration, leading to computation times in excess of 3 seconds. The driver was crashing due to a time-out, and disabling TDR simply removed that time-out.
    The crash only happened during the last "no-Async" test. The results I'm showing refer to the first two tests.


    But as I said, we should try to get more results from other people with Maxwell 2 and CPU logging.
     
  11. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,894
    Likes Received:
    4,549
    So what exactly do you intend to show? As Sebbi and others already mentioned "This benchmark is valid for testing async compute latency." Below is a link to a response (@DmitryKo) you might find interesting:

    http://forums.guru3d.com/showpost.php?p=5152048&postcount=62
     
  12. madyasiwi

    Newcomer

    Joined:
    Oct 7, 2008
    Messages:
    194
    Likes Received:
    32
    Um, no, that CPU usage jumps should only started way past the beginning of forced async sequences. Probably somewhere after the 300th.
     
  13. Devnant

    Newcomer

    Joined:
    Sep 3, 2015
    Messages:
    10
    Likes Received:
    7
    Hi, I've created an account just because ToTTenTranz asked for more Maxwell results with CPU usage.

    Here's my perf log on a 980 TI. I got a TDR crash on the single commandlist but it shouldn't matter.

    Also, here is overall CPU usage on the duration of the test on a i7 5960X:

    CPU1.png CPU2.png CPU3.png
     

    Attached Files:

    Deleted member 13524 and Razor1 like this.
  14. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    He isn't talking about that, with the relevance of this program, end performance isn't what is being tabulated and it can't be, it will be different for different programs, as with different architectures. Think of it like this, if you are trying to measure lets say how much weight I'm going to loose by running 10 miles today, I can get a good idea of how many kj I will use but I can't say definitively this is the amount of fat I will burn.

    Guys lets get back to the topic
     
    #414 Razor1, Sep 3, 2015
    Last edited: Sep 3, 2015
    Lalaland and pharma like this.
  15. pMax

    Regular

    Joined:
    May 14, 2013
    Messages:
    327
    Likes Received:
    22
    Location:
    out of the games
    I don't get the difference, sorry: why comparing a core without HT to one that features it -with the added knowledge that the 1st logical core is almost never resource-bounded in a given time-frame of reference- is not a valid comparison for end performance?? Because with async computing and a set time of reference (say 33 msec?) it is exactly what you get - how much workload you can fit inside the given time frame. It is just like if you said that IPC for a single core vs (2 LC) doesnt matter for end performance: I'd agree on different nodes, but really, dont tell me it doesnt matter when comparing a GPU to a GPU.

    What you are measuring there is your average IPC over a given time reference. Async computing allows you to do more over a given time, thus you can achieve alot of benefits once you start using it.
    Those charts show that: if your software starts pushing stuff in the compute queue and in the 3d queue, you get more done in the same time of reference.

    Isnt that what you call 'performance'? For 'end performance', take 2 cards which outputs similar results without Async Compute, then factor it in and see: more done on same time reference, more end performance. Simple as that.

    OT I wonder if this limit has to anything to do with the AMD's hardware scheduler presence vs the NVIDIA software one?
     
  16. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    we can't say it on a performance side for this app, its just not possible because

    A) its not a realistic game scenario Its like my running example, you just can't do it because there are many other factors you have to take into consideration that can effect the end results, the amount of glycogen in my muscles, the amount of tone my muscles have, the amount of food intake I took that day, the time of the food I took, how much water loss am I having due to the ambient temperature
    B) Different games, different apps will use different instruction amounts and different threads.
    C) Due to this the schedulers will react differently, this is why the single command test was put in, to see how the hardware would react to it being forced sync, pretty much taking the driver and scheduling out of the loop.

    About the hardware scheduler vs a software one, it think there is no any evidence of that yet, the CPU tests don't seem to show that, its pretty much flat line, yeah we have another one that shows otherwise, not to mention the plot line would not look like what we have seen either but need more tests to see if it plays out.
     
    #416 Razor1, Sep 3, 2015
    Last edited: Sep 3, 2015
  17. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    That is true but the guy also mentioned it changed the behaviour with TDR disabled:
    Regarding the results, seems those show a consistent async type behaviour when looking at around thread 200 and onwards for awhile, not sure if this the chart shown a few pages back is associated with.

    It may just be an inconsistent anomaly, but how many others with NVIDIA Maxwell 2 have done this and also looked at the % utilisation?
    Can anyone test with the beta drivers that were associated with Star Swarm?
    Cheers
     
  18. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,894
    Likes Received:
    4,549
    I don't think those will run on Win 10.
     
  19. Devnant

    Newcomer

    Joined:
    Sep 3, 2015
    Messages:
    10
    Likes Received:
    7
    Here is with TDR disabled, no crashes. 980 TI driver 355.82
     

    Attached Files:

  20. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,535
    Likes Received:
    144
    Guys, I am disappointed that this needs to be pointed out, but apparently it needs to be pointed out: this place is NOT Reddit. So the whole "I flaming love IHV X so I AM GOING TO POST ANGRY THINGS AT IHV Y WHO IS THE DEVIL" thing is not acceptable. And before you read into it that this is either about ATI people or NVIDIA people, it's about both, as both are equally culpable. Please try to stay on topic, or this'll get locked.
     
    virpz, firstminion, smw and 10 others like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...