DX12 Performance Discussion And Analysis Thread

Discussion in 'Rendering Technology and APIs' started by A1xLLcqAgt0qc2RyMz0y, Jul 29, 2015.

  1. Deadhand

    Joined:
    Sep 1, 2015
    Messages:
    8
    Likes Received:
    1
    Has anyone tried putting this through AMD's GPU PerfStudio? I'd give it a shot myself, but I'm currently on Windows 7.
     
  2. Nobu

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    21
    Likes Received:
    1
    Doesn't work for me--it sees it and tries to connect but times-out. I allowed it through the firewall, so I don't know why...
     
  3. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    329
    Likes Received:
    286
    Yes, GCN requires at least 4 wavefronts per CU to become fully utilized, up to 10 are possible, each wavefront executed in lockstep.

    But what I'm actually wondering about, is how they got scheduled in the first place. With only 2 single queue ACEs, and the work load consisting of single threads only, I wouldn't have expected them to form a full wavefront of 64 threads in the first place.

    And then again:
    If it was capable of forming a full wavefront, why did it stop at a single one? It should at least have launched another one per CU, even if the shader program did break the register limit per CU (which I don't think is what happened). Given the specs and the fact that it obviously DID assemble full wave fronts (otherwise it wouldn't have been able to execute 64 threads in parallel), even Cap Verde shouldn't have topped out at less than 2560-6400 threads per batch - and not at precisely 64.
     
    Razor1 and drSeehas like this.
  4. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    246
    Likes Received:
    109
    FYI, it is up to 32/40 wavefronts per CU in fact, depending on required register allocation by the kernel.

    Queues are taking kernel dispatches (as a single command) of a multi-dimensional grid, submitted by CPU or other agents. So generally the GPU hardware scheduler is dividing a large dispatch into workgroups, and then dividing the workgroups into wavefronts of 64 lanes.

    Say if you have an OpenCL filter working on an 1080p image, you submit a kernel dispatch with a dimension of 1920*1080 to the GPU queue, and then signals the ACE about the enqueuing. The ACE will then take the packet and transform it into wavefronts internally, and in this case it would be 32400 wavefronts. Since the GPU obviously has no such concurrency available, the wavefront scheduler would generate new wavefront as long as there is an available slot, or otherwise wait for one.
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,853
    Likes Received:
    722
    Location:
    London
    I wasn't trying to suggest that 64 kernel launches are actually assembled into a single work-group to run as a single hardware thread (though I did theorise that this is possible).

    I think you guys misunderstand the nature of the ACEs. Each of them manages a queue. I would expect an ACE to be able to launch multiple kernels in parallel, which appears to be what we're seeing in each line where the timings in square brackets feature many results which are identical, e.g.:

    Code:
    64. 5.59ms [5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39]
    65. 10.59ms [5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 10.37]
    66. 10.98ms [5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 10.37 10.78]
    
    There can be only 10 kernels per SIMD, or 40 per CU, so multiple CUs are required to be running the set of kernels simultaneously in order to deliver these timings.

    At 1GHz the theoretical time to run this kernel is 4.33ms (128 instructions per loop = 512 cycles; + 16 cycles to branch = 528 cycles; * 8192 iterations of the unrolled loop = 4,325,376 cycles; at 1GHz = 4.33ms). The timings observed are slightly slower, e.g. 4.33ms for 1025MHz HD7950, or 4ms for 1100MHz Fury X, which is 98.3% of the theoretical speed.

    So there is some overhead in assembling and launching kernels and it seems as if the overhead is shared by either approximately 64 (compute or async) or ~128 (single command list on newer GCN) hardware threads.

    I guess that launching the same number of hardware threads (e.g. 64) from a single kernel, which is in the normal style of enqueuing work to a GPU would result in a lower overhead.
     
    Razor1 and BRiT like this.
  6. pTmdfx

    Newcomer

    Joined:
    May 27, 2014
    Messages:
    246
    Likes Received:
    109
    Oops. I missed the earlier discussions. Sorry about that.

    IMO there is a maximum number of kernels that each compute pipeline scheduler that can handle. On top of that, it seems each shader engine (group of CUs) gets its own wavefront scheduler, while ACEs could dispatch workgroups to any of them. That means there could be interlocks and queues in between that might cause incrementing latencies at high loads.
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,853
    Likes Received:
    722
    Location:
    London
    Your results are like the other AMD version 6 GPUs (original GCN), and fairly erratic timings. I expect your card will behave like Benny-ua's card.

    I haven't done an upload, but Firefox allows me to press the Upload a File button when I'm in the "More Options" view for writing a reply.
     
  8. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    329
    Likes Received:
    286
    Thanks, got the per CU / per SIMD limits mixed up.

    If that was the case, we should see different level of parallelism depending on the number of CUs per card, but we actually don't. We don't even see any difference between GCN 1.0 and 1.2 for true async workload, and that's really odd.

    Also, an ACE shouldn't be able to schedule more than the top of each queue at a time. But perhaps it is capable of joining equal kernels into a single wavefront?

    Is it possible, that even on GCN 1.1 and 1.2 cards, only a SINGLE ACE and a SINGLE queue is being used so far? That would for sure explain why it looks like as if there was only a single wavefront active.

    That would mean this benchmark isn't even utilizing 1/64th of the scheduling capabilities of GCN 1.2....
     
  9. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    329
    Likes Received:
    286
    We need a version with more queues. 64 queues should be perfectly fine.

    I'm not sure about what that would to Maxwell v2 either. It is possible that the Nvidia driver already pushed each async task into a different hardware queue, and achieved parallelism this way (with only a single thread per wavefront). But it's also possible that it did the same as GCN, and actually dispatched a full wavefront from only a single queue. Which means the platform is also actually underutilized by a factor of 32x.
     
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,853
    Likes Received:
    722
    Location:
    London
    I'm dubious that the intra-SMM issue/scheduling of hardware threads is relevant here as I reckon that's handling work that's already been distributed internally by the hardware to the SMM.
     
  11. ivanp3000

    Joined:
    Aug 31, 2015
    Messages:
    1
    Likes Received:
    0
    I find this thread interesting so I tested my hd7970.(15.8 driver)
     

    Attached Files:

  12. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    well we can throw out preemption and context switching out the window, both those are used together and they will actually increase latency not reduce, so what AMD's Hallock has eluded to is nV is using this method when he mentioned context switching, which it isn't right, we can see that with GPUveiw and the data from small program, that would be easy to spot, the latency doesn't have enough spikes for that, it wouldn't be a step like plot. It would be more erratic almost like a EKG added to the step like plot.
     
  13. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    329
    Likes Received:
    286
    Uhm, I think we saw that context switching actually IS involved, even though not in the way we expected it. The context switch didn't happen when switching between async compute and graphics shaders of the same program as originally expected, but it still did happen when switching between pure compute context and graphics context, as we could observe with the starving DWM process. It actually looked like the platform wasn't even capable of preemption in that case.

    We could also observe that Nvidia apparently isn't capable (yet?) of performing shaders in a sequential order in any efficient manner, while AMD apparently could execute them in parallel to some extent by ensuring an ordered memory view - either in hard or in software, we don't know that yet.

    Even though we still don't know all about it yet.
    For instance, we don't know yet if the Nvidia platform is grouping async shaders into full wavefronts, we don't know how software queues relate to hardware queues, we don't know how deep the queues can be in hardware, we don't know whether possible grouping into wavefronts is performed in hardware or software, we don't know why there is a limit to grouping.

    Only thing we know for sure so far, is that on GCN 1.0, shaders are definitely being grouped, both in "async" and "sequential" execution mode.


    The next step would be to test again with multiple queues, to see for sure whether the async shaders were being put in the same or in different hardware queues actually.

    My guess is, so far it was only a single hardware queue for both vendors, and we actually have a queue depth of 32 tasks for Nvidia and up to ~128 for GCN 1.2. Still unsure why only a single wavefront was dispatched on GCN 1.2 in actual "async" mode.

    If my assumptions are correct, this still means that GCN is a lot more powerful, where Nvidia could schedule at most 992 (31*32) "draw calls" (well, "async compute tasks" actually), AMD would go up to 8192 (64*128) draw calls being executed in parallel.
     
  14. digitalwanderer

    digitalwanderer Dangerously Mirthful
    Legend

    Joined:
    Feb 19, 2002
    Messages:
    16,814
    Likes Received:
    1,427
    Location:
    Winfield, IN USA
    Could I have someone give me an ELI5 ("explain like I'm 5") summary of this thread so far? I've really gotten confused.
     
  15. hesido

    Regular

    Joined:
    Mar 28, 2004
    Messages:
    553
    Likes Received:
    85
    I'm hoping this silence is precursor to the Eli5's that are being made ready as I type these.
     
    digitalwanderer likes this.
  16. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    The summery is no one really know yet lol, but it seems like Maxwell 2 can do Async, and we don't really know the limitations right now, if and when the drivers are ready, that will give us a better idea.
     
    digitalwanderer and pharma like this.
  17. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    329
    Likes Received:
    286
    Original topic:
    What performance gains are to be expected fro the use of async shaders and other DX12 features. Async shaders are basically additional "tasks" you can launch on your GPU, to increase the utilization and thereby efficiency as well as performance even further. Unlike in the classic model, these tasks are not performed one after another, but truly in parallel.

    IT happened:
    One prominent developer studio, currently working on an DX12 title, tried to make use of async compute shaders for the first time. It went horribly wrong on Nvidia GPUs while it achieved astonishing gains on AMD GPUs.

    The question: What went wrong?
    One guy tried to construct a minimal testsuite, trying to replicate the precise circumstances under which the the Nvidia GPUs failed. Many assumptions were made, but only few held true.

    There were claims originally, that Nvidia GPUs wouldn't even be able to execute async compute shaders in an async fashion at all, this myth was quickly debunked.

    What become clear, however, is that Nvidia GPUs preferred a much lighter load than AMD cards. At small loads, Nvidia GPUs would run circles around AMD cards. At high load, well, quite the opposite, up to the point where Nvidia GPUs took such a long time to process the workload that they triggered safeguards in Windows. Which caused Windows to pull the trigger and kill the driver, assuming that it got stuck.

    Final result (for now): AMD GPUs are capable of handling a much higher load. About 10x times what Nvidia GPUs can handle. But they also need also about 4x the pressure applied before they get to play out there capabilities.
     
  18. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    7,917
    Likes Received:
    1,636
    Location:
    Finland
    What about NVIDIA apparently using PreEmption and not truly parallel execution and having to wait for draw call to finish to switch context? At least that's the case for "asynchronous timewarp", which is just another case of asynchronous shaders / compute? (it's in their VR documents, linked here too, I'm sure)
     
  19. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    Preemption is being used in VR, but that's because VR needs preemption, I don't think async needs this at all, as preemption is to completely shift kernels, kernels with aynsc should be doing them concurrently right? I'm thinking that you shouldn't need to shut down one kernel and go to another........not sure though.
     
  20. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    843
    I thought there were anomalies with Fiji GPUs as well?
    Seems these have a subtly different architecture to the other models; would need to check but I thought in Ashes the 290x outperforms it as well.
    Cheers
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...