DX12 Performance Discussion And Analysis Thread

Discussion in 'Rendering Technology and APIs' started by A1xLLcqAgt0qc2RyMz0y, Jul 29, 2015.

  1. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    By a stretched definition even Maxwell supports async. Interleave compute and graphics and it's asynchronous. Results are ugly, but it's async.

    So you're saying a single SM can execute compute and graphics concurrently? Because all the evidence we've seen so far suggests otherwise. Schedule a single long running compute job to each SM and what happens when you attempt to run graphics? The current design is segmenting the GPU into two sections, compute and graphics, and pushing tasks to their respective areas. When those ratios change, things get interesting. It's far more useful to know how a feature works and the limitations than just assuming checkbox feature. As implemented, it should work well for a feature like ATW where it mostly occurs once a frame. Get a physics engine dispatching jobs at a different framerate, asynchronously, and you'll likely have problems.
     
  2. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    And yet I still haven't been able to see these ugly results, or how to actually get to them.

    That's been done and it works fine (concurrently). It also works fine if you schedule more then one long running compute job to each SM.
    And even GCN is not that all powerful. It's actually quite easy to kick it in a way that you won't see any performance benefit of concurrent execution.
     
  3. renderstate

    Newcomer

    Joined:
    Apr 24, 2016
    Messages:
    54
    Likes Received:
    51
    I am not saying that because we have no evidence.

    I don't know how they schedule work but I do know you don't need preemption and you don't need an SM capable of running compute and graphics tasks at the same time in order to co-schedule graphics and compute work on a many-SM GPU.
     
  4. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    A quick question to MDolenc about compute queue latency on PC. People are mostly talking about concurrency (to increase GPU utilization) but high priority compute queues also have other use cases.

    I am interested in this use case:
    - The main rendering queue is filled with rendering tasks (buffering up to 2 frames). Might be either vsynch locked or not (depending on settings). Basic game rendering (PS/VS) + some compute. Nothing special.
    - There's a separate high priority compute queue (D3D12_COMMAND_QUEUE_PRIORITY_HIGH) for game logic tasks. It is driven from a separate CPU thread than the rendering and there is no frame lock between logic and rendering.
    - Each game logic compute task is cheap (less than 1 millisecond).

    I am interested to know whether this use case is practical on various PC hardware (NV/AMD/Intel). What kind of latency should I expect to see (dispatch + readback data to CPU) based on the results you have seen? Do the high priority compute queues actually work (in reducing CPU->GPU->CPU latency when the main render queue is filled with work)?
     
    Razor1 and Anarchist4000 like this.
  5. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Concurrently on a insta-SM level? I'm really curious to actually know. Because my understanding was they were varying ratios of Graphics:Compute assignments each frame. A SM still needs reconfigured to my understanding, which would mean it has to flush and possibly affect other SMs when the switch occurs?

    This wouldn't be preemption, but based on the ability that enables it. Coscheduling on a many SM device isn't the issue here. It's the hardware being able to be reconfigured when the workload changes.

    sebbbi's use case is the one I'm interested in as well. But I should add, some of the compute tasks will be arriving in irregular, unpredictable bursts. Say evaluating bullet collision when the action starts coupled with an async timewarp. Quite a few other possibilities as well. So how quickly could all of the SMs evaluating graphics switch to compute and return results?
     
  6. troyan

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    120
    Likes Received:
    181
    "Intra-SM" level scheduling doesnt make sense on nVidia. With Maxwell every SM is occupied by either graphics or compute workload. Every SM is binded to a geometry pipeline and four pixel pipelines. A full GM200 chip has 24 SM, 96 ROPs and 6 GPCs which results in 96 pixel per clock. The chip isnt able to schedule additional workload without reconfiguration of the partition.

    nVidia has changed the SM per GPC ratio with Pascal. Instead of four SMs Pascal has 5 SMs for every GPC. While a GPC can still out 16 pixel per clock only four of the five SMs are used for graphics operations.
    There is an up to 25% compute overhead on Pascal for Async Compute. Pascal has an uneven ratio between graphics and compute. With the dynamic load balancing the chip is able to redirect compute workload from 4SMs to all SMs after the graphics workload has been finished.
     
  7. spworley

    Newcomer

    Joined:
    Apr 19, 2013
    Messages:
    146
    Likes Received:
    190
    It's far more limited than that. Maxwell (and earlier generations) switch between graphics and compute loads on an even coarser whole-device basis, not on an SM or even GPC basis. This is evidenced by the watchdog timeout kernel killer for CUDA. Start a 1 block CUDA kernel that takes multiple seconds to run, and your display will freeze during execution dispite the fact that all but one SMs have no allocated task. And after 3-5 seconds the kernel will be killed to allow display updates. If there is no display attached to the GPU, there is no timeout killer.
     
  8. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    I checked this today real quick. Added a new case to the sample, so scenario is a bit different:
    - main queue renders to offscreen target (no buffering of frames, trivial VS/PS) - 128 draws.
    - there's a high priority queue that executes a compute kernel after 10ms delay.
    Seems to work on GCN only. That is on 380X graphics finishes in 70ms and compute in 1.5ms. Reaction time (from issue to completion signal on high priority queue - average kernel runtime) seems to be in 0.2-0.5ms-ish range. Checked on Maxwell and HD 4600 and in both cases high priority queue only kicked in after graphics queue was done.
     
    vLaDv, Alexko, Alessio1989 and 7 others like this.
  9. troyan

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    120
    Likes Received:
    181
    Can you upload it? I could give it a try with Pascal.
     
    Razor1 and sebbbi like this.
  10. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Big thanks for doing the test! Great results for GCN (as expected from a console dev). I am currently evaluating my options about implementing some gameplay GPGPU code on PC. I hoped that DX12 / Vulkan high priority compute queues would make this practical.

    I hoped that the Nvidia GPUs would stop the rendering when (either): A) command processor fetches the next draw call (or packet of draws), B) next "wait for idle" event occurs (for example switching from RT/UAV write to read). There shouldn't be a good reason NOT to switch to high priority compute after "wait for idle" (as the GPU is already fully idle)... except that Fermi/Kepler (Maxwell?) incur additional penalty of switching the GPU mode between graphics<->compute. I guess it needs to do flush all the caches in addition to "wait for idle", as it re-purposes parts of the cache as scratchpad for graphics stuff. As far as I understood, Intel also re-purposes parts of their L3 cache in graphics/compute modes (LDS is reserved from L3 cache).

    Could you retest with the graphics queue workload consisting solely of compute kernels? Does Kepler perform the high priority compute queue task immediately if the render queue also happens to running a compute kernel at the same time? My rendering code is mostly compute shaders. If the high priority queue works in this case, then it is good enough for my purposes.

    Also, I would be interested to know whether the concurrent execution of compute queues is working on Nvidia (Kepler, Maxwell) and Intel when the graphics queue is only running compute shaders. As far as I know, there shouldn't be any technical issues blocking this use case. But the drivers might not take advantage of this yet (as it is a rather uncommon use case). Most developers are still rendering triangles after all...
     
    #1430 sebbbi, Jun 23, 2016
    Last edited: Jun 23, 2016
    vLaDv, Anarchist4000, Razor1 and 2 others like this.
  11. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    I'll have to revise my statement about Maxwell. It does interrupt graphics queue with a high priority queue, but it's a bit more jumpy. First there's the need to finish current draw call (and in my test they are quite long). Second there's some fighting with DWM I guess. Simply having Firefox with youtube open completely messed up my test (though I imagine full screen exclusive will make this a non issue).

    With regard to compute: I have checked on Kepler with two normal queues and compute only kernels. They can freely overtake one another so I don't think there would be any weirdness with high priority queues.

    Off to DX12 Performance Discussion And Analysis Thread. :)
     
  12. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    Updated AsyncCompute test. Changes:
    - shorter compute shader
    - thanks to Jawed (loop within loop) compute performance should be more comparable between GeForce and Radeon
    - latency test where high priority compute kernel jumps in the middle of graphics queue

    P.S.: Removed CUDA dependency.
     

    Attached Files:

    #1432 MDolenc, Jun 23, 2016
    Last edited: Jun 24, 2016
    vLaDv, Alexko, Kej and 4 others like this.
  13. Sinistar

    Sinistar I LIVE
    Regular Subscriber

    Joined:
    Aug 11, 2004
    Messages:
    648
    Likes Received:
    61
    Location:
    Indiana
    Is this a dx12, or Cuda?

    Get "cudart64_75.dll is missing"
     
  14. troyan

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    120
    Likes Received:
    181
    Me, too. But google helped to fix it.

    However the programm is crashing within the compute test at kernel 115 on Pascal...

    /edit: After ten tries or so i was able to finish the run with a GTX1080 @ 1924Mhz.
     

    Attached Files:

    #1434 troyan, Jun 23, 2016
    Last edited: Jun 23, 2016
  15. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    DX12, but there was a cuda for compute experiment there as well in one version. Dependency removed. Thanks.
     
  16. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Good to know. So graphics + high prio compute queue is working fine on Maxwell (as long as you use compute queue to reduce GPGPU latency). Someone needs to test this on Intel (Haswell, Broadwell, Skylake). If Kepler is the only problem case (Fermi never likely gets DX12 drivers), it might be OK to just accept an extra frame of latency for those players.

    MODS: Please move our posts to DX12 performance analysis thread. This discussion is mostly OT in this thread.
     
  17. Kaarlisk

    Regular Newcomer Subscriber

    Joined:
    Mar 22, 2010
    Messages:
    293
    Likes Received:
    49
    In case anybody find this of interest, ran it on a GT2 Haswell (4210U).
     

    Attached Files:

  18. HKS

    HKS
    Newcomer

    Joined:
    Apr 26, 2007
    Messages:
    31
    Likes Received:
    14
    Location:
    Norway
    Here are the results for a Titan X card (Maxwell Gen2) if someone are interested...
    Driver 368.39
     

    Attached Files:

  19. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,490
    Likes Received:
    400
    Location:
    Varna, Bulgaria
    980Ti @ 1400MHz, 368.39 driver.

    [​IMG]
     

    Attached Files:

  20. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    Remember, browsers closed, if you don't want to mess up latency on Maxwell.
     

    Attached Files:

    Razor1 and pharma like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...