DX12 Performance Discussion And Analysis Thread

Discussion in 'Rendering Technology and APIs' started by A1xLLcqAgt0qc2RyMz0y, Jul 29, 2015.

  1. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    323
    Likes Received:
    269
    It causes the steps of 32 in "pure compute" (not actually compute context, but compute slots in the graphic context queue!), as well as the steep increase in serialized mode.
    In actual compute context (used for OpenCL und CUDA), this appears to work much better, but according to the specs, the GPU actually has an entirely different hardware queue for that mode.

    It also explains the reported CPU usage when the DX12 API is hammered with async compute shaders, as the queue is rather short, containing only 32 entries, it means a lot of roundtrips.

    On top of that, it appears to be some type of driver "bug".
    Remember how I mentioned that the AMD driver appears to merge shader programs? It looks like Nvidia isn't doing such a thing for async queues, only for the primary graphic queue. That could probably need some optimization in the driver, by enforcing concatenated execution of independent shaders if the software queue becomes overfull.

    Well, at least I assume that Nvidia is actually merging draw calls in the primary queue, at least I couldn't explain the discrepancy between the latency measured here in the sequential mode, and the throughput measured in other synthetic benchmarks. Even though it is true, that even now, Maxwell v2 has a shorter latency than GCN does when it comes to refilling the ACEs.

    Good catch! That could also explain why the hardware needs to flush the entire hardware queue before switching between "real" compute and graphics mode. And the issues we saw here in this thread with Windows killing the driver for not handling a true compute shader program waiting for the compute context to become active. Either this, or some rather similar double use of some ASIC.
     
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,976
    Likes Received:
    2,418
    Location:
    Well within 3d
    The software-visible destination of a command is decided by the program. It's like multi-threading that the program has to explicitly state that two things are independent, and if they aren't it's the program's problem.

    What happens below the abstraction given by the API is not really its problem either, which is why descriptions of the functionality add a caveat that what happens at the API-level is not a mandate on what happens under the hood.
    What the GPU, driver, or a GPU viewing program categorize things as is not what the API needs to care about as long as the behaviors defined by it are met.
    If a compute command list is run as if it is a compute command list, whose business is it that the hardware queue can do graphics as well in other situations?
     
    Razor1 and pharma like this.
  3. serversurfer

    Newcomer

    Joined:
    Nov 13, 2008
    Messages:
    5
    Likes Received:
    1
    Nice! Thank you very much.


    Maybe that's normal-ish? Take a look at this table from Anandtech.

    [​IMG]
    Sounds like Max2 has the same 32-queue scheduler Max1 had, but now it can be used in a "mixed" mode where one of the 32 is reserved for render jobs, allowing you to run them "simultaneously." Maybe Max2 always operates in Mixed mode on DX12, even when doing nothing but compute work, and their software scheduler pulls jobs from a 1 + 31 setup, and queues it all up together on the hardware rendering queue? That would explain why the compute queue isn't being used, but I don't know if it explains why it doesn't seem to show any fine-grain/overlapping execution. For whatever reason, it seems like they can't actually process jobs from both queue types at the same time.

    Or maybe it does explain it? Sorry, are you guys saying there's something that would require the system to handle the render jobs separately from the compute jobs, as opposed to blending them together? I'm not really sure I followed what you were saying.
     
  4. Darius

    Newcomer

    Joined:
    Sep 27, 2013
    Messages:
    37
    Likes Received:
    30
    I'm unsure of how the specific app is programmed, whether it has to define the queues/mixed mode from the start and that sets the stage for what happens next, but the program begins with a purely compute workload, long before graphics come into the mix. And when running the OpenCL benchmark luxmark, it does purely use the compute queue, never touching graphics.

    You may be right about that, but should that really have such major performance ramifications? Just because it's a software scheduler shouldn't preclude it from good performance, aren't CPUs scheduled by the OS in software?
     
  5. Forceman

    Newcomer

    Joined:
    Dec 23, 2010
    Messages:
    11
    Likes Received:
    10
    Update from Kollock, the Oxide developer posting over at OCN:

    So sounds like a driver issue.
     
    drSeehas, fellix, Jackalito and 3 others like this.
  6. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    323
    Likes Received:
    269
    No, it is blending them together just fine. One queue is filled commands from the render pipeline. The other 31 queues are either filled with commands from up to 31 dedicated (software) queues or with async tasks.

    That's almost working as it is supposed to. There appear to be several distinct problem though:
    • If a task is flagged as async, it is never batched with other tasks - so the corresponding queue underruns as soon as the assigned task finished.
    • If a queue is flagged as async AND serial, apart from the lack of batching, all other 31 queues are not filled in either, so the GPU runs out of jobs after executing just a single task each.
    • As soon as dependencies become involved, the entire scheduling is performed on the CPU side, as opposed to offloading at least parts of it to the GPU.
    • Mixed mode (31+1) seems to have different hardware capabilities than pure compute mode (32). Via DX12, only mixed mode is accessible. (Not sure if this is true or if just the driver isn't making use of any hardware features.)
    • The graphic part of the benchmark in this thread appears to keep the GPU completely busy on Nvidia hardware, so none of the compute tasks is actually running "parallel".
    • "Real" compute tasks require the GPU to be completely idle first before switching context. This became prominent as Windows is dispatching at least one pure compute task on a regular base, which starved and caused Windows to kill the driver.
    Well, at least the first two are apparently driver bugs:
    • Nvidia COULD just merge multiple async tasks into batches in order to avoid underruns in the individual queues. This would reduce the number of roundtrips to the CPU enormously.
    • Same goes for sequential tasks, they can be merged in batches as well. AMD already does that, and even performes additional optimizations on the merged code (notice how the performance went up in sequential mode?).
    So AMD's driver is just far more matured, while Nvidia is currently using a rather naive approach only.

    Well, and there's the confirmation.

    I wouldn't expect the Nvidia cards to perform THAT bad in the future, given that there are still possible gains to be made in the driver. I wouldn't exactly overestimate them either, though. AMD has just a far more scalable hardware design in this domain, and the necessity of switching between compute and graphic context in combination with the starvation issue will continue to haunt Nvidia as that isn't a software but a hardware design fault.
     
    chris1515, Jackalito, RedVi and 5 others like this.
  7. serversurfer

    Newcomer

    Joined:
    Nov 13, 2008
    Messages:
    5
    Likes Received:
    1
    Right, but maybe the compute queue is just ignored in DX12, and the 32-queue scheduler just dumps everything on the "primary" (render) queue, and gives priority to any render jobs that come through.

    Might not the behavior be different under DX12? Keep in mind, I'm a bit of a noob here. lol

    Well, that's the part I still couldn't explain. lol It seems to me that if the scheduler is managing the same 32 queues as always, it should be able to schedule a compute job alongside a rendering job. Then Dilettante said something I didn't understand, and Ext3h seemed to say, "Aha! No wonder they need to switch contexts!" so I thought maybe they'd figured that bit out. :p

    Here's the noob theory I just came up with, based on my limited understanding of how this stuff works. Every cycle, you have a certain amount of resources available on the GPU, like seats on a plane. Render jobs are considered high-priority, places-to-be passengers, with reserved seating. Additionally, we've got a bunch of compute jobs flying standby, hoping to nab any unclaimed seats. So in theory, once the full-fare passengers have claimed their seats, the scheduler should be able to fill any remaining seats with standby passengers from the 31 lanes they're standing in, and then release the plane.

    But again, those full-fare passengers have places to be. Nvidia are doing software scheduling, right? Might it be that they simply can't wait to see what rendering needs to be done on this cycle and still have time to board all of the standby passengers before the plane is scheduled to leave? If so, they may just let the full-fare passengers have this plane to themselves and sit wherever they want while the scheduler uses that time to figure out how to most efficiently cram all of the standby passengers on the next flight, which will carry nothing but. Maybe they don't have the ability to peek ahead in the rendering queue, so each rendering job is effectively a surprise. No pre-checkin for full-fare means no ability to plan ahead for standby needs/usage.

    But like I said, I'm not even sure if that's how this stuff is being handled. lol
     
  8. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,628
    Likes Received:
    1,369
    Would all 4 warp schedulers be used for performing these tasks?
     
  9. serversurfer

    Newcomer

    Joined:
    Nov 13, 2008
    Messages:
    5
    Likes Received:
    1
    Thanks for trying, but unfortunately, I don't really know enough to have been able to follow most of that. lol If you don't mind holding my hand a bit more…

    If I understand what you mean by batching, that's what I meant by blending. To give an example that also illustrates my skill level here, let's say the math units in the GPU know how to add, multiply, subtract, or divide. If the rendering queue only needs use of the add and multiply units on this particular cycle, then we should be able to hang compute jobs on the subtraction and/or division units, right? And for whatever reason, that's not happening on the Nvidia cards, right? Is that even what we're talking about here? :p

    At the end, you say the need for context switching will continue to haunt them. Why? Is that the same inability to batch/blend jobs, or are you talking about that DWM.exe that Windows was hanging on the "real" compute queue which subsequently went ignored when the rendering queue got too busy computing?

    Sorry, I really am trying! l
     
  10. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,628
    Likes Received:
    1,369
    Oxide developer statement regarding D3D12 (originally posted at http://www.overclock.net/t/1569897/...ingularity-dx12-benchmarks/2130#post_24379702 ):
    Wow, lots more posts here, there is just too many things to respond to so I'll try to answer what I can.

    /inconvenient things I'm required to ask or they won't let me post anymore
    Regarding screenshots and other info from our game, we appreciate your support but please refrain from disclosing these until after we hit early access. It won't be long now.
    /end

    Regarding batches, we use the term batches just because we are counting both draw calls and dispatch calls. Dispatch calls are compute shaders, draw calls are normal graphics shaders. Though sometimes everyone calls dispatchs draw calls, they are different so we thought we'd avoid the confusion by calling everything a draw call.

    Regarding CPU load balancing on D3D12, that's entirely the applications responsibility. So if you see a case where it’s not load balancing, it’s probably the application not the driver/API. We’ve done some additional tunes to the engine even in the last month and can clearly see usage cases where we can load 8 cores at maybe 90-95% load. Getting to 90% on an 8 core machine makes us really happy. Keeping our application tuned to scale like this definitely on ongoing effort.

    Additionally, hitches and stalls are largely the applications responsibility under D3D12. In D3D12, essentially everything that could cause a stall has been removed from the API. For example, the pipeline objects are designed such that the dreaded shader recompiles won’t ever have to happen. We also have precise control over how long a graphics command is queued up. This is pretty important for VR applications.

    Also keep in mind that the memory model for D3d12 is completely different the D3D11, at an OS level. I’m not sure if you can honestly compare things like memory load against each other. In D3D12 we have more control over residency and we may, for example, intentionally keep something unused resident so that there is no chance of a micro-stutter if that resource is needed. There is no reliable way to do this in D3D11. Thus, comparing memory residency between the two APIS may not be meaningful, at least not until everyone's had a chance to really tune things for the new paradigm.

    Regarding SLI and cross fire situations, yes support is coming. However, those options in the ini file probablly do not do what you think they do, just FYI. Some posters here have been remarkably perceptive on different multi-GPU modes that are coming, and let me just say that we are looking beyond just the standard Crossfire and SLI configurations of today. We think that Multi-GPU situations are an area where D3D12 will really shine. (once we get all the kinks ironed out, of course). I can't promise when this support will be unvieled, but we are commited to doing it right.

    Regarding Async compute, a couple of points on this. FIrst, though we are the first D3D12 title, I wouldn't hold us up as the prime example of this feature. There are probably better demonstrations of it. This is a pretty complex topic and to fully understand it will require significant understanding of the particular GPU in question that only an IHV can provide. I certainly wouldn't hold Ashes up as the premier example of this feature.

    We actually just chatted with Nvidia about Async Compute, indeed the driver hasn't fully implemented it yet, but it appeared like it was. We are working closely with them as they fully implement Async Compute. We'll keep everyone posted as we learn more.

    Also, we are pleased that D3D12 support on Ashes should be functional on Intel hardware relatively soon, (actually, it's functional now it's just a matter of getting the right driver out to the public).

    Thanks!
     
    #550 pharma, Sep 5, 2015
    Last edited by a moderator: Sep 5, 2015
    Lightman, BRiT, mosen and 4 others like this.
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,853
    Likes Received:
    722
    Location:
    London
    After years of working on D3D12, NVidia has finally realised it needs to do async compute.
     
    vLaDv, Malo, Devnant and 1 other person like this.
  12. Nobu

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    21
    Likes Received:
    1
    Don't know if you are all still interested in what gpuview shows for older GCN cards (or, really, if you were in the first place), specifically my r9-270x, but I guess I'm doing something wrong because it's not showing any info on the GPU itself--only process activity...
    [​IMG]
    [​IMG]
    [​IMG]
     
  13. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    I think its just the way I see it it depends on their sprint cycles for development.
     
    pharma likes this.
  14. trandoanhung1991

    Joined:
    Sep 2, 2015
    Messages:
    6
    Likes Received:
    6
    I don't think anyone expected a closed-alpha benchmark to make any noise before it even hits Early Access, and for such a niche market too.

    I think this is just a misstep in marketing for NVIDIA.
     
  15. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,079
    Likes Received:
    647
    Location:
    O Canada!
  16. digitalwanderer

    digitalwanderer Dangerously Mirthful
    Legend

    Joined:
    Feb 19, 2002
    Messages:
    16,784
    Likes Received:
    1,411
    Location:
    Winfield, IN USA
    Been a loooong time since I've seen nVidia PR screw the pooch. I might hate their PR dept, but I respect their evil abilities. It don't seem like their style to make a mistake like this.
     
    pharma and Razor1 like this.
  17. Forceman

    Newcomer

    Joined:
    Dec 23, 2010
    Messages:
    11
    Likes Received:
    10
    He means RTS, I'm guessing.
     
  18. CasellasAbdala

    Newcomer

    Joined:
    Aug 28, 2015
    Messages:
    11
    Likes Received:
    5
    So, does this mean that Maxwell 2 "Si tiene Shaders Asincronos!" ?

    Now for the real question.

    Read a while ago, that Maxwell's 2 Async Shaders are software based rather than hardware based (this was stated by "Mahigan", after analyzing what the oxide Dev said)

    Is this true? how does it work then?
     
    Razor1 likes this.
  19. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    no it seems the front end of Maxwell 2 will have issues with heavy compute situations with graphics. This is because it doesn't have the register space nor cache to keep filling the queues. Now will we see that in its life time, its up in the air need more Dx12 games to test for that. (this is all if its all the drivers fault that we have seen so far)
     
  20. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    Yeah the RTS market isn't that small, I don't think. Starcraft, Warcraft sold a lot of games.
     
    #560 Razor1, Sep 5, 2015
    Last edited: Sep 5, 2015
    digitalwanderer likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...