DX12 Performance Discussion And Analysis Thread

It's possible that there's more driver-level management and construction of the queue.
Possibly, the 32 "queues" are software-defined slots of independent calls the driver can determine that the GPU can issue in parallel, possibly by a single command front end.
If running purely in compute, this seems to stair-step in timings as one would expect.
It causes the steps of 32 in "pure compute" (not actually compute context, but compute slots in the graphic context queue!), as well as the steep increase in serialized mode.
In actual compute context (used for OpenCL und CUDA), this appears to work much better, but according to the specs, the GPU actually has an entirely different hardware queue for that mode.

It also explains the reported CPU usage when the DX12 API is hammered with async compute shaders, as the queue is rather short, containing only 32 entries, it means a lot of roundtrips.

On top of that, it appears to be some type of driver "bug".
Remember how I mentioned that the AMD driver appears to merge shader programs? It looks like Nvidia isn't doing such a thing for async queues, only for the primary graphic queue. That could probably need some optimization in the driver, by enforcing concatenated execution of independent shaders if the software queue becomes overfull.

Well, at least I assume that Nvidia is actually merging draw calls in the primary queue, at least I couldn't explain the discrepancy between the latency measured here in the sequential mode, and the throughput measured in other synthetic benchmarks. Even though it is true, that even now, Maxwell v2 has a shorter latency than GCN does when it comes to refilling the ACEs.

AMD's separate compute paths may provide a form of primitive tracking separate from the primitive tracking in the geometry pipeline.
It does seem like the separate command list cases can pipeline well enough. Perhaps there is a unified tracking system that does not readily handle geometric primitive and compute primitive ordering within the same context?
Good catch! That could also explain why the hardware needs to flush the entire hardware queue before switching between "real" compute and graphics mode. And the issues we saw here in this thread with Windows killing the driver for not handling a true compute shader program waiting for the compute context to become active. Either this, or some rather similar double use of some ASIC.
 
So if I understand you correctly, the graphics was sent to a predefined "graphics queue" in DX12, and the compute was sent to a predefined "compute queue" in DX12. And then Maxwell internally redirected the compute to the graphics? Presumably because it determined it couldn't run that code in a compute context?

The software-visible destination of a command is decided by the program. It's like multi-threading that the program has to explicitly state that two things are independent, and if they aren't it's the program's problem.

What happens below the abstraction given by the API is not really its problem either, which is why descriptions of the functionality add a caveat that what happens at the API-level is not a mandate on what happens under the hood.
What the GPU, driver, or a GPU viewing program categorize things as is not what the API needs to care about as long as the behaviors defined by it are met.
If a compute command list is run as if it is a compute command list, whose business is it that the hardware queue can do graphics as well in other situations?
 
That doesn't seem to be the case at all.
Would you mind linking which chart(s) in particular you're looking at?

For maxwell cards (here or here or here) the orange dots are mostly on 0 with a few dots managing to jump up, seemingly at fixed intervals. Never anywhere close to blue or red.

For GCN cards (here or here or here) the orange dots are mostly on either blue or red (whichever is lower).
They're on red at the beginning where red is lower than blue, and stays on blue afterwards when red has increased above blue.

Indeed, Fury X (here or here) is looking very weird. The orange dots either stay on zero or jump all the way up to blue (or red, whichever's lower), but never anywhere inbetween.
Another run using the older version of the test shows the orange dots being either 0, 25%, 50%, 75%, or 100% of the blue dots.
Nice! Thank you very much.


That can't be it, because when the compute workload is done in isolation it's still considered graphics by Maxwell.
Maybe that's normal-ish? Take a look at this table from Anandtech.

ZbAIeHp.png

Sounds like Max2 has the same 32-queue scheduler Max1 had, but now it can be used in a "mixed" mode where one of the 32 is reserved for render jobs, allowing you to run them "simultaneously." Maybe Max2 always operates in Mixed mode on DX12, even when doing nothing but compute work, and their software scheduler pulls jobs from a 1 + 31 setup, and queues it all up together on the hardware rendering queue? That would explain why the compute queue isn't being used, but I don't know if it explains why it doesn't seem to show any fine-grain/overlapping execution. For whatever reason, it seems like they can't actually process jobs from both queue types at the same time.

AMD's separate compute paths may provide a form of primitive tracking separate from the primitive tracking in the geometry pipeline.
It does seem like the separate command list cases can pipeline well enough. Perhaps there is a unified tracking system that does not readily handle geometric primitive and compute primitive ordering within the same context?
Good catch! That could also explain why the hardware needs to flush the entire hardware queue before switching between "real" compute and graphics mode. And the issues we saw here in this thread with Windows killing the driver for not handling a true compute shader program waiting for the compute context to become active. Either this, or some rather similar double use of some ASIC.
Or maybe it does explain it? Sorry, are you guys saying there's something that would require the system to handle the render jobs separately from the compute jobs, as opposed to blending them together? I'm not really sure I followed what you were saying.
 
Maybe that's normal-ish? Take a look at this table from Anandtech.

I'm unsure of how the specific app is programmed, whether it has to define the queues/mixed mode from the start and that sets the stage for what happens next, but the program begins with a purely compute workload, long before graphics come into the mix. And when running the OpenCL benchmark luxmark, it does purely use the compute queue, never touching graphics.

Sounds like Max2 has the same 32-queue scheduler Max1 had, but now it can be used in a "mixed" mode where one of the 32 is reserved for render jobs, allowing you to run them "simultaneously." Maybe Max2 always operates in Mixed mode on DX12, even when doing nothing but compute work, and their software scheduler pulls jobs from a 1 + 31 setup, and queues it all up together on the hardware rendering queue? That would explain why the compute queue isn't being used, but I don't know if it explains why it doesn't seem to show any fine-grain/overlapping execution. For whatever reason, it seems like they can't actually process jobs from both queue types at the same time.

You may be right about that, but should that really have such major performance ramifications? Just because it's a software scheduler shouldn't preclude it from good performance, aren't CPUs scheduled by the OS in software?
 
Update from Kollock, the Oxide developer posting over at OCN:

We actually just chatted with Nvidia about Async Compute, indeed the driver hasn't fully implemented it yet, but it appeared like it was. We are working closely with them as they fully implement Async Compute. We'll keep everyone posted as we learn more.

So sounds like a driver issue.
 
Or maybe it does explain it? Sorry, are you guys saying there's something that would require the system to handle the render jobs separately from the compute jobs, as opposed to blending them together? I'm not really sure I followed what you were saying.
No, it is blending them together just fine. One queue is filled commands from the render pipeline. The other 31 queues are either filled with commands from up to 31 dedicated (software) queues or with async tasks.

That's almost working as it is supposed to. There appear to be several distinct problem though:
  • If a task is flagged as async, it is never batched with other tasks - so the corresponding queue underruns as soon as the assigned task finished.
  • If a queue is flagged as async AND serial, apart from the lack of batching, all other 31 queues are not filled in either, so the GPU runs out of jobs after executing just a single task each.
  • As soon as dependencies become involved, the entire scheduling is performed on the CPU side, as opposed to offloading at least parts of it to the GPU.
  • Mixed mode (31+1) seems to have different hardware capabilities than pure compute mode (32). Via DX12, only mixed mode is accessible. (Not sure if this is true or if just the driver isn't making use of any hardware features.)
  • The graphic part of the benchmark in this thread appears to keep the GPU completely busy on Nvidia hardware, so none of the compute tasks is actually running "parallel".
  • "Real" compute tasks require the GPU to be completely idle first before switching context. This became prominent as Windows is dispatching at least one pure compute task on a regular base, which starved and caused Windows to kill the driver.
Well, at least the first two are apparently driver bugs:
  • Nvidia COULD just merge multiple async tasks into batches in order to avoid underruns in the individual queues. This would reduce the number of roundtrips to the CPU enormously.
  • Same goes for sequential tasks, they can be merged in batches as well. AMD already does that, and even performes additional optimizations on the merged code (notice how the performance went up in sequential mode?).
So AMD's driver is just far more matured, while Nvidia is currently using a rather naive approach only.

We actually just chatted with Nvidia about Async Compute, indeed the driver hasn't fully implemented it yet, but it appeared like it was. We are working closely with them as they fully implement Async Compute. We'll keep everyone posted as we learn more.

Well, and there's the confirmation.

I wouldn't expect the Nvidia cards to perform THAT bad in the future, given that there are still possible gains to be made in the driver. I wouldn't exactly overestimate them either, though. AMD has just a far more scalable hardware design in this domain, and the necessity of switching between compute and graphic context in combination with the starvation issue will continue to haunt Nvidia as that isn't a software but a hardware design fault.
 
I'm unsure of how the specific app is programmed, whether it has to define the queues/mixed mode from the start and that sets the stage for what happens next, but the program begins with a purely compute workload, long before graphics come into the mix.
Right, but maybe the compute queue is just ignored in DX12, and the 32-queue scheduler just dumps everything on the "primary" (render) queue, and gives priority to any render jobs that come through.

And when running the OpenCL benchmark luxmark, it does purely use the compute queue, never touching graphics.
Might not the behavior be different under DX12? Keep in mind, I'm a bit of a noob here. lol

You may be right about that, but should that really have such major performance ramifications? Just because it's a software scheduler shouldn't preclude it from good performance, aren't CPUs scheduled by the OS in software?
Well, that's the part I still couldn't explain. lol It seems to me that if the scheduler is managing the same 32 queues as always, it should be able to schedule a compute job alongside a rendering job. Then Dilettante said something I didn't understand, and Ext3h seemed to say, "Aha! No wonder they need to switch contexts!" so I thought maybe they'd figured that bit out. :p

Here's the noob theory I just came up with, based on my limited understanding of how this stuff works. Every cycle, you have a certain amount of resources available on the GPU, like seats on a plane. Render jobs are considered high-priority, places-to-be passengers, with reserved seating. Additionally, we've got a bunch of compute jobs flying standby, hoping to nab any unclaimed seats. So in theory, once the full-fare passengers have claimed their seats, the scheduler should be able to fill any remaining seats with standby passengers from the 31 lanes they're standing in, and then release the plane.

But again, those full-fare passengers have places to be. Nvidia are doing software scheduling, right? Might it be that they simply can't wait to see what rendering needs to be done on this cycle and still have time to board all of the standby passengers before the plane is scheduled to leave? If so, they may just let the full-fare passengers have this plane to themselves and sit wherever they want while the scheduler uses that time to figure out how to most efficiently cram all of the standby passengers on the next flight, which will carry nothing but. Maybe they don't have the ability to peek ahead in the rendering queue, so each rendering job is effectively a surprise. No pre-checkin for full-fare means no ability to plan ahead for standby needs/usage.

But like I said, I'm not even sure if that's how this stuff is being handled. lol
 
Would all 4 warp schedulers be used for performing these tasks?
 
No, it is blending them together just fine. One queue is filled commands from the render pipeline. The other 31 queues are either filled with commands from up to 31 dedicated (software) queues or with async tasks.

That's almost working as it is supposed to. There appear to be several distinct problem though:
  • If a task is flagged as async, it is never batched with other tasks - so the corresponding queue underruns as soon as the assigned task finished.
  • If a queue is flagged as async AND serial, apart from the lack of batching, all other 31 queues are not filled in either, so the GPU runs out of jobs after executing just a single task each.
  • As soon as dependencies become involved, the entire scheduling is performed on the CPU side, as opposed to offloading at least parts of it to the GPU.
  • Mixed mode (31+1) seems to have different hardware capabilities than pure compute mode (32). Via DX12, only mixed mode is accessible. (Not sure if this is true or if just the driver isn't making use of any hardware features.)
  • The graphic part of the benchmark in this thread appears to keep the GPU completely busy on Nvidia hardware, so none of the compute tasks is actually running "parallel".
  • "Real" compute tasks require the GPU to be completely idle first before switching context. This became prominent as Windows is dispatching at least one pure compute task on a regular base, which starved and caused Windows to kill the driver.
Well, at least the first two are apparently driver bugs:
  • Nvidia COULD just merge multiple async tasks into batches in order to avoid underruns in the individual queues. This would reduce the number of roundtrips to the CPU enormously.
  • Same goes for sequential tasks, they can be merged in batches as well. AMD already does that, and even performes additional optimizations on the merged code (notice how the performance went up in sequential mode?).
So AMD's driver is just far more matured, while Nvidia is currently using a rather naive approach only.



Well, and there's the confirmation.

I wouldn't expect the Nvidia cards to perform THAT bad in the future, given that there are still possible gains to be made in the driver. I wouldn't exactly overestimate them either, though. AMD has just a far more scalable hardware design in this domain, and the necessity of switching between compute and graphic context in combination with the starvation issue will continue to haunt Nvidia as that isn't a software but a hardware design fault.
Thanks for trying, but unfortunately, I don't really know enough to have been able to follow most of that. lol If you don't mind holding my hand a bit more…

If I understand what you mean by batching, that's what I meant by blending. To give an example that also illustrates my skill level here, let's say the math units in the GPU know how to add, multiply, subtract, or divide. If the rendering queue only needs use of the add and multiply units on this particular cycle, then we should be able to hang compute jobs on the subtraction and/or division units, right? And for whatever reason, that's not happening on the Nvidia cards, right? Is that even what we're talking about here? :p

At the end, you say the need for context switching will continue to haunt them. Why? Is that the same inability to batch/blend jobs, or are you talking about that DWM.exe that Windows was hanging on the "real" compute queue which subsequently went ignored when the rendering queue got too busy computing?

Sorry, I really am trying! l
 
Oxide developer statement regarding D3D12 (originally posted at http://www.overclock.net/t/1569897/...ingularity-dx12-benchmarks/2130#post_24379702 ):
Wow, lots more posts here, there is just too many things to respond to so I'll try to answer what I can.

/inconvenient things I'm required to ask or they won't let me post anymore
Regarding screenshots and other info from our game, we appreciate your support but please refrain from disclosing these until after we hit early access. It won't be long now.
/end

Regarding batches, we use the term batches just because we are counting both draw calls and dispatch calls. Dispatch calls are compute shaders, draw calls are normal graphics shaders. Though sometimes everyone calls dispatchs draw calls, they are different so we thought we'd avoid the confusion by calling everything a draw call.

Regarding CPU load balancing on D3D12, that's entirely the applications responsibility. So if you see a case where it’s not load balancing, it’s probably the application not the driver/API. We’ve done some additional tunes to the engine even in the last month and can clearly see usage cases where we can load 8 cores at maybe 90-95% load. Getting to 90% on an 8 core machine makes us really happy. Keeping our application tuned to scale like this definitely on ongoing effort.

Additionally, hitches and stalls are largely the applications responsibility under D3D12. In D3D12, essentially everything that could cause a stall has been removed from the API. For example, the pipeline objects are designed such that the dreaded shader recompiles won’t ever have to happen. We also have precise control over how long a graphics command is queued up. This is pretty important for VR applications.

Also keep in mind that the memory model for D3d12 is completely different the D3D11, at an OS level. I’m not sure if you can honestly compare things like memory load against each other. In D3D12 we have more control over residency and we may, for example, intentionally keep something unused resident so that there is no chance of a micro-stutter if that resource is needed. There is no reliable way to do this in D3D11. Thus, comparing memory residency between the two APIS may not be meaningful, at least not until everyone's had a chance to really tune things for the new paradigm.

Regarding SLI and cross fire situations, yes support is coming. However, those options in the ini file probablly do not do what you think they do, just FYI. Some posters here have been remarkably perceptive on different multi-GPU modes that are coming, and let me just say that we are looking beyond just the standard Crossfire and SLI configurations of today. We think that Multi-GPU situations are an area where D3D12 will really shine. (once we get all the kinks ironed out, of course). I can't promise when this support will be unvieled, but we are commited to doing it right.

Regarding Async compute, a couple of points on this. FIrst, though we are the first D3D12 title, I wouldn't hold us up as the prime example of this feature. There are probably better demonstrations of it. This is a pretty complex topic and to fully understand it will require significant understanding of the particular GPU in question that only an IHV can provide. I certainly wouldn't hold Ashes up as the premier example of this feature.

We actually just chatted with Nvidia about Async Compute, indeed the driver hasn't fully implemented it yet, but it appeared like it was. We are working closely with them as they fully implement Async Compute. We'll keep everyone posted as we learn more.

Also, we are pleased that D3D12 support on Ashes should be functional on Intel hardware relatively soon, (actually, it's functional now it's just a matter of getting the right driver out to the public).

Thanks!
 
Last edited by a moderator:
Don't know if you are all still interested in what gpuview shows for older GCN cards (or, really, if you were in the first place), specifically my r9-270x, but I guess I'm doing something wrong because it's not showing any info on the GPU itself--only process activity...
gpuview_r9-270x.jpg
gpuview_r9-270x_ZOOMx1.jpg
gpuview_r9-270x_ZOOMx2.jpg
 
After years of working on D3D12, NVidia has finally realised it needs to do async compute.

I don't think anyone expected a closed-alpha benchmark to make any noise before it even hits Early Access, and for such a niche market too.

I think this is just a misstep in marketing for NVIDIA.
 
I think its just the way I see it it depends on their sprint cycles for development.
So, does this mean that Maxwell 2 "Si tiene Shaders Asincronos!" ?

Now for the real question.

Read a while ago, that Maxwell's 2 Async Shaders are software based rather than hardware based (this was stated by "Mahigan", after analyzing what the oxide Dev said)

Is this true? how does it work then?
 
no it seems the front end of Maxwell 2 will have issues with heavy compute situations with graphics. This is because it doesn't have the register space nor cache to keep filling the queues. Now will we see that in its life time, its up in the air need more Dx12 games to test for that. (this is all if its all the drivers fault that we have seen so far)
 
Back
Top