DX12 Performance Discussion And Analysis Thread

CSI PC · Sep 9, 2015

Ext3h said:
The discontinuity only affects the sequential run, doesn't it?
And no, turning TDR on / off isn't a subtly change at all. It means turning off a watchdog, which is actively trying to terminate excessively long running calls. That's a also a driver issue, not necessarily a hardware one.

Or do you refer to the concurrent execution being partially slightly slower than the separated one? Not so much of a surprise either, given the aggressive power management of recent GPU generations, you would need to force them to a fixed clock in order to get completely stable results.

And still, we could see a speedup of up to 10ms from executing graphics and compute kernels in parallel using pure DX12. With DX12+CUDA, that's down to mere 3ms, which sounds more like that speedup is only achieved by hiding some latencies in the driver, not by increasing hardware utilization.

Agree subtle (all relative) is not the exact word for that situation but still not enough posts to edit yet

And yeah that was my point regarding drivers as there seems a lot of unusual performance trends, speedup was 12ms-18ms in segments (assuming we are talking about the earlier async test).
Problem is PadyEos performance behaviour partial success is pretty rare for the Maxwell 2 cards (would be a pain to work out why his is performing much better when it works compared to nearly every one else where it does not), so would need to see the DX12+CUDA on his PC-system environment; this is compounded that he is using same driver and same model 980ti as several others.
Cheers

Jawed · Sep 9, 2015

CSI PC said:
Problem is PadyEos performance behaviour partial success is pretty rare for the Maxwell 2 cards (would be a pain to work out why his is performing much better when it works compared to nearly every one else where it does not)

As I've already explained in this thread his system shows compute only slowing down substantially in part of the test (and then speeding up later). The performance deviates from the trend.

You cannot use that as proof of "success".

CSI PC · Sep 10, 2015

Jawed said:
As I've already explained in this thread his system shows compute only slowing down substantially in part of the test (and then speeding up later). The performance deviates from the trend.

You cannot use that as proof of "success".

I did not say success, I said "sort of success", in fact it is clear the performance as I keep saying has improvement above that of marginal errors or cannot be explained away (consider others also have 980ti in this thread) from a clocking/power management perspective and he could repeat the test with similar results, along with the fact that only him and possibly one other has managed to do this though.
Hence my point that one cannot make any assumptions either way about this or behaviour with CUDA in DX12; whether Maxwell 2 has no support/does have support but IMO can conclude definitely has driver/interoperability issues.
So how do you interpret the 10-18ms faster time at segments as shown in the charts for async compared to the associated individual times that is also repeatable on doing the test multiple times?
Yes agree it deviates from the trend but I have been saying that all along for awhile now, but the repeatable (important point) test measurement he can do is still relevant.
Cheers

trandoanhung1991 · Sep 10, 2015

This is what I get running the CUDA version, using CUDA 7.5 runtimes renamed to 7.0. GTX 970, 355.82.

A1xLLcqAgt0qc2RyMz0y · Sep 10, 2015

I just noticed that the title of this thread that I started has changed.

The original title had the wording about performance of DX12 and now with the change it is only about "DX12 Async Compute Latency" which seems very limiting.

From my first post of this thread:

With Windows 10 launch today we should expect some sites to do DX12 performance tests on the various vendors GPU's.

Please use this thread to post any that you find or any issues or data regarding performance of DX12.

If someone wanted to post about performance of DX12 that is not about DX12 Async what would be the
appropriate thread?

pjbliverpool · Sep 10, 2015

A1xLLcqAgt0qc2RyMz0y said:
I just noticed that the title of this thread that I started has changed.

The original title had the wording about performance of DX12 and now with the change it is only about "DX12 Async Compute Latency" which seems very limiting.

From my first post of this thread:

If someone wanted to post about performance of DX12 that is not about DX12 Async what would be the
appropriate thread?

Lol I think your threads been hijacked! It's a very good hijack to be fair though which has wide reaching implications. I'm not sure the current name is completely accurate either though. Maybe just "DX12 Async Compute" would be better.

troyan · Sep 10, 2015

Zoom into Cuda version right after the compute only test:

Start time of the compute work (without the yellow border) is 1,5ms after the graphics work.

Deadhand · Sep 11, 2015

troyan said:
Zoom into Cuda version right after the compute only test:
[snip]
Start time of the compute work (without the yellow border) is 1,5ms after the graphics work.

In GPUView the 'workloads' are queued vertically within a given hardware queue. Jobs are executed in order based on when they enter the queue. The length of the bar does not represent the workload's 'size', only that it has been in the queue for that period of time, as well. Nor do we necessarily know that the jobs on the bottom of the queue are actually being executed, only that once the bar ends, it has completed execution, and that the beginning of the bar shows when it was added (and this drops down, top to bottom, as you can see in the yellow outline). I'm not even sure that this would show any sort of context switching.

troyan · Sep 11, 2015

There are a few differences between the CUDA and DX12 version.
With CUDA both workloads in the combined test take longer to execute. For the first queue the graphics one needs 18,87ms instead of 16,01ms (graphics only) and the compute workload needs 17,45ms instead of 3,45ms (compute only). The execution time of both queues needs more time with every new task.

In the DX12 test the graphics workload has always the same execution time and the compute workload will be increasing with every step. You can see the difference in the following zoom of the DX12 test:

Execution time of the compute workload is 4,3ms and much shorter than it is in the CUDA version.

Alessio1989 · Sep 11, 2015

didn't follow the full thread, so maybe this has been already answer: any difference between one and multiple compute queues? (if the drivers support multiple compute queues...).
Damn, I was very busy this summer, and I am still trying to make basic work to work in right way, I didn't quite play with compute shaders so far :\

Jawed · Sep 12, 2015

CSI PC said:
I did not say success, I said "sort of success", in fact it is clear the performance as I keep saying has improvement above that of marginal errors or cannot be explained away (consider others also have 980ti in this thread) from a clocking/power management perspective and he could repeat the test with similar results, along with the fact that only him and possibly one other has managed to do this though.
Hence my point that one cannot make any assumptions either way about this or behaviour with CUDA in DX12; whether Maxwell 2 has no support/does have support but IMO can conclude definitely has driver/interoperability issues.
So how do you interpret the 10-18ms faster time at segments as shown in the charts for async compared to the associated individual times that is also repeatable on doing the test multiple times?
Yes agree it deviates from the trend but I have been saying that all along for awhile now, but the repeatable (important point) test measurement he can do is still relevant.
Cheers

The overall trend on this 980Ti, like all others, is an absence of concurrency.

When some data points in the pure compute test are slower than expected, then of course you discard them or you discard all test results from the card entirely.

As to why? Well, every 980Ti has some set of disabled SMMs. Perhaps there's an interaction between work distribution and the placement of the disabled SMMs that causes hiccoughs in the test.

To argue that an anomalous data point is valid and indicates "sort of success" is just idiotic. It's the same as running multiple tests for averaging purposes and then choosing the result that's furthest from the average as typical.

lanek · Sep 12, 2015

I dont even know why we try to discuss this further.. Nvidia have somewhat admit that they dont support Async right now.. I even believe, that for the moment, it is disabled in their driver.

Keep some force, for analyze and compile the data when Nvidia release their driver for it. When this will happend, we will need some brave peoples for analyze how they achieve it ( and this could be a really interessant point )..

Ext3h · Sep 12, 2015

Well, in the first place, we would need a proper testbench which actually allows to measure gains from async computing.

There are quite a few aspects to consider:

Tests which are not utilizing the GPU are futile. If the workload isn't (even in theory) sufficient to saturate all shader units, there is barely anything to measure. You would only benchmark the scheduling frontend in that case, which is only a fraction of the hardware involved. This means: No less than concurrent 2048 threads for GTX980, and no less than 16k(!!!) threads for Fury series.
When you test async capabiilties, it makes absolutely no sense to use an uniform workload. There are no gains at all to be expected , unless the different kernels in execution are considerably UNOPTIMIIZED! Pipeline stalls are exactly what you WANT to happen in such a benchmark. This means you need different compute kernels. And each single kernel should be designed to be limited by exactly one resource:
- Single precision throughput
- Memory bandwidth
- Register usage
- Cache locality
- Texture filtering
- Integer arithmetic
- Control flow logic
- Workgroup memory barriers
- Global memory barriers
- Dependencies between kernel invocations declared via the API
- Scheduler throughput (kernels with minimal runtime)
You need a minimal level of adaption to the underlying hardware. GCN 1.1 and 1.2 need at least 8 queues in software, to activate all 8 ACEs in order to reach the limit of full 512 kernels. Maxwell V2 can only execute 32 kernels in parallel. Underutilize the one platform, and it looses the ability to schedule in hardware. Overutilize the other one, and the over saturated hardware queues will prevent it from scheduling different kernels.
When you mix workloads, you must make sure beforehand, that all different workloads have roughly the same execution time when scheduled separately. This is no point in running a single graphic kernel for only a few milliseconds, and then a huge batch of compute kernels which is running for several seconds. Either the different workloads COULD at least run concurrently, or there is simply nothing you could possibly measure.
Don't invoke kernels with to few threads. Lower limit for Nvidia and AMD are both each at least 64 threads per kernel invocation, IF you use each the full job limit of the scheduling fronted. Any less, and the hardware is underutilized.
Turn of power management before running ANY tests. Seriously, just do it. Make sure the GPU is already at max clock PRIOR to starting a test run, and make sure it won't clock down. Otherwise the numbers are just going to be garbage. If the GPU just switches out of power save mode when the first batch of threads has already finished, the result is entirely meaningless. Same if the GPU decides to clock down randomly. Or even worse: Temporarily activates boost.

Well, so much for that. If someone wants to try to create a workbench which yields any reliable numbers, feel free to get in touch with me. It just won't be able to do it myself right now, as I just lost my only available Windows desktop to a hardware fault today. (And doing DX12 development in Visual Studio on a laptop with an Intel IGP isn't exactly fun.)

Ext3h · Sep 12, 2015

Just a single test in the testbench isn't going to cut it though. There are quite a number of interesting scenarios to consider:

High number of jobs vs. kernels with high thread count
Shuffling kernels in software into one queue, vs. sorting kernels into multiple software queues
Various combinations of kernels with different pipeline stalls
Degenerated scenarios, such as mixing kernels with high and low thread counts or kernels with a threadcount not divisible by wavefront size
Latency tests whereby the latency for each queue is measured separately
Explicit software scheduling vs. hardware scheduling

Especially the degenerated workloads are the most interesting ones - since they are pushing the abilities of the driver to the limit.
The shuffling test should also yield quite interesting results, especially on Nvidia hardware. The hardware on Maxwell V2 can only choose from 32 jobs, but if you assist the hardware by ensuring variety between those jobs....

3dilettante · Sep 13, 2015

Ext3h said:
Tests which are not utilizing the GPU are futle. If the workload isn't (even in theory) sufficient to saturate all shader units, there is barely anything to measure. You would only benchmark the scheduling frontend in that case, which is only a fraction of the hardware involved. This means: No less than concurrent 2048 threads for GTX980, and no less than 16k(!!!) threads for Fury series.

What would this measure, since the threads are not independent at a lane level, but must execute on SIMD hardware?

When you test async capabiilties, it makes absolutely no sense to use an uniform workload. There are no gains at all to be expected , unless the different kernels in execution are considerably UNOPTIMIIZED! Pipeline stalls are exactly what you WANT to happen in such a benchmark. This means you need different compute kernels. And each single kernel should be designed to be limited by exactly one resource:

Single precision throughput

Memory bandwidth

Register usage

Cache locality

Texture filtering

Integer arithmetic

Control flow logic

Workgroup memory barriers

Global memory barriers

Dependencies between kernel invocations declared via the API

Scheduler throughput (kernels with minimal runtime)

With the exception of dependences between queued commands visible to the API, what do these other elements measure as far as how freely the queues can issue past each other?

You need a minimal level of adaption to the underlying hardware. GCN 1.1 and 1.2 need at least 8 queues in software, to activate all 8 ACEs in order to reach the limit of full 512 kernels. Maxwell V2 can only execute 32 kernels in parallel. Underutilize the one platform, and it looses the ability to schedule in hardware. Overutilize the other one, and the over saturated hardware queues will prevent it from scheduling different kernels.

The multi-queue version of the test did not reveal a major difference in the behavior of GCN, although I didn't see more than a few batches posted.
There's no guarantee of uniform queue loading, and AMD admitted the count of queues is currently overkill, so restricting an ACE's ability to track resources to a small fraction of the GPU's throughput might hamstring it.

Don't invoke kernels with to few threads. Lower limit for Nvidia and AMD are both each at least 64 threads per kernel invocation, IF you use each the full job limit of the scheduling fronted. Any less, and the hardware is underutilized.

The lower limit for Nvidia is 64? I thought it was half that.
Either way, since scheduling and launch is measured at a wavefront or warp granularity, what insight does a fully utilized SIMD give towards measuring how freely queued launches can move past each other?

CSI PC · Sep 13, 2015

Jawed said:
To argue that an anomalous data point is valid and indicates "sort of success" is just idiotic. It's the same as running multiple tests for averaging purposes and then choosing the result that's furthest from the average as typical.

And to ignore that a tester could repeatedly create the situation of "async type behaviour" is also idiotic.
One could say his is actually working closer to what one would expect or was expected of the NVIDIA Maxwell 2, who knows.... but again to stress there seems to be driver/interoperability issues.
But then it is just wrong to ignore it (including the fact this anomaly can be repeated by the tester and that is an important point) because it does not fit with the rest of the results and ones own conclusions.
Your approach reminds me of the "issue" with NVIDIA "downgrading" BF4 where posters ignored those who had no issues... and it turned out it was wrong to ignore those that were working.
I really cannot see what your problem is considering I have repeatedly mentioned it looks broken and more needs to be done to investigate the reason for the divergence and also erratic behaviour in the tests for Maxwell 2 cards, before any conclusions-assumptions can be put forward as we have seen a few do.
And this is coming from years working as an engineer where technology-solutions do not always behave as one expects or the data/test criteria is as clear cut as some seem to think it is for recent tests (not talking about what is being proposed).
Cheers

Silent_Buddha · Sep 13, 2015

CSI PC said:
But then it is just wrong to ignore it (including the fact this anomaly can be repeated by the tester and that is an important point) because it does not fit with the rest of the results and ones own conclusions.

No, ignoring it is just good science. Good science basically ignores anything that can't be replicated by anyone else. If it is just one source that is achieving those results and no one else can replicate it, then that means there is either

- Something wrong with his testing methodology.
- Deliberately misrepresenting his results.
- Something wrong with his test setup.
- Other.

Until such time as other people are able to replicate what he has gotten, then it is just an anomaly that doesn't mean anything. Now if other people can replicate it, then it becomes interesting and worthy of discussion.

Regards,
SB

CSI PC · Sep 13, 2015

Silent_Buddha said:
No, ignoring it is just good science. Good science basically ignores anything that can't be replicated by anyone else. If it is just one source that is achieving those results and no one else can replicate it, then that means there is either

- Something wrong with his testing methodology.
- Deliberately misrepresenting his results.
- Something wrong with his test setup.
- Other.

Until such time as other people are able to replicate what he has gotten, then it is just an anomaly that doesn't mean anything. Now if other people can replicate it, then it becomes interesting and worthy of discussion.

Regards,
SB

Your would be right IF the tester could not consistently repeat the same results, which he can (there was also one other tester ages back who also had unusual blip positive results in a segment but never followed it up again).
If it was a one off sure, but it was not for him.
Again you are also jumping to conclusions; we do not know if the issue is with his setup or those where the results do not improve; as I said earlier (and as you mention science-engineering) then these tests would be done in a controlled environment-system, with each component,software,firmware validated for the test.
This reminds me (to give another example beyond BF4) where there was a divergence between most test reviews for Skylake gaming performance while a few others performed well; those few that had good performance were using a specific manufacturer, where most that suffered mediocre results were using a competitor (which turns out needs a firmware upgrade and tbh still not great but improved).

Yep there could be an issue with his testing (albeit no-one could think of a reason at the time or bothered posting-engaging with the tester) although one needs to consider the test is pretty much automated, well if you feel he lied about his results you should had come out at the time but that is a lot of data lines to mess around with.....
Cheers

Jawed · Sep 13, 2015

Async shading is all about letting the GPU schedule compute workloads to run alongside a single queue of graphics work. In D3D11 and 10, compute can only be scheduled as work in the graphics queue.

The model of compute in those older APIs is simply as a special case of a pixel shader with no interpolants as input and no memory locations as output (both inputs and outputs are defined by fixed function hardware configuration states). So a simple state change to turn off that graphics-specific stuff gets you a compute shader. And you're free to follow the compute with graphics, by setting up a new graphics state to re-activate that fixed function stuff.

Back in the mists of time, GPUs had some hardware for shading vertices and some hardware for shading pixels. If you wrote a shadowing pass then the pixel shading hardware would sit there doing nothing because all the work was geometry (vertex shading) - a great example of wastefulness.

Then we got unified shading, where the hardware was generalised so that it could shade both vertices and pixels. Now the hardware was fully occupied (well if vertex shading was sufficiently complex) and the shadowing pass could go faster since more hardware would be shading the vertices. This was a hardware solution to an efficiency problem.

Async compute is just another hardware solution to an efficiency problem. It allows the GPU to independently schedule, in its own right, work from multiple queues across the shading hardware while the graphics queue plods along as normal. In D3D12, "async shaders" is a technique for the developer to create and manage independent queues of work, in order to maximally/flexibly access the efficiency gains the hardware offers.

MDolenc's test reveals increased throughput while running async shaders on some hardware. It shows that some hardware can manage its own workload and not be stuck as a pure slave to the graphics queue, which is the D3D11 view of the world. The test now works in that regard after a bit of crowd sourcing (single compute queue and unroll with nested j loop of 32 or 64 seems to present the clearest result).

Now, one reason why this test might fail on NVidia is that GM2xx re-uses rasterisation hardware within each graphics processing cluster (GPC) to generate hardware threads for compute. If that hardware is fully loaded with the fill-rate portion of this test, then it's doomed. Perhaps dependent texturing can be used to relieve pressure on the rasteriser to assess whether this is a factor:

http://graphics.stanford.edu/projects/gpubench/test_fetchcosts.html

Alternatively it might be simpler to write a VS-GS-PS test where the GS is bandwidth bound due to amplification. Or maybe a VS with "dependent texturing".

Ext3h · Sep 13, 2015

3dilettante said:
What would this measure, since the threads are not independent at a lane level, but must execute on SIMD hardware?

The capabilities of hard- and software to actually handle interleaved execution efficiently, rather than just activating formerly unused SIMD units. The goal isn't just to issue at least one wavefront to each SIMD - it is to achieve full utilization even under the aspect of pipeline stalls and other limiting factors.

3dilettante said:
With the exception of dependences between queued commands visible to the API, what do these other elements measure as far as how freely the queues can issue past each other?

Individually: Nothing, at least nothing you couldn't read from the specs.
When issued concurrently: The actual possible gains from async scheduling.
The entire idea behind async computing is to increase the utilization by interleaving different workloads, which allows you to achieve full utilization of the entire GPU despite not having optimized your individual shaders onto the pipeline of each GPU architecture.
However, there are many limitations which can prevent actual concurrent execution on a per SIMD level, leave alone efficient execution, e.g. register file usage, L1 and L2 cache misses, shared command paths and so on. So two workloads can't be taken as suited for concurrent execution per default, but each single pair actually needs to be tested individually.

3dilettante said:
The multi-queue version of the test did not reveal a major difference in the behavior of GCN, although I didn't see more than a few batches posted.
There's no guarantee of uniform queue loading, and AMD admitted the count of queues is currently overkill, so restricting an ACE's ability to track resources to a small fraction of the GPU's throughput might hamstring it.

I actually don't know for sure yet, either. That's why I suggested to include these parameters as variables into the testbench.

3dilettante said:
The lower limit for Nvidia is 64? I thought it was half that.

You are right, somewhat, I oversimplified that statement. Wavefront size is 32 for Nvidia and so is therefore optimal workgroup size, however, with 2048 shaders and only 32 concurrent kernels, you can only achieve full shader saturation with at least 64 threads per kernel on average. You can still issue the kernel to operate with a workgroup size of 32, but there shouldn't be less than 64 threads on average.

3dilettante said:
Either way, since scheduling and launch is measured at a wavefront or warp granularity, what insight does a fully utilized SIMD give towards measuring how freely queued launches can move past each other?

Well, out of order launches respectively scheduling full wavefronts are only one half of async compute. The actual concurrent execution of multiple wavefronts per SIMD is the other half.
You can construct scenarios where only the one OR the other aspect will limit you, but the goal is to find the point where you are limited by neither, but by raw performance of the card instead. That is the sweet point you want to hit.

In order to stay on the original topic:
Effective latency is also a function of both. In order to keep it down, you need your kernel scheduled ASAP, as well as ensuring efficient concurrent execution on SIMD units which are possibly already under full load. You should never hope on finding completely idle SIMD units nor can guarantee that scheduling is even possible right now due to various constraints.

DX12 Performance Discussion And Analysis Thread

CSI PC

Jawed

CSI PC

trandoanhung1991

Attachments

A1xLLcqAgt0qc2RyMz0y

pjbliverpool

B3D Scallywag

troyan

Deadhand

troyan

Alessio1989

Jawed

lanek

Ext3h

Ext3h

3dilettante

CSI PC

Silent_Buddha

CSI PC

Jawed

Ext3h

Similar threads