DX12 Performance Discussion And Analysis Thread

3dilettante · Sep 3, 2015

MDolenc said:
P.S.: Numbers in [] are gpu timestamps from beginning of the whole thing to the end of n-th dispatch converted to ms. Fillrate in {} is fillrate calculated based on gpu timestamp before clear and after all the draws.

Just for clarification, by whole thing do you mean the start of the batch of n dispatches, or start of that particular test mode?

OlegSH said:
https://forum.beyond3d.com/posts/1869076/
I wonder why fadd+fmul sequinces were not converted to FMA and why the loop was not unrolled
Here is the same code in PTX - http://forum.ixbt.com/topic.cgi?id=10:61506-91#3150

Code generation for GCN has in other cases appeared to be more conservative.
Perhaps the heuristics favor smaller footprint for better occupancy and cache presence, since just switching to another wavefront for periodic stalls is what the architecture leans on.
The branch overhead is modest for GCN, and the upshot of the 4-cycle execution loop is that from the software point of view there is generally no forwarding latency.
Other architectures like Maxwell would have different trade-offs, since their time window for resolving their instruction latencies and overheads is significantly shorter, and they have not cut down single-threaded performance quite as much.

sebbbi said:
A single lane compute work needs the scheduler to spawn a single wave on all GPUs. On AMD the wave is 64 wide, meaning that the architecture is designed to run/manage less waves (as each do more work). If you spawn single lane work, you will more likely end up under-utilizing GCN compared to other GPUs.

Naively, it's utilizing GCN twice as badly as Maxwell, at 1/64 versus 1/32. That seems like something minor enough for the test in question.
If there are other bottlenecks besides it that are being hit, I would like to find them.

Work group sizes can also expose (academic) bottlenecks, since the resource (gpr, lds) acquisition/release is done at work group granularity.

I would want to find more cases where things break. The vendors describe things with varying degrees of opacity, and sometimes it's more illuminating to break the things they say work just great when you're not looking.

I am just saying that using a workload that is not realistic can cause various bottlenecks that will not matter in a realistic scenario. Some people seem to be drawing all kinds of conclusions based on the results of this thread.

Unfortunately, we could have a taste test between GPUs and the data would be twisted into a debate on which vendor is more salty.

This is a downside of the GCN architecture, but it almost never matters in real software.

In games, sure.
There's other software that GCN had ambitions for once upon a time.
It also may have implications as to the hardware and what directions it can take in the future, but the importance of that is a matter of personal preference.

This benchmark has relevance for mixed tightly interleaved CPU<->GPU workloads. However it is important to realize that the current benchmark does not just measure async compute, it measures the whole GPU pipeline latency. The GPUs are good at hiding this latency internally, but are not designed to hide it to external observers (such as the CPU).

That makes GPUs more honest than most marketing departments.

Deleted member 13524 · Sep 3, 2015

Razor1 said:
again trying to draw a parallel from cpu usage to async happening is hard to do

Except when they're happening at the same time which, you know, makes it rather easy to do.

Darius · Sep 3, 2015

Razor1 said:
again trying to draw a parallel from cpu usage to async happening is hard to do, unless you know exactly what the drivers, gpu and cpu, are doing at that point, what is the purpose of the cpu usage has not been quantified to any degree.

If we don't know and draw that parallel, it might be wrong and that is no good because it changes the way we think about the situation, in essence prejudices us and in turn force us to make incorrect assumptions.

I'm not at my computer right now so I can't do it myself, but maybe logging the results using windows performance analyzer would shed some light on what the CPU and GPU are actually doing?

Deleted member 2197 · Sep 3, 2015

http://wccftech.com/nvidia-amd-directx-12-graphic-card-list-features-explained/4/

ka_rf · Sep 3, 2015

Outside of PadyEos' results, Maxwell has shown consistent results from everyone else and no one else had the weird cpu usage that has reported cpu usage. All GCN cards have shown pretty consistent results with Fury results being a little off.

More results plotted.

Razor1 · Sep 3, 2015

ToTTenTranz said:
Except when they're happening at the same time which, you know, makes it rather easy to do.

I am not ruling out front end driver problems in my conclusions, I am saying it is still capable of async calculations though. I'm not going to speculate on something that we don't know what is going on.

Razor1 · Sep 3, 2015

Darius said:
I'm not at my computer right now so I can't do it myself, but maybe logging the results using windows performance analyzer would shed some light on what the CPU and GPU are actually doing?

Yes that will show us things like what software vs, hardware is doing. I don't think it will show driver parts though?

Darius · Sep 3, 2015

Razor1 said:
Yes that will show us things like what software vs, hardware is doing. I don't think it will show driver parts though?

It should, to an absurd degree of specificity.

PadyEos · Sep 3, 2015

Hi guys. I see many of you are ready to disregards my TDR on/off results. I beg you please take a second look.
Got home from work today and was able to reproduce it almost exactly(even in reverse order, TDR first off and then on).

980TI TDR off:

Compute only:1. 5.64ms ~ 512. 76.11ms
Graphics only: 16.77ms (100.06G pixels/s)
Graphics + compute: 1. 21.11ms (79.48G pixels/s) ~ 512. 92.77ms (18.08G pixels/s)
Graphics, compute single commandlist: 1. 20.68ms (81.12G pixels/s) ~ 512. 2294.99ms (0.73G pixels/s)

980TI TDR on:

Compute only:1. 5.76ms ~ 512. 80.63ms
Graphics only: 16.78ms (99.97G pixels/s)
Graphics + compute: 1. 20.83ms (80.55G pixels/s) ~ 512. 92.58ms (18.12G pixels/s)
Graphics, compute single commandlist: 1. 20.67ms (81.18G pixels/s) ~ 459. 2905.30ms (0.58G pixels/s) -> Driver crash!

Please not the huge difference in GPU uitilization in almost all part of the test between TDR off and on. The most striking is the constant switching between 0% and 100% with TDR on in the Graphics, compute single commandlist part after about batch 200.

I would love for someone else with a 900 series to try and see if they get the same GPU utilization results with TDR on and off.I already saw a 970 result that had a similar GPU utilization with TDR on.

Have no idea if this impacts compute, but it certainly impacts the results from the tool.

Darius · Sep 3, 2015

The part of the windows ADK most useful for analyzing this is GPUView:

http://graphics.stanford.edu/~mdfisher/GPUView.html

To be perfectly honest I'm not nearly qualified enough to interpret most of that, but I'm sure someone here is.

Forceman · Sep 3, 2015

ka_rf said:
Outside of PadyEos' results, Maxwell has shown consistent results from everyone else and no one else had the weird cpu usage that has reported cpu usage. All GCN cards have shown pretty consistent results with Fury results being a little off.

I've run it several times on my 3470S/960/355.82 and got results like this, which show normal CPU behavior. I also ran it on 353.62 drivers and had the same results.

I ran it about ten times with different driver versions and TDR on/off, however, and one time the program seemed to bug out. The display was much different and the CPU usage spiked to 50% across the board. I managed to grab a screenshot of the output of the file when it happened. That CPU usage thing might be a bug of some kind. The bugged screen is on the left, the normal on the right, both about the same place in the test.

Forceman · Sep 3, 2015

PadyEos said:
Please not the huge difference in GPU uitilization in almost all part of the test between TDR off and on. The most striking is the constant switching between 0% and 100% with TDR on in the Graphics, compute single commandlist part after about batch 200.

I would love for someone else with a 900 series to try and see if they get the same GPU utilization results with TDR on and off.I already saw a 970 result that had a similar GPU utilization with TDR on.

Have no idea if this impacts compute, but it certainly impacts the results from the tool.

Oh my gosh, why is there no edit abiity?

Anyway, I see basically the same thing with TDR and a 960. Here is TDR on with 353.62 drivers, and off with 355.82 (sorry didn't get both on the same drivers but I doubt it matters). It's still spiky with TDR off, but not nearly as much and it may be because the GPU is so much slower.

PadyEos · Sep 3, 2015

Forceman said:
I've run it several times on my 3470S/960/355.82 and got results like this, which show normal CPU behavior. I also ran it on 353.62 drivers and had the same results.

I ran it about ten times with different driver versions and TDR on/off, however, and one time the program seemed to bug out. The display was much different and the CPU usage spiked to 50% across the board. I managed to grab a screenshot of the output of the file when it happened. That CPU usage thing might be a bug of some kind. The bugged screen is on the left, the normal on the right, both about the same place in the test.

Did you restart between each switch of the TDR state?
Also how did you do it? I did it by setting the TdrLevel flag to 0 in the registry HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers.

Nobu · Sep 3, 2015

To me it looks like the GPU is waiting at least some of the time (GPU usage dropping down, approaching 0 as the workload increases)--something is keeping it from continuing on from one task to the next, and I'd guess the compute part, but couldn't be sure.

Nobu · Sep 3, 2015

In the case where it get's irratic, like below, I'd guess the one side (either compute or graphics) part of the GPU gets saturated, so the other side has to catch up intermittantly, causing the seemingly random GPU usage.

ToTTenTranz said:

PadyEos · Sep 3, 2015

Nobu said:
To me it looks like the GPU is waiting at least some of the time (GPU usage dropping down, approaching 0 as the workload increases)--something is keeping it from continuing on from one task to the next, and I'd guess the compute part, but couldn't be sure.

Yes, it's what TDR does:

"The GPU scheduler, which is part of the DirectX graphics kernel subsystem (Dxgkrnl.sys), detects that the GPU is taking more than the permitted amount of time to execute a particular task. The GPU scheduler then tries to preempt this particular task. The preempt operation has a "wait" timeout, which is the actual TDR timeout. This step is thus the timeout detection phase of the process. The default timeout period in Windows Vista and later operating systems is 2 seconds. If the GPU cannot complete or preempt the current task within the TDR timeout period, the operating system diagnoses that the GPU is frozen.

To prevent timeout detection from occurring, hardware vendors should ensure that graphics operations (that is, direct memory access (DMA) buffer completion) take no more than 2 seconds in end-user scenarios such as productivity and game play."

So basically the switch to zero is the TDR preemption "wait" time-out, and if it takes less than 2 seconds to finish the batch form the time the preemption was started the GPU keeps recovering and goes back to 100%, and repeat, and repreat, ....., until it actually takes longer than 2 seconds and it kills it.

Forceman · Sep 3, 2015

PadyEos said:
Did you restart between each switch of the TDR state?
Also how did you do it? I did it by setting the TdrLevel flag to 0 in the registry HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers.

I ran it several times with TDR on, then I switched it and restarted (and updated the driver) and then ran it a few more times. I didn't ever switch back and forth directly. I used the same registry key you did to disable it - it stopped crashing so it must have worked.

Nobu · Sep 3, 2015

Ah, that makes sense. I had considered TDR, but I thought it would kill the task that caused the hang immediately, so I dismissed the possibility.

CSI PC · Sep 3, 2015

ToTTenTranz said:
If you're referring to PadyEos' results, those are the same that show about 50% usage of a 8-thread CPU when the "Async" test starts.

Which of his tests are you basing this upon?
Page 12 or later?
Thanks

CSI PC · Sep 3, 2015

Ignore my last post, all answered since with further tests.
For this noob here can I delete or amend my posts?... so blushing at the moment that something so simple is beyond me

Cheers

DX12 Performance Discussion And Analysis Thread

3dilettante

Deleted member 13524

Guest

Darius

Deleted member 2197

Guest

ka_rf

Razor1

Razor1

Darius

PadyEos

Attachments

Darius

Forceman

Attachments

Forceman

Attachments

PadyEos

Nobu

Nobu

PadyEos

Forceman

Nobu

CSI PC

CSI PC

Similar threads