DX12 Performance Discussion And Analysis Thread

Formeman, both of your results conform to the rest of the Maxwell results we've seen. Can anyone with Kepler run the new test?

xQgbquH.png

wygNux7.png
 
Please not the huge difference in GPU uitilization in almost all part of the test between TDR off and on. The most striking is the constant switching between 0% and 100% with TDR on in the Graphics, compute single commandlist part after about batch 200.

I would love for someone else with a 900 series to try and see if they get the same GPU utilization results with TDR on and off.I already saw a 970 result that had a similar GPU utilization with TDR on.

Have no idea if this impacts compute, but it certainly impacts the results from the tool.

I've posted my results. I'm also using a 980 TI, and even with TDR ON GPU usage was oscillating between 0% and 100%. Only difference with turning TDR OFF is the benchmark didn't crash.

CPU usage was low with TDR ON and OFF throughout the test.

I also disabled TDR by setting the TdrLevel flag to 0 in the registry HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers.

Using same drivers as you, but my CPU is an i7 5960X OC @ 4.7 GHz.
 
Yes, but to be perfectly honest I didn't completely disable TDR. I just increased the delay to 10 seconds so the test wouldn't crash.

I've posted my results. I'm also using a 980 TI, and even with TDR ON GPU usage was oscillating between 0% and 100%. Only difference with turning TDR OFF is the benchmark didn't crash.

CPU usage was low with TDR ON and OFF throughout the test.

I also disabled TDR by setting the TdrLevel flag to 0 in the registry HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers.

Using same drivers as you, but my CPU is an i7 5960X OC @ 4.7 GHz.

So did you use the TdrLevel flag to 0 or just increased the delay? Because just increasing the delay would still cause the preemption time-out to trigger and you would never reach the 10 seconds required for preemption to take place.
That would explain why you still had the 0-100-0 spikes and managed to actually finish the test.
 
PadyEos, here is your result with TDR On, it is in line with all other results (These results shouldn't be effected by TDR as they happen before the issue occurs in the benchmark). I also included a result from your TDR Off log. For some reason on your machine it seems turning TDR Off causes weird behavior.

xo1rSoY.png

NuqAPO0.png
 
So did you use the TdrLevel flag to 0 or just increased the delay? Because just increasing the delay would still cause the preemption time-out to trigger and you would never reach the 10 seconds required for preemption to take place.
That would explain why you still had the 0-100-0 spikes and managed to actually finish the test.

I did both. There was no difference.
 
Oh my gosh, why is there no edit abiity?

Anyway, I see basically the same thing with TDR and a 960. Here is TDR on with 353.62 drivers, and off with 355.82 (sorry didn't get both on the same drivers but I doubt it matters). It's still spiky with TDR off, but not nearly as much and it may be because the GPU is so much slower.

PadyEos, here is your result with TDR On, it is in line with all other results (These results shouldn't be effected by TDR as they happen before the issue occurs in the benchmark). I also included a result from your TDR Off log. For some reason on your machine it seems turning TDR Off causes weird behavior.

Well, now there are 2 of us(980TI and 960) with the same behavior on 355.82 when setting TdrLevel flag to 0, restarting the computer and the running the test.
 
You're right, my mistake, grabbed it from the wrong file.

The difference though between Forceman and your results, though, is that either on or off, it shows Asynch results consistent with others (at least, the two logs he posted) whereas yours shows weird behavior half way through. I'm not saying your results aren't valid based upon the test, just that it seems to be something specific to your system possibly causing different results.
 
Comparing overclocked VS stock; mine starts off at a nice scale
Heavily overclocked, (1654mhz gpu 4150mhz ram) VS stock MSI (1351mhz / 3505mhz ram), an 18% decrease in clocks shows a linear 18% decrease in completion times
Compute only @ 512 finishes up around 66ms VS 81ms
Graphics only: 19.70ms (85.15G pixels/s) VS 23.45ms (71.55G pixels/s)
Graphics + compute: 85.05ms (19.73G pixels/s) {89.47 G pixels/s} VS 104.33ms (16.08G pixels/s) {74.11 G pixels/s}

Graphics, compute single commandlist:
Overclocked, it finishes 512 every time. 2759.19ms (0.61G pixels/s)
Stock refuses to finish. 487. 0.24ms (7076.43G pixels/s) VS (Overclocked 487. 2667.79ms (0.63G pixels/s))

Can someone explain that to me (or did i miss that in the thread?)

Repeated it several times. Watching afterburner, CPU load idles throughout the task. The main gtx-980 clocks up, the second gtx-980 remains at reference clock bin (1190mhz) with the "normal" trivial amount of GPU load. (Sometimes it has 5-15% load, which is common for the second card in SLI)


i7-5930K @ 4.5ghz (27x166.67)
32gb G.Skill DDR4-3001 T1, 13-14-14-34-278
2x MSI GTX-980 4G Frozr
Samsung XP941 256gb M.2 (OS/App drive, 800mb/s read 600mb/s write)
2x RAID-0 Mushkin Skorpeon Delux 512gb (4.26gb/s read 3.86gb/s write)
 

Attachments

  • gtx-980_1351mhz_3505mhz_perf.txt
    916.8 KB · Views: 2
  • gtx-980_1654mhz_4150mhz_perf.txt
    975 KB · Views: 3
Pfff... runing out of idead now... The last I would have would be did you create a DWORD or a QWORD command? I used DWORD.
Except that I see nothing else that could be different,


You're right, my mistake, grabbed it from the wrong file.

The difference though between Forceman and your results, though, is that either on or off, it shows Asynch results consistent with others (at least, the two logs he posted) whereas yours shows weird behavior half way through. I'm not saying your results aren't valid based upon the test, just that it seems to be something specific to your system possibly causing different results.

I have no explanation for that. My numbers don't seem that far off. The sum of both series ones is almost always equal to the async time. Similar stuff with Forceman's 960. Maybe knowing the formulas behind the graph would help.
 
Toysrme, the single commandlist causes a TDR (timeout) on Maxwell because of how long it starts to take each iteration. It seems that heavily oc'd it gets it done just fast enough to avoid the timeout I guess.
 
Your results with TDR on are fine, it's with TDR off that they get a little screwy (compared to other Maxwell cards). This is your results with TDR off from earlier in the thread, I didn't plot it but your new results look similar unless I picked the wrong folder again. Series execution is graphics + compute added. Asynchronous is the results of the asynchronous test. These values are normalized to which ever value of graphics or compute is slower so that if the orange line is at 1, it means it is running the two loads (graphics and compute) perfectly asynchronously and in parallel. Dips below 1 shouldn't happen but is probably just an artifact of the time stamps or something.

vevF50L.png
 
So with TDR on my graphs are consistent with the others with TDR on.

But with TDR off they aren't consistent with the others? May I ask everyone that contributed with TDR off for graphs, did you still get with it off the 0-100-0-100 gpu load switch? If yes I suspect your TDR wasn't actually off and you were still experiencing the TDR time-out of 2 seconds even if the test finished successfully without the driver crashing.

Can I get specific confirmation that Forceman's TDR off graph(he was the only one able to reproduce TDR off without the 0-100-0 gpu load switch) is also not matching mine?
 
AMD Fury async results look disappointing compared to other GCN cards. Could it be related to Fiji power limit ? Async computing -> more work done in parallel -> more Watts -> Fiji reducing shader frequencies, or stalling some shaders, or whatever. A good way to find out would be to run tests again with increased power limit.
 
Should I explain how the ILP or the lack of ILP due to instructions dependencies affect performance?
[...]
Nope, moreover I've not seen the code, the whole thing was obvious starting from "single lane" words of MDolenc, though instructions dependencies, arithmetic instructions latencies and instructions pairing could affect performance too, but such things really require code analyse
I thought you might have seen how it compiled.

I now see that you have linked: http://forum.ixbt.com/topic.cgi?id=10:61506-91#3150

Code:
BB0_1:
add.f32  %f9, %f57, %f56;
mul.f32  %f10, %f9, 0f3F000011;
add.f32  %f11, %f10, %f57;
fma.rn.f32  %f12, %f11, 0f3F000011, %f10;
mul.f32  %f13, %f12, 0f3F000011;
fma.rn.f32  %f14, %f11, 0f3F000011, %f13;
fma.rn.f32  %f15, %f14, 0f3F000011, %f13;
mul.f32  %f16, %f15, 0f3F000011;
fma.rn.f32  %f17, %f14, 0f3F000011, %f16;
fma.rn.f32  %f18, %f17, 0f3F000011, %f16;
mul.f32  %f19, %f18, 0f3F000011;
fma.rn.f32  %f20, %f17, 0f3F000011, %f19;
fma.rn.f32  %f21, %f20, 0f3F000011, %f19;
mul.f32  %f22, %f21, 0f3F000011;
fma.rn.f32  %f23, %f20, 0f3F000011, %f22;
fma.rn.f32  %f24, %f23, 0f3F000011, %f22;
mul.f32  %f25, %f24, 0f3F000011;
fma.rn.f32  %f26, %f23, 0f3F000011, %f25;
fma.rn.f32  %f27, %f26, 0f3F000011, %f25;
mul.f32  %f28, %f27, 0f3F000011;
fma.rn.f32  %f29, %f26, 0f3F000011, %f28;
fma.rn.f32  %f30, %f29, 0f3F000011, %f28;
mul.f32  %f31, %f30, 0f3F000011;
fma.rn.f32  %f32, %f29, 0f3F000011, %f31;
fma.rn.f32  %f33, %f32, 0f3F000011, %f31;
mul.f32  %f34, %f33, 0f3F000011;
fma.rn.f32  %f35, %f32, 0f3F000011, %f34;
fma.rn.f32  %f36, %f35, 0f3F000011, %f34;
mul.f32  %f37, %f36, 0f3F000011;
fma.rn.f32  %f38, %f35, 0f3F000011, %f37;
fma.rn.f32  %f39, %f38, 0f3F000011, %f37;
mul.f32  %f40, %f39, 0f3F000011;
fma.rn.f32  %f41, %f38, 0f3F000011, %f40;
fma.rn.f32  %f42, %f41, 0f3F000011, %f40;
mul.f32  %f43, %f42, 0f3F000011;
fma.rn.f32  %f44, %f41, 0f3F000011, %f43;
fma.rn.f32  %f45, %f44, 0f3F000011, %f43;
mul.f32  %f46, %f45, 0f3F000011;
fma.rn.f32  %f47, %f44, 0f3F000011, %f46;
fma.rn.f32  %f48, %f47, 0f3F000011, %f46;
mul.f32  %f49, %f48, 0f3F000011;
fma.rn.f32  %f50, %f47, 0f3F000011, %f49;
fma.rn.f32  %f51, %f50, 0f3F000011, %f49;
mul.f32  %f52, %f51, 0f3F000011;
fma.rn.f32  %f53, %f50, 0f3F000011, %f52;
fma.rn.f32  %f54, %f53, 0f3F000011, %f52;
mul.f32  %f56, %f54, 0f3F000011;
fma.rn.f32  %f55, %f53, 0f3F000011, %f56;
mul.f32  %f57, %f55, 0f3F000011;
add.s32  %r6, %r6, 32;
setp.ne.s32  %p1, %r6, 0;
@%p1 bra  BB0_1;

Which is a partially unrolled loop (not fully unrolled). Which will be vastly faster than AMD's naive compilation. Which is not as advanced as I wrote about back in post 167, but it needs the compiler to spot the opportunity.

There is no co-issue here.

https://forum.beyond3d.com/posts/1869076/
I wonder why fadd+fmul sequinces were not converted to FMA and why the loop was not unrolled
Combination of the compiler being stupid and I think FMA isn't strictly the same.
 
A single CU can run multiple kernels, and since the test is using async compute, we can assume that multiple queues produce work to the same CU. This means that it can interleave SALU from the other kernel and VALU from the other, making both process twice the rate.
You asserted: "If the compiler is good, [...] emitting both scalar + vector in an interleaved "dual issue" way (the CU can issue both at the same cycle, doubling the throughput)." Which is clearly not the same as the core scheduling an SALU op from one hardware as well as a VALU op from another. The compiler hasn't emitted "dual issue" in this case.

Of course there are kernels that are fastest with thread blocks of 64 or 1024 threads. But these are fast because of some other bottlenecks, you most likely are trading some GPU utilization for other improvements (like reduced memory traffic). Also if the GCN OpenCL compiler is very smart, it could compile thread groups of 64 threads differently (the scalar unit could be exploited more).
The important case is that all work-group synchronisations cost 0 cycles with 64 work items. Barriers are elided. For example it makes heavy duty LDS usage that bit faster.
 
The graphs made by ka_rf are very nice.
I only found one user, PadyEos, who made the test with a 980 Ti at the same time as he logged his CPU usage.
Here's kar_rf's plot of his results:

3NrhGRo.png



We see that somewhere near the 256 kernels the "Async" results do seem to approach the Sync results, suggesting there could be some Async Compute happening after all.

The graph actually shows that the compute only line, blue, shows a divergence from the overall trend, slowing down in that region. The orange line hasn't shifted downwards from its trend, to indicate that async compute is occurring, the blue line has wobbled into slowness. The blue line returns to its trend near 360.

There's no async here.

However after looking at RedditUserB's suggestion, I compared the GPU usage to the CPU usage logs that PadyEos took, and here's what I found (poorly glued through paint):

aN7d4lC.png


By the time the Async test starts (which should be around the middle of the test), the CPU usage jumps towards ~50% in all cores and threads.
Could that be the start of the single command list run?
 
Back
Top