Please not the huge difference in GPU uitilization in almost all part of the test between TDR off and on. The most striking is the constant switching between 0% and 100% with TDR on in the Graphics, compute single commandlist part after about batch 200.
I would love for someone else with a 900 series to try and see if they get the same GPU utilization results with TDR on and off.I already saw a 970 result that had a similar GPU utilization with TDR on.
Have no idea if this impacts compute, but it certainly impacts the results from the tool.
You can after posting a certain number, so you are well on your way. (not sure exact number)For this noob here can I delete or amend my posts
Yes, but to be perfectly honest I didn't completely disable TDR. I just increased the delay to 10 seconds so the test wouldn't crash.
I've posted my results. I'm also using a 980 TI, and even with TDR ON GPU usage was oscillating between 0% and 100%. Only difference with turning TDR OFF is the benchmark didn't crash.
CPU usage was low with TDR ON and OFF throughout the test.
I also disabled TDR by setting the TdrLevel flag to 0 in the registry HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers.
Using same drivers as you, but my CPU is an i7 5960X OC @ 4.7 GHz.
So did you use the TdrLevel flag to 0 or just increased the delay? Because just increasing the delay would still cause the preemption time-out to trigger and you would never reach the 10 seconds required for preemption to take place.
That would explain why you still had the 0-100-0 spikes and managed to actually finish the test.
I did both. There was no difference.
Oh my gosh, why is there no edit abiity?
Anyway, I see basically the same thing with TDR and a 960. Here is TDR on with 353.62 drivers, and off with 355.82 (sorry didn't get both on the same drivers but I doubt it matters). It's still spiky with TDR off, but not nearly as much and it may be because the GPU is so much slower.
PadyEos, here is your result with TDR On, it is in line with all other results (These results shouldn't be effected by TDR as they happen before the issue occurs in the benchmark). I also included a result from your TDR Off log. For some reason on your machine it seems turning TDR Off causes weird behavior.
Did you also restart the computer after setting TdrLevel flag to 0?
Pfff... runing out of idead now... The last I would have would be did you create a DWORD or a QWORD command? I used DWORD.Yes.
You're right, my mistake, grabbed it from the wrong file.
The difference though between Forceman and your results, though, is that either on or off, it shows Asynch results consistent with others (at least, the two logs he posted) whereas yours shows weird behavior half way through. I'm not saying your results aren't valid based upon the test, just that it seems to be something specific to your system possibly causing different results.
I thought you might have seen how it compiled.Should I explain how the ILP or the lack of ILP due to instructions dependencies affect performance?
[...]
Nope, moreover I've not seen the code, the whole thing was obvious starting from "single lane" words of MDolenc, though instructions dependencies, arithmetic instructions latencies and instructions pairing could affect performance too, but such things really require code analyse
BB0_1:
add.f32 %f9, %f57, %f56;
mul.f32 %f10, %f9, 0f3F000011;
add.f32 %f11, %f10, %f57;
fma.rn.f32 %f12, %f11, 0f3F000011, %f10;
mul.f32 %f13, %f12, 0f3F000011;
fma.rn.f32 %f14, %f11, 0f3F000011, %f13;
fma.rn.f32 %f15, %f14, 0f3F000011, %f13;
mul.f32 %f16, %f15, 0f3F000011;
fma.rn.f32 %f17, %f14, 0f3F000011, %f16;
fma.rn.f32 %f18, %f17, 0f3F000011, %f16;
mul.f32 %f19, %f18, 0f3F000011;
fma.rn.f32 %f20, %f17, 0f3F000011, %f19;
fma.rn.f32 %f21, %f20, 0f3F000011, %f19;
mul.f32 %f22, %f21, 0f3F000011;
fma.rn.f32 %f23, %f20, 0f3F000011, %f22;
fma.rn.f32 %f24, %f23, 0f3F000011, %f22;
mul.f32 %f25, %f24, 0f3F000011;
fma.rn.f32 %f26, %f23, 0f3F000011, %f25;
fma.rn.f32 %f27, %f26, 0f3F000011, %f25;
mul.f32 %f28, %f27, 0f3F000011;
fma.rn.f32 %f29, %f26, 0f3F000011, %f28;
fma.rn.f32 %f30, %f29, 0f3F000011, %f28;
mul.f32 %f31, %f30, 0f3F000011;
fma.rn.f32 %f32, %f29, 0f3F000011, %f31;
fma.rn.f32 %f33, %f32, 0f3F000011, %f31;
mul.f32 %f34, %f33, 0f3F000011;
fma.rn.f32 %f35, %f32, 0f3F000011, %f34;
fma.rn.f32 %f36, %f35, 0f3F000011, %f34;
mul.f32 %f37, %f36, 0f3F000011;
fma.rn.f32 %f38, %f35, 0f3F000011, %f37;
fma.rn.f32 %f39, %f38, 0f3F000011, %f37;
mul.f32 %f40, %f39, 0f3F000011;
fma.rn.f32 %f41, %f38, 0f3F000011, %f40;
fma.rn.f32 %f42, %f41, 0f3F000011, %f40;
mul.f32 %f43, %f42, 0f3F000011;
fma.rn.f32 %f44, %f41, 0f3F000011, %f43;
fma.rn.f32 %f45, %f44, 0f3F000011, %f43;
mul.f32 %f46, %f45, 0f3F000011;
fma.rn.f32 %f47, %f44, 0f3F000011, %f46;
fma.rn.f32 %f48, %f47, 0f3F000011, %f46;
mul.f32 %f49, %f48, 0f3F000011;
fma.rn.f32 %f50, %f47, 0f3F000011, %f49;
fma.rn.f32 %f51, %f50, 0f3F000011, %f49;
mul.f32 %f52, %f51, 0f3F000011;
fma.rn.f32 %f53, %f50, 0f3F000011, %f52;
fma.rn.f32 %f54, %f53, 0f3F000011, %f52;
mul.f32 %f56, %f54, 0f3F000011;
fma.rn.f32 %f55, %f53, 0f3F000011, %f56;
mul.f32 %f57, %f55, 0f3F000011;
add.s32 %r6, %r6, 32;
setp.ne.s32 %p1, %r6, 0;
@%p1 bra BB0_1;
Combination of the compiler being stupid and I think FMA isn't strictly the same.https://forum.beyond3d.com/posts/1869076/
I wonder why fadd+fmul sequinces were not converted to FMA and why the loop was not unrolled
You asserted: "If the compiler is good, [...] emitting both scalar + vector in an interleaved "dual issue" way (the CU can issue both at the same cycle, doubling the throughput)." Which is clearly not the same as the core scheduling an SALU op from one hardware as well as a VALU op from another. The compiler hasn't emitted "dual issue" in this case.A single CU can run multiple kernels, and since the test is using async compute, we can assume that multiple queues produce work to the same CU. This means that it can interleave SALU from the other kernel and VALU from the other, making both process twice the rate.
The important case is that all work-group synchronisations cost 0 cycles with 64 work items. Barriers are elided. For example it makes heavy duty LDS usage that bit faster.Of course there are kernels that are fastest with thread blocks of 64 or 1024 threads. But these are fast because of some other bottlenecks, you most likely are trading some GPU utilization for other improvements (like reduced memory traffic). Also if the GCN OpenCL compiler is very smart, it could compile thread groups of 64 threads differently (the scalar unit could be exploited more).
I only found one user, PadyEos, who made the test with a 980 Ti at the same time as he logged his CPU usage.
Here's kar_rf's plot of his results:
We see that somewhere near the 256 kernels the "Async" results do seem to approach the Sync results, suggesting there could be some Async Compute happening after all.
Could that be the start of the single command list run?However after looking at RedditUserB's suggestion, I compared the GPU usage to the CPU usage logs that PadyEos took, and here's what I found (poorly glued through paint):
By the time the Async test starts (which should be around the middle of the test), the CPU usage jumps towards ~50% in all cores and threads.