DX12 Performance Discussion And Analysis Thread

Missed this one why is Fiji getting similar results to Maxwell 2 then?

It's not.

You'll probably resort to a lot of mental gymnastics to claim it is, though.
Good luck with that, I guess. ;-)
 
Wow, you guys can make simple things complicated, I thought it was obvious from the beginning
Shame you didn't flesh out your thoughts.

I've been pondering your comment about dual-issue and I still don't understand how that's relevant. Maybe you can explain that. Do you know how this code compiles on Maxwell 2?
 
Apart from that I can't help wondering if the [numthreads(1, 1, 1)] attribute of the kernel is making AMD do something additionally strange.
If the compiler is good, it could either skip the vector units completely by emitting pure scalar unit code (saving power) or emitting both scalar + vector in an interleaved "dual issue" way (the CU can issue both at the same cycle, doubling the throughput).

Benchmarking thread groups that are under 256 threads on GCN is not going to lead into any meaningful results, as you would (almost) never use smaller thread groups in real (optimized) applications. I would suspect a performance bug if a kernel thread count doesn't belong to {256, 384, 512}. Single lane thread groups result in less than 1% of meaningful work on GCN. Why would you run code like this on a GPU (instead of using the CPU)? Not a good test case at all. No GPU is optimized for this case.

Also I question the need to run test cases with tens of (or hundreds of) compute queues. Biggest gains can be had with one or two additional queues (running work that hits different bottlenecks each). More queues will just cause problems (cache trashing, etc).
 
Last edited:
My bad, I missed that.

I do not understand the result of this new test. How can I get similar result to a 290 with my 7790? The pixels pushed/s is lower but execution time is similar.
I think you have the best AMD results so far for the compute portion of this test.

I did have a theory that the single shader engine might make the Cape Verde chip better, but that was based on my belief that the compute should finish much faster. Perhaps there is something there, but it's not blatantly obvious to me right now...

I suspect your card is clocked at more than 1GHz since my 7770 is 1GHz.
 
Seem to like 50% GPU usage on Fury X

AS.jpg
 

Attachments

  • Perf.zip
    133.6 KB · Views: 6
I think you have the best AMD results so far for the compute portion of this test.

I did have a theory that the single shader engine might make the Cape Verde chip better, but that was based on my belief that the compute should finish much faster. Perhaps there is something there, but it's not blatantly obvious to me right now...

I suspect your card is clocked at more than 1GHz since my 7770 is 1GHz.

Indeed my card is clocked at 1.1 Ghz (r7 260x reference clock). So it seems the result on the compute part of the second test for GCN is only dependent on the clock and do not depend at all on the number of compute unit you have.
 
Benchmarking thread groups that are under 256 threads on GCN is not going to lead into any meaningful results, as you would (almost) never use smaller thread groups in real (optimized) applications. I would suspect a performance bug if a kernel thread count doesn't belong to {256, 384, 512}. Single lane thread groups result in less than 1% of meaningful work on GCN. Why would you run code like this on a GPU (instead of using the CPU)? Not a good test case at all. No GPU is optimized for this case.
The test case is whether graphics processing time can overlay with compute, is there an expectation that a 256 thread group would change the verdict for the GPUs running it?
A single lane seems like a base case that is generally equivalent to GPUs with differing SIMD widths when it comes to testing if the two types of threads can overlap their lifespans without customizing the code.
There was puzzlement when it came to why the latency was that disparate for GCN for the lowest cases, which is largely explained by the omission of the 4-cycle wavefront.


I am curious about the exact placement of the inflection points for the timings, since they don't necessarily line up with some of the most obvious resource limits.

Also I question the need to run test cases with tens of (or hundreds of) compute queues. Biggest gains can be had with one or two additional queues (running work that hits different bottlenecks each). More queues will just cause problems (cache trashing, etc).
The current testing does not do this, although a reason to test it would be similar to why people climb mountains: because it's there.
 
If the compiler is good, it could either skip the vector units completely by emitting pure scalar unit code (saving power) or emitting both scalar + vector in an interleaved "dual issue" way (the CU can issue both at the same cycle, doubling the throughput).
I believe SALU and VALU ops have to come from different hardware threads, so this specific kernel couldn't be sped up that way.

Benchmarking thread groups that are under 256 threads on GCN is not going to lead into any meaningful results, as you would (almost) never use smaller thread groups in real (optimized) applications.
I strongly disagree as I have some code that runs fastest with 64, but I come from an OpenCL perspective (can't get more than 256 work items into a work-group, apart from anything else :p)...

I would suspect a performance bug if a kernel thread count doesn't belong to {256, 384, 512}. Single lane thread groups result in less than 1% of meaningful work on GCN. Why would you run code like this on a GPU (instead of using the CPU)? Not a good test case at all. No GPU is optimized for this case.
Fillrate tests aren't meaningful work either. This test does reveal serial versus async behaviours, so it's a success on those terms.

Also I question the need to run test cases with tens of (or hundreds of) compute queues. Biggest gains can be had with one or two additional queues (running work that hits different bottlenecks each). More queues will just cause problems (cache trashing, etc).
The results presented demonstrate async compute, when it's available. The test does so with one or two queues as far as I can tell. It appears that even a single queue, the command list test, results in async compute on GCN. EDIT: erm, actually I don't think that last sentence is correct.

What I'm curious to see is whether NVidia hardware will gain async behaviour once NVidia has spent some time on this. Maybe it truly won't work on the current chips, but I'm not convinced either way as yet.
 
Shame you didn't flesh out your thoughts.
Well, I am not a native speaker, so it's PITA to flesh out thoughts in the right way

I've been pondering your comment about dual-issue and I still don't understand how that's relevant
Should I explain how the ILP or the lack of ILP due to instructions dependencies affect performance?

Do you know how this code compiles on Maxwell 2?
Nope, moreover I've not seen the code, the whole thing was obvious starting from "single lane" words of MDolenc, though instructions dependencies, arithmetic instructions latencies and instructions pairing could affect performance too, but such things really require code analyse
 
Why is the Async test on GCN does nothing to CPU usage or temps, but on Maxwell, CPU usage/temp spikes up very high and GPU usage drops to 0%?

Does this not indicate that there is software emulation of Async Compute "support" going on? This is what Oxide has referred to directly, high CPU usage due to driver emulation. I'd imagine in a real gaming workload, the compute task would be a lot more complex, if its offloaded to the CPU, it will hammer the CPU into a stall.
 
It's not.

You'll probably resort to a lot of mental gymnastics to claim it is, though.
Good luck with that, I guess. ;-)

http://nubleh.github.io/async/#18

http://nubleh.github.io/async/#38

No mental exercise, looking up the two Fiji tests that might be too much for you, any case can we say Fiji doesn't have hardware supported async because of this test? The results are poor compared to older GCN architecture, although better than nV's hardware, but they mirror the up and down cycles of lower and higher latency at certain points and the step by step increase based on load.
 
Last edited:
http://nubleh.github.io/async/#18

http://nubleh.github.io/async/#38

No mental exercise, looking up the two Fiji tests that might be too much for you, any case can we say Fiji doesn't have hardware supported async because of this test? The results are poor compared to older GCN architecture, although better than nV's hardware, but they mirror the up and down cycles of lower and higher latency at certain points and the step by step increase based on load.

You should read what Sebbi has to say a few posts earlier, the tool is not fit to analyze performance for GCN because it vastly under-utilizes the architecture.

Also in your example, look at the Forced Async, massive improvement for GCN. Massive degradation for Maxwell.
 
what forced async, there is no forced async, the single command list is not forced async

I read what Sebbi stated, but with the same number of 8 ACE's in Hawaii to Fiji we shouldn't see a similar data plot to Maxwell 2 when latency is concerned, I would expect to something similar to Hawaii or Tonga or, you can see spikes yes but that is about it

This is latency not end performance.

And the theory about offloading to the CPU, when doing async doesn't make sense the latency should increase drastically when doing this, the plot wouldn't go with the step by step method anymore, it should like what the single command list looks likes.
 
Last edited:
AMD-Radeon-R9-Nano-Fiji-GPU-Block-Hot-Chips.jpg


Do you guys notice any different in Fiji/Fury?

Look at the ACEs. There's normally 8 in other GCN, Fiji has 4 ACEs and 2 HWS.....

Some radically different way to handle queues?
 
http://forums.anandtech.com/showpost.php?p=37656793&postcount=204

zlatan @ AT said:
All newer GCN 1.2 cards have this configuration. There are 4 core ACEs. The two HWS units can do the same work as 4 ACEs, so this is why AMD refer to 8 ACEs in some presentations. The HWS units just smarter and can support more interesting workloads, but AMD don't talk about these right now. I think it has something to do with the HSA QoS feature. Essentially the GCN 1.2 design is not just a efficient multitask system, but also good for multi-user environments.

Most GPUs are not designed to run more than one program, because these systems are not optimized for latency. They can execute multiply GPGPU programs, but executing a game when a GPGPU program is running won't give you good results. This is why HSA has a graphics preemption feature. These GCN 1.2 GPUs can prioritize all graphics task to provide a low-latency output. QoS is just one level further. It can run two games or a game and a GPGPU app simultaneously for two different users, and the performance/experience will be really good with these HWS units.

Interesting. Does Tonga have the same behavior as Fiji under this test?
 
I just ran the new AsyncCompute benchmark, results are kinda different from the last one. Could anyone kindly explain my results? If ms are going up, does this mean it's not doing asycn?

R9 280 OC
 

Attachments

  • perf - R9 280 OC.zip
    65.6 KB · Views: 9
Back
Top