DX12 Performance Discussion And Analysis Thread

p p p p pero... mis shaders Asincronos! Nooo!

Lo siento! Hay mas cosas que disfrutar en la vida, no?

Seriously, theres like 5 people stating the results show it works on nvidia...
There's really just one..

B) he already got consent to say somethings not others
You mean if he asked his higher-ups permission to disclose a bunch of development details, he was authorized to disclose some things and not authorized to disclose other things?!
Shock!

Any easy explanation for a high level programmer *who doesnt work with graphic programming atm, but will be interested to do so in the future btw.

Chances are if you're a high-level programmer you'll never have to deal with this. If anything, you'll be using a middleware engine (Unreal, CryEngine, Unity, etc.) which takes care of all the low-level optimizations for you.
As for an easy explanation... you have Anandtech's which is really good.

If you want an even easier (more layman-er) explanation.. let's see..


Imagine I have a wall-painting workforce of 2048 people.
My 2048 people are versatile enough. They can all do one of two tasks: 1) wash the wall and 2) paint.
Each of my 2048 workers can be assigned only one task (wash or paint) each day.
For telling my 2048 workers what they should do each day, I have two additional employees:
A) Wash-Teamer who organizes teams for washing the wall
B) Paint-Teamer who organizes teams for painting the wall

Now for older generations of GPUs with Compute capabilities (Kepler, Maxwell 1, pre-GCN, etc.), Wash-Teamer really disliked Paint-Teamer. They couldn't even be in the same room without throwing insults towards each other's mother.
I couldn't have that. I want a safe and friendly environment in my company.. so what happens is that Wash-Teamer never comes in the same days as Paint-Teamer.
My company is still efficient enough. I can predict my paint jobs rather well so I know when I'll be needing more people to wash and tell Wash-Teamer to come, or more people to paint and tell Paint-Teamer to come instead.

But this of course isn't the most efficient way to distribute my 2048 workers. For example, some days I only have painting space for 1536 painters, so it would be great if the remaining 512 who are sitting there playing cards and drinking beer (I'm a cool boss. I allow drinking beer during work hours. Deal with it.) could move on to the next block and start washing another wall.
Even worse: lots of times we have a very strict deadline to paint a wall, but there's only one little bit missing! I can only put something like 64 workers in that wall, which means that during that whole day, I have no less than 1984 workers sitting around, doing nothing. Drinking beer. Almost 2000 guys! Those are some troublesome days...

Come GCN (and supposedly Maxwell 2..), and Wash-Teamer has finally made peace with Paint-Teamer!
Now they both come to work everyday, they compliment each others' shoes first thing in the morning and proceed to distribute my 2048 workers in the best possible way.
What happens now is that I sometimes have 1664 workers (shaders) doing the painting (Rendering) and 384 workers doing the washing (Compute). Other days I need more Rendering, so I get 1920 workers doing Rendering, but I can still spare some 128 shaders doing the Compute.

And thanks to this, by the end of the month my company is painting walls a lot faster.
Man, I wish I had listened to my friend Mark Cerny back in 2013. He insisted that Wash-Teamer and Paint-Teamer needed to be friends over two years ago!





And to summarize: GCN results are showing that Wash-Teamer and Paint-Teamer are indeed showing up to work at once.
On nVidia GPUs, they're supposed to be friends with each other, yet they refuse to show up at the same time, instead having one showing up after the other leaves the precint, while pretending to be at the same time. Perhaps they just told the press they were friends and would work together, but still can't stand one another.
 
Last edited by a moderator:
Could driver bugs be behind the Fiji GPU issues?

Maybe more the task who is asked by the test.. but well i dont know. ( Is CodeXL or GPU perf could debug this ? )

Maybe the Async demo from AMD could be better suited but i dont think they have put it available right now.
 
doesn't matter what the performance difference is if there is even a small amount doing async code, it is functional, its not about the end performance vs the different IHV's, its about is it capable or not, and it is capable. Serial path should always be the same or higher than doing it asynchronously if the variables are the same and if the processor is being tasked enough.

Yes, provided that the perf info we have is correct and not some weird bench artifact. Which then raises another question, why is there almost no benefit in running async on Maxwell2 uarch? And in some cases performance loss as per Oxide.

If everything is working as it should, there is 30% or more of potential perf gain Nvidia wont be able to tap into. I'm taking 30% from what console devs are saying.
 
Lo siento! Hay mas cosas que disfrutar en la vida, no?



You mean if he asked his higher-ups permission to disclose a bunch of development details, he was authorized to disclose some things and not authorized to disclose other things?!
Shock!


No one is asking for that, that's why I stated psuedo code + shader profiler info, if you don't understand that means..... you will come up with the statement you just stated.
 
Yes, provided that the perf info we have is correct and not some weird bench artifact. Which then raises another question, why is there almost no benefit in running async on Maxwell2 uarch? And in some cases performance loss as per Oxide.

If everything is working as it should, there is 30% or more of potential perf gain Nvidia wont be able to tap into. I'm taking 30% from what console devs are saying.


That will require more tests. More benchmarks etc.
 
Now, all in all, how does this affect a gtx 980Ti? (Objectively speaking) Trying to decide ebtween fury X and this for longevity...
We have some evidence that 980Ti can't do async compute. Why not wait until there's a better variety of tests? Games should benefit substantially from D3D12 even without async compute.

There's something really wrong with GCN alltogether in this test. Compute times are just horrible, and GPU usage is way too low (max 10% under compute). Well granted it's not benchmark made for pure performance.
I discovered a mistake I made earlier.

In this post:

DX12 performance thread

I said the loop is 8 cycles. This is radically wrong. It's actually 40 cycles. The new version of CodeXL makes this clear (though there's a whopper of a bug) because it indicates the timings of instructions and points at something I totally forgot: a single work item runs each SIMD at 1/4 throughput over time. Whereas on NVidia a single work item should run at full throughput over time, because the SIMD width matches the work-group width.

For a loop of 1,048,576 iterations, that's 40ms. It's amusing because it means that in the earlier test AMD couldn't drop below 40ms.

In the second test the loop iterates 524,288 times. That's 20ms. So now we get to some truth about this kernel, it runs vastly slower on AMD than on NVidia. OK, there's still 6 ms that I can't explain (which is as much time as GM200 spends), but I think we've almost cracked one of the mysteries for the bizarre slowness on AMD

Apart from that I can't help wondering if the [numthreads(1, 1, 1)] attribute of the kernel is making AMD do something additionally strange.
 
I don't know what DMA copy you're referring to.

If i understand well, referring on the anand articles, but its maybe me who misunderstood the thing, in reality in Async compute in GCN, you have 3 type of Queues: the graphics one, the Compute one, and the copy queue one ( Who will handle data streaming, dynamic data update, defrag, whatever )..

I got the feeling that what we seen, is nothing is made with this kernel for permit to hide the latency on the thread scheduling.

Not my domain, so im all but clear.
 
Last edited:
There's no data movement in this test as far as I'm aware. It renders pixels to the screen and it writes a few bytes to a separate buffer (I'm not sure if this is a single buffer shared by all kernel instances?). The buffer isn't used in any other way.
 
I said the loop is 8 cycles. This is radically wrong. It's actually 40 cycles. The new version of CodeXL makes this clear (though there's a whopper of a bug) because it indicates the timings of instructions and points at something I totally forgot: a single work item runs each SIMD at 1/4 throughput over time. Whereas on NVidia a single work item should run at full throughput over time, because the SIMD width matches the work-group width.
I forgot about the 4x multiplier as well. The numbers didn't add up to something that felt intuitively right without that in place.

That explains the magnitude of the time increment, mostly. Overhead and maybe clock variance might explain part of the remainder.

Intra-batch timings should have more space to dispatch wavefronts within 20ms, unless dispatch is that high overhead.
I was running with a mental model that the batches were isolated in time, but the GPU could be cycling amongst them for fairness purposes.
 
If possible, it would be interesting to see a version of the test coded with GCN's own intricacies taken into account, to get latency numbers down to more realistic values. One version tailored to Maxwell/v2 (as it seems to be right now) and another for GCN, both doing the same total workload (should be comparable in numbers, if desired).

Still, this is a test of async compute capability, not performance or a benchmark, so it's nothing that would validate or invalidate the results we have at hand so far.
 
MDolenc seems to own a GTX 680, which is a Kepler GPU.
Perhaps it's natural that something debugged with a Kepler card would work better on a Maxwell than GCN, but it wasn't made for Maxwell.

Regardless, I'll leave the effort of trying to evaluate performance out of this tool to people here who are much more literate than me.
Some results like Tahiti's seem completely bonkers to me. The only thing I can discern is that no nVidia card is being capable of Async Compute in this test so far.
 
The only thing I can discern is that no nVidia card is being capable of Async Compute in this test so far.

Which is the point of the test, to check for async compute capability. It's not a benchmark. I'll also leave the performance side of the tool for the more literate. I was just saying that maybe there could be some tweaks to how the test does its work that then would fit GCN (Why not Maxwell too, if coded for Kepler?) better, to speed it up. Again, not needed since it's not a benchmark, but could be worthwhile.

Tahiti's results relative to Hawaii and Fiji, on the other hand, are insane. GCN the first, the best? :D
 
This was the point of this tool, not to "benchmark" the performance on computing or graphics or async anyway.

Still, i cant really paint a black and white line with itt, somewhat i think i could miss something.. More test will be needed, or more informations from Nvidia.

The problem is, nvidia have been really closed to divulgue anything on how work deeply their architecture, we have barely no information outside the basic schema and find ..

At contrario of AMD where nearly every aspect of GCN is really exposed and can be discussed .( even more since the consoles devs have start work with it, share their find, the possibilty and optimization, requirement etc etc )

I cant even find on what we know of the architecture where is the difference between Maxwell 1 and 2 who can make the first is not supporting Async and the second could.
 
Last edited:
radeon 7790 GCN 1.1 (with 260x bios) driver 15.20.1062.1004

Compute only:
1. 50.52ms
2. 50.48ms
3. 50.46ms
4. 50.47ms
5. 50.47ms
6. 50.49ms
7. 50.52ms
8. 50.52ms
9. 50.46ms
10. 50.45ms
11. 50.47ms
12. 50.52ms
13. 50.48ms
14. 50.47ms
15. 50.51ms
16. 50.49ms
17. 50.48ms
18. 50.50ms
19. 50.45ms
20. 50.50ms
21. 50.48ms
22. 50.51ms
23. 50.52ms
24. 50.50ms
25. 50.48ms
26. 50.45ms
27. 50.46ms
28. 50.47ms
29. 50.46ms
30. 50.45ms
31. 50.44ms
32. 50.45ms
33. 50.45ms
34. 50.46ms
35. 50.46ms
36. 50.47ms
37. 50.46ms
38. 50.45ms
39. 50.45ms
40. 50.45ms
41. 50.45ms
42. 50.48ms
43. 50.47ms
44. 50.45ms
45. 50.45ms
46. 50.64ms
47. 50.45ms
48. 50.45ms
49. 50.46ms
50. 50.47ms
51. 50.46ms
52. 50.44ms
53. 50.44ms
54. 50.45ms
55. 50.46ms
56. 50.45ms
57. 50.49ms
58. 50.49ms
59. 54.18ms
60. 50.49ms
61. 50.49ms
62. 50.49ms
63. 50.50ms
64. 50.49ms
65. 50.51ms
66. 50.50ms
67. 50.57ms
68. 50.52ms
69. 50.52ms
70. 50.51ms
71. 50.50ms
72. 50.52ms
73. 50.54ms
74. 50.57ms
75. 50.60ms
76. 58.27ms
77. 58.26ms
78. 62.79ms
79. 50.49ms
80. 50.51ms
81. 50.51ms
82. 50.52ms
83. 50.56ms
84. 58.28ms
85. 50.51ms
86. 50.54ms
87. 50.57ms
88. 50.56ms
89. 58.28ms
90. 78.48ms
91. 50.53ms
92. 50.54ms
93. 50.51ms
94. 50.56ms
95. 50.51ms
96. 50.51ms
97. 50.58ms
98. 50.51ms
99. 50.49ms
100. 50.56ms
101. 63.14ms
102. 50.52ms
103. 50.49ms
104. 50.52ms
105. 50.51ms
106. 50.58ms
107. 50.56ms
108. 50.56ms
109. 50.52ms
110. 50.60ms
111. 50.53ms
112. 50.52ms
113. 64.66ms
114. 64.61ms
115. 64.63ms
116. 73.23ms
117. 64.77ms
118. 64.73ms
119. 64.67ms
120. 64.67ms
121. 64.61ms
122. 64.66ms
123. 64.79ms
124. 64.71ms
125. 64.76ms
126. 64.66ms
127. 64.64ms
128. 76.80ms
Graphics only: 97.56ms (17.20G pixels/s)
Graphics + compute:
1. 97.80ms (17.16G pixels/s)
2. 98.13ms (17.10G pixels/s)
3. 98.35ms (17.06G pixels/s)
4. 98.13ms (17.10G pixels/s)
5. 98.32ms (17.06G pixels/s)
6. 98.52ms (17.03G pixels/s)
7. 98.49ms (17.03G pixels/s)
8. 97.75ms (17.16G pixels/s)
9. 97.65ms (17.18G pixels/s)
10. 98.30ms (17.07G pixels/s)
11. 97.96ms (17.13G pixels/s)
12. 97.95ms (17.13G pixels/s)
13. 97.22ms (17.26G pixels/s)
14. 98.51ms (17.03G pixels/s)
15. 97.94ms (17.13G pixels/s)
16. 97.94ms (17.13G pixels/s)
17. 98.85ms (16.97G pixels/s)
18. 98.80ms (16.98G pixels/s)
19. 97.98ms (17.12G pixels/s)
20. 98.14ms (17.09G pixels/s)
21. 97.45ms (17.22G pixels/s)
22. 98.18ms (17.09G pixels/s)
23. 97.83ms (17.15G pixels/s)
24. 98.23ms (17.08G pixels/s)
25. 98.42ms (17.05G pixels/s)
26. 97.87ms (17.14G pixels/s)
27. 97.21ms (17.26G pixels/s)
28. 98.50ms (17.03G pixels/s)
29. 97.62ms (17.19G pixels/s)
30. 98.03ms (17.11G pixels/s)
31. 98.28ms (17.07G pixels/s)
32. 97.92ms (17.13G pixels/s)
33. 97.74ms (17.17G pixels/s)
34. 97.86ms (17.14G pixels/s)
35. 98.51ms (17.03G pixels/s)
36. 98.27ms (17.07G pixels/s)
37. 97.71ms (17.17G pixels/s)
38. 97.98ms (17.12G pixels/s)
39. 97.49ms (17.21G pixels/s)
40. 98.02ms (17.12G pixels/s)
41. 97.74ms (17.17G pixels/s)
42. 98.32ms (17.06G pixels/s)
43. 98.32ms (17.06G pixels/s)
44. 98.10ms (17.10G pixels/s)
45. 98.31ms (17.07G pixels/s)
46. 97.98ms (17.12G pixels/s)
47. 98.52ms (17.03G pixels/s)
48. 98.55ms (17.02G pixels/s)
49. 98.40ms (17.05G pixels/s)
50. 97.84ms (17.15G pixels/s)
51. 98.95ms (16.95G pixels/s)
52. 98.62ms (17.01G pixels/s)
53. 100.74ms (16.65G pixels/s)
54. 98.61ms (17.01G pixels/s)
55. 98.62ms (17.01G pixels/s)
56. 97.09ms (17.28G pixels/s)
57. 97.89ms (17.14G pixels/s)
58. 100.84ms (16.64G pixels/s)
59. 99.07ms (16.93G pixels/s)
60. 98.68ms (17.00G pixels/s)
61. 99.43ms (16.87G pixels/s)
62. 100.41ms (16.71G pixels/s)
63. 97.77ms (17.16G pixels/s)
64. 98.78ms (16.98G pixels/s)
65. 97.77ms (17.16G pixels/s)
66. 98.46ms (17.04G pixels/s)
67. 100.62ms (16.67G pixels/s)
68. 104.77ms (16.01G pixels/s)
69. 98.51ms (17.03G pixels/s)
70. 98.48ms (17.04G pixels/s)
71. 100.85ms (16.64G pixels/s)
72. 102.28ms (16.40G pixels/s)
73. 99.50ms (16.86G pixels/s)
74. 100.57ms (16.68G pixels/s)
75. 101.84ms (16.47G pixels/s)
76. 99.07ms (16.93G pixels/s)
77. 98.80ms (16.98G pixels/s)
78. 98.86ms (16.97G pixels/s)
79. 98.01ms (17.12G pixels/s)
80. 100.67ms (16.67G pixels/s)
81. 100.69ms (16.66G pixels/s)
82. 99.60ms (16.84G pixels/s)
83. 101.68ms (16.50G pixels/s)
84. 111.46ms (15.05G pixels/s)
85. 107.74ms (15.57G pixels/s)
86. 105.30ms (15.93G pixels/s)
87. 112.01ms (14.98G pixels/s)
88. 119.58ms (14.03G pixels/s)
89. 102.47ms (16.37G pixels/s)
90. 131.52ms (12.76G pixels/s)
91. 131.55ms (12.75G pixels/s)
92. 112.19ms (14.95G pixels/s)
93. 104.49ms (16.06G pixels/s)
94. 126.18ms (13.30G pixels/s)
95. 123.98ms (13.53G pixels/s)
96. 114.98ms (14.59G pixels/s)
97. 125.19ms (13.40G pixels/s)
98. 111.20ms (15.09G pixels/s)
99. 116.89ms (14.35G pixels/s)
100. 113.22ms (14.82G pixels/s)
101. 110.97ms (15.12G pixels/s)
102. 106.94ms (15.69G pixels/s)
103. 97.66ms (17.18G pixels/s)
104. 98.25ms (17.08G pixels/s)
105. 98.09ms (17.10G pixels/s)
106. 120.23ms (13.95G pixels/s)
107. 121.07ms (13.86G pixels/s)
108. 114.33ms (14.67G pixels/s)
109. 109.20ms (15.36G pixels/s)
110. 118.82ms (14.12G pixels/s)
111. 109.68ms (15.30G pixels/s)
112. 119.59ms (14.03G pixels/s)
113. 105.64ms (15.88G pixels/s)
114. 106.72ms (15.72G pixels/s)
115. 111.05ms (15.11G pixels/s)
116. 116.57ms (14.39G pixels/s)
117. 109.87ms (15.27G pixels/s)
118. 135.25ms (12.40G pixels/s)
119. 98.30ms (17.07G pixels/s)
120. 98.26ms (17.07G pixels/s)
121. 110.16ms (15.23G pixels/s)
122. 114.49ms (14.65G pixels/s)
123. 115.29ms (14.55G pixels/s)
124. 123.68ms (13.56G pixels/s)
125. 122.32ms (13.72G pixels/s)
126. 111.39ms (15.06G pixels/s)
127. 105.41ms (15.92G pixels/s)
128. 110.22ms (15.22G pixels/s)

I think these result are interesting. My card is one of the weakest GCN so we see that the compute part of your bench is able to saturate so GCN is starting to "jump" like Maxwell:
compute
112. 50.52ms
113. 64.66ms
...
127. 64.64ms
128. 76.80ms

With graphics + compute it start to jump earlier :
83. 101.68ms (16.50G pixels/s)
84. 111.46ms (15.05G pixels/s)
...
128. 110.22ms (15.22G pixels/s)

So I have my idea to explain the fully flat result of the stronger GCN cards is that this bench is not pushing enough work for them.
 
Back
Top