DX12 Performance Discussion And Analysis Thread

MDolenc · Sep 8, 2015

3dilettante said:
We also have not tested a multi-queue state, and we do not have ready visibility on what queues are actually being exercised relative to the API-visible queues.

We have. Initial version, 128 queues, GCN crashed, Kepler/Maxwell basically serialized all of them. Which I'd say is what happens with graphics + compute queue anyway.

I think NV drivers are just not there yet. That said there's one more thing we can try. D3D12 for graphics + CUDA instead of D3D12 compute shaders. So this version is now NV only.

Jawed · Sep 8, 2015

MDolenc, could you make a variant that does the unroll in the compute shader explicitly either like that linked by Benny-ua (which I quoted) or the nested loop unroll that I wrote. I'm curious if it's possible to make the compute portion run faster as the PTX that was posted looked quite a bit wonky to me.

CasellasAbdala · Sep 8, 2015

sebbbi said:
Compute is not useless by any means. Many DirectX 11 games used compute shaders. Async is just a bonus. There are lots of games out there that use compute shaders for lighting (Battlefield 3 being the first big AAA title). Compute shader based lighting perform very well even on AMD Terascale VLIW GPUs (such a Radeon 5000 and 6000 series) and NVIDIA Fermi (Geforce GTX 400 series). Compute shaders are used in rendering because compute shaders allow writing less brute force algorithms that save ALU and bandwidth compared to pixel shader equivalents.

Full screen (~2 million threads) compute shader passes do not need any concurrent graphics tasks to fill the whole GPU. It is perfectly fine to run first graphics to raster the G-buffer and shadow maps and then run compute shaders for lighting and post processing. Games have been doing it like this since DirectX 11 launched. Everybody has been happy. I don't understand the recent fuzz.

So, all this fuzz about async shaders is a little overrated? Also, about context switching and stuff, you guys just said that its not about compute but switching between comptue and graphics... now Kollock (Oxide Dev) said that engines could migrate to a 100% compute (Or at least thats what i read at overclock quoted by this guy mahigan)... Any thoughts on that?

As a partially final question. Could someone explain me in really really simple words, whats the actual panorama for Fiji and Maxwell 2 future proof wise? (Will they be as future proof as my HD 5870 or as a whatever high end card of theese past years?)

Thanks!

Ps: Hola Razor.

Rikimaru · Sep 8, 2015

Future proof hardware will only come next year.

Ext3h · Sep 8, 2015

MDolenc said:
We have. Initial version, 128 queues, GCN crashed, Kepler/Maxwell basically serialized all of them. Which I'd say is what happens with graphics + compute queue anyway.

I think NV drivers are just not there yet. That said there's one more thing we can try. D3D12 for graphics + CUDA instead of D3D12 compute shaders. So this version is now NV only.

Would you mind supplying the source code for that individual versions of that project?

Are you even sure that GCN crashed? Or did you simple forgot to check the return value of ID3D12Device::CreateCommandQueue ?
You know, that thing can fail for many reasons, and if it does, the ppCommandQueue parameter is tainted. Only explicit S_OK return value indicates success. ThrowIfFailed is actually unnecessary, the driver may still be able to recover.

Also, with 128 queues you haven't hit the sweet spot for either platform. 1, 8, 32 and 64 would be the numbers to compare. At least 1 and 8 should differ greatly on GCN 1.2.

Oh, and D3D12 + CUDA is now something entirely different again. CUDA runs in compute, not graphic context. There is definitely a context switch involved now, even though you are now suddenly profiting from all the optimizations in the driver stack for CUDA, which yields entirely incomparable results.

3dilettante · Sep 8, 2015

MDolenc said:
We have. Initial version, 128 queues, GCN crashed, Kepler/Maxwell basically serialized all of them. Which I'd say is what happens with graphics + compute queue anyway.

Apologies, I was thinking of the test outputs that provided end times for individual dispatches.
The data sets in this thread were more sparse prior to the latest revision.

willardjuice · Sep 8, 2015

Ext3h said:
Would you mind supplying the source code for that individual versions of that project?

I'll +1 that. If you make it a git repo we could even contribute and develop other test cases (Beyond3D async test suite incoming?

). No pressure though obviously, but I'm sure there are others like me that are interested in helping!

Davros · Sep 8, 2015

Has anyone posted this I'm too lazy to read through 27 pages

Well, here is some good news for NVIDIA users. It appears that]NVIDIA will fully implement Async Compute via an upcoming driver:
http://www.overclock.net/t/1569897/...ingularity-dx12-benchmarks/2130#post_24379702

3dilettante · Sep 8, 2015

Davros said:
Has anyone posted this I'm too lazy to read through 27 pages

Well, here is some good news for NVIDIA users. It appears that]NVIDIA will fully implement Async Compute via an upcoming driver:
http://www.overclock.net/t/1569897/...ingularity-dx12-benchmarks/2130#post_24379702

It came up on page 28.
It may come down to a wait and see for the Nvidia implementation. Possibly, whatever is incomplete may explain some of the oddities when it comes to what queues are appearing to be utilized, and the difference in how the test runs generate timing information between vendors.

CSI PC · Sep 9, 2015

Nobu said:
http://nubleh.github.io/async/#38
It's doing pretty well compared to a 980ti, but there's still large parts that aren't in the blue area. Whether that's a driver issue or the way the test was written, I don't know, but for now it's in at least a similar position to the 980ti (except in the "forced sequential" case, which I hope is not a normal use-case for compute or graphics in general. D: )

And interestingly it was worse on a more recent FuryX using MSI Afterburner (maybe a poor correlation on my part but why I asked Pady earlier if he also implemented MSI Afterburner on his sort of working 980ti).
http://nubleh.github.io/async/#47
Cheers

Pensuke89 · Sep 9, 2015

Tested using the D3D12 + CUDA (post 661) on a GTX970 (355.82)

CUDA only:
1. 5.24ms
2. 5.18ms
3. 5.40ms
4. 5.20ms
5. 5.25ms
6. 5.30ms
7. 5.29ms
8. 5.24ms
9. 5.13ms
10. 5.55ms
11. 5.32ms
12. 5.26ms
13. 5.38ms
14. 5.50ms
15. 5.40ms
16. 5.58ms
17. 5.45ms
18. 6.07ms
19. 5.43ms
20. 5.56ms
21. 5.59ms
22. 5.51ms
23. 5.73ms
24. 5.63ms
25. 5.34ms
26. 5.35ms
27. 5.22ms
28. 5.28ms
29. 5.55ms
30. 5.20ms
31. 6.12ms
32. 5.38ms
33. 10.03ms
34. 9.68ms
35. 9.34ms
36. 9.49ms
37. 9.25ms
38. 9.51ms
39. 9.45ms
40. 9.78ms
41. 9.67ms
42. 9.03ms
43. 9.36ms
44. 9.22ms
45. 9.39ms
46. 9.39ms
47. 9.37ms
48. 9.42ms
49. 9.57ms
50. 10.13ms
51. 10.57ms
52. 10.04ms
53. 10.39ms
54. 10.93ms
55. 10.39ms
56. 10.08ms
57. 10.09ms
58. 10.24ms
59. 10.05ms
60. 10.41ms
61. 10.10ms
62. 10.32ms
63. 10.18ms
64. 11.55ms
65. 14.31ms
66. 14.22ms
67. 14.82ms
68. 15.59ms
69. 14.95ms
70. 15.71ms
71. 16.00ms
72. 14.95ms
73. 15.01ms
74. 14.79ms
75. 15.44ms
76. 15.39ms
77. 14.77ms
78. 15.38ms
79. 14.92ms
80. 15.03ms
81. 14.90ms
82. 15.09ms
83. 18.81ms
84. 15.50ms
85. 16.57ms
86. 17.34ms
87. 15.77ms
88. 15.93ms
89. 16.22ms
90. 16.43ms
91. 16.50ms
92. 16.56ms
93. 16.89ms
94. 16.37ms
95. 16.43ms
96. 16.61ms
97. 20.07ms
98. 19.52ms
99. 19.74ms
100. 19.46ms
101. 20.02ms
102. 20.95ms
103. 21.05ms
104. 21.21ms
105. 21.03ms
106. 21.18ms
107. 21.19ms
108. 21.52ms
109. 21.72ms
110. 21.61ms
111. 21.60ms
112. 21.15ms
113. 21.40ms
114. 21.26ms
115. 20.94ms
116. 21.16ms
117. 20.57ms
118. 21.33ms
119. 21.78ms
120. 22.78ms
121. 22.33ms
122. 22.61ms
123. 22.78ms
124. 22.69ms
125. 22.49ms
126. 23.23ms
127. 22.89ms
128. 22.74ms
Graphics only: 33.01ms (50.82G pixels/s)
Graphics + CUDA:
1. 37.04ms (45.30G pixels/s) {51.77 G pixels/s}
2. 36.76ms (45.64G pixels/s) {51.79 G pixels/s}
3. 36.69ms (45.72G pixels/s) {51.82 G pixels/s}
4. 36.83ms (45.56G pixels/s) {51.79 G pixels/s}
5. 36.70ms (45.72G pixels/s) {51.82 G pixels/s}
6. 36.71ms (45.71G pixels/s) {51.81 G pixels/s}
7. 36.81ms (45.58G pixels/s) {51.83 G pixels/s}
8. 36.72ms (45.69G pixels/s) {51.81 G pixels/s}
9. 36.63ms (45.80G pixels/s) {51.83 G pixels/s}
10. 36.68ms (45.74G pixels/s) {51.82 G pixels/s}
11. 36.74ms (45.67G pixels/s) {51.83 G pixels/s}
12. 41.07ms (40.85G pixels/s) {51.79 G pixels/s}
13. 36.78ms (45.62G pixels/s) {51.80 G pixels/s}
14. 36.73ms (45.68G pixels/s) {51.82 G pixels/s}
15. 36.68ms (45.74G pixels/s) {51.82 G pixels/s}
16. 36.93ms (45.43G pixels/s) {51.74 G pixels/s}
17. 36.96ms (45.39G pixels/s) {51.79 G pixels/s}
18. 37.07ms (45.26G pixels/s) {51.80 G pixels/s}
19. 36.84ms (45.54G pixels/s) {51.81 G pixels/s}
20. 37.00ms (45.34G pixels/s) {51.80 G pixels/s}
21. 36.85ms (45.53G pixels/s) {51.78 G pixels/s}
22. 36.88ms (45.50G pixels/s) {51.80 G pixels/s}
23. 36.70ms (45.71G pixels/s) {51.81 G pixels/s}
24. 36.72ms (45.69G pixels/s) {51.81 G pixels/s}
25. 36.83ms (45.55G pixels/s) {51.81 G pixels/s}
26. 36.70ms (45.72G pixels/s) {51.81 G pixels/s}
27. 36.76ms (45.64G pixels/s) {51.84 G pixels/s}
28. 36.79ms (45.60G pixels/s) {51.81 G pixels/s}
29. 41.17ms (40.75G pixels/s) {51.75 G pixels/s}
30. 37.14ms (45.17G pixels/s) {51.79 G pixels/s}
31. 36.82ms (45.57G pixels/s) {51.76 G pixels/s}
32. 36.76ms (45.64G pixels/s) {51.80 G pixels/s}
33. 41.23ms (40.69G pixels/s) {51.79 G pixels/s}
34. 41.23ms (40.69G pixels/s) {51.81 G pixels/s}
35. 41.14ms (40.78G pixels/s) {51.84 G pixels/s}
36. 41.04ms (40.88G pixels/s) {51.83 G pixels/s}
37. 44.92ms (37.35G pixels/s) {51.80 G pixels/s}
38. 41.21ms (40.71G pixels/s) {51.80 G pixels/s}
39. 40.90ms (41.02G pixels/s) {51.83 G pixels/s}
40. 41.29ms (40.63G pixels/s) {51.77 G pixels/s}
41. 40.95ms (40.97G pixels/s) {51.79 G pixels/s}
42. 41.16ms (40.76G pixels/s) {51.81 G pixels/s}
43. 41.36ms (40.56G pixels/s) {51.81 G pixels/s}
44. 41.26ms (40.66G pixels/s) {51.81 G pixels/s}
45. 45.10ms (37.20G pixels/s) {51.80 G pixels/s}
46. 41.33ms (40.60G pixels/s) {51.81 G pixels/s}
47. 41.43ms (40.49G pixels/s) {51.79 G pixels/s}
48. 41.20ms (40.72G pixels/s) {51.78 G pixels/s}
49. 41.60ms (40.33G pixels/s) {51.76 G pixels/s}
50. 41.79ms (40.15G pixels/s) {51.83 G pixels/s}
51. 41.79ms (40.14G pixels/s) {51.83 G pixels/s}
52. 41.72ms (40.21G pixels/s) {51.80 G pixels/s}
53. 41.78ms (40.15G pixels/s) {51.80 G pixels/s}
54. 41.65ms (40.28G pixels/s) {51.82 G pixels/s}
55. 42.06ms (39.89G pixels/s) {51.80 G pixels/s}
56. 42.48ms (39.50G pixels/s) {51.82 G pixels/s}
57. 41.82ms (40.11G pixels/s) {51.79 G pixels/s}
58. 41.92ms (40.03G pixels/s) {51.81 G pixels/s}
59. 41.87ms (40.07G pixels/s) {51.78 G pixels/s}
60. 41.64ms (40.29G pixels/s) {51.81 G pixels/s}
61. 41.57ms (40.36G pixels/s) {51.81 G pixels/s}
62. 41.81ms (40.12G pixels/s) {51.81 G pixels/s}
63. 41.89ms (40.05G pixels/s) {51.82 G pixels/s}
64. 42.03ms (39.91G pixels/s) {51.82 G pixels/s}
65. 45.65ms (36.75G pixels/s) {51.79 G pixels/s}
66. 45.46ms (36.90G pixels/s) {51.82 G pixels/s}
67. 45.97ms (36.50G pixels/s) {51.82 G pixels/s}
68. 46.22ms (36.30G pixels/s) {51.81 G pixels/s}
69. 47.07ms (35.64G pixels/s) {51.80 G pixels/s}
70. 46.01ms (36.46G pixels/s) {51.83 G pixels/s}
71. 47.63ms (35.23G pixels/s) {51.80 G pixels/s}
72. 46.18ms (36.33G pixels/s) {51.83 G pixels/s}
73. 46.15ms (36.35G pixels/s) {51.84 G pixels/s}
74. 46.42ms (36.14G pixels/s) {51.80 G pixels/s}
75. 46.65ms (35.97G pixels/s) {51.82 G pixels/s}
76. 46.02ms (36.45G pixels/s) {51.81 G pixels/s}
77. 46.01ms (36.47G pixels/s) {51.72 G pixels/s}
78. 46.09ms (36.40G pixels/s) {51.80 G pixels/s}
79. 46.05ms (36.43G pixels/s) {51.71 G pixels/s}
80. 46.05ms (36.44G pixels/s) {51.78 G pixels/s}
81. 46.04ms (36.44G pixels/s) {51.79 G pixels/s}
82. 46.18ms (36.33G pixels/s) {51.80 G pixels/s}
83. 46.10ms (36.40G pixels/s) {51.83 G pixels/s}
84. 47.10ms (35.62G pixels/s) {51.80 G pixels/s}
85. 46.86ms (35.80G pixels/s) {51.80 G pixels/s}
86. 47.15ms (35.58G pixels/s) {51.72 G pixels/s}
87. 46.79ms (35.86G pixels/s) {51.78 G pixels/s}
88. 46.69ms (35.93G pixels/s) {51.80 G pixels/s}
89. 47.02ms (35.68G pixels/s) {51.81 G pixels/s}
90. 46.60ms (36.00G pixels/s) {51.79 G pixels/s}
91. 47.01ms (35.69G pixels/s) {51.82 G pixels/s}
92. 47.34ms (35.44G pixels/s) {51.80 G pixels/s}
93. 47.06ms (35.65G pixels/s) {51.82 G pixels/s}
94. 47.24ms (35.52G pixels/s) {51.82 G pixels/s}
95. 46.72ms (35.91G pixels/s) {51.81 G pixels/s}
96. 47.01ms (35.69G pixels/s) {51.82 G pixels/s}
97. 50.46ms (33.25G pixels/s) {51.81 G pixels/s}
98. 50.28ms (33.37G pixels/s) {51.81 G pixels/s}
99. 50.37ms (33.31G pixels/s) {51.84 G pixels/s}
100. 50.19ms (33.43G pixels/s) {51.81 G pixels/s}
101. 51.28ms (32.72G pixels/s) {51.82 G pixels/s}
102. 51.70ms (32.45G pixels/s) {51.81 G pixels/s}
103. 51.66ms (32.48G pixels/s) {51.79 G pixels/s}
104. 51.29ms (32.71G pixels/s) {51.79 G pixels/s}
105. 51.18ms (32.78G pixels/s) {51.77 G pixels/s}
106. 51.11ms (32.82G pixels/s) {51.78 G pixels/s}
107. 51.43ms (32.62G pixels/s) {51.79 G pixels/s}
108. 51.38ms (32.66G pixels/s) {51.79 G pixels/s}
109. 51.71ms (32.44G pixels/s) {51.81 G pixels/s}
110. 51.56ms (32.54G pixels/s) {51.82 G pixels/s}
111. 51.60ms (32.52G pixels/s) {51.82 G pixels/s}
112. 51.42ms (32.63G pixels/s) {51.80 G pixels/s}
113. 51.34ms (32.68G pixels/s) {51.81 G pixels/s}
114. 52.16ms (32.17G pixels/s) {51.78 G pixels/s}
115. 51.32ms (32.69G pixels/s) {51.80 G pixels/s}
116. 51.29ms (32.71G pixels/s) {51.80 G pixels/s}
117. 51.09ms (32.84G pixels/s) {51.83 G pixels/s}
118. 51.45ms (32.61G pixels/s) {51.80 G pixels/s}
119. 51.68ms (32.47G pixels/s) {51.80 G pixels/s}
120. 52.54ms (31.93G pixels/s) {51.81 G pixels/s}
121. 51.76ms (32.41G pixels/s) {51.81 G pixels/s}
122. 55.55ms (30.20G pixels/s) {51.81 G pixels/s}
123. 52.12ms (32.19G pixels/s) {51.83 G pixels/s}
124. 52.53ms (31.94G pixels/s) {51.82 G pixels/s}
125. 52.11ms (32.20G pixels/s) {51.82 G pixels/s}
126. 52.70ms (31.84G pixels/s) {51.80 G pixels/s}
127. 52.58ms (31.91G pixels/s) {51.80 G pixels/s}
128. 52.37ms (32.04G pixels/s) {51.82 G pixels/s}

Ext3h · Sep 9, 2015

That's a mere 3 milliseconds saved, between executing CUDA and graphic kernel individually, vs. "simultaneously".

I think it couldn't be any clearer, that when using CUDA, there is no concurrent execution at all. Not much of a surprise though, since we already knew that a context switch is required in this setup.

It's also once again the precise same 32 threads. Which is odd ....
CUDA should definitely have made use of all available hardware queues.

Perhaps it actually did? That would mean that Maxwell v2 can actually only ever process up to 32 calls in parallel. That's a rather low limit. I had hoped it would actually be able to pull 32 items from each queue, but it actually looks more like topping out at 32 items in total.

Even though we still need a multi queue test for the Fury. I'm still suspecting that the GCN architecture/driver is actually mapping each hardware queue 1:1 onto a software queue, so we still haven't gotten all out of it yet.

Deadhand · Sep 9, 2015

Has anyone seen this?

This is a game being developed for the PS4 which uses async shaders extensively, with much of the graphics pipeline implemented in compute shaders.
http://fumufumu.q-games.com/archives/TheTechnologyOfTomorrowsChildrenFinal.pdf (Starting at page 163)

http://cdn3.dualshockers.com/wp-content/uploads/2014/09/TomorrowChildren33.jpg
Unfortunately the profiler is apparently specific to the PS4, as it seems to give quite a low-level look at what's going on at the CU and SIMD level.

Razor1 · Sep 9, 2015

CasellasAbdala said:
So, all this fuzz about async shaders is a little overrated? Also, about context switching and stuff, you guys just said that its not about compute but switching between comptue and graphics... now Kollock (Oxide Dev) said that engines could migrate to a 100% compute (Or at least thats what i read at overclock quoted by this guy mahigan)... Any thoughts on that?

As a partially final question. Could someone explain me in really really simple words, whats the actual panorama for Fiji and Maxwell 2 future proof wise? (Will they be as future proof as my HD 5870 or as a whatever high end card of theese past years?)

Thanks!

Ps: Hola Razor.

I wouldn't say its overrated, premature yes, but in this case it makes nV get off their butts and fix it. Engines won't migrate to 100% compute, they will do more compute in the future yet to what degree its up to the dev, but in the near feature I would say they will pick the low hanging fruit first, so I wouldn't expect any more than 50% increase in compute and they always have to look at lowest common denominator, in this case it seems to be Maxwell 2 and with nV with majority of the market share they can't ignore that. If you are going to keep a card for three years, I would wait for the next gen cards, typical generations in a graphics card now is 1.5 years, and since their was no node drop this generation, what ever nV and AMD could do was done based on architectural design and they had to make compromises, hence the drop of DP.

firstminion · Sep 9, 2015

As if 50% increase was irrelevant.

Drop of DP is orthogonal to this.

Razor1 · Sep 9, 2015

I didn't say it was irrelevant, I reread my post I don't see it do you? And also 50% increase from what they are using now, which is very little guess what it isn't going to be dominant in games. And how long has Maxwell 2 been out already? Its getting close to a year, AMD wasn't able to get their cards out fast enough, effectively delayed compared in a market perspective. So I do expect both next gen cards coming out end of Q1 to end of Q2 next year.

And the drop of DP was just one of the compromises and I was using it as an example, I'm sure they would have liked to do more if they could, and would have if 20nm was viable, these GPU's weren't what was being planned on 20nm they would have had quite a bit more silicon and transistors available to them.

And lets discuss this in another thread is you like to.

Nobu · Sep 9, 2015

MDolenc said:
We have. Initial version, 128 queues, GCN crashed, Kepler/Maxwell basically serialized all of them. Which I'd say is what happens with graphics + compute queue anyway.

So, I ran this version, just to see what it'd look like. It's more or less the same, except a bit more erratic on the async test, and there are a few lines with numbers that don't add up (buffer underrun?):

Code:

190. 87.00ms (19.28G pixels/s) [26.26 26.26 26.26 26.26 26.26 26.26 26.26 26.26 26.26 26.26 26.26 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.27 27.28 27.28 27.28 27.28 27.28 27.28 32.96 32.96 32.96 32.96 32.96 32.96 32.96 32.96 32.96 34.35 34.35 34.35 34.35 34.35 34.35 34.35 34.35 34.35 41.23 41.23 52.63 52.63 53.51 53.51 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.66 53.68 53.68 53.68 53.68 53.68 59.08 59.15 59.15 59.36 59.36 59.36 59.36 59.36 59.36 60.53 60.53 60.75 60.75 60.75 60.75 60.75 60.75 60.75 67.17 67.63 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00 683212728696832.00] {24.77 G pixels/s}

CSI PC · Sep 9, 2015

Ext3h said:
That's a mere 3 milliseconds saved, between executing CUDA and graphic kernel individually, vs. "simultaneously".

I think it couldn't be any clearer, that when using CUDA, there is no concurrent execution at all. Not much of a surprise though, since we already knew that a context switch is required in this setup.

It's also once again the precise same 32 threads. Which is odd ....
CUDA should definitely have made use of all available hardware queues.

Perhaps it actually did? That would mean that Maxwell v2 can actually only ever process up to 32 calls in parallel. That's a rather low limit. I had hoped it would actually be able to pull 32 items from each queue, but it actually looks more like topping out at 32 items in total.
.

Ext3h,
is it possible to reach this conclusion when one can see some strange behaviour on one of the sort of working DX12 980ti tests where its all over the place; as an example the following test run only seems to perform sort of correctly in places and not at the beginning (check around 190), while then behaving different when the settings subtly changed?
http://nubleh.github.io/async/#52
And same machine TDR ON: http://nubleh.github.io/async/#53
PadyEOS could consistently see this trend re-runnng the tests, which is not behaving like the majority of Maxwell 2 users (even when using same driver).
Point being could it be driver related with the new DX12 CUDA and graphics test and so too early to arrive at any conclusion?
Cheers

Ext3h · Sep 9, 2015

CSI PC said:
Ext3h,
is it possible to reach this conclusion when one can see some strange behaviour on one of the sort of working DX12 980ti tests where its all over the place; as an example the following test run only seems to perform sort of correctly in places and not at the beginning (check around 190), while then behaving different when the settings subtly changed?
http://nubleh.github.io/async/#52
And same machine TDR ON: http://nubleh.github.io/async/#53

The discontinuity only affects the sequential run, doesn't it?
And no, turning TDR on / off isn't a subtly change at all. It means turning off a watchdog, which is actively trying to terminate excessively long running calls. That's a also a driver issue, not necessarily a hardware one.

Or do you refer to the concurrent execution being partially slightly slower than the separated one? Not so much of a surprise either, given the aggressive power management of recent GPU generations, you would need to force them to a fixed clock in order to get completely stable results.

And still, we could see a speedup of up to 10ms from executing graphics and compute kernels in parallel using pure DX12. With DX12+CUDA, that's down to mere 3ms, which sounds more like that speedup is only achieved by hiding some latencies in the driver, not by increasing hardware utilization.

Kwee · Sep 9, 2015

GTX 960 355.82

CUDA only:
1. 4.46ms
2. 4.29ms
3. 4.30ms
4. 4.33ms
5. 4.69ms
6. 4.34ms
7. 4.37ms
8. 4.56ms
9. 4.41ms
10. 4.44ms
11. 4.51ms
12. 4.60ms
13. 4.50ms
14. 4.61ms
15. 4.62ms
16. 4.50ms
17. 4.51ms
18. 4.72ms
19. 4.54ms
20. 4.75ms
21. 4.80ms
22. 4.60ms
23. 4.76ms
24. 4.82ms
25. 4.65ms
26. 4.80ms
27. 4.86ms
28. 4.69ms
29. 4.85ms
30. 4.91ms
31. 4.73ms
32. 4.90ms
33. 9.03ms
34. 8.61ms
35. 8.40ms
36. 8.26ms
37. 8.42ms
38. 8.21ms
39. 8.22ms
40. 8.22ms
41. 8.26ms
42. 8.26ms
43. 8.30ms
44. 8.28ms
45. 8.29ms
46. 8.33ms
47. 8.39ms
48. 8.35ms
49. 8.37ms
50. 8.67ms
51. 9.09ms
52. 8.86ms
53. 8.86ms
54. 8.05ms
55. 11.43ms
56. 8.36ms
57. 8.23ms
58. 8.13ms
59. 8.13ms
60. 8.18ms
61. 8.20ms
62. 8.18ms
63. 8.21ms
64. 8.20ms
65. 11.38ms
66. 11.60ms
67. 11.77ms
68. 12.29ms
69. 12.27ms
70. 12.34ms
71. 11.76ms
72. 11.72ms
73. 11.93ms
74. 12.03ms
75. 11.92ms
76. 11.90ms
77. 11.92ms
78. 11.87ms
79. 11.75ms
80. 11.97ms
81. 11.88ms
82. 11.98ms
83. 11.89ms
84. 12.35ms
85. 12.64ms
86. 12.86ms
87. 12.76ms
88. 12.52ms
89. 12.51ms
90. 12.69ms
91. 12.71ms
92. 12.74ms
93. 12.73ms
94. 12.74ms
95. 12.92ms
96. 12.58ms
97. 15.59ms
98. 15.54ms
99. 15.65ms
100. 15.44ms
101. 15.92ms
102. 16.49ms
103. 16.69ms
104. 16.73ms
105. 16.41ms
106. 16.50ms
107. 16.65ms
108. 16.57ms
109. 16.76ms
110. 16.81ms
111. 16.62ms
112. 16.59ms
113. 16.55ms
114. 16.74ms
115. 16.74ms
116. 16.64ms
117. 16.56ms
118. 16.72ms
119. 17.16ms
120. 17.60ms
121. 17.52ms
122. 17.31ms
123. 17.18ms
124. 17.34ms
125. 17.49ms
126. 17.62ms
127. 17.43ms
128. 17.56ms
Graphics only: 39.78ms (42.18G pixels/s)
Graphics + CUDA:
1. 42.81ms (39.19G pixels/s) {42.70 G pixels/s}
2. 42.91ms (39.10G pixels/s) {42.70 G pixels/s}
3. 42.81ms (39.19G pixels/s) {42.70 G pixels/s}
4. 42.89ms (39.12G pixels/s) {42.70 G pixels/s}
5. 42.84ms (39.16G pixels/s) {42.70 G pixels/s}
6. 42.81ms (39.19G pixels/s) {42.70 G pixels/s}
7. 42.81ms (39.19G pixels/s) {42.70 G pixels/s}
8. 42.89ms (39.12G pixels/s) {42.70 G pixels/s}
9. 42.82ms (39.18G pixels/s) {42.71 G pixels/s}
10. 42.81ms (39.19G pixels/s) {42.70 G pixels/s}
11. 42.82ms (39.18G pixels/s) {42.70 G pixels/s}
12. 42.82ms (39.18G pixels/s) {42.69 G pixels/s}
13. 42.84ms (39.16G pixels/s) {42.70 G pixels/s}
14. 42.86ms (39.15G pixels/s) {42.70 G pixels/s}
15. 42.84ms (39.16G pixels/s) {42.70 G pixels/s}
16. 42.83ms (39.17G pixels/s) {42.70 G pixels/s}
17. 42.83ms (39.17G pixels/s) {42.70 G pixels/s}
18. 42.82ms (39.18G pixels/s) {42.70 G pixels/s}
19. 43.05ms (38.97G pixels/s) {42.70 G pixels/s}
20. 43.04ms (38.98G pixels/s) {42.70 G pixels/s}
21. 42.83ms (39.17G pixels/s) {42.70 G pixels/s}
22. 42.83ms (39.17G pixels/s) {42.70 G pixels/s}
23. 42.91ms (39.10G pixels/s) {42.70 G pixels/s}
24. 42.92ms (39.09G pixels/s) {42.70 G pixels/s}
25. 42.85ms (39.16G pixels/s) {42.70 G pixels/s}
26. 42.87ms (39.13G pixels/s) {42.70 G pixels/s}
27. 42.86ms (39.15G pixels/s) {42.71 G pixels/s}
28. 42.85ms (39.15G pixels/s) {42.70 G pixels/s}
29. 42.92ms (39.09G pixels/s) {42.70 G pixels/s}
30. 42.92ms (39.08G pixels/s) {42.70 G pixels/s}
31. 42.84ms (39.17G pixels/s) {42.70 G pixels/s}
32. 46.25ms (36.27G pixels/s) {42.70 G pixels/s}
33. 46.35ms (36.20G pixels/s) {42.70 G pixels/s}
34. 46.26ms (36.27G pixels/s) {42.70 G pixels/s}
35. 46.30ms (36.23G pixels/s) {42.70 G pixels/s}
36. 46.31ms (36.23G pixels/s) {42.69 G pixels/s}
37. 46.38ms (36.17G pixels/s) {42.70 G pixels/s}
38. 46.32ms (36.22G pixels/s) {42.70 G pixels/s}
39. 46.39ms (36.17G pixels/s) {42.65 G pixels/s}
40. 46.35ms (36.20G pixels/s) {42.70 G pixels/s}
41. 46.38ms (36.18G pixels/s) {42.70 G pixels/s}
42. 46.38ms (36.17G pixels/s) {42.70 G pixels/s}
43. 46.39ms (36.17G pixels/s) {42.70 G pixels/s}
44. 46.41ms (36.15G pixels/s) {42.70 G pixels/s}
45. 46.42ms (36.14G pixels/s) {42.70 G pixels/s}
46. 46.44ms (36.13G pixels/s) {42.70 G pixels/s}
47. 46.50ms (36.08G pixels/s) {42.70 G pixels/s}
48. 46.47ms (36.10G pixels/s) {42.70 G pixels/s}
49. 46.48ms (36.09G pixels/s) {42.69 G pixels/s}
50. 46.74ms (35.89G pixels/s) {42.70 G pixels/s}
51. 50.09ms (33.49G pixels/s) {42.67 G pixels/s}
52. 46.79ms (35.85G pixels/s) {42.68 G pixels/s}
53. 46.72ms (35.91G pixels/s) {42.67 G pixels/s}
54. 46.69ms (35.93G pixels/s) {42.70 G pixels/s}
55. 46.80ms (35.85G pixels/s) {42.70 G pixels/s}
56. 46.87ms (35.80G pixels/s) {42.68 G pixels/s}
57. 46.84ms (35.82G pixels/s) {42.68 G pixels/s}
58. 46.82ms (35.84G pixels/s) {42.70 G pixels/s}
59. 46.79ms (35.86G pixels/s) {42.70 G pixels/s}
60. 46.81ms (35.84G pixels/s) {42.70 G pixels/s}
61. 50.17ms (33.44G pixels/s) {42.70 G pixels/s}
62. 46.82ms (35.83G pixels/s) {42.70 G pixels/s}
63. 46.85ms (35.81G pixels/s) {42.69 G pixels/s}
64. 46.84ms (35.81G pixels/s) {42.70 G pixels/s}
65. 49.95ms (33.59G pixels/s) {42.70 G pixels/s}
66. 49.97ms (33.57G pixels/s) {42.68 G pixels/s}
67. 50.34ms (33.33G pixels/s) {42.70 G pixels/s}
68. 50.47ms (33.24G pixels/s) {42.67 G pixels/s}
69. 50.44ms (33.26G pixels/s) {42.66 G pixels/s}
70. 50.53ms (33.20G pixels/s) {42.67 G pixels/s}
71. 50.29ms (33.36G pixels/s) {42.68 G pixels/s}
72. 50.32ms (33.34G pixels/s) {42.70 G pixels/s}
73. 50.47ms (33.24G pixels/s) {42.68 G pixels/s}
74. 50.39ms (33.29G pixels/s) {42.67 G pixels/s}
75. 50.39ms (33.30G pixels/s) {42.68 G pixels/s}
76. 50.53ms (33.20G pixels/s) {42.68 G pixels/s}
77. 50.56ms (33.18G pixels/s) {42.69 G pixels/s}
78. 50.35ms (33.32G pixels/s) {42.68 G pixels/s}
79. 50.32ms (33.34G pixels/s) {42.70 G pixels/s}
80. 50.38ms (33.30G pixels/s) {42.68 G pixels/s}
81. 50.47ms (33.24G pixels/s) {42.70 G pixels/s}
82. 50.38ms (33.30G pixels/s) {42.68 G pixels/s}
83. 50.45ms (33.25G pixels/s) {42.70 G pixels/s}
84. 50.73ms (33.07G pixels/s) {42.68 G pixels/s}
85. 50.83ms (33.00G pixels/s) {42.67 G pixels/s}
86. 50.92ms (32.95G pixels/s) {42.66 G pixels/s}
87. 50.88ms (32.97G pixels/s) {42.66 G pixels/s}
88. 50.87ms (32.98G pixels/s) {42.67 G pixels/s}
89. 50.82ms (33.01G pixels/s) {42.68 G pixels/s}
90. 50.92ms (32.95G pixels/s) {42.66 G pixels/s}
91. 50.90ms (32.96G pixels/s) {42.67 G pixels/s}
92. 50.97ms (32.91G pixels/s) {42.66 G pixels/s}
93. 51.03ms (32.88G pixels/s) {42.67 G pixels/s}
94. 50.89ms (32.96G pixels/s) {42.67 G pixels/s}
95. 50.95ms (32.93G pixels/s) {42.66 G pixels/s}
96. 50.93ms (32.94G pixels/s) {42.69 G pixels/s}
97. 53.92ms (31.12G pixels/s) {42.69 G pixels/s}
98. 53.98ms (31.08G pixels/s) {42.70 G pixels/s}
99. 53.94ms (31.10G pixels/s) {42.68 G pixels/s}
100. 53.86ms (31.15G pixels/s) {42.70 G pixels/s}
101. 54.30ms (30.90G pixels/s) {42.70 G pixels/s}
102. 54.42ms (30.83G pixels/s) {42.66 G pixels/s}
103. 54.45ms (30.81G pixels/s) {42.64 G pixels/s}
104. 54.37ms (30.86G pixels/s) {42.64 G pixels/s}
105. 54.29ms (30.91G pixels/s) {42.65 G pixels/s}
106. 54.44ms (30.82G pixels/s) {42.67 G pixels/s}
107. 54.41ms (30.83G pixels/s) {42.65 G pixels/s}
108. 54.41ms (30.83G pixels/s) {42.66 G pixels/s}
109. 54.59ms (30.74G pixels/s) {42.66 G pixels/s}
110. 54.44ms (30.82G pixels/s) {42.65 G pixels/s}
111. 54.46ms (30.80G pixels/s) {42.66 G pixels/s}
112. 54.41ms (30.84G pixels/s) {42.62 G pixels/s}
113. 54.52ms (30.77G pixels/s) {42.67 G pixels/s}
114. 54.60ms (30.73G pixels/s) {42.66 G pixels/s}
115. 54.56ms (30.75G pixels/s) {42.65 G pixels/s}
116. 54.58ms (30.74G pixels/s) {42.67 G pixels/s}
117. 54.56ms (30.75G pixels/s) {42.67 G pixels/s}
118. 54.68ms (30.68G pixels/s) {42.68 G pixels/s}
119. 58.16ms (28.85G pixels/s) {42.67 G pixels/s}
120. 54.94ms (30.54G pixels/s) {42.63 G pixels/s}
121. 54.87ms (30.58G pixels/s) {42.63 G pixels/s}
122. 54.79ms (30.62G pixels/s) {42.64 G pixels/s}
123. 54.89ms (30.57G pixels/s) {42.66 G pixels/s}
124. 54.99ms (30.51G pixels/s) {42.65 G pixels/s}
125. 54.98ms (30.52G pixels/s) {42.65 G pixels/s}
126. 54.98ms (30.52G pixels/s) {42.63 G pixels/s}
127. 58.28ms (28.79G pixels/s) {42.64 G pixels/s}
128. 55.10ms (30.45G pixels/s) {42.64 G pixels/s}

DX12 Performance Discussion And Analysis Thread

MDolenc

Attachments

Jawed

CasellasAbdala

Rikimaru

Ext3h

3dilettante

willardjuice

super willyjuice

Davros

3dilettante

CSI PC

Pensuke89

Ext3h

Deadhand

Razor1

firstminion

Razor1

Nobu

CSI PC

Ext3h

Kwee

Similar threads