DX12 Performance Discussion And Analysis Thread

Maybe I'm reading the graphs wrong...are the GCN cards doing better with a single commandlist?
Yes, they (except the 7950) are.

I put my general thoughts in an earlier post, but if you look at the numbers for say, the Fury X, you'd see that single commandlist starts as equal to graphics-only+compute-only, and then increases in increments of compute-only every 65 invocations added (1, 66, 131, ...), while the 'async' version starts equal to max(graphics-only, compute-only), then increments every 30 invocations (1, 31, 61, ...). The 7950 data on the other hand keeps the 65-invocation scaling in both modes, so async ends up always being better by a constant factor.
 
Could the AoS bench/game be benefitting from this hardware (or, I guess, firmware/driver) optimization somehow? Seems like for a naive implementation of async that would give a fair advantage in the general case. Of course, I'm sort of talking out of my ass right now--I have next to zero understanding of either nvidia's or amd's async architecture.
 
So Titan X SLI can do compute + graphics async faster than compute + graphics serial, from this app. But single GPU (both 980Ti results the same thus far) cannot.
With two GPUs in SLI each GPU can execute it's own queue independent of each other and as there are no dependencies between the two that's an ideal outcome anyway.
 
It's not the Titan SLI, the selected highlighted result is the 290x
This is, indeed, correct. It was my mistake for not making the selected item stand out more. The position of the tooltip is also misleading if you're not actually using the tool; the data is shown for whichever bar you have your mouse cursor on.
I usually make things to work for myself that i tend to neglect how misleading they can be for other users.

I've fixed that and i've also added a label at the top of the chart to show what value the y-axis is, at the very top of the page.
Again, here's the link, for the convenience of not having to open previous pages to look for it.
 
Could the AoS bench/game be benefitting from this hardware (or, I guess, firmware/driver) optimization somehow? Seems like for a naive implementation of async that would give a fair advantage in the general case. Of course, I'm sort of talking out of my ass right now--I have next to zero understanding of either nvidia's or amd's async architecture.
I doubt it has an amazing effect on most practical D3D12 programs, if only because nVidia's driver/hw stack isn't doing anything special, so anything relying on that optimization would run horribly there. I would guess that it's probably a 'common' low-level optimization that has a better effect on D3D11/OpenGL/etc.
 
Again, what is the expected result?

ie.
25ms Graphics
25ms Compute

With Async Compute enabled, the combined Graphics + Compute task should be completed in.... ? 25ms?

With Async Compute disabled, the combined Graphics + Compute task should be completed in....? 50ms?
 
Hello, this is my first post in this forum.
The attached file is a log of Async Compute Test written by MDolenc on my Fury X with Catalyst 15.8b.

Thanks to you people for making an great investigation on this subject.

(My English is poor. So I don't make a post in forums using English such as here , but I really wanted to say thank you many times.)
 

Attachments

  • FuryX(15.8b).zip
    107.9 KB · Views: 19
Ok so I'm looking at the several results being posted from MDolenc's tool and here's what I'm seeing:

1 - All GCN chips seem to present an almost flat compute time of 50-60ms regardless of the number of compute kernels and the rendering task being enabled or not.

2 - a) All nVidia chips seem to present a time that increases with the increase of compute kernels. Maxwell chips show lower compute times than Kepler chips.
b) If rendering task time is X for an nVidia chip and compute time for a given number n of kernels is Y(n), then the "async compute" time for all nVidia chips seems to be very close to Y(n)+X.




If nVidia's chips need to add the rendering time to a compute task with even one active kernel, doesn't this mean that "Async Compute" is not actually working and nVidia's hardware, at least in this test, does not seem to support Async Compute? Even if the driver does allow Async Compute tasks to be done, the hardware just seems to be doing rendering+compute in a serial fashion and not parallel at all.

This is the most logical interpretation. An async task would not exhibit a time to completion that's near the sum of the two tasks separately. That indicates NO async mode is functional and its defaulting to serial operations.

Since I'm good at car analogies, let's do that.

2 Cars are on the road, let's call them Car 1 (Compute) and Car 2 (Graphics). Both cars are trying to go from A - > B.

The time it takes for Car 1 to travel the journey is 1 hour. The time it takes for Car 2 to travel the journey is 2 hours.

The question is, how long does it take for both Cars to reach destination B?

1. Both Cars can travel on the road together, simultaneously, starting at the same time: 2 hours.
2. Only ONE Car can be on the road at once, so Car 1 goes first (order doesn't matter), finishes, then Car 2 starts. Thus, both Cars reach their destination in: 3 hours.

Minor variations aside, that should be the expected behavior, correct? #1 would therefore be Async Mode, and #2 is not.
 
First post here, I was curious on this matter so I ran both on my spare and main card.

Just sharing results if it might prove useful.

AsyncCompute
written by MDolenc

7950 Catalyst 15.8b

Graphics only: 57.37ms (29.24G pixels/s)
Graphics + compute: 238.70ms (7.03G pixels/s)
Graphics, compute single commandlist: 295.77ms (5.67G pixels/s)

980 Forceware 355.82

Graphics only: 23.23ms (72.21G pixels/s)
Graphics + compute: 103.58ms (16.20G pixels/s)
Graphics, compute single commandlist: 2433.35ms (0.69G pixels/s)
 

Attachments

  • 7950_980_results.zip
    130.6 KB · Views: 9
Ok here's the updated version. This one ought to be freaking interesting based on what I get on Kepler (not sharing any spoilers though :)).
I've made compute shader 1/2 the length (but it goes up to 512 dispatches) and I added GPU timestamps (though oddly on Kepler they only work on DIRECT queue and not on COMPUTE one). And there's a new mode...
Can we try running bandwidth heavy compute in parallel with computation heavy compute?
 
First post here, I was curious on this matter so I ran both on my spare and main card.

Just sharing results if it might prove useful.

AsyncCompute
written by MDolenc

7950 Catalyst 15.8b

Graphics only: 57.37ms (29.24G pixels/s)
Graphics + compute: 238.70ms (7.03G pixels/s)
Graphics, compute single commandlist: 295.77ms (5.67G pixels/s)

980 Forceware 355.82

Graphics only: 23.23ms (72.21G pixels/s)
Graphics + compute: 103.58ms (16.20G pixels/s)
Graphics, compute single commandlist: 2433.35ms (0.69G pixels/s)


Thanks. We need the Compute only times to analyze, from your data:

1st Kernel

980:

Compute only: 5.33ms
Graphics only: 23.23ms
Graphics + compute: 27.68ms

7950:

Compute only: 29.91ms
Graphics only: 57.37ms
Graphics + compute: 57.41ms

64th Kernel:

980:

Compute only: 14.33ms
Graphics only: 23.23ms
Graphics + compute: 37.32ms

7950:

Compute only: 29.91ms
Graphics only: 57.37ms
Graphics + compute: 57.35ms


128th Kernel

980:

Compute only: 28.70ms
Graphics only: 23.23ms
Graphics + compute: 46.62ms (The 129th Kernel jumps to 51.58ms)


7950:

Compute only: 59.72ms
Graphics only: 57.37ms
Graphics + compute: 60.01ms
 
This is the most logical interpretation. An async task would not exhibit a time to completion that's near the sum of the two tasks separately. That indicates NO async mode is functional and its defaulting to serial operations.

Since I'm good at car analogies, let's do that.

2 Cars are on the road, let's call them Car 1 (Compute) and Car 2 (Graphics). Both cars are trying to go from A - > B.

The time it takes for Car 1 to travel the journey is 1 hour. The time it takes for Car 2 to travel the journey is 2 hours.

The question is, how long does it take for both Cars to reach destination B?

1. Both Cars can travel on the road together, simultaneously, starting at the same time: 2 hours.
2. Only ONE Car can be on the road at once, so Car 1 goes first (order doesn't matter), finishes, then Car 2 starts. Thus, both Cars reach their destination in: 3 hours.

Minor variations aside, that should be the expected behavior, correct? #1 would therefore be Async Mode, and #2 is not.


That would be the case for the results we're getting with MDolenc's tests, yes. (BTW that's a "highway lanes analogy, not a cars analogy ;) )

Basically, if (graphics+compute) time = (graphics time) + (compute time), then at least with this code the hardware isn't running Async Compute.
And that's what we're seeing with both Kepler+Maxwell 1 (which do not support Async Compute by nVidia's own spec) and Maxwell 2.

As far as I can see, there are 3 very odd things with the results so far:

1 - Maxwell 2 isn't doing Async Compute in this test. Pretty much all results are showing that.
Razor1 pointed to someone with two Titan Xs being seemingly able to do Async but it seems the driver is just cleverly sending the render to one card and the compute to another (which for PhysX is actually something that you could toggle in the driver since G80, so the capability was been there for many years). Of course, if you're using two Maxwell cards for SLI in the typical Alternate Frame Rendering mode, this "feature" will be useless because both cards are rendering. The same thing will happen for a VR implementation where each card is rendering each eye.

2 - Forcing "no Async" in the test (single command queue) is making nVidia chips to serialize everything. This means that the last test with rendering + 512 kernels will take the Render-time + 512x(Compute-time of 1 kernel). That's why the test times end up ballooning, which eventually crashes the display driver.


3 - Forcing "no Async" is making GCN 1.1 chips doing some very weird stuff (perhaps the driver is recognizing a pattern and skipping some calculations as suggested before?). GCN 1.0 like Tahiti in the 7950 is behaving like it "should": (compute[n] + render) time = compute[n] time + render time.




Graphics + compute: 238.70ms (7.03G pixels/s)
That's not what your performance log shows... That's the time for 512 kernels in pure compute mode.
 
I'm struggling to see how NVidia is failing by any sensible metric when Graphics + compute completes in 92ms on GTX980Ti and 444ms on Fury X. Or compute only which is 76 versus 468ms. AMD, whatever it's doing, is just broken.

Or maybe Fiji is just spending 25.9ms sleeping, then waking up momentarily to execute a kernel that should take about 8 microseconds.

At least we're seeing some steps on AMD.

3dilettante: wouldn't it be interesting if active TDR is slowing down these tests...
Regardless of the actual time itself, NVIDIA is failing by the fact that they can't hide anything with async compute, while AMD can hide most and often even all the grapics latency with async compute
From one of the earlier results:
980 Ti: Compute ~10ms, Graphics ~18ms, Compute + Graphics ~28ms
Fury X (or 390X, not sure since i'm copypasting) Compute ~50ms, Graphics ~25ms, Compute + Graphics ~50-60ms
See the difference?
 
doesn't matter what the performance difference is if there is even a small amount doing async code, it is functional, its not about the end performance vs the different IHV's, its about is it capable or not, and it is capable. Serial path should always be the same or higher than doing it asynchronously if the variables are the same and if the processor is being tasked enough.
 
Radeon hd 7850 ))

Compute only:
1. 64.14ms
2. 64.14ms
3. 64.15ms
4. 64.15ms
5. 64.15ms
6. 64.15ms
7. 64.14ms
8. 64.14ms
9. 64.14ms
10. 64.17ms
11. 64.14ms
12. 64.16ms
13. 64.15ms
14. 64.14ms
15. 64.14ms
16. 64.15ms
17. 64.15ms
18. 64.15ms
19. 64.14ms
20. 64.15ms
21. 64.15ms
22. 64.15ms
23. 64.14ms
24. 64.14ms
25. 64.14ms
26. 64.14ms
27. 64.15ms
28. 64.14ms
29. 64.14ms
30. 64.15ms
31. 64.14ms
32. 64.16ms
33. 64.16ms
34. 64.15ms
35. 64.14ms
36. 64.14ms
37. 64.16ms
38. 64.14ms
39. 64.15ms
40. 64.15ms
41. 64.17ms
42. 64.15ms
43. 64.15ms
44. 64.14ms
45. 64.15ms
46. 64.14ms
47. 64.14ms
48. 64.14ms
49. 64.15ms
50. 64.14ms
51. 64.15ms
52. 64.14ms
53. 64.14ms
54. 64.14ms
55. 64.14ms
56. 64.14ms
57. 64.13ms
58. 64.15ms
59. 64.14ms
60. 64.15ms
61. 64.14ms
62. 64.15ms
63. 64.15ms
64. 64.16ms
65. 64.15ms
66. 64.16ms
67. 64.15ms
68. 64.16ms
69. 64.14ms
70. 64.15ms
71. 64.16ms
72. 64.15ms
73. 64.15ms
74. 64.16ms
75. 64.14ms
76. 64.14ms
77. 64.14ms
78. 64.15ms
79. 64.14ms
80. 64.14ms
81. 64.14ms
82. 64.14ms
83. 64.15ms
84. 64.14ms
85. 64.14ms
86. 64.14ms
87. 64.15ms
88. 64.15ms
89. 64.14ms
90. 64.14ms
91. 64.15ms
92. 64.15ms
93. 64.14ms
94. 64.15ms
95. 64.15ms
96. 64.15ms
97. 64.15ms
98. 64.16ms
99. 64.15ms
100. 64.14ms
101. 64.16ms
102. 64.15ms
103. 64.15ms
104. 64.14ms
105. 64.16ms
106. 64.23ms
107. 64.20ms
108. 64.15ms
109. 64.15ms
110. 64.21ms
111. 83.74ms
112. 64.14ms
113. 64.14ms
114. 64.15ms
115. 64.14ms
116. 64.14ms
117. 64.16ms
118. 64.16ms
119. 64.18ms
120. 64.19ms
121. 64.15ms
122. 64.24ms
123. 74.36ms
124. 64.16ms
125. 64.15ms
126. 64.15ms
127. 64.14ms
128. 64.15ms
Graphics only: 61.90ms (27.10G pixels/s)
Graphics + compute:
1. 64.48ms (26.02G pixels/s)
2. 64.53ms (26.00G pixels/s)
3. 64.53ms (26.00G pixels/s)
4. 64.55ms (25.99G pixels/s)
5. 64.56ms (25.99G pixels/s)
6. 64.57ms (25.98G pixels/s)
7. 64.57ms (25.98G pixels/s)
8. 64.63ms (25.96G pixels/s)
9. 64.59ms (25.98G pixels/s)
10. 64.52ms (26.00G pixels/s)
11. 64.62ms (25.96G pixels/s)
12. 64.59ms (25.98G pixels/s)
13. 64.57ms (25.98G pixels/s)
14. 64.54ms (25.99G pixels/s)
15. 64.53ms (26.00G pixels/s)
16. 64.57ms (25.98G pixels/s)
17. 64.55ms (25.99G pixels/s)
18. 64.52ms (26.00G pixels/s)
19. 64.51ms (26.01G pixels/s)
20. 64.54ms (26.00G pixels/s)
21. 64.53ms (26.00G pixels/s)
22. 64.56ms (25.99G pixels/s)
23. 64.64ms (25.96G pixels/s)
24. 64.53ms (26.00G pixels/s)
25. 64.64ms (25.96G pixels/s)
26. 64.65ms (25.95G pixels/s)
27. 64.61ms (25.96G pixels/s)
28. 64.58ms (25.98G pixels/s)
29. 64.58ms (25.98G pixels/s)
30. 68.83ms (24.38G pixels/s)
31. 64.59ms (25.97G pixels/s)
32. 64.62ms (25.96G pixels/s)
33. 64.62ms (25.96G pixels/s)
34. 64.59ms (25.97G pixels/s)
35. 64.59ms (25.97G pixels/s)
36. 64.58ms (25.98G pixels/s)
37. 64.59ms (25.98G pixels/s)
38. 64.60ms (25.97G pixels/s)
39. 64.59ms (25.98G pixels/s)
40. 64.64ms (25.96G pixels/s)
41. 64.57ms (25.98G pixels/s)
42. 64.62ms (25.96G pixels/s)
43. 64.67ms (25.94G pixels/s)
44. 64.60ms (25.97G pixels/s)
45. 64.64ms (25.96G pixels/s)
46. 64.62ms (25.96G pixels/s)
47. 64.62ms (25.96G pixels/s)
48. 64.65ms (25.95G pixels/s)
49. 64.61ms (25.97G pixels/s)
50. 64.61ms (25.97G pixels/s)
51. 64.64ms (25.95G pixels/s)
52. 64.61ms (25.97G pixels/s)
53. 64.57ms (25.98G pixels/s)
54. 64.63ms (25.96G pixels/s)
55. 64.61ms (25.97G pixels/s)
56. 64.66ms (25.95G pixels/s)
57. 64.59ms (25.98G pixels/s)
58. 64.65ms (25.95G pixels/s)
59. 97.11ms (17.28G pixels/s)
60. 64.60ms (25.97G pixels/s)
61. 68.71ms (24.42G pixels/s)
62. 64.59ms (25.97G pixels/s)
63. 64.63ms (25.96G pixels/s)
64. 64.61ms (25.97G pixels/s)
65. 64.84ms (25.88G pixels/s)
66. 64.92ms (25.84G pixels/s)
67. 64.95ms (25.83G pixels/s)
68. 64.97ms (25.82G pixels/s)
69. 65.01ms (25.81G pixels/s)
70. 64.92ms (25.84G pixels/s)
71. 64.96ms (25.83G pixels/s)
72. 64.94ms (25.84G pixels/s)
73. 64.93ms (25.84G pixels/s)
74. 65.07ms (25.78G pixels/s)
75. 64.91ms (25.85G pixels/s)
76. 64.95ms (25.83G pixels/s)
77. 64.93ms (25.84G pixels/s)
78. 64.94ms (25.84G pixels/s)
79. 65.00ms (25.81G pixels/s)
80. 65.02ms (25.80G pixels/s)
81. 65.01ms (25.81G pixels/s)
82. 65.01ms (25.81G pixels/s)
83. 64.99ms (25.82G pixels/s)
84. 65.14ms (25.76G pixels/s)
85. 65.04ms (25.80G pixels/s)
86. 65.08ms (25.78G pixels/s)
87. 65.08ms (25.78G pixels/s)
88. 64.99ms (25.82G pixels/s)
89. 64.97ms (25.82G pixels/s)
90. 64.97ms (25.82G pixels/s)
91. 64.94ms (25.84G pixels/s)
92. 64.94ms (25.83G pixels/s)
93. 64.96ms (25.83G pixels/s)
94. 64.98ms (25.82G pixels/s)
95. 65.07ms (25.78G pixels/s)
96. 65.07ms (25.78G pixels/s)
97. 64.99ms (25.81G pixels/s)
98. 65.01ms (25.81G pixels/s)
99. 65.01ms (25.81G pixels/s)
100. 64.96ms (25.83G pixels/s)
101. 65.07ms (25.78G pixels/s)
102. 65.19ms (25.74G pixels/s)
103. 65.24ms (25.72G pixels/s)
104. 65.21ms (25.73G pixels/s)
105. 65.21ms (25.73G pixels/s)
106. 65.22ms (25.72G pixels/s)
107. 65.25ms (25.71G pixels/s)
108. 85.70ms (19.58G pixels/s)
109. 65.13ms (25.76G pixels/s)
110. 65.14ms (25.76G pixels/s)
111. 65.23ms (25.72G pixels/s)
112. 65.10ms (25.77G pixels/s)
113. 65.30ms (25.69G pixels/s)
114. 65.31ms (25.69G pixels/s)
115. 65.05ms (25.79G pixels/s)
116. 65.22ms (25.72G pixels/s)
117. 65.14ms (25.76G pixels/s)
118. 65.13ms (25.76G pixels/s)
119. 65.07ms (25.78G pixels/s)
120. 65.29ms (25.70G pixels/s)
121. 65.17ms (25.74G pixels/s)
122. 65.19ms (25.74G pixels/s)
123. 65.06ms (25.79G pixels/s)
124. 65.13ms (25.76G pixels/s)
125. 65.10ms (25.77G pixels/s)
126. 64.98ms (25.82G pixels/s)
127. 65.13ms (25.76G pixels/s)
128. 65.02ms (25.80G pixels/s)
 
doesn't matter what the performance difference is if there is even a small amount doing async code, it is functional, its not about the end performance vs the different IHV's, its about is it capable or not, and it is capable. Serial path should always be the same or higher than doing it asynchronously if the variables and if the processor is being tasked enough.

It's not functional. I looked through the results graphed up by Nub, for Kepler & Maxwells over all the Kernels, there ARE specific instances of +/- faster/slower in async mode, but it evens out to be almost the SAME as the sum of compute & graphics individually.

If anyone is trying to compare the time (ms) for compute/graphics, etc, and link a performance comparison to GCN vs Maxwell or whatever, please note that there's already plenty of DX11 games that use compute. If GCN indeed has a 50ms latency for compute (as indicated by the program), it means the games cannot exceed 20fps. Thus, this particular program is not to be used as a performance comparison. But I believe this point was raised earlier already by others.

Now, the conclusion from this program is either Async Compute is not functional, there's no simultaneous execution of graphics + compute (leading to a big performance gains), or the program is wrong and Maxwell 2 can indeed do parallel graphics + compute, ie. functional AC.
 
why don't you explain then why the difference is there, on all maxwell 2 cards in favor of async compute vs serial, why is it always faster by a few microseconds. I have looked at majority of the reports too. You said it compile the data and link it as a spreadsheet.

I don't think this program was made for performance so take that out of the equation right now.
 
Back
Top