DX12 Performance Discussion And Analysis Thread

But isn't Async Compute supposed to be a feature where the GPU does not need to juggle between two distinct modes (graphics/compute)?
My interpretation is that there is no software obligation that asynchronous compute's queued commands be subject to the sequential ordering of the graphics and synchronous queue, barring explicit barriers. That's not the same as the GPU being obligated to actually process them physically in a concurrent manner.

That's part of where I was curious about yielding or stalls being a case where it might still be possible to get some asynchronous behavior.

Sort of like this:
AC queue: A B C D E F G
Graphics: a b c d *maybe stall* e f g

The actual sequence might be abcdABCDEFGefg, abcdefgABCDEFG, or if the compute portion has some level of preemption abcdABC*preempt*efgEFG.
Even if they cannot process concurrently, the compute queue's command processing relative to the graphics domain is not fixed.


So now we have a second test pointing to what both the Oxide employee and AMD_Robert claimed?
I'd still prefer more data.
The context switch requirement prior to Maxwell 2 is known.
http://www.anandtech.com/show/9124/amd-dives-deep-on-asynchronous-shading
What exactly Maxwell 2 has for a policy is not yet clear.
It may be that it has a more limited subset of options relative to GCN for some kind of switching based on specific events, but if it architecturally has not split compute as heavily from graphics, a solidly running graphics context might still have priority or a lack of preemption makes the benchmark's graphics portion a solid block that the GPU cannot interleave work with.
 
Well... That's interesting... Found a brand new behaviour on my GTX 680... Will post a new version a bit latter I still want to implement gpu timestamps could indicate better what's going on on GCN.
 
GTX 980TI 355.82
Compute only:
1. 11.39ms
2. 11.40ms
3. 10.41ms
4. 10.23ms
5. 10.26ms
6. 10.26ms
7. 9.79ms
8. 9.65ms
9. 9.66ms
10. 9.69ms
11. 9.66ms
12. 9.66ms
13. 9.66ms
14. 9.66ms
15. 9.66ms
16. 10.25ms
17. 9.72ms
18. 9.68ms
19. 9.67ms
20. 9.68ms
21. 9.68ms
22. 9.67ms
23. 9.67ms
24. 9.67ms
25. 9.69ms
26. 9.74ms
27. 9.69ms
28. 9.71ms
29. 9.71ms
30. 9.73ms
31. 9.71ms
32. 19.21ms
33. 21.62ms
34. 24.04ms
35. 24.09ms
36. 24.14ms
37. 21.65ms
38. 24.05ms
39. 21.65ms
40. 19.25ms
41. 24.10ms
42. 21.67ms
43. 19.28ms
44. 21.67ms
45. 21.67ms
46. 21.71ms
47. 24.37ms
48. 21.69ms
49. 21.67ms
50. 19.27ms
51. 21.68ms
52. 21.71ms
53. 19.33ms
54. 19.29ms
55. 21.69ms
56. 19.30ms
57. 21.69ms
58. 24.15ms
59. 21.76ms
60. 19.31ms
61. 21.69ms
62. 21.72ms
63. 21.73ms
64. 31.27ms
65. 28.85ms
66. 28.86ms
67. 28.90ms
68. 38.61ms
69. 28.91ms
70. 31.30ms
71. 28.92ms
72. 33.71ms
73. 28.91ms
74. 28.91ms
75. 28.90ms
76. 43.42ms
77. 33.70ms
78. 31.29ms
79. 36.15ms
80. 28.90ms
81. 28.90ms
82. 28.90ms
83. 28.98ms
84. 41.14ms
85. 28.92ms
86. 28.92ms
87. 33.79ms
88. 31.29ms
89. 33.70ms
90. 28.92ms
91. 31.44ms
92. 28.99ms
93. 28.91ms
94. 28.91ms
95. 31.40ms
96. 38.43ms
97. 43.23ms
98. 43.32ms
99. 40.93ms
100. 40.85ms
101. 38.56ms
102. 45.68ms
103. 40.86ms
104. 48.15ms
105. 45.70ms
106. 38.49ms
107. 45.72ms
108. 40.87ms
109. 40.87ms
110. 43.48ms
111. 38.49ms
112. 40.88ms
113. 43.41ms
114. 43.29ms
115. 43.30ms
116. 43.62ms
117. 43.32ms
118. 40.90ms
119. 48.14ms
120. 40.89ms
121. 38.49ms
122. 43.40ms
123. 43.29ms
124. 45.74ms
125. 40.94ms
126. 40.89ms
127. 45.72ms
128. 52.88ms
Graphics only: 17.26ms (97.19G pixels/s)
Graphics + compute:
1. 26.69ms (62.86G pixels/s)
2. 26.83ms (62.53G pixels/s)
3. 26.76ms (62.69G pixels/s)
4. 26.79ms (62.62G pixels/s)
5. 26.74ms (62.75G pixels/s)
6. 26.80ms (62.60G pixels/s)
7. 26.88ms (62.41G pixels/s)
8. 26.83ms (62.54G pixels/s)
9. 26.80ms (62.61G pixels/s)
10. 26.76ms (62.69G pixels/s)
11. 26.84ms (62.50G pixels/s)
12. 26.75ms (62.72G pixels/s)
13. 26.82ms (62.56G pixels/s)
14. 26.83ms (62.54G pixels/s)
15. 26.76ms (62.69G pixels/s)
16. 26.94ms (62.27G pixels/s)
17. 26.84ms (62.50G pixels/s)
18. 26.76ms (62.69G pixels/s)
19. 26.79ms (62.62G pixels/s)
20. 26.83ms (62.54G pixels/s)
21. 26.82ms (62.56G pixels/s)
22. 26.81ms (62.57G pixels/s)
23. 26.76ms (62.69G pixels/s)
24. 26.83ms (62.54G pixels/s)
25. 26.86ms (62.46G pixels/s)
26. 26.77ms (62.68G pixels/s)
27. 26.77ms (62.67G pixels/s)
28. 26.84ms (62.51G pixels/s)
29. 26.82ms (62.54G pixels/s)
30. 26.78ms (62.65G pixels/s)
31. 26.78ms (62.65G pixels/s)
32. 36.32ms (46.19G pixels/s)
33. 36.44ms (46.05G pixels/s)
34. 38.86ms (43.17G pixels/s)
35. 36.36ms (46.14G pixels/s)
36. 36.39ms (46.10G pixels/s)
37. 38.83ms (43.21G pixels/s)
38. 36.44ms (46.04G pixels/s)
39. 36.40ms (46.09G pixels/s)
40. 43.95ms (38.18G pixels/s)
41. 36.43ms (46.06G pixels/s)
42. 36.41ms (46.08G pixels/s)
43. 38.89ms (43.14G pixels/s)
44. 36.43ms (46.05G pixels/s)
45. 36.44ms (46.04G pixels/s)
46. 36.43ms (46.05G pixels/s)
47. 39.08ms (42.93G pixels/s)
48. 36.42ms (46.07G pixels/s)
49. 36.40ms (46.09G pixels/s)
50. 38.85ms (43.19G pixels/s)
51. 36.41ms (46.08G pixels/s)
52. 36.40ms (46.09G pixels/s)
53. 36.59ms (45.85G pixels/s)
54. 38.85ms (43.18G pixels/s)
55. 38.80ms (43.25G pixels/s)
56. 36.46ms (46.01G pixels/s)
57. 38.83ms (43.21G pixels/s)
58. 36.45ms (46.03G pixels/s)
59. 36.41ms (46.07G pixels/s)
60. 41.56ms (40.37G pixels/s)
61. 36.40ms (46.09G pixels/s)
62. 36.47ms (46.01G pixels/s)
63. 39.01ms (43.01G pixels/s)
64. 45.95ms (36.51G pixels/s)
65. 45.96ms (36.51G pixels/s)
66. 51.03ms (32.88G pixels/s)
67. 46.03ms (36.45G pixels/s)
68. 53.26ms (31.50G pixels/s)
69. 45.95ms (36.51G pixels/s)
70. 46.05ms (36.44G pixels/s)
71. 53.38ms (31.43G pixels/s)
72. 46.00ms (36.47G pixels/s)
73. 53.27ms (31.49G pixels/s)
74. 46.02ms (36.45G pixels/s)
75. 46.01ms (36.47G pixels/s)
76. 48.81ms (34.38G pixels/s)
77. 46.00ms (36.48G pixels/s)
78. 46.05ms (36.43G pixels/s)
79. 46.05ms (36.43G pixels/s)
80. 48.47ms (34.61G pixels/s)
81. 46.17ms (36.34G pixels/s)
82. 46.02ms (36.46G pixels/s)
83. 46.07ms (36.41G pixels/s)
84. 50.86ms (32.99G pixels/s)
85. 46.05ms (36.43G pixels/s)
86. 46.17ms (36.34G pixels/s)
87. 48.47ms (34.61G pixels/s)
88. 48.40ms (34.66G pixels/s)
89. 50.92ms (32.95G pixels/s)
90. 46.02ms (36.46G pixels/s)
91. 50.93ms (32.94G pixels/s)
92. 48.46ms (34.62G pixels/s)
93. 46.07ms (36.42G pixels/s)
94. 50.88ms (32.97G pixels/s)
95. 46.06ms (36.43G pixels/s)
96. 55.64ms (30.15G pixels/s)
97. 58.14ms (28.86G pixels/s)
98. 57.98ms (28.94G pixels/s)
99. 55.59ms (30.18G pixels/s)
100. 55.63ms (30.16G pixels/s)
101. 65.32ms (25.69G pixels/s)
102. 55.59ms (30.18G pixels/s)
103. 55.66ms (30.14G pixels/s)
104. 55.63ms (30.16G pixels/s)
105. 58.17ms (28.84G pixels/s)
106. 55.62ms (30.17G pixels/s)
107. 58.08ms (28.88G pixels/s)
108. 55.67ms (30.14G pixels/s)
109. 55.80ms (30.07G pixels/s)
110. 58.04ms (28.91G pixels/s)
111. 55.66ms (30.14G pixels/s)
112. 55.61ms (30.17G pixels/s)
113. 55.61ms (30.17G pixels/s)
114. 63.15ms (26.57G pixels/s)
115. 55.63ms (30.16G pixels/s)
116. 55.70ms (30.12G pixels/s)
117. 55.63ms (30.16G pixels/s)
118. 55.77ms (30.08G pixels/s)
119. 55.65ms (30.15G pixels/s)
120. 60.53ms (27.72G pixels/s)
121. 58.07ms (28.89G pixels/s)
122. 55.75ms (30.09G pixels/s)
123. 58.06ms (28.89G pixels/s)
124. 58.09ms (28.88G pixels/s)
125. 58.07ms (28.89G pixels/s)
126. 55.65ms (30.15G pixels/s)
127. 55.92ms (30.00G pixels/s)
128. 65.12ms (25.76G pixels/s)

BTW, what if we monitor CPU usage during MDolenc's test to see if the "heavy CPU costs" claimed by Kollok are there for Kepler/Maxwell, and then compare for GCN GPUs?

I was monitoring all I could with MSI Afterburner while running the tool. Couldn't find any indication of "heavy CPU costs". Almost none to be fair, overall the CPU is under 5% load.
Maybe some one more knowledgeable can have a look at the Afterburner log, I manually extracted the relevant log part, it's an html so it can be directly opened it with Afterburner.
 

Attachments

  • HardwareMonitoring2.zip
    1.8 KB · Views: 17
What exactly Maxwell 2 has for a policy is not yet clear.
My bad, I thought Maxwell 1 (which is only GM107, right?) also supported Async Compute.


It may be that it has a more limited subset of options relative to GCN for some kind of switching based on specific events, but if it architecturally has not split compute as heavily from graphics, a solidly running graphics context might still have priority or a lack of preemption makes the benchmark's graphics portion a solid block that the GPU cannot interleave work with.

Or.. Kollock and AMD_Robert are right and the feature is simply being emulated through software, like what we're seeing with Kepler and Maxwell 1 cards?
You don't consider this a possibility at all?


I was monitoring all I could with MSI Afterburner while running the tool. Couldn't find any indication of "heavy CPU costs". Almost none to be fair, overall the CPU is under 5% load.
Same here with my GK107, but my laptop GPU is so horribly slow in the compute tests that I thought it would be the main reason for not taxing the CPU at all.



Maybe some one more knowledgeable can have a look at the Afterburner log

Yap the CPU doesn't seem to sow anything (the log only takes a screenshot of the graphics).
What I did notice is that the GPU clock is at 595MHz.
Are you using a frame rate limiter or something?
 
Last edited by a moderator:
GTX 980TI 355.82
I was monitoring all I could with MSI Afterburner while running the tool. Couldn't find any indication of "heavy CPU costs". Almost none to be fair, overall the CPU is under 5% load.
This test is not trying to stress the CPU portion, however, and the level of complexity is not on the order of what a full engine would be doing.
It might arise with more complicated dependence tracking and workload balancing. The test so far is a very straightforward graphics context and independent, simple, and identical compute kernels.

Or.. Kollock and AMD_Robert are right and the feature is simply being emulated through software, like what we're seeing with Kepler and Maxwell 1 cards?
You don't consider this a possibility at all?
It's possible, but it was stated to be different with that generation of Maxwell, and additional work on the testing methods are ongoing.
Likening asynchronous compute to CPU multithreading may be appropriate here as well, as there is more than one way to thread a CPU, with varying levels of complexity and ability to run instructions concurrently.
The most straightforward is switch-on-event, which only runs commands from one thread at a time, but the CPU can make the switch on its own based on detecting specific events.

The other thing is that this test is mixing concurrency testing with testing asynchronous compute. That may be usually fine, but there are cases that can have very low concurrency (execution time does not change much) while still not being synchronous (actual dispatch times/ordering do).

Perhaps with a little more data, we can verify if there isn't a corner case here.
 
Yap the CPU doesn't seem to sow anything (the log only takes a screenshot of the graphics).
What I did notice is that the GPU clock is at 595MHz.
Are you using a frame rate limiter or something?

No I wasn't. I also noticed that, I ran again several times and the GPU boosts fine to 1400+ so it was just a on time problem. The compute results are the same when the log shows it boosts to 1441MHz.

I am more intrigues by the frame buffer that gradually builds and empties during the async compute part, more noticeable if the log is viewed as txt.
During the compute only part it is empty.
And during the graphics part it is always constanly filled to 12%.
Sincerely as someone with limited knowledge I have no idea what that means or if it is relevant.
 
Maxwell 1 should support async-compute (ie no driver serialization) but as far I know it should not receive any sensible advantage...
 
Is the EDRAM micromanaged? Honest question. I thought DX11.x handled that automatically, like Intel's IGPs using L3 and L4.
I don't know how EDRAM is managed in current games and I suspect you're right it's more likely to be like the Intel case. But I also expect this won't be the case with D3D12 on XBox One.

As for the HSA-centric code, perhaps they can point many of those tasks towards the CPU anyways? I imagine most gaming PCs will have CPUs that are >3x faster than the 8x 1.6-1.75GHz Jaguars.
HSA-centric meaning that with a single main memory shared by both types of cores and with the ability to bounce work around the processor at miniscule latencies and vast bandwidths in comparison with a PC configured with discrete graphics. Essentially algorithms can use both types of cores in a fine-grained fashion for a single algorithm and not have to worry about the PCI Express bus getting in the way.
 
And the Oxide dev or who ever he is, is not being straight froward by using vagaries in his comments about performance differences between the two architectures, which I would expect him to know exactly what is going on.
Maybe we can persuade them to provide a user-toggle for async shaders. That way we could compare on and off on both AMD and NVidia.
 
Hey guys, first time poster.

I've been following this topic, it's very interesting!
However, i noticed that the charts that have been shared here so far do not actually illustrate the impact (or lack thereof) of async shading on the execution time of the test tool.

So i made this data visualizer from data scraped from this thread.
(i will add more as more people post their results here)

Each bar in the chart shows the time it took for the async compute to finish.
The red block that floats to the top is the time it would take for the compute, by itself, to finish.
The blue block at the bottom is the time it would take for the graphics, by itself, to finish.

What we want here is for the red and blue to overlap, this signifies the async compute running faster than if you were to run the compute and graphics separately.
Sometimes we see a white gap between the 2 colors, this signifies that the async compute run is slower than it would have been if the two were run separately.

H5aBbkf.png
 
Nice page Nub. A marked vertical axis would improve this. Also a second vertical axis for a line that shows your "faster by" metric would be sweet. Though that does mean you'd have to allow negatives on that axis. But you can offset the 0 on that axis above the base of the page.

You could really go for gold by allowing your page to show two cards at the same time (two side-by-side graphs rather than interleaved).

To be fair, this code is still changing rapidly, so you might want to revisit your page design if MDolenc releases a new version.
 
Maybe we can persuade them to provide a user-toggle for async shaders. That way we could compare on and off on both AMD and NVidia.


That would be nice but so far I don't think they will do that at least not the way they have been talking. And this would easy to do, if its just a vendor id.
 
Ok here's the updated version. This one ought to be freaking interesting based on what I get on Kepler (not sharing any spoilers though :)).
I've made compute shader 1/2 the length (but it goes up to 512 dispatches) and I added GPU timestamps (though oddly on Kepler they only work on DIRECT queue and not on COMPUTE one). And there's a new mode...
 

Attachments

  • AsyncCompute.zip
    17.8 KB · Views: 210
Ok here's the updated version. This one ought to be freaking interesting based on what I get on Kepler (not sharing any spoilers though :)).
I've made compute shader 1/2 the length (but it goes up to 512 dispatches) and I added GPU timestamps (though oddly on Kepler they only work on DIRECT queue and not on COMPUTE one). And there's a new mode...

Used the new one on my 980TI. Took 14 minutes to run and the Nvidia 355.82 driver crashed at the async compute batch 455(when it got neat to 3000ms/batch).
Also added Afterburner log during the run if it interest anyone. From a quick glance uses more CPU and the GPU switched from 100 to 0% usage when going from one async compute batch to another.
 

Attachments

  • perfGTX980TIPadyEos.zip
    241.2 KB · Views: 43
  • HardwareMonitoring.zip
    29.5 KB · Views: 17
Back
Top