DX12 Performance Discussion And Analysis Thread

That's not visible to users or even developers (ACE microcode). Unless you're writing your own linux drivers...
Even in the linux drivers the microcode was signed binaries last I checked.

The big difference for ACEs is that GCN1.0 doesn't have microcode as far as I know. Should still work, just a bit less intuitive. It was the future programmable ACE/HWS designs that got the newer features backported. GCN1.2 and newer.
 
Radeon 7970 was quite popular. I still use one frequently for testing. 7970 GE is very close to RX 470 in compute performance. Geometry performance of course is much worse, but our game doesn't render triangles. I wonder whether Vulkan async compute still works on GCN 1.0. AMD themselves recommend to use just one compute queue (on PC). GCN 1.0 queue count isn't the limiting factor here. IIRC somebody in B3D forums said that GCN 1.0 can't run the same microcode for ACEs as there's not enough room. Maybe there's load balancing issues or some other bugs and the new code just doesn't fit in GCN 1.0. This is very unfortunate if true, since 7970 is still widely used. Async compute would have extended its lifetime a bit, especially in compute heavy games like ours.

The capacity constraint for microcode I'm thinking of came up in the context of allowing the microcode engines to support the standard command packet types, HWS, and the AQL packets for HSA at the same time.
At least the standard compute being used for the synthetic benchmark from this thread wouldn't be purposefully involving those extra sets of microcode since that's beyond what the program could control, and it hasn't been changed.
Losing the exact same functionality now might be that something changed the driver's threshold for serialization, a choice to back off on AC for 1.0, or a bug.

There are features that AMD has introduced that may not filter back to anything older than Sea Islands .
The front-end hardware for GCN 1.0 doesn't seem to have the foundation shared by the next revisions, in terms of the ability to update and the ability/hardware features for the microcode engine's interaction with the GPU back end. The underlying hardware of the front end for GCN 1.0 might have had more of a shared basis with Northern Islands and its introduction of compute.

From posts in the following thread, it seems that going forward AMD's overall compute platform is based on 1.1 and higher.
https://www.phoronix.com/forums/for...te-1-3-platform-brings-polaris-other-features

I wouldn't think that would necessitate scrapping 1.0 support for basic AC, although perhaps there are details on how the stack communicates with the hardware that might explain why this gets more difficult, or the moving onto bigger and better things can increase the chance of corner cases coming up and forcing a fallback when no additional hotfix is forthcoming. Applications that do try to use more recent features could prompt a drop back to standard execution rather than trying to infer how they can be massaged into 1.0 at runtime.
 
Just tried last drivers with my 280, got 4% performance gain with AC on the nBody sample that AMD modified some time ago adding naive "async. compute" support: https://github.com/GPUOpen-LibrariesAndSDKs/nBodyD3D12/tree/master/Samples/D3D12nBodyGravity

But I suspect there is a bug in the current driver branch for GCN1 devices, since I got a lot less performance then my other 380, with and without "async. compute" enable. The performance bottleneck looks like the compute shader in both async on and off.
 
Last edited:
Also "async compute on" on the GCN 1 produce a stuttering pattern, while this is completely absent if running the demo with the 380. Moreover the performance of the 380 are like the double, even with async compute off, while the two cards should be more ore less in the same performance range with +10/-15% for the 280 vs the 380 depending on the context. In a worst case scenario, like a synthetic benchmark, the the 380 should performs 21% better then the 280 on single precision (and like -68% less on double precision :p)... All this smell like a crap bug. Please note that I am running those two cards under a CPU limited scenario, which should increase the performance benefit when the two queues (default and compute) are not serialized by the driver..
 
I own AOTS Escalation and ran the benchmark test in both D3D11 and D3D12 modes on both my notebook with 980M and desktop with 1070. The 980M is slightly slower in D3D12 whereas the 1070 is slightly faster. Same drivers. No idea if "async compute" is enabled or disabled by default these days.
 
Same issues and same terrible performance on GCN1 and compute shaders under D3D12, still stuttering and small performance improvement under async-compute.
 
I spent all my night to install every drivers between 16.3.1 and 16.9.2. The break point is the driver 16.4.2. After this driver no more Async Compute on GCN 1.0. So Nixxes was aware that Async Compute was not active. They release Async Compute patch for Rise Of The Tomb Raider in July, specifying that only GCN 1.1 and superior can take advantage of Async Compute.
 
R9 280 (GCN1 Tahiti)
Code:
Compute only:
1. 54.14ms
2. 54.13ms
3. 54.13ms
4. 54.13ms
5. 54.13ms
6. 54.13ms
7. 54.13ms
8. 54.13ms
9. 54.13ms
10. 54.13ms
11. 54.13ms
12. 54.13ms
13. 54.13ms
14. 54.13ms
15. 54.13ms
16. 54.13ms
17. 54.13ms
18. 54.13ms
19. 54.13ms
20. 54.13ms
21. 54.13ms
22. 54.13ms
23. 54.13ms
24. 54.13ms
25. 54.13ms
26. 54.13ms
27. 54.13ms
28. 54.13ms
29. 54.13ms
30. 54.13ms
31. 54.13ms
32. 54.13ms
33. 54.13ms
34. 54.13ms
35. 54.13ms
36. 54.14ms
37. 54.13ms
38. 54.13ms
39. 54.13ms
40. 54.14ms
41. 54.13ms
42. 54.15ms
43. 54.13ms
44. 54.13ms
45. 54.13ms
46. 54.13ms
47. 54.13ms
48. 54.14ms
49. 54.13ms
50. 54.13ms
51. 54.13ms
52. 54.13ms
53. 54.13ms
54. 54.14ms
55. 54.13ms
56. 54.14ms
57. 54.13ms
58. 54.13ms
59. 54.14ms
60. 54.13ms
61. 54.14ms
62. 54.13ms
63. 54.14ms
64. 54.13ms
65. 54.13ms
66. 54.13ms
67. 54.13ms
68. 54.13ms
69. 54.13ms
70. 54.14ms
71. 54.13ms
72. 54.13ms
73. 54.13ms
74. 54.13ms
75. 54.13ms
76. 54.13ms
77. 54.13ms
78. 54.13ms
79. 54.13ms
80. 54.13ms
81. 54.13ms
82. 54.13ms
83. 54.13ms
84. 54.13ms
85. 54.13ms
86. 54.14ms
87. 54.13ms
88. 54.13ms
89. 54.13ms
90. 54.13ms
91. 54.13ms
92. 54.13ms
93. 54.13ms
94. 54.13ms
95. 54.13ms
96. 54.13ms
97. 54.13ms
98. 54.13ms
99. 54.13ms
100. 54.13ms
101. 54.13ms
102. 54.13ms
103. 54.13ms
104. 54.13ms
105. 54.13ms
106. 54.13ms
107. 54.13ms
108. 54.13ms
109. 54.13ms
110. 54.13ms
111. 54.13ms
112. 54.13ms
113. 54.13ms
114. 54.13ms
115. 54.13ms
116. 54.14ms
117. 54.13ms
118. 54.13ms
119. 54.13ms
120. 54.13ms
121. 54.13ms
122. 54.13ms
123. 54.13ms
124. 54.13ms
125. 54.14ms
126. 54.13ms
127. 54.13ms
128. 54.13ms
Graphics only: 56.28ms (29.81G pixels/s)
Graphics + compute:
1. 110.47ms (15.19G pixels/s)
2. 110.53ms (15.18G pixels/s)
3. 110.53ms (15.18G pixels/s)
4. 110.53ms (15.18G pixels/s)
5. 110.53ms (15.18G pixels/s)
6. 110.52ms (15.18G pixels/s)
7. 110.52ms (15.18G pixels/s)
8. 110.53ms (15.18G pixels/s)
9. 110.53ms (15.18G pixels/s)
10. 110.53ms (15.18G pixels/s)
11. 110.53ms (15.18G pixels/s)
12. 110.53ms (15.18G pixels/s)
13. 110.54ms (15.18G pixels/s)
14. 110.53ms (15.18G pixels/s)
15. 110.53ms (15.18G pixels/s)
16. 110.53ms (15.18G pixels/s)
17. 110.53ms (15.18G pixels/s)
18. 110.53ms (15.18G pixels/s)
19. 110.54ms (15.18G pixels/s)
20. 110.53ms (15.18G pixels/s)
21. 110.53ms (15.18G pixels/s)
22. 110.53ms (15.18G pixels/s)
23. 110.53ms (15.18G pixels/s)
24. 110.53ms (15.18G pixels/s)
25. 110.56ms (15.18G pixels/s)
26. 110.53ms (15.18G pixels/s)
27. 110.53ms (15.18G pixels/s)
28. 110.53ms (15.18G pixels/s)
29. 110.53ms (15.18G pixels/s)
30. 110.53ms (15.18G pixels/s)
31. 110.54ms (15.18G pixels/s)
32. 110.53ms (15.18G pixels/s)
33. 110.54ms (15.18G pixels/s)
34. 110.53ms (15.18G pixels/s)
35. 110.53ms (15.18G pixels/s)
36. 110.52ms (15.18G pixels/s)
37. 110.53ms (15.18G pixels/s)
38. 110.53ms (15.18G pixels/s)
39. 110.52ms (15.18G pixels/s)
40. 110.53ms (15.18G pixels/s)
41. 110.53ms (15.18G pixels/s)
42. 110.53ms (15.18G pixels/s)
43. 110.53ms (15.18G pixels/s)
44. 110.53ms (15.18G pixels/s)
45. 110.53ms (15.18G pixels/s)
46. 110.53ms (15.18G pixels/s)
47. 110.53ms (15.18G pixels/s)
48. 110.53ms (15.18G pixels/s)
49. 110.53ms (15.18G pixels/s)
50. 110.53ms (15.18G pixels/s)
51. 110.54ms (15.18G pixels/s)
52. 110.53ms (15.18G pixels/s)
53. 110.55ms (15.18G pixels/s)
54. 110.54ms (15.18G pixels/s)
55. 110.52ms (15.18G pixels/s)
56. 110.53ms (15.18G pixels/s)
57. 110.54ms (15.18G pixels/s)
58. 110.54ms (15.18G pixels/s)
59. 110.54ms (15.18G pixels/s)
60. 110.53ms (15.18G pixels/s)
61. 110.53ms (15.18G pixels/s)
62. 110.53ms (15.18G pixels/s)
63. 110.55ms (15.18G pixels/s)
64. 110.53ms (15.18G pixels/s)
65. 110.54ms (15.18G pixels/s)
66. 110.53ms (15.18G pixels/s)
67. 110.54ms (15.18G pixels/s)
68. 110.55ms (15.18G pixels/s)
69. 110.55ms (15.18G pixels/s)
70. 110.54ms (15.18G pixels/s)
71. 110.54ms (15.18G pixels/s)
72. 110.55ms (15.18G pixels/s)
73. 110.55ms (15.18G pixels/s)
74. 110.55ms (15.18G pixels/s)
75. 110.55ms (15.18G pixels/s)
76. 110.56ms (15.17G pixels/s)
77. 110.54ms (15.18G pixels/s)
78. 110.54ms (15.18G pixels/s)
79. 110.54ms (15.18G pixels/s)
80. 110.54ms (15.18G pixels/s)
81. 110.55ms (15.18G pixels/s)
82. 110.54ms (15.18G pixels/s)
83. 110.55ms (15.18G pixels/s)
84. 110.55ms (15.18G pixels/s)
85. 110.57ms (15.17G pixels/s)
86. 110.55ms (15.18G pixels/s)
87. 110.57ms (15.17G pixels/s)
88. 110.55ms (15.18G pixels/s)
89. 110.55ms (15.18G pixels/s)
90. 110.56ms (15.17G pixels/s)
91. 110.55ms (15.18G pixels/s)
92. 110.55ms (15.18G pixels/s)
93. 110.55ms (15.18G pixels/s)
94. 110.57ms (15.17G pixels/s)
95. 110.56ms (15.17G pixels/s)
96. 110.55ms (15.18G pixels/s)
97. 110.57ms (15.17G pixels/s)
98. 110.55ms (15.18G pixels/s)
99. 110.56ms (15.18G pixels/s)
100. 110.55ms (15.18G pixels/s)
101. 110.54ms (15.18G pixels/s)
102. 110.54ms (15.18G pixels/s)
103. 110.55ms (15.18G pixels/s)
104. 110.55ms (15.18G pixels/s)
105. 110.55ms (15.18G pixels/s)
106. 110.54ms (15.18G pixels/s)
107. 110.54ms (15.18G pixels/s)
108. 110.55ms (15.18G pixels/s)
109. 110.55ms (15.18G pixels/s)
110. 110.54ms (15.18G pixels/s)
111. 110.54ms (15.18G pixels/s)
112. 110.56ms (15.17G pixels/s)
113. 110.55ms (15.18G pixels/s)
114. 110.55ms (15.18G pixels/s)
115. 110.55ms (15.18G pixels/s)
116. 110.55ms (15.18G pixels/s)
117. 110.57ms (15.17G pixels/s)
118. 110.55ms (15.18G pixels/s)
119. 110.55ms (15.18G pixels/s)
120. 110.55ms (15.18G pixels/s)
121. 110.55ms (15.18G pixels/s)
122. 110.55ms (15.18G pixels/s)
123. 110.55ms (15.18G pixels/s)
124. 110.55ms (15.18G pixels/s)
125. 110.55ms (15.18G pixels/s)
126. 110.55ms (15.18G pixels/s)
127. 110.55ms (15.18G pixels/s)
128. 110.55ms (15.18G pixels/s)

-.-
 
16.11.x, but with the 16.12 I got the same results. I simply rolled back to 16.11.x since the 16.12 have some power and fan control issues.
 
Here is the D3D12nBodyGravity sample that AMD forked from Microsoft to add concurrent execution of a compute queue along to the default/graphics queue.

As you can see, there is no compute queue at all running the sample on the R9 280:

R9_380.png


R9_280.png
 
Back
Top