DX12 Performance Discussion And Analysis Thread

R9 390X driver version 15.20.1062.1004
Compute only:
1. 52.28ms
2. 52.43ms
3. 52.24ms
4. 52.25ms
5. 52.28ms
6. 52.23ms
7. 52.26ms
8. 52.34ms
9. 52.27ms
10. 52.23ms
11. 52.24ms
12. 52.24ms
13. 52.24ms
14. 52.29ms
15. 52.24ms
16. 52.28ms
17. 52.25ms
18. 52.24ms
19. 52.23ms
20. 52.30ms
21. 52.21ms
22. 52.27ms
23. 52.22ms
24. 52.25ms
25. 52.19ms
26. 52.24ms
27. 52.28ms
28. 52.27ms
29. 52.26ms
30. 52.33ms
31. 52.32ms
32. 52.18ms
33. 52.31ms
34. 52.21ms
35. 52.47ms
36. 52.24ms
37. 52.23ms
38. 52.28ms
39. 52.25ms
40. 52.27ms
41. 52.17ms
42. 52.22ms
43. 52.25ms
44. 52.24ms
45. 52.25ms
46. 52.25ms
47. 52.26ms
48. 52.17ms
49. 52.22ms
50. 52.22ms
51. 52.24ms
52. 52.28ms
53. 52.25ms
54. 52.35ms
55. 52.20ms
56. 52.35ms
57. 52.26ms
58. 52.20ms
59. 52.23ms
60. 52.27ms
61. 52.26ms
62. 52.60ms
63. 52.56ms
64. 52.24ms
65. 52.27ms
66. 52.25ms
67. 52.28ms
68. 52.31ms
69. 52.21ms
70. 52.21ms
71. 52.24ms
72. 52.36ms
73. 52.25ms
74. 52.34ms
75. 52.33ms
76. 52.18ms
77. 52.21ms
78. 52.40ms
79. 52.21ms
80. 52.53ms
81. 52.20ms
82. 52.22ms
83. 52.20ms
84. 52.35ms
85. 52.24ms
86. 52.33ms
87. 52.32ms
88. 52.33ms
89. 52.69ms
90. 52.23ms
91. 52.26ms
92. 52.21ms
93. 52.24ms
94. 52.28ms
95. 52.28ms
96. 52.24ms
97. 52.21ms
98. 52.28ms
99. 52.26ms
100. 52.26ms
101. 52.23ms
102. 52.33ms
103. 52.26ms
104. 52.33ms
105. 52.33ms
106. 52.30ms
107. 52.29ms
108. 52.25ms
109. 52.24ms
110. 52.22ms
111. 52.28ms
112. 52.26ms
113. 52.13ms
114. 52.27ms
115. 52.26ms
116. 52.61ms
117. 52.25ms
118. 52.24ms
119. 52.26ms
120. 52.53ms
121. 52.23ms
122. 52.20ms
123. 52.34ms
124. 52.33ms
125. 52.33ms
126. 52.35ms
127. 52.36ms
128. 52.26ms
Graphics only: 27.55ms (60.89G pixels/s)
Graphics + compute:
1. 53.07ms (31.62G pixels/s)
2. 52.47ms (31.97G pixels/s)
3. 52.57ms (31.91G pixels/s)
4. 52.57ms (31.91G pixels/s)
5. 52.85ms (31.74G pixels/s)
6. 52.81ms (31.77G pixels/s)
7. 52.53ms (31.94G pixels/s)
8. 52.69ms (31.84G pixels/s)
9. 52.98ms (31.67G pixels/s)
10. 52.74ms (31.81G pixels/s)
11. 52.74ms (31.81G pixels/s)
12. 52.97ms (31.67G pixels/s)
13. 53.56ms (31.33G pixels/s)
14. 53.20ms (31.54G pixels/s)
15. 53.52ms (31.35G pixels/s)
16. 52.66ms (31.86G pixels/s)
17. 52.92ms (31.70G pixels/s)
18. 52.58ms (31.91G pixels/s)
19. 54.12ms (31.00G pixels/s)
20. 53.10ms (31.60G pixels/s)
21. 53.10ms (31.60G pixels/s)
22. 52.54ms (31.93G pixels/s)
23. 53.03ms (31.64G pixels/s)
24. 53.28ms (31.49G pixels/s)
25. 53.01ms (31.65G pixels/s)
26. 54.17ms (30.97G pixels/s)
27. 52.78ms (31.79G pixels/s)
28. 54.23ms (30.94G pixels/s)
29. 53.23ms (31.52G pixels/s)
30. 53.43ms (31.40G pixels/s)
31. 53.30ms (31.47G pixels/s)
32. 53.42ms (31.40G pixels/s)
33. 54.75ms (30.65G pixels/s)
34. 52.56ms (31.92G pixels/s)
35. 53.64ms (31.28G pixels/s)
36. 53.16ms (31.56G pixels/s)
37. 56.06ms (29.93G pixels/s)
38. 53.72ms (31.23G pixels/s)
39. 53.36ms (31.44G pixels/s)
40. 53.40ms (31.42G pixels/s)
41. 53.46ms (31.38G pixels/s)
42. 53.89ms (31.13G pixels/s)
43. 52.63ms (31.88G pixels/s)
44. 54.40ms (30.84G pixels/s)
45. 52.55ms (31.93G pixels/s)
46. 55.17ms (30.41G pixels/s)
47. 53.35ms (31.45G pixels/s)
48. 53.36ms (31.44G pixels/s)
49. 52.58ms (31.91G pixels/s)
50. 53.41ms (31.41G pixels/s)
51. 54.21ms (30.95G pixels/s)
52. 52.57ms (31.91G pixels/s)
53. 55.68ms (30.13G pixels/s)
54. 54.22ms (30.94G pixels/s)
55. 54.40ms (30.84G pixels/s)
56. 54.30ms (30.90G pixels/s)
57. 53.94ms (31.10G pixels/s)
58. 56.26ms (29.82G pixels/s)
59. 54.38ms (30.85G pixels/s)
60. 54.93ms (30.54G pixels/s)
61. 54.52ms (30.77G pixels/s)
62. 56.51ms (29.69G pixels/s)
63. 52.55ms (31.92G pixels/s)
64. 53.81ms (31.18G pixels/s)
65. 53.58ms (31.31G pixels/s)
66. 54.19ms (30.96G pixels/s)
67. 54.53ms (30.76G pixels/s)
68. 56.66ms (29.61G pixels/s)
69. 54.79ms (30.62G pixels/s)
70. 54.37ms (30.86G pixels/s)
71. 56.02ms (29.95G pixels/s)
72. 52.53ms (31.94G pixels/s)
73. 52.65ms (31.87G pixels/s)
74. 52.74ms (31.81G pixels/s)
75. 58.51ms (28.67G pixels/s)
76. 52.61ms (31.89G pixels/s)
77. 56.75ms (29.56G pixels/s)
78. 52.76ms (31.80G pixels/s)
79. 52.55ms (31.93G pixels/s)
80. 57.43ms (29.21G pixels/s)
81. 53.99ms (31.07G pixels/s)
82. 57.87ms (28.99G pixels/s)
83. 55.15ms (30.42G pixels/s)
84. 58.63ms (28.62G pixels/s)
85. 53.88ms (31.14G pixels/s)
86. 58.06ms (28.90G pixels/s)
87. 52.59ms (31.90G pixels/s)
88. 55.23ms (30.38G pixels/s)
89. 55.30ms (30.34G pixels/s)
90. 55.16ms (30.42G pixels/s)
91. 55.45ms (30.26G pixels/s)
92. 54.03ms (31.05G pixels/s)
93. 57.21ms (29.33G pixels/s)
94. 55.55ms (30.20G pixels/s)
95. 54.34ms (30.87G pixels/s)
96. 52.55ms (31.93G pixels/s)
97. 56.54ms (29.67G pixels/s)
98. 55.46ms (30.25G pixels/s)
99. 58.49ms (28.68G pixels/s)
100. 52.78ms (31.79G pixels/s)
101. 54.59ms (30.73G pixels/s)
102. 56.09ms (29.91G pixels/s)
103. 52.75ms (31.81G pixels/s)
104. 57.93ms (28.96G pixels/s)
105. 52.56ms (31.92G pixels/s)
106. 57.76ms (29.05G pixels/s)
107. 55.86ms (30.03G pixels/s)
108. 58.50ms (28.68G pixels/s)
109. 52.76ms (31.80G pixels/s)
110. 54.43ms (30.82G pixels/s)
111. 54.52ms (30.77G pixels/s)
112. 55.31ms (30.33G pixels/s)
113. 59.58ms (28.16G pixels/s)
114. 55.44ms (30.26G pixels/s)
115. 55.07ms (30.47G pixels/s)
116. 56.05ms (29.94G pixels/s)
117. 54.61ms (30.72G pixels/s)
118. 56.32ms (29.79G pixels/s)
119. 58.34ms (28.76G pixels/s)
120. 52.76ms (31.80G pixels/s)
121. 52.59ms (31.90G pixels/s)
122. 54.54ms (30.76G pixels/s)
123. 54.55ms (30.75G pixels/s)
124. 52.73ms (31.82G pixels/s)
125. 54.64ms (30.71G pixels/s)
126. 58.86ms (28.50G pixels/s)
127. 52.52ms (31.94G pixels/s)
128. 52.63ms (31.88G pixels/s)
 
What kind of workloads are you running on each thread? It is super easy to contruct workloads with zero potential gains.

Use cases to avoid:
- Both queues run the same shader. Obviously this has zero advantage as bottlenecks are identical.
- Both queues are bottlenecked by the same resource (bandwidth, ALU, sampler cycles).

Good use cases:
- One queue is bandwidth heavy and other is ALU heavy.
- Graphics queue is fixed function heavy and you run compute in the other queue. For example high poly shadow map rendering (primitive setup + rop, no pixel shader).
- One queue is running long lasting computation and the other is running small single lane tasks with data dependencies. The long running background task keeps the GPU occupied while the other queue waits for synchronization.
It's running from 1 to 128 single lane compute kernels which are quite long and require minimal amount of bandwidth. The graphics queue is basically just pushing fillrate with triangles occupying 4k x 4k offscreen render target.
So basically the best possible case.
 
Fury X - Fiji GCN 1.2
15.20.1062.1004

Compute only:
1. 49.65ms
2. 49.66ms
3. 49.66ms
4. 49.66ms
5. 49.66ms
6. 49.65ms
7. 49.66ms
8. 49.66ms
9. 49.65ms
10. 49.66ms
11. 49.66ms
12. 49.65ms
13. 49.66ms
14. 49.64ms
15. 49.66ms
16. 49.66ms
17. 49.66ms
18. 49.65ms
19. 49.65ms
20. 49.67ms
21. 49.66ms
22. 49.66ms
23. 49.66ms
24. 49.66ms
25. 49.67ms
26. 49.66ms
27. 49.66ms
28. 49.66ms
29. 49.66ms
30. 49.66ms
31. 49.66ms
32. 49.66ms
33. 49.66ms
34. 49.66ms
35. 49.67ms
36. 49.65ms
37. 49.65ms
38. 49.66ms
39. 49.66ms
40. 49.67ms
41. 49.66ms
42. 49.66ms
43. 49.66ms
44. 49.66ms
45. 49.66ms
46. 49.66ms
47. 49.65ms
48. 49.66ms
49. 49.66ms
50. 49.66ms
51. 49.66ms
52. 49.66ms
53. 49.66ms
54. 49.66ms
55. 49.66ms
56. 49.66ms
57. 49.66ms
58. 49.66ms
59. 49.66ms
60. 49.65ms
61. 49.66ms
62. 49.66ms
63. 49.66ms
64. 49.66ms
65. 49.68ms
66. 49.66ms
67. 49.68ms
68. 49.67ms
69. 49.66ms
70. 49.67ms
71. 49.66ms
72. 49.66ms
73. 49.65ms
74. 49.67ms
75. 49.67ms
76. 49.66ms
77. 49.66ms
78. 49.66ms
79. 49.66ms
80. 49.67ms
81. 49.66ms
82. 49.65ms
83. 49.66ms
84. 49.66ms
85. 49.67ms
86. 49.65ms
87. 49.67ms
88. 49.66ms
89. 49.66ms
90. 49.66ms
91. 49.65ms
92. 49.66ms
93. 49.67ms
94. 49.67ms
95. 49.67ms
96. 49.67ms
97. 49.65ms
98. 49.66ms
99. 49.66ms
100. 49.66ms
101. 49.66ms
102. 49.67ms
103. 49.67ms
104. 49.66ms
105. 49.67ms
106. 49.66ms
107. 49.67ms
108. 49.67ms
109. 49.66ms
110. 49.68ms
111. 49.67ms
112. 49.67ms
113. 49.67ms
114. 49.67ms
115. 49.68ms
116. 49.67ms
117. 49.67ms
118. 49.67ms
119. 49.67ms
120. 49.68ms
121. 49.66ms
122. 49.69ms
123. 49.67ms
124. 49.67ms
125. 49.66ms
126. 49.66ms
127. 49.65ms
128. 49.67ms
Graphics only: 25.18ms (66.62G pixels/s)
Graphics + compute:
1. 55.93ms (30.00G pixels/s)
2. 56.01ms (29.95G pixels/s)
3. 49.76ms (33.72G pixels/s)
4. 49.76ms (33.72G pixels/s)
5. 49.75ms (33.72G pixels/s)
6. 49.82ms (33.68G pixels/s)
7. 56.03ms (29.94G pixels/s)
8. 56.05ms (29.93G pixels/s)
9. 49.85ms (33.66G pixels/s)
10. 49.79ms (33.69G pixels/s)
11. 49.77ms (33.71G pixels/s)
12. 49.80ms (33.69G pixels/s)
13. 56.06ms (29.93G pixels/s)
14. 62.31ms (26.92G pixels/s)
15. 49.78ms (33.70G pixels/s)
16. 62.34ms (26.91G pixels/s)
17. 49.80ms (33.69G pixels/s)
18. 62.40ms (26.89G pixels/s)
19. 56.00ms (29.96G pixels/s)
20. 62.35ms (26.91G pixels/s)
21. 56.13ms (29.89G pixels/s)
22. 56.01ms (29.95G pixels/s)
23. 62.33ms (26.92G pixels/s)
24. 49.82ms (33.68G pixels/s)
25. 62.27ms (26.94G pixels/s)
26. 49.76ms (33.72G pixels/s)
27. 56.00ms (29.96G pixels/s)
28. 56.07ms (29.92G pixels/s)
29. 62.31ms (26.93G pixels/s)
30. 56.12ms (29.90G pixels/s)
31. 68.61ms (24.45G pixels/s)
32. 49.77ms (33.71G pixels/s)
33. 56.01ms (29.95G pixels/s)
34. 62.27ms (26.94G pixels/s)
35. 62.35ms (26.91G pixels/s)
36. 68.59ms (24.46G pixels/s)
37. 55.99ms (29.96G pixels/s)
38. 75.62ms (22.19G pixels/s)
39. 49.79ms (33.70G pixels/s)
40. 49.80ms (33.69G pixels/s)
41. 49.79ms (33.69G pixels/s)
42. 49.77ms (33.71G pixels/s)
43. 49.76ms (33.72G pixels/s)
44. 74.78ms (22.44G pixels/s)
45. 56.01ms (29.95G pixels/s)
46. 49.79ms (33.69G pixels/s)
47. 49.81ms (33.68G pixels/s)
48. 56.15ms (29.88G pixels/s)
49. 56.07ms (29.92G pixels/s)
50. 49.81ms (33.68G pixels/s)
51. 62.28ms (26.94G pixels/s)
52. 49.78ms (33.70G pixels/s)
53. 62.30ms (26.93G pixels/s)
54. 49.82ms (33.68G pixels/s)
55. 62.34ms (26.91G pixels/s)
56. 56.04ms (29.94G pixels/s)
57. 56.05ms (29.93G pixels/s)
58. 56.03ms (29.94G pixels/s)
59. 49.77ms (33.71G pixels/s)
60. 62.38ms (26.90G pixels/s)
61. 49.82ms (33.68G pixels/s)
62. 56.02ms (29.95G pixels/s)
63. 56.05ms (29.93G pixels/s)
64. 56.07ms (29.92G pixels/s)
65. 56.05ms (29.93G pixels/s)
66. 56.03ms (29.94G pixels/s)
67. 56.09ms (29.91G pixels/s)
68. 49.79ms (33.70G pixels/s)
69. 62.24ms (26.95G pixels/s)
70. 49.79ms (33.70G pixels/s)
71. 62.32ms (26.92G pixels/s)
72. 49.81ms (33.68G pixels/s)
73. 55.98ms (29.97G pixels/s)
74. 49.81ms (33.68G pixels/s)
75. 49.79ms (33.69G pixels/s)
76. 49.76ms (33.71G pixels/s)
77. 55.98ms (29.97G pixels/s)
78. 56.10ms (29.90G pixels/s)
79. 49.82ms (33.67G pixels/s)
80. 62.48ms (26.85G pixels/s)
81. 49.77ms (33.71G pixels/s)
82. 49.79ms (33.69G pixels/s)
83. 55.96ms (29.98G pixels/s)
84. 49.78ms (33.70G pixels/s)
85. 49.78ms (33.70G pixels/s)
86. 49.78ms (33.70G pixels/s)
87. 62.28ms (26.94G pixels/s)
88. 68.57ms (24.47G pixels/s)
89. 62.28ms (26.94G pixels/s)
90. 56.00ms (29.96G pixels/s)
91. 62.43ms (26.87G pixels/s)
92. 68.55ms (24.47G pixels/s)
93. 68.58ms (24.46G pixels/s)
94. 49.77ms (33.71G pixels/s)
95. 62.44ms (26.87G pixels/s)
96. 49.80ms (33.69G pixels/s)
97. 56.02ms (29.95G pixels/s)
98. 56.06ms (29.93G pixels/s)
99. 56.03ms (29.94G pixels/s)
100. 56.03ms (29.95G pixels/s)
101. 55.98ms (29.97G pixels/s)
102. 56.02ms (29.95G pixels/s)
103. 74.82ms (22.42G pixels/s)
104. 62.31ms (26.92G pixels/s)
105. 56.13ms (29.89G pixels/s)
106. 62.26ms (26.95G pixels/s)
107. 49.79ms (33.69G pixels/s)
108. 56.07ms (29.92G pixels/s)
109. 49.78ms (33.71G pixels/s)
110. 49.78ms (33.70G pixels/s)
111. 56.05ms (29.93G pixels/s)
112. 56.05ms (29.94G pixels/s)
113. 56.12ms (29.90G pixels/s)
114. 74.67ms (22.47G pixels/s)
115. 62.26ms (26.95G pixels/s)
116. 56.01ms (29.96G pixels/s)
117. 49.83ms (33.67G pixels/s)
118. 49.78ms (33.70G pixels/s)
119. 49.78ms (33.70G pixels/s)
120. 49.78ms (33.70G pixels/s)
121. 49.81ms (33.69G pixels/s)
122. 55.96ms (29.98G pixels/s)
123. 62.45ms (26.87G pixels/s)
124. 75.07ms (22.35G pixels/s)
125. 56.07ms (29.92G pixels/s)
126. 62.38ms (26.90G pixels/s)
127. 56.08ms (29.92G pixels/s)
128. 62.24ms (26.96G pixels/s)
 
It looks like GCN hits a flatline in compute-only loads and performance doesn't benefit from lower batch count, like Nvidia parts. :???:
 
Laptop 8970M, i74700mq
Compute only:
1. 61.52ms
2. 61.52ms
3. 61.52ms
4. 61.58ms
5. 61.47ms
6. 61.51ms
7. 61.51ms
8. 61.52ms
9. 61.50ms
10. 61.51ms
11. 61.52ms
12. 61.52ms
13. 61.52ms
14. 61.52ms
15. 61.53ms
16. 61.51ms
17. 61.51ms
18. 61.52ms
19. 61.52ms
20. 61.50ms
21. 61.52ms
22. 61.52ms
23. 61.52ms
24. 61.51ms
25. 61.53ms
26. 61.50ms
27. 61.53ms
28. 61.51ms
29. 61.53ms
30. 61.52ms
31. 61.52ms
32. 61.51ms
33. 61.52ms
34. 61.52ms
35. 61.49ms
36. 61.49ms
37. 61.50ms
38. 61.51ms
39. 61.52ms
40. 61.52ms
41. 61.52ms
42. 61.51ms
43. 61.54ms
44. 61.52ms
45. 61.52ms
46. 61.51ms
47. 61.52ms
48. 61.51ms
49. 61.52ms
50. 61.54ms
51. 61.52ms
52. 61.52ms
53. 61.51ms
54. 61.53ms
55. 61.51ms
56. 61.50ms
57. 61.52ms
58. 61.52ms
59. 61.51ms
60. 61.52ms
61. 61.51ms
62. 61.50ms
63. 61.50ms
64. 61.53ms
65. 61.50ms
66. 61.39ms
67. 61.36ms
68. 61.39ms
69. 61.43ms
70. 61.51ms
71. 61.54ms
72. 61.51ms
73. 61.52ms
74. 61.51ms
75. 61.50ms
76. 61.52ms
77. 61.54ms
78. 61.51ms
79. 61.53ms
80. 61.52ms
81. 61.50ms
82. 61.51ms
83. 61.51ms
84. 61.52ms
85. 61.48ms
86. 61.53ms
87. 61.52ms
88. 61.48ms
89. 61.50ms
90. 61.52ms
91. 61.50ms
92. 61.51ms
93. 61.51ms
94. 61.52ms
95. 61.51ms
96. 61.50ms
97. 61.52ms
98. 61.52ms
99. 61.52ms
100. 61.52ms
101. 61.52ms
102. 61.52ms
103. 61.52ms
104. 61.50ms
105. 61.51ms
106. 61.51ms
107. 61.52ms
108. 61.53ms
109. 61.52ms
110. 61.50ms
111. 61.50ms
112. 61.51ms
113. 61.52ms
114. 61.52ms
115. 61.50ms
116. 61.51ms
117. 61.52ms
118. 61.50ms
119. 61.51ms
120. 61.52ms
121. 61.54ms
122. 61.49ms
123. 61.52ms
124. 61.52ms
125. 61.52ms
126. 61.51ms
127. 61.50ms
128. 61.51ms
Graphics only: 59.03ms (28.42G pixels/s)
Graphics + compute:
1. 62.97ms (26.64G pixels/s)
2. 63.01ms (26.63G pixels/s)
3. 63.00ms (26.63G pixels/s)
4. 63.01ms (26.63G pixels/s)
5. 62.99ms (26.64G pixels/s)
6. 63.01ms (26.63G pixels/s)
7. 63.00ms (26.63G pixels/s)
8. 63.01ms (26.63G pixels/s)
9. 63.00ms (26.63G pixels/s)
10. 62.99ms (26.64G pixels/s)
11. 63.02ms (26.62G pixels/s)
12. 63.00ms (26.63G pixels/s)
13. 63.00ms (26.63G pixels/s)
14. 62.98ms (26.64G pixels/s)
15. 63.01ms (26.63G pixels/s)
16. 62.94ms (26.65G pixels/s)
17. 63.01ms (26.63G pixels/s)
18. 62.99ms (26.63G pixels/s)
19. 62.99ms (26.64G pixels/s)
20. 63.00ms (26.63G pixels/s)
21. 63.00ms (26.63G pixels/s)
22. 63.00ms (26.63G pixels/s)
23. 63.00ms (26.63G pixels/s)
24. 63.01ms (26.62G pixels/s)
25. 63.00ms (26.63G pixels/s)
26. 63.01ms (26.62G pixels/s)
27. 62.96ms (26.65G pixels/s)
28. 63.02ms (26.62G pixels/s)
29. 62.99ms (26.63G pixels/s)
30. 63.01ms (26.63G pixels/s)
31. 63.01ms (26.63G pixels/s)
32. 63.01ms (26.63G pixels/s)
33. 62.99ms (26.63G pixels/s)
34. 63.01ms (26.63G pixels/s)
35. 63.00ms (26.63G pixels/s)
36. 63.02ms (26.62G pixels/s)
37. 62.99ms (26.63G pixels/s)
38. 63.00ms (26.63G pixels/s)
39. 63.01ms (26.63G pixels/s)
40. 63.01ms (26.62G pixels/s)
41. 63.00ms (26.63G pixels/s)
42. 63.00ms (26.63G pixels/s)
43. 63.01ms (26.62G pixels/s)
44. 62.96ms (26.65G pixels/s)
45. 63.01ms (26.63G pixels/s)
46. 63.00ms (26.63G pixels/s)
47. 63.01ms (26.63G pixels/s)
48. 62.99ms (26.63G pixels/s)
49. 63.01ms (26.63G pixels/s)
50. 62.97ms (26.64G pixels/s)
51. 63.01ms (26.63G pixels/s)
52. 62.97ms (26.64G pixels/s)
53. 63.01ms (26.63G pixels/s)
54. 63.00ms (26.63G pixels/s)
55. 63.00ms (26.63G pixels/s)
56. 63.01ms (26.63G pixels/s)
57. 63.01ms (26.62G pixels/s)
58. 63.00ms (26.63G pixels/s)
59. 62.99ms (26.63G pixels/s)
60. 63.00ms (26.63G pixels/s)
61. 63.01ms (26.62G pixels/s)
62. 63.01ms (26.62G pixels/s)
63. 62.99ms (26.63G pixels/s)
64. 63.00ms (26.63G pixels/s)
65. 62.98ms (26.64G pixels/s)
66. 63.02ms (26.62G pixels/s)
67. 62.98ms (26.64G pixels/s)
68. 63.01ms (26.62G pixels/s)
69. 63.00ms (26.63G pixels/s)
70. 63.01ms (26.62G pixels/s)
71. 63.01ms (26.63G pixels/s)
72. 63.03ms (26.62G pixels/s)
73. 63.02ms (26.62G pixels/s)
74. 63.01ms (26.63G pixels/s)
75. 63.01ms (26.63G pixels/s)
76. 62.99ms (26.63G pixels/s)
77. 63.02ms (26.62G pixels/s)
78. 63.00ms (26.63G pixels/s)
79. 63.00ms (26.63G pixels/s)
80. 62.98ms (26.64G pixels/s)
81. 63.02ms (26.62G pixels/s)
82. 63.01ms (26.63G pixels/s)
83. 63.00ms (26.63G pixels/s)
84. 63.00ms (26.63G pixels/s)
85. 63.01ms (26.63G pixels/s)
86. 63.00ms (26.63G pixels/s)
87. 63.01ms (26.63G pixels/s)
88. 63.01ms (26.63G pixels/s)
89. 63.01ms (26.62G pixels/s)
90. 63.00ms (26.63G pixels/s)
91. 63.00ms (26.63G pixels/s)
92. 62.99ms (26.63G pixels/s)
93. 63.01ms (26.63G pixels/s)
94. 63.00ms (26.63G pixels/s)
95. 62.99ms (26.63G pixels/s)
96. 63.02ms (26.62G pixels/s)
97. 63.01ms (26.62G pixels/s)
98. 63.01ms (26.63G pixels/s)
99. 62.97ms (26.64G pixels/s)
100. 63.01ms (26.63G pixels/s)
101. 63.01ms (26.62G pixels/s)
102. 62.99ms (26.64G pixels/s)
103. 63.00ms (26.63G pixels/s)
104. 63.00ms (26.63G pixels/s)
105. 63.00ms (26.63G pixels/s)
106. 62.99ms (26.63G pixels/s)
107. 63.00ms (26.63G pixels/s)
108. 62.98ms (26.64G pixels/s)
109. 63.01ms (26.63G pixels/s)
110. 62.99ms (26.64G pixels/s)
111. 63.01ms (26.62G pixels/s)
112. 63.01ms (26.63G pixels/s)
113. 63.01ms (26.63G pixels/s)
114. 63.00ms (26.63G pixels/s)
115. 63.00ms (26.63G pixels/s)
116. 62.99ms (26.64G pixels/s)
117. 63.01ms (26.62G pixels/s)
118. 63.00ms (26.63G pixels/s)
119. 63.00ms (26.63G pixels/s)
120. 63.01ms (26.63G pixels/s)
121. 63.02ms (26.62G pixels/s)
122. 62.99ms (26.63G pixels/s)
123. 62.97ms (26.64G pixels/s)
124. 63.02ms (26.62G pixels/s)
125. 62.98ms (26.64G pixels/s)
126. 62.98ms (26.64G pixels/s)
127. 62.99ms (26.64G pixels/s)
128. 63.01ms (26.63G pixels/s)
 
It looks like GCN hits a flatline in compute-only loads and performance doesn't benefit from lower batch count, like Nvidia parts. :???:
Looks like GCN is single thread and latency limited since the test is single lane, I wonder whether Maxwells benefit from the narrower 32 threads optimized pipeline and fma + load/store dual issue here
 
Looks like GCN is single thread and latency limited since the test is single lane, I wonder whether Maxwells benefit from the narrower 32 threads optimized pipeline and fma + load/store dual issue here
That's a good point, the compute kernel might need to be normalised against a version that computes a set of 100,000 work items. Or a variant that ramps up the count of work-items, allowing work groups to be filled...

I dare say the results being posted are fascinating even for such a simple test. Fillrate on GM200 (980Ti) starts at a relatively high percentage of native fillrate, 61/94 = 65% and falls to 29/94 = 31%. Fiji starts worse, but doesn't change at about a constant 50%.

---

Having looked at the shader code and seeing the results on GM200, I think two things need to be said:

1. performance steps in increments of 32, which is the work group size on GM200. This looks to me as if NVidia's driver has observed the static configuration here and discerned that 32 work groups can be packed into a single work group. There are no barriers (which are a work-group wide synchronisation) and there's no use of memory fences, so the kernel is trivial to pack into 32-wide work-groups instead of 1-wide.

2. the kernel contains a loop iterator with hard-coded bounds. This is bait for any compiler to pre-compute the result and stick in a constant. If the compiler spots this, then the kernel runs in near-0 ALU cycles.
 
AMD pours more drama in the AofS case:
Oxide effectively summarized my thoughts on the matter. NVIDIA claims "full support" for DX12, but conveniently ignores that Maxwell is utterly incapable of performing asynchronous compute without heavy reliance on slow context switching.

GCN has supported async shading since its inception, and it did so because we hoped and expected that gaming would lean into these workloads heavily. Mantle, Vulkan and DX12 all do. The consoles do (with gusto). PC games are chock full of compute-driven effects.

If memory serves, GCN has higher FLOPS/mm2 than any other architecture, and GCN is once again showing its prowess when utilized with common-sense workloads that are appropriate for the design of the architecture.
Reddit
 
So here's the updated version of the benchmark. This time it runs compute only, graphics only and compute + graphics. This time just two command queues.
GTX 680:

So multiple dispatches yes and graphics + compute is just as expected graphics + compute (not async).
Anyone willing to give it a try on Maxwell 2 or GCN?

Very good app. But i have same questions: You are rendering graphics to 4k x 4k surface, it means that about 16M pixel shaders are in flight. As i can see on NVidia HW compute is serialized after graphics task - so maybe graphics task have a some kind of higher priority, and can't be preempted by compute tasks. Maybe You try to refine your app so there will be fewer pixel shaders in flight? But i have no idea how to do it ;) Small triangle at about 500px will be just very fast, many small triangles will be executed parallel and will also take all resources...
 
Picture speaks louder, visualizing results from Fellix and Dygaza. To scale. Lower is better.

ac_980ti_vs_fury_x.png
 
It's running from 1 to 128 single lane compute kernels


Can you raise this number to 256 or 512, for example?
None of the GCN chips have their performance even budging between 1 and 128 kernels, and with async there's hardly any difference at all. Perhaps it'd be interesting to see how many more are needed until we see the latency stepping up like the Kepler and Maxwell chips.
 
I dare say the results being posted are fascinating even for such a simple test. Fillrate on GM200 (980Ti) starts at a relatively high percentage of native fillrate, 61/94 = 65% and falls to 29/94 = 31%. Fiji starts worse, but doesn't change at about a constant 50%.
That's more of a which part finishes first. Graphics load is static, compute load changes. So 980 Ti starts at a high percentage because compute is that much quicker to complete. It doesn't go parallel with graphics though.

1. performance steps in increments of 32, which is the work group size on GM200. This looks to me as if NVidia's driver has observed the static configuration here and discerned that 32 work groups can be packed into a single work group. There are no barriers (which are a work-group wide synchronisation) and there's no use of memory fences, so the kernel is trivial to pack into 32-wide work-groups instead of 1-wide.
It steps in groups of 32 on GM200, groups of 16 on GTX 750 and groups of 8 on GTX 680. I think you'll also find it interesting that the jump is actually at 31 not at 32.

2. the kernel contains a loop iterator with hard-coded bounds. This is bait for any compiler to pre-compute the result and stick in a constant. If the compiler spots this, then the kernel runs in near-0 ALU cycles.
Well compiler would also have to assume that thread id is actually constant. It could do that I guess given it's also there in the code, but the numbers we're seeing don't suggest anywhere near 0 alu cycles. I just wanted something simple and that would keep SMM/SMX/CU busy for a long easily configurable time so fibonacci with a small change seamed like the best idea.
 
That's more of a which part finishes first. Graphics load is static, compute load changes. So 980 Ti starts at a high percentage because compute is that much quicker to complete. It doesn't go parallel with graphics though.
Are you saying that you are computing fillrate based on the total execution time, even if the triangle was completed much earlier?

It steps in groups of 32 on GM200, groups of 16 on GTX 750 and groups of 8 on GTX 680. I think you'll also find it interesting that the jump is actually at 31 not at 32.
The boundary is 31, 64, 96 and 128. So the first boundary is the outlier in this case, though it appears that the first boundary always behaves this way on the 3 NVidia architectures documented so far...

Also 16 and 8, do those sizes correspond with the native width of the SIMD in each case?

Wasn't it stated before that Maxwell 2 supports 32 queues for compute? I now realise that this could be the reason for the magic number and so the match with the SIMD width is pure coincidence.

I'm now wondering if the compute latencies being reported are system-wide latencies. ~50ms for a single kernel on Fiji is mostly system latency, since the theoretical execution time at 1.05 GHz is 7.6ms. So the kernel itself doesn't run for long enough to hide system latencies.

Of course, system latencies are a thing to quantify.

On Fiji, though, you'd need about 17,000 kernels before you'd fill the GPU with enough work to use up 50ms execution time. That's 50/7.6ms = 6.7 factor of 256 SIMDs with 10 work-groups each. Arguably, less kernel launches, since they won't all be launched concurrently.

Well compiler would also have to assume that thread id is actually constant. It could do that I guess given it's also there in the code, but the numbers we're seeing don't suggest anywhere near 0 alu cycles. I just wanted something simple and that would keep SMM/SMX/CU busy for a long easily configurable time so fibonacci with a small change seamed like the best idea.
I checked and on GCN it compiles like this:

Code:
shader csMain
  asic(SI)
  type(CS)

  v_add_i32     v0, vcc, s8, v0                             // 00000000: 4A000008
  v_add_i32     v1, vcc, s9, v1                             // 00000004: 4A020209
  v_cvt_f32_u32  v0, v0                                     // 00000008: 7E000D00
  v_cvt_f32_u32  v1, v1                                     // 0000000C: 7E020D01
  v_add_f32     v0, 1.0, v0                                 // 00000010: 060000F2
  v_add_f32     v1, 1.0, v1                                 // 00000014: 060202F2
  v_mov_b32     v2, 0                                       // 00000018: 7E040280
  s_movk_i32    s0, 0x0000                                  // 0000001C: B0000000
label_0008:
  s_cmp_ge_i32  s0, 0x00100000                              // 00000020: BF03FF00 00100000
  s_cbranch_scc1  label_0012                                // 00000028: BF850007
  v_add_f32     v0, v0, v1                                  // 0000002C: 06000300
  v_mul_f32     v2, 0x3f000011, v0                          // 00000030: 100400FF 3F000011
  s_add_u32     s0, s0, 1                                   // 00000038: 80008100
  v_mov_b32     v0, v1                                      // 0000003C: 7E000301
  v_mov_b32     v1, v2                                      // 00000040: 7E020302
  s_branch      label_0008                                  // 00000044: BF82FFF6
label_0012:
  v_lshl_b64    v[0:1], 0, 0                                // 00000048: D2C20000 00010080
  buffer_store_dword  v2, v[0:1], s[4:7], 0 offen idxen     // 00000050: E0703000 80010200
  s_endpgm                                                  // 00000058: BF810000
end
Which results in an 8 cycle inner loop and theoretical 7.6ms execution time.

Someone should be able to get the compiled code for GM200. ~10ms being reported seems like a reasonable indication, since it should be ~25% slower than Fiji in the worst-case (9.9ms).

All the same, it would be nice to eliminate system-wide latencies on both platforms. But I think NVidia has historically had radically lower kernel launch latencies in compute, so this pattern isn't surprising.

Fiji's 40+ms overhead is looking pretty useless.
 
Are you saying that you are computing fillrate based on the total execution time, even if the triangle was completed much earlier?
Correct. Though now that you mentioned it... WaitForMultipleObjects does return which event finished first so this would be nice to see yes.

Also 16 and 8, do those sizes correspond with the native width of the SIMD in each case?

Wasn't it stated before that Maxwell 2 supports 32 queues for compute? I now realise that this could be the reason for the magic number and so the match with the SIMD width is pure coincidence.
Warp is 32 on all NV platforms though that I don't think this plays a role here. It was stated 31 compute queues + 1 graphics/compute.

Which results in an 8 cycle inner loop and theoretical 7.6ms execution time.

Someone should be able to get the compiled code for GM200. ~10ms being reported seems like a reasonable indication, since it should be ~25% slower than Fiji in the worst-case (9.9ms).

All the same, it would be nice to eliminate system-wide latencies on both platforms. But I think NVidia has historically had radically lower kernel launch latencies in compute, so this pattern isn't surprising.

Fiji's 40+ms overhead is looking pretty useless.
40+ms launch overhead would indeed make it completely useless. That would cap any game making a dispatch call to 25 fps. Can you try to shorten the loop to 1024 and see how it reacts? You have a GCN board right?

I'm still a bit bothered by Intel, which I'd use as kind of a control. But I still can't get Haswell gpu to do multiple dispatches in parallel and Andrew says it can :). So it really makes you wonder what is actually enabled in current d3d12 drivers.
 
I've delayed W10 installation until I'm reasonably sure it's not going to piss me off. Can I be bothered to knock-up an OpenCL version of the test...

As for the launch overhead, I think that's partially related to the test which fills then drains the device queues. Normally games never let device queues run dry. It might be an idea to enqueue a hundred iterations of the 128-kernel scenario, for example, to see the effect.

Still there's something fishy. If a shadow-buffer pass takes <5ms, then on AMD is that finished before the compute kernel that's supposed to run in parallel even starts? Does D3D allow the developer to synchronise kernel launches?

Is your test multi-threaded? Should the compute and graphics tasks be running on distinct CPU threads?

I know very little about the D3D execution model...
 
40+ms launch overhead would indeed make it completely useless.

I measured ~0.01 ms for AMD 295 and ~1.0 ms for Nvidia 780Ti once in an DX11 engine path using passthrough compute shaders directly after a graphics task. Compute dispatch was abysimally slower on Nvidia then AMD. I don't remember how it was when compute followed compute.This 40ms shoudn't be dispatch latency.
 
Personally, I think one could just as easily make the claim that we were biased toward Nvidia as the only 'vendor' specific code is for Nvidia where we had to shutdown async compute. By vendor specific, I mean a case where we look at the Vendor ID and make changes to our rendering path. Curiously, their driver reported this feature was functional but attempting to use it was an unmitigated disaster in terms of performance and conformance so we shut it down on their hardware. As far as I know, Maxwell doesn't really have Async Compute so I don't know why their driver was trying to expose that. The only other thing that is different between them is that Nvidia does fall into Tier 2 class binding hardware instead of Tier 3 like AMD which requires a little bit more CPU overhead in D3D12, but I don't think it ended up being very significant. This isn't a vendor specific path, as it's responding to capabilities the driver reports.
Does anyone here known how to get that result? Did NVIDIA added such query to the NDA version of NVAPI? -_-
 
An AMD official weighs in on the Oxide Games employee's post:

AMD_Robert said:
Oxide effectively summarized my thoughts on the matter. NVIDIA claims "full support" for DX12, but conveniently ignores that Maxwell is utterly incapable of performing asynchronous compute without heavy reliance on slow context switching.

GCN has supported async shading since its inception, and it did so because we hoped and expected that gaming would lean into these workloads heavily. Mantle, Vulkan and DX12 all do. The consoles do (with gusto). PC games are chock full of compute-driven effects.

If memory serves, GCN has higher FLOPS/mm2 than any other architecture, and GCN is once again showing its prowess when utilized with common-sense workloads that are appropriate for the design of the architecture.
 
Unfortunately with the current market share, Nvidia not being capable doesn't hurt Nvidia, but AMD that's supposed to gain from the use of Async compute as this somewhat reduces the incentive to use these techniques. Sorry for the business related side of things on this technical thread..
 
Back
Top