DX12 Performance Discussion And Analysis Thread

dogen · Sep 1, 2015

Forceman said:
And why does the 750Ti show the same 32 command pattern as the 9 series? Shouldn't it be different since it doesn't support even the 31+1 of the 9 series?

It doesn't. It jumps after every 16.

sebbbi · Sep 1, 2015

Jawed said:
I think the EDRAM plus the HSA-like architecture of the consoles makes a load of console-specific performance-centric design decisions moot in the PC space.

Also, if publishers hand over console games to some fly-by-night studio which solely has to get the game working on PC, given the art assets and the console gaming experience as a guide then you get something like Batman: Arkham Knight.

Slide 22 of this presentation (http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf) lists many GCN specific things that would be hard to port to PC DirectX. And if people start to do crazy stuff such as writing to the GPU command queue by a compute shader (to spawn tasks by the GPU), then porting to PC becomes almost impossible.

You could also write PS3 SPU rendering code that was very hard to port to PC. And we still got acceptable PC ports. However this time both consoles support same low level GPU tricks, making PC the most different platform (last gen PS3 was the most different). It is interesting too see where this leads, and how fast the PC APIs will start to expose the remaining missing console GPU features.

Kodiack · Sep 1, 2015

GTX 980 Ti (ForceWare 355.82)

Compute only:
1. 11.53ms
2. 11.01ms
3. 10.31ms
4. 10.43ms
5. 10.40ms
6. 9.50ms
7. 9.28ms
8. 9.29ms
9. 9.32ms
10. 9.33ms
11. 9.12ms
12. 9.10ms
13. 9.15ms
14. 9.17ms
15. 9.16ms
16. 9.18ms
17. 9.17ms
18. 9.15ms
19. 9.12ms
20. 9.16ms
21. 9.16ms
22. 9.16ms
23. 9.19ms
24. 9.18ms
25. 9.21ms
26. 9.15ms
27. 9.13ms
28. 9.22ms
29. 9.18ms
30. 9.21ms
31. 9.14ms
32. 18.02ms
33. 18.05ms
34. 22.44ms
35. 20.28ms
36. 20.24ms
37. 20.20ms
38. 18.05ms
39. 20.27ms
40. 20.21ms
41. 20.24ms
42. 18.05ms
43. 20.23ms
44. 20.26ms
45. 22.47ms
46. 20.22ms
47. 20.25ms
48. 18.05ms
49. 20.26ms
50. 20.27ms
51. 20.23ms
52. 18.02ms
53. 20.23ms
54. 20.25ms
55. 20.22ms
56. 18.05ms
57. 20.24ms
58. 20.26ms
59. 20.25ms
60. 18.08ms
61. 20.24ms
62. 22.48ms
63. 18.07ms
64. 28.96ms
65. 28.98ms
66. 31.31ms
67. 29.21ms
68. 26.98ms
69. 26.97ms
70. 26.98ms
71. 31.40ms
72. 26.97ms
73. 26.96ms
74. 29.16ms
75. 31.41ms
76. 27.03ms
77. 27.01ms
78. 29.14ms
79. 29.22ms
80. 31.39ms
81. 27.01ms
82. 26.98ms
83. 29.15ms
84. 31.39ms
85. 26.95ms
86. 26.98ms
87. 27.01ms
88. 27.06ms
89. 29.18ms
90. 26.95ms
91. 29.18ms
92. 27.07ms
93. 27.02ms
94. 31.39ms
95. 27.00ms
96. 40.13ms
97. 37.90ms
98. 35.84ms
99. 38.15ms
100. 38.12ms
101. 35.91ms
102. 35.91ms
103. 40.29ms
104. 35.89ms
105. 38.09ms
106. 38.16ms
107. 38.08ms
108. 35.89ms
109. 35.98ms
110. 35.90ms
111. 35.90ms
112. 40.36ms
113. 40.41ms
114. 40.35ms
115. 36.14ms
116. 40.28ms
117. 35.89ms
118. 38.10ms
119. 35.94ms
120. 38.14ms
121. 38.09ms
122. 35.96ms
123. 35.91ms
124. 35.93ms
125. 36.03ms
126. 35.90ms
127. 35.93ms
128. 44.67ms
Graphics only: 16.40ms (102.31G pixels/s)
Graphics + compute:
1. 25.02ms (67.05G pixels/s)
2. 24.96ms (67.23G pixels/s)
3. 25.05ms (66.99G pixels/s)
4. 25.25ms (66.44G pixels/s)
5. 25.19ms (66.60G pixels/s)
6. 25.10ms (66.83G pixels/s)
7. 25.10ms (66.83G pixels/s)
8. 25.17ms (66.65G pixels/s)
9. 25.19ms (66.61G pixels/s)
10. 25.19ms (66.60G pixels/s)
11. 25.22ms (66.52G pixels/s)
12. 25.20ms (66.56G pixels/s)
13. 25.19ms (66.60G pixels/s)
14. 25.21ms (66.56G pixels/s)
15. 25.16ms (66.68G pixels/s)
16. 25.15ms (66.70G pixels/s)
17. 25.18ms (66.63G pixels/s)
18. 25.22ms (66.51G pixels/s)
19. 25.15ms (66.70G pixels/s)
20. 25.20ms (66.57G pixels/s)
21. 25.16ms (66.69G pixels/s)
22. 25.15ms (66.71G pixels/s)
23. 25.12ms (66.78G pixels/s)
24. 25.13ms (66.76G pixels/s)
25. 25.14ms (66.74G pixels/s)
26. 25.20ms (66.58G pixels/s)
27. 25.12ms (66.80G pixels/s)
28. 25.13ms (66.76G pixels/s)
29. 25.15ms (66.72G pixels/s)
30. 25.16ms (66.68G pixels/s)
31. 25.16ms (66.68G pixels/s)
32. 34.02ms (49.32G pixels/s)
33. 33.94ms (49.43G pixels/s)
34. 34.10ms (49.20G pixels/s)
35. 36.37ms (46.13G pixels/s)
36. 34.15ms (49.12G pixels/s)
37. 34.10ms (49.19G pixels/s)
38. 34.14ms (49.14G pixels/s)
39. 34.13ms (49.16G pixels/s)
40. 34.12ms (49.17G pixels/s)
41. 34.13ms (49.16G pixels/s)
42. 34.17ms (49.10G pixels/s)
43. 34.10ms (49.20G pixels/s)
44. 34.10ms (49.19G pixels/s)
45. 34.05ms (49.27G pixels/s)
46. 36.30ms (46.22G pixels/s)
47. 34.06ms (49.25G pixels/s)
48. 34.10ms (49.21G pixels/s)
49. 36.40ms (46.09G pixels/s)
50. 34.12ms (49.17G pixels/s)
51. 34.20ms (49.06G pixels/s)
52. 34.13ms (49.16G pixels/s)
53. 36.33ms (46.17G pixels/s)
54. 34.08ms (49.23G pixels/s)
55. 34.12ms (49.17G pixels/s)
56. 34.14ms (49.14G pixels/s)
57. 36.38ms (46.12G pixels/s)
58. 33.98ms (49.37G pixels/s)
59. 36.34ms (46.17G pixels/s)
60. 34.04ms (49.29G pixels/s)
61. 34.12ms (49.17G pixels/s)
62. 34.22ms (49.03G pixels/s)
63. 38.47ms (43.61G pixels/s)
64. 44.98ms (37.30G pixels/s)
65. 42.83ms (39.17G pixels/s)
66. 42.94ms (39.07G pixels/s)
67. 43.04ms (38.98G pixels/s)
68. 43.04ms (38.98G pixels/s)
69. 43.00ms (39.01G pixels/s)
70. 42.98ms (39.04G pixels/s)
71. 45.46ms (36.91G pixels/s)
72. 47.66ms (35.20G pixels/s)
73. 49.83ms (33.67G pixels/s)
74. 43.31ms (38.73G pixels/s)
75. 43.13ms (38.90G pixels/s)
76. 45.22ms (37.10G pixels/s)
77. 43.10ms (38.93G pixels/s)
78. 43.00ms (39.02G pixels/s)
79. 45.36ms (36.99G pixels/s)
80. 49.69ms (33.76G pixels/s)
81. 49.77ms (33.71G pixels/s)
82. 43.09ms (38.93G pixels/s)
83. 43.00ms (39.01G pixels/s)
84. 45.43ms (36.93G pixels/s)
85. 43.03ms (38.99G pixels/s)
86. 43.03ms (38.99G pixels/s)
87. 47.60ms (35.25G pixels/s)
88. 42.97ms (39.04G pixels/s)
89. 43.03ms (38.99G pixels/s)
90. 45.36ms (36.99G pixels/s)
91. 43.06ms (38.96G pixels/s)
92. 43.12ms (38.91G pixels/s)
93. 43.09ms (38.93G pixels/s)
94. 45.32ms (37.02G pixels/s)
95. 45.29ms (37.04G pixels/s)
96. 53.97ms (31.08G pixels/s)
97. 51.91ms (32.32G pixels/s)
98. 51.88ms (32.34G pixels/s)
99. 54.15ms (30.98G pixels/s)
100. 54.24ms (30.93G pixels/s)
101. 54.20ms (30.95G pixels/s)
102. 52.13ms (32.19G pixels/s)
103. 52.01ms (32.26G pixels/s)
104. 54.23ms (30.94G pixels/s)
105. 52.02ms (32.25G pixels/s)
106. 51.98ms (32.28G pixels/s)
107. 52.00ms (32.26G pixels/s)
108. 52.00ms (32.27G pixels/s)
109. 54.20ms (30.95G pixels/s)
110. 52.04ms (32.24G pixels/s)
111. 52.15ms (32.17G pixels/s)
112. 51.97ms (32.28G pixels/s)
113. 52.01ms (32.26G pixels/s)
114. 51.97ms (32.29G pixels/s)
115. 52.04ms (32.24G pixels/s)
116. 52.15ms (32.17G pixels/s)
117. 51.93ms (32.30G pixels/s)
118. 52.07ms (32.22G pixels/s)
119. 51.96ms (32.29G pixels/s)
120. 54.22ms (30.94G pixels/s)
121. 52.02ms (32.25G pixels/s)
122. 51.98ms (32.28G pixels/s)
123. 52.12ms (32.19G pixels/s)
124. 51.95ms (32.29G pixels/s)
125. 52.04ms (32.24G pixels/s)
126. 51.96ms (32.29G pixels/s)
127. 54.30ms (30.90G pixels/s)
128. 60.72ms (27.63G pixels/s)

fellix · Sep 1, 2015

source

I think gamers are learning an important lesson: there's no such thing as "full support" for DX12 on the market today.

There have been many attempts to distract people from this truth through campaigns that deliberately conflate feature levels, individual untiered features and the definition of "support." This has been confusing, and caused so much unnecessary heartache and rumor-mongering.

Here is the unvarnished truth: Every graphics architecture has unique features, and no one architecture has them all. Some of those unique features are more powerful than others.

Yes, we're extremely pleased that people are finally beginning to see the game of chess we've been playing with the interrelationship of GCN, Mantle, DX12, Vulkan and LiquidVR.

source

Raster Ordered Views and Conservative Raster. Thankfully, the techniques that these enable (like global illumination) can already be done in other ways at high framerates (see: DiRT Showdown).

Not sure if AMD is preemptively downplaying on the DX12 feature levels. :???:

Alessio1989 · Sep 1, 2015

Well, full support would mean also tiled-deferred support, which honestly I never saw it outside mobile and the Dreamcast @_@

Forceman · Sep 1, 2015

dogen said:
It doesn't. It jumps after every 16.

Yeah, I saw that after I looked closer. Shouldn't it also be 32 though, since it is supposed to be able to do 32 compute (as opposed to 1+31)?

fellix · Sep 1, 2015

A test run on a GK110 board would give us a bit more clarity.

drSeehas · Sep 1, 2015

GeForce GT 730 (GK208) Forceware 353.62

Compute only:
1. 21.50ms
2. 21.46ms
3. 21.45ms
4. 21.47ms
5. 21.46ms
6. 21.55ms
7. 42.63ms
8. 42.66ms
9. 42.78ms
10. 42.72ms
11. 42.71ms
12. 42.77ms
13. 63.86ms
14. 63.86ms
15. 63.86ms
16. 63.91ms
17. 63.93ms
18. 63.87ms
19. 85.03ms
20. 85.10ms
21. 85.06ms
22. 85.07ms
23. 85.08ms
24. 85.16ms
25. 106.70ms
26. 106.88ms
27. 106.38ms
28. 106.67ms
29. 106.93ms
30. 106.48ms
31. 127.43ms
32. 148.42ms
33. 164.57ms
34. 164.99ms
35. 153.84ms
36. 153.89ms
37. 148.41ms
38. 169.69ms
39. 174.89ms
40. 169.53ms
41. 174.85ms
42. 174.89ms
43. 174.96ms
44. 190.76ms
45. 190.64ms
46. 196.04ms
47. 196.02ms
48. 192.24ms
49. 201.42ms
50. 211.86ms
51. 217.15ms
52. 217.24ms
53. 217.16ms
54. 217.28ms
55. 217.17ms
56. 233.23ms
57. 254.90ms
58. 238.60ms
59. 238.31ms
60. 238.26ms
61. 250.63ms
62. 254.64ms
63. 270.24ms
64. 286.86ms
65. 280.66ms
66. 280.79ms
67. 275.22ms
68. 280.57ms
69. 280.54ms
70. 301.76ms
71. 301.84ms
72. 301.70ms
73. 301.77ms
74. 312.61ms
75. 301.74ms
76. 322.90ms
77. 322.87ms
78. 322.86ms
79. 322.88ms
80. 328.20ms
81. 351.39ms
82. 354.82ms
83. 371.05ms
84. 355.59ms
85. 371.04ms
86. 354.73ms
87. 362.46ms
88. 375.93ms
89. 378.02ms
90. 387.64ms
91. 393.14ms
92. 376.66ms
93. 370.47ms
94. 397.34ms
95. 397.43ms
96. 412.72ms
97. 419.23ms
98. 419.82ms
99. 419.05ms
100. 435.07ms
101. 410.72ms
102. 433.87ms
103. 428.58ms
104. 433.91ms
105. 428.59ms
106. 428.55ms
107. 433.89ms
108. 455.07ms
109. 449.74ms
110. 474.66ms
111. 466.78ms
112. 466.16ms
113. 455.24ms
114. 481.76ms
115. 487.04ms
116. 481.76ms
117. 481.89ms
118. 486.94ms
119. 481.60ms
120. 508.05ms
121. 502.92ms
122. 502.77ms
123. 508.17ms
124. 502.82ms
125. 508.19ms
126. 524.03ms
127. 521.02ms
128. 550.32ms
Graphics only: 240.26ms (6.98G pixels/s)
Graphics + compute:
1. 261.56ms (6.41G pixels/s)
2. 261.33ms (6.42G pixels/s)
3. 261.30ms (6.42G pixels/s)
4. 261.33ms (6.42G pixels/s)
5. 261.31ms (6.42G pixels/s)
6. 261.82ms (6.41G pixels/s)
7. 282.36ms (5.94G pixels/s)
8. 282.44ms (5.94G pixels/s)
9. 282.35ms (5.94G pixels/s)
10. 282.57ms (5.94G pixels/s)
11. 282.53ms (5.94G pixels/s)
12. 282.42ms (5.94G pixels/s)
13. 303.52ms (5.53G pixels/s)
14. 303.53ms (5.53G pixels/s)
15. 303.50ms (5.53G pixels/s)
16. 303.56ms (5.53G pixels/s)
17. 303.49ms (5.53G pixels/s)
18. 303.60ms (5.53G pixels/s)
19. 325.76ms (5.15G pixels/s)
20. 324.78ms (5.17G pixels/s)
21. 324.75ms (5.17G pixels/s)
22. 324.66ms (5.17G pixels/s)
23. 324.83ms (5.16G pixels/s)
24. 324.73ms (5.17G pixels/s)
25. 345.95ms (4.85G pixels/s)
26. 345.96ms (4.85G pixels/s)
27. 345.87ms (4.85G pixels/s)
28. 345.81ms (4.85G pixels/s)
29. 345.91ms (4.85G pixels/s)
30. 346.03ms (4.85G pixels/s)
31. 367.08ms (4.57G pixels/s)
32. 388.30ms (4.32G pixels/s)
33. 398.85ms (4.21G pixels/s)
34. 398.80ms (4.21G pixels/s)
35. 404.17ms (4.15G pixels/s)
36. 398.84ms (4.21G pixels/s)
37. 398.85ms (4.21G pixels/s)
38. 409.41ms (4.10G pixels/s)
39. 419.93ms (4.00G pixels/s)
40. 414.65ms (4.05G pixels/s)
41. 419.95ms (4.00G pixels/s)
42. 419.97ms (3.99G pixels/s)
43. 425.17ms (3.95G pixels/s)
44. 430.60ms (3.90G pixels/s)
45. 435.86ms (3.85G pixels/s)
46. 446.44ms (3.76G pixels/s)
47. 446.50ms (3.76G pixels/s)
48. 435.88ms (3.85G pixels/s)
49. 446.43ms (3.76G pixels/s)
50. 451.93ms (3.71G pixels/s)
51. 462.20ms (3.63G pixels/s)
52. 462.28ms (3.63G pixels/s)
53. 462.23ms (3.63G pixels/s)
54. 457.06ms (3.67G pixels/s)
55. 467.50ms (3.59G pixels/s)
56. 472.97ms (3.55G pixels/s)
57. 488.73ms (3.43G pixels/s)
58. 478.13ms (3.51G pixels/s)
59. 478.27ms (3.51G pixels/s)
60. 488.91ms (3.43G pixels/s)
61. 483.57ms (3.47G pixels/s)
62. 493.98ms (3.40G pixels/s)
63. 510.01ms (3.29G pixels/s)
64. 536.43ms (3.13G pixels/s)
65. 525.76ms (3.19G pixels/s)
66. 531.28ms (3.16G pixels/s)
67. 531.14ms (3.16G pixels/s)
68. 531.11ms (3.16G pixels/s)
69. 530.86ms (3.16G pixels/s)
70. 541.71ms (3.10G pixels/s)
71. 536.40ms (3.13G pixels/s)
72. 546.96ms (3.07G pixels/s)
73. 541.68ms (3.10G pixels/s)
74. 536.38ms (3.13G pixels/s)
75. 541.67ms (3.10G pixels/s)
76. 578.68ms (2.90G pixels/s)
77. 568.07ms (2.95G pixels/s)
78. 557.46ms (3.01G pixels/s)
79. 568.07ms (2.95G pixels/s)
80. 573.27ms (2.93G pixels/s)
81. 562.79ms (2.98G pixels/s)
82. 589.27ms (2.85G pixels/s)
83. 589.21ms (2.85G pixels/s)
84. 589.29ms (2.85G pixels/s)
85. 594.57ms (2.82G pixels/s)
86. 584.10ms (2.87G pixels/s)
87. 589.66ms (2.85G pixels/s)
88. 600.39ms (2.79G pixels/s)
89. 610.24ms (2.75G pixels/s)
90. 610.28ms (2.75G pixels/s)
91. 605.01ms (2.77G pixels/s)
92. 615.60ms (2.73G pixels/s)
93. 600.25ms (2.80G pixels/s)
94. 631.56ms (2.66G pixels/s)
95. 636.96ms (2.63G pixels/s)
96. 647.93ms (2.59G pixels/s)
97. 658.03ms (2.55G pixels/s)
98. 657.94ms (2.55G pixels/s)
99. 652.72ms (2.57G pixels/s)
100. 647.38ms (2.59G pixels/s)
101. 647.56ms (2.59G pixels/s)
102. 668.41ms (2.51G pixels/s)
103. 673.82ms (2.49G pixels/s)
104. 668.60ms (2.51G pixels/s)
105. 673.90ms (2.49G pixels/s)
106. 675.27ms (2.48G pixels/s)
107. 706.66ms (2.37G pixels/s)
108. 708.39ms (2.37G pixels/s)
109. 701.54ms (2.39G pixels/s)
110. 695.48ms (2.41G pixels/s)
111. 694.96ms (2.41G pixels/s)
112. 695.47ms (2.41G pixels/s)
113. 701.90ms (2.39G pixels/s)
114. 720.94ms (2.33G pixels/s)
115. 722.85ms (2.32G pixels/s)
116. 716.19ms (2.34G pixels/s)
117. 716.11ms (2.34G pixels/s)
118. 713.24ms (2.35G pixels/s)
119. 721.78ms (2.32G pixels/s)
120. 744.78ms (2.25G pixels/s)
121. 739.88ms (2.27G pixels/s)
122. 752.32ms (2.23G pixels/s)
123. 753.69ms (2.23G pixels/s)
124. 744.38ms (2.25G pixels/s)
125. 735.86ms (2.28G pixels/s)
126. 780.35ms (2.15G pixels/s)
127. 770.58ms (2.18G pixels/s)
128. 780.66ms (2.15G pixels/s)

vedivis · Sep 1, 2015

GeForce GTX 780 Ti (GK110)

Compute only:
1. 17.09ms
2. 17.32ms
3. 16.75ms
4. 16.52ms
5. 16.74ms
6. 16.48ms
7. 16.51ms
8. 16.50ms
9. 16.57ms
10. 16.57ms
11. 16.71ms
12. 16.50ms
13. 16.58ms
14. 16.56ms
15. 16.51ms
16. 16.52ms
17. 16.51ms
18. 16.52ms
19. 16.52ms
20. 16.50ms
21. 16.54ms
22. 16.52ms
23. 16.56ms
24. 16.52ms
25. 16.55ms
26. 16.50ms
27. 16.54ms
28. 16.56ms
29. 16.53ms
30. 16.51ms
31. 32.30ms
32. 48.64ms
33. 52.77ms
34. 52.78ms
35. 52.74ms
36. 52.70ms
37. 52.72ms
38. 52.69ms
39. 52.70ms
40. 52.78ms
41. 52.72ms
42. 52.72ms
43. 52.74ms
44. 52.72ms
45. 52.72ms
46. 52.73ms
47. 52.75ms
48. 56.85ms
49. 56.88ms
50. 56.87ms
51. 56.84ms
52. 56.85ms
53. 52.78ms
54. 52.95ms
55. 52.72ms
56. 56.90ms
57. 56.83ms
58. 56.82ms
59. 56.83ms
60. 52.74ms
61. 52.74ms
62. 64.73ms
63. 72.64ms
64. 84.86ms
65. 88.94ms
66. 88.95ms
67. 84.84ms
68. 88.90ms
69. 84.85ms
70. 88.94ms
71. 84.86ms
72. 88.94ms
73. 84.84ms
74. 88.96ms
75. 88.94ms
76. 88.96ms
77. 88.94ms
78. 84.84ms
79. 88.93ms
80. 84.85ms
81. 88.93ms
82. 88.98ms
83. 88.94ms
84. 88.94ms
85. 88.95ms
86. 89.07ms
87. 84.85ms
88. 88.92ms
89. 84.87ms
90. 88.93ms
91. 88.95ms
92. 88.94ms
93. 88.92ms
94. 104.72ms
95. 100.64ms
96. 121.08ms
97. 121.03ms
98. 116.98ms
99. 121.05ms
100. 121.04ms
101. 121.26ms
102. 116.95ms
103. 121.04ms
104. 121.03ms
105. 121.09ms
106. 116.96ms
107. 121.04ms
108. 121.03ms
109. 121.03ms
110. 116.96ms
111. 121.03ms
112. 116.91ms
113. 121.02ms
114. 116.94ms
115. 121.04ms
116. 121.03ms
117. 121.04ms
118. 116.95ms
119. 121.04ms
120. 121.04ms
121. 121.02ms
122. 116.94ms
123. 121.05ms
124. 121.06ms
125. 116.93ms
126. 132.75ms
127. 132.73ms
128. 153.13ms
Graphics only: 37.21ms (45.09G pixels/s)
Graphics + compute:
1. 53.49ms (31.36G pixels/s)
2. 53.42ms (31.40G pixels/s)
3. 53.43ms (31.40G pixels/s)
4. 53.52ms (31.35G pixels/s)
5. 53.54ms (31.34G pixels/s)
6. 53.51ms (31.35G pixels/s)
7. 53.51ms (31.35G pixels/s)
8. 53.47ms (31.38G pixels/s)
9. 53.46ms (31.38G pixels/s)
10. 53.49ms (31.36G pixels/s)
11. 53.50ms (31.36G pixels/s)
12. 53.50ms (31.36G pixels/s)
13. 53.51ms (31.35G pixels/s)
14. 53.47ms (31.38G pixels/s)
15. 53.45ms (31.39G pixels/s)
16. 53.46ms (31.38G pixels/s)
17. 53.49ms (31.37G pixels/s)
18. 53.54ms (31.34G pixels/s)
19. 53.58ms (31.31G pixels/s)
20. 53.56ms (31.32G pixels/s)
21. 53.51ms (31.35G pixels/s)
22. 53.51ms (31.35G pixels/s)
23. 53.49ms (31.36G pixels/s)
24. 53.48ms (31.37G pixels/s)
25. 53.46ms (31.38G pixels/s)
26. 53.43ms (31.40G pixels/s)
27. 53.43ms (31.40G pixels/s)
28. 53.48ms (31.37G pixels/s)
29. 53.52ms (31.35G pixels/s)
30. 53.50ms (31.36G pixels/s)
31. 69.27ms (24.22G pixels/s)
32. 85.57ms (19.61G pixels/s)
33. 85.56ms (19.61G pixels/s)
34. 89.60ms (18.73G pixels/s)
35. 89.65ms (18.71G pixels/s)
36. 89.66ms (18.71G pixels/s)
37. 85.58ms (19.60G pixels/s)
38. 85.55ms (19.61G pixels/s)
39. 89.61ms (18.72G pixels/s)
40. 89.65ms (18.71G pixels/s)
41. 89.66ms (18.71G pixels/s)
42. 85.56ms (19.61G pixels/s)
43. 89.61ms (18.72G pixels/s)
44. 89.65ms (18.71G pixels/s)
45. 89.66ms (18.71G pixels/s)
46. 85.62ms (19.60G pixels/s)
47. 85.61ms (19.60G pixels/s)
48. 89.60ms (18.73G pixels/s)
49. 89.65ms (18.71G pixels/s)
50. 89.65ms (18.71G pixels/s)
51. 89.66ms (18.71G pixels/s)
52. 85.59ms (19.60G pixels/s)
53. 89.63ms (18.72G pixels/s)
54. 89.68ms (18.71G pixels/s)
55. 89.70ms (18.70G pixels/s)
56. 85.59ms (19.60G pixels/s)
57. 85.62ms (19.59G pixels/s)
58. 89.63ms (18.72G pixels/s)
59. 89.68ms (18.71G pixels/s)
60. 89.68ms (18.71G pixels/s)
61. 89.68ms (18.71G pixels/s)
62. 101.37ms (16.55G pixels/s)
63. 105.45ms (15.91G pixels/s)
64. 121.75ms (13.78G pixels/s)
65. 121.77ms (13.78G pixels/s)
66. 121.74ms (13.78G pixels/s)
67. 121.78ms (13.78G pixels/s)
68. 121.79ms (13.78G pixels/s)
69. 121.75ms (13.78G pixels/s)
70. 117.64ms (14.26G pixels/s)
71. 121.71ms (13.79G pixels/s)
72. 117.65ms (14.26G pixels/s)
73. 121.69ms (13.79G pixels/s)
74. 117.66ms (14.26G pixels/s)
75. 121.68ms (13.79G pixels/s)
76. 126.01ms (13.31G pixels/s)
77. 121.79ms (13.78G pixels/s)
78. 121.75ms (13.78G pixels/s)
79. 121.70ms (13.79G pixels/s)
80. 117.63ms (14.26G pixels/s)
81. 121.76ms (13.78G pixels/s)
82. 121.72ms (13.78G pixels/s)
83. 121.75ms (13.78G pixels/s)
84. 117.65ms (14.26G pixels/s)
85. 125.84ms (13.33G pixels/s)
86. 121.74ms (13.78G pixels/s)
87. 125.77ms (13.34G pixels/s)
88. 121.74ms (13.78G pixels/s)
89. 125.81ms (13.34G pixels/s)
90. 121.71ms (13.78G pixels/s)
91. 121.69ms (13.79G pixels/s)
92. 117.64ms (14.26G pixels/s)
93. 121.73ms (13.78G pixels/s)
94. 133.40ms (12.58G pixels/s)
95. 137.53ms (12.20G pixels/s)
96. 153.84ms (10.91G pixels/s)
97. 153.75ms (10.91G pixels/s)
98. 149.74ms (11.20G pixels/s)
99. 153.80ms (10.91G pixels/s)
100. 153.83ms (10.91G pixels/s)
101. 157.93ms (10.62G pixels/s)
102. 149.73ms (11.20G pixels/s)
103. 149.78ms (11.20G pixels/s)
104. 153.88ms (10.90G pixels/s)
105. 153.87ms (10.90G pixels/s)
106. 149.73ms (11.20G pixels/s)
107. 153.89ms (10.90G pixels/s)
108. 153.88ms (10.90G pixels/s)
109. 157.99ms (10.62G pixels/s)
110. 149.84ms (11.20G pixels/s)
111. 149.77ms (11.20G pixels/s)
112. 153.81ms (10.91G pixels/s)
113. 153.96ms (10.90G pixels/s)
114. 153.77ms (10.91G pixels/s)
115. 149.77ms (11.20G pixels/s)
116. 153.86ms (10.90G pixels/s)
117. 153.82ms (10.91G pixels/s)
118. 153.86ms (10.90G pixels/s)
119. 153.76ms (10.91G pixels/s)
120. 149.75ms (11.20G pixels/s)
121. 153.86ms (10.90G pixels/s)
122. 153.84ms (10.91G pixels/s)
123. 153.77ms (10.91G pixels/s)
124. 149.80ms (11.20G pixels/s)
125. 153.82ms (10.91G pixels/s)
126. 169.65ms (9.89G pixels/s)
127. 169.72ms (9.89G pixels/s)
128. 185.96ms (9.02G pixels/s)

Deleted member 13524 · Sep 1, 2015

The plot thickens:

Tier 2 vs Tier 3 binding is a completely separate issue from Async Compute. It's has to do with the number of root level descriptors we can pass. In tier 3, it turns out we can basically never update a descriptor during a frame, but in tier 2 we sometimes have to build a few . I don't think it's a significant performance issue though, just a techinical detail.

In regards to the purpose of Async compute, there are really 2 main reasons for it:

1) It allows jobs to be cycled into the GPU during dormant phases. In can vaguely be thought of as the GPU equivalent of hyper threading. Like hyper threading, it really depends on the workload and GPU architecture for as to how important this is. In this case, it is used for performance. I can't divulge too many details, but GCN can cycle in work from an ACE incredibly efficiently. Maxwell's schedular has no analog just as a non hyper-threaded CPU has no analog feature to a hyper threaded one.

2) It allows jobs to be cycled in completely out of band with the rendering loop. This is potentially the more interesting case since it can allow gameplay to offload work onto the GPU as the latency of work is greatly reduced. I'm not sure of the background of Async Compute, but it's quite possible that it is intended for use on a console as sort of a replacement for the Cell Processors on a ps3. On a console environment, you really can use them in a very similar way. This could mean that jobs could even span frames, which is useful for longer, optional computational tasks.

It didn't look like there was a hardware defect to me on Maxwell just some unfortunate complex interaction between software scheduling trying to emulate it which appeared to incur some heavy CPU costs. Since we were tying to use it for #1, not #2, it made little sense to bother. I don't believe there is any specific requirement that Async Compute be required for D3D12, but perhaps I misread the spec.

Regarding trying to figure out bottlenecks on GPUS, it's important to note that GPUs do not scale simply by adding more cores to it, especially graphics tasks which have alot of serial points. My $.02 is that GCN is a bit triangle limited, which is why you see greater performance on 4k, where the average triangle size is 4x the triangle size of 1080p.

I think you're also being a bit short-sighted on the possible use of compute for general graphics. It is not limited to post process. Right now, I estimate about 20% of our graphics pipeline occurs in compute shaders, and we are projecting this to be more then 50% on the next iteration of our engine. In fact, it is even conceivable to build a rendering pipeline entirely in compute shaders. For example, there are alternative rendering primitives to triangles which are actually quite feasible in compute. There was a great talk at SIGGRAPH this year on this subject. If someone gave us a card with only compute pipeline, I'd bet we could build an engine around it which would be plenty fast. In fact, this was the main motivating factors behind the Larabee project. The main problem with Larabee wasn't that it wasn't fast, it was that they failed to be able to map dx9 games to it well enough to be a viable product. I'm not saying that the graphics pipeline will disappear anytime soon (or ever), but it's by no means certain that it's necessary. It's quite possible that in 5 years time Nitrous's rendering pipeline is 100% implemented via compute shaders.

Thoughts?

Jawed said:
I think the EDRAM plus the HSA-like architecture of the consoles makes a load of console-specific performance-centric design decisions moot in the PC space.

Is the EDRAM micromanaged? Honest question. I thought DX11.x handled that automatically, like Intel's IGPs using L3 and L4.
As for the HSA-centric code, perhaps they can point many of those tasks towards the CPU anyways? I imagine most gaming PCs will have CPUs that are >3x faster than the 8x 1.6-1.75GHz Jaguars.

Jawed said:
Also, if publishers hand over console games to some fly-by-night studio which solely has to get the game working on PC, given the art assets and the console gaming experience as a guide then you get something like Batman: Arkham Knight.

Although it backfired really hard because of its even-worse-than-expected performance on Kepler cards, Arkham Knight is nVidia's dream come true.
If only gamers would suck it up, buy the game and not complain like the entitled little whining brats that they are...

Razor1 said:
https://docs.unrealengine.com/lates...ing/ShaderDevelopment/AsyncCompute/index.html

AsyncCompute should be used with caution as it can cause more unpredicatble performance and requires more coding effort for synchromization.

Click to expand...

Your selective quoting abilities are great. Let me try it too.
Same link: https://docs.unrealengine.com/lates...ing/ShaderDevelopment/AsyncCompute/index.html

The Rendering Hardware Interface (RHI) now supports asynchronous compute (AsyncCompute) for Xbox One. This is a good way to utilize unused GPU resources (Compute Units (CUs), registers and bandwidth), by running dispatch() calls asynchronously with the rendering.
(...)
This feature was implemented by Lionhead Studios.
We integrated it and intend to make use of it as a tool to optimize the XboxOne rendering. As more more APIs expose the hardware feature we would like make the system more cross platform.

Razor1 · Sep 1, 2015

ToTTenTranz said:
The plot thickens:

Thoughts?

pointless, tier 2 binding isn't even being used fully and probably won't be for another few years.

The bolded part, interesting, and without them profiling their shaders, and showing that to their respective community its very hard to even say what is going on on Maxwell, so what is thier point? The only other way around that is to see other Dx12 async code at work.

Although it backfired really hard because of its even-worse-than-expected performance on Kepler cards, Arkham Knight is nVidia's dream come true.
If only gamers would suck it up, buy the game and not complain like the entitled little whining brats that they are...

has nothing to do with this discussion

Your selective quoting abilities are great. Let me try it too.
Same link: https://docs.unrealengine.com/lates...ing/ShaderDevelopment/AsyncCompute/index.html

yeah so, that's why I linked the entire page? your point? Outside of useless banter?

Deleted member 13524 · Sep 1, 2015

Here's a reminder that back in 2013, Mark Cerny presented Async almost as the "holy grail" for this generation of consoles, and that we should be seeing titles taking advantage of that in.. 2016:

Mark Cerny said:
And there’s a lot of features in the GPU to support asynchronous fine-grain computing. ‘Asynchronous’ is just saying it’s not directly related to graphics, ‘fine grain’ is just saying it’s a whole bunch of these running simultaneously on the GPU. So I think we’re going to see the benefits of that architecture around 2016 or so.
With the PlayStation 4, it’s even such things as the share cores have a beautiful instruction set and can’t be programmed in assembly. If you were willing to invest the time to do that, you could do some incredibly efficient processing on the GPU for graphics or for anything else. But the timeframe for that kind of work would not be now. I don’t even think it would be three years from now.

Razor1 said:
yeah so, that's why I linked the entire page? your point? Outside of useless banter?

I just find it curious that you chose to quote the only sentence in the whole damn page that seems to demean the use of Async compute, that's all.
Very curious.

Andrew · Sep 1, 2015

fellix said:
A test run on a GK110 board would give us a bit more clarity.

Here are the results on my 780ti with 355.82

Compute only:
1. 20.27ms
2. 19.38ms
3. 18.32ms
4. 17.98ms
5. 16.55ms
6. 16.56ms
7. 15.88ms
8. 15.55ms
9. 15.53ms
10. 15.52ms
11. 15.52ms
12. 15.58ms
13. 15.56ms
14. 15.51ms
15. 15.56ms
16. 15.54ms
17. 15.53ms
18. 15.53ms
19. 15.53ms
20. 15.52ms
21. 15.54ms
22. 15.53ms
23. 15.59ms
24. 15.54ms
25. 15.53ms
26. 15.56ms
27. 15.51ms
28. 15.59ms
29. 15.54ms
30. 15.55ms
31. 31.01ms
32. 46.38ms
33. 50.23ms
34. 54.15ms
35. 58.11ms
36. 54.17ms
37. 50.24ms
38. 54.16ms
39. 54.13ms
40. 50.36ms
41. 50.27ms
42. 54.12ms
43. 58.04ms
44. 50.24ms
45. 58.08ms
46. 54.13ms
47. 50.24ms
48. 50.27ms
49. 54.24ms
50. 54.18ms
51. 50.24ms
52. 50.25ms
53. 54.15ms
54. 58.09ms
55. 54.17ms
56. 50.24ms
57. 58.06ms
58. 50.25ms
59. 50.31ms
60. 54.14ms
61. 50.24ms
62. 61.87ms
63. 69.63ms
64. 81.12ms
65. 85.01ms
66. 88.94ms
67. 81.14ms
68. 92.77ms
69. 85.05ms
70. 81.12ms
71. 88.88ms
72. 85.07ms
73. 84.99ms
74. 85.00ms
75. 85.10ms
76. 84.97ms
77. 85.00ms
78. 85.10ms
79. 84.98ms
80. 85.00ms
81. 85.09ms
82. 81.11ms
83. 85.01ms
84. 85.09ms
85. 81.09ms
86. 85.05ms
87. 85.09ms
88. 81.14ms
89. 81.12ms
90. 85.10ms
91. 81.10ms
92. 85.01ms
93. 88.96ms
94. 96.51ms
95. 104.37ms
96. 111.99ms
97. 115.86ms
98. 119.76ms
99. 115.85ms
100. 115.90ms
101. 115.88ms
102. 119.85ms
103. 115.86ms
104. 123.72ms
105. 115.85ms
106. 115.94ms
107. 111.99ms
108. 123.69ms
109. 115.84ms
110. 119.80ms
111. 112.00ms
112. 111.97ms
113. 119.78ms
114. 112.01ms
115. 119.76ms
116. 115.85ms
117. 119.84ms
118. 115.88ms
119. 115.95ms
120. 115.85ms
121. 123.70ms
122. 111.99ms
123. 123.70ms
124. 115.84ms
125. 119.81ms
126. 131.27ms
127. 135.22ms
128. 150.60ms
Graphics only: 35.75ms (46.93G pixels/s)
Graphics + compute:
1. 51.28ms (32.72G pixels/s)
2. 51.17ms (32.79G pixels/s)
3. 51.13ms (32.81G pixels/s)
4. 51.20ms (32.77G pixels/s)
5. 51.25ms (32.73G pixels/s)
6. 51.22ms (32.76G pixels/s)
7. 51.21ms (32.76G pixels/s)
8. 51.20ms (32.77G pixels/s)
9. 51.17ms (32.79G pixels/s)
10. 51.28ms (32.72G pixels/s)
11. 51.24ms (32.75G pixels/s)
12. 51.20ms (32.77G pixels/s)
13. 51.19ms (32.78G pixels/s)
14. 51.19ms (32.77G pixels/s)
15. 51.26ms (32.73G pixels/s)
16. 51.23ms (32.75G pixels/s)
17. 51.15ms (32.80G pixels/s)
18. 51.14ms (32.81G pixels/s)
19. 51.17ms (32.79G pixels/s)
20. 51.27ms (32.72G pixels/s)
21. 51.18ms (32.78G pixels/s)
22. 51.16ms (32.80G pixels/s)
23. 51.21ms (32.76G pixels/s)
24. 51.19ms (32.78G pixels/s)
25. 51.28ms (32.72G pixels/s)
26. 51.22ms (32.75G pixels/s)
27. 51.16ms (32.79G pixels/s)
28. 51.21ms (32.76G pixels/s)
29. 51.18ms (32.78G pixels/s)
30. 51.19ms (32.78G pixels/s)
31. 66.60ms (25.19G pixels/s)
32. 82.16ms (20.42G pixels/s)
33. 82.13ms (20.43G pixels/s)
34. 85.94ms (19.52G pixels/s)
35. 85.95ms (19.52G pixels/s)
36. 82.15ms (20.42G pixels/s)
37. 82.07ms (20.44G pixels/s)
38. 85.90ms (19.53G pixels/s)
39. 82.09ms (20.44G pixels/s)
40. 85.93ms (19.52G pixels/s)
41. 85.96ms (19.52G pixels/s)
42. 89.84ms (18.67G pixels/s)
43. 82.04ms (20.45G pixels/s)
44. 82.03ms (20.45G pixels/s)
45. 85.97ms (19.51G pixels/s)
46. 82.04ms (20.45G pixels/s)
47. 85.94ms (19.52G pixels/s)
48. 86.05ms (19.50G pixels/s)
49. 82.00ms (20.46G pixels/s)
50. 85.97ms (19.51G pixels/s)
51. 82.12ms (20.43G pixels/s)
52. 85.89ms (19.53G pixels/s)
53. 85.93ms (19.52G pixels/s)
54. 89.90ms (18.66G pixels/s)
55. 81.98ms (20.46G pixels/s)
56. 85.94ms (19.52G pixels/s)
57. 89.95ms (18.65G pixels/s)
58. 82.01ms (20.46G pixels/s)
59. 85.95ms (19.52G pixels/s)
60. 86.05ms (19.50G pixels/s)
61. 82.01ms (20.46G pixels/s)
62. 97.56ms (17.20G pixels/s)
63. 101.35ms (16.55G pixels/s)
64. 112.91ms (14.86G pixels/s)
65. 116.93ms (14.35G pixels/s)
66. 116.78ms (14.37G pixels/s)
67. 116.89ms (14.35G pixels/s)
68. 112.92ms (14.86G pixels/s)
69. 120.73ms (13.90G pixels/s)
70. 116.75ms (14.37G pixels/s)
71. 116.86ms (14.36G pixels/s)
72. 116.77ms (14.37G pixels/s)
73. 112.99ms (14.85G pixels/s)
74. 116.79ms (14.37G pixels/s)
75. 116.91ms (14.35G pixels/s)
76. 112.91ms (14.86G pixels/s)
77. 116.79ms (14.36G pixels/s)
78. 120.74ms (13.90G pixels/s)
79. 116.81ms (14.36G pixels/s)
80. 124.65ms (13.46G pixels/s)
81. 116.81ms (14.36G pixels/s)
82. 120.76ms (13.89G pixels/s)
83. 116.79ms (14.37G pixels/s)
84. 120.76ms (13.89G pixels/s)
85. 112.90ms (14.86G pixels/s)
86. 116.85ms (14.36G pixels/s)
87. 116.71ms (14.38G pixels/s)
88. 116.88ms (14.35G pixels/s)
89. 112.87ms (14.86G pixels/s)
90. 120.75ms (13.89G pixels/s)
91. 116.75ms (14.37G pixels/s)
92. 113.00ms (14.85G pixels/s)
93. 116.84ms (14.36G pixels/s)
94. 132.31ms (12.68G pixels/s)
95. 128.35ms (13.07G pixels/s)
96. 151.59ms (11.07G pixels/s)
97. 143.71ms (11.67G pixels/s)
98. 147.74ms (11.36G pixels/s)
99. 147.65ms (11.36G pixels/s)
100. 151.61ms (11.07G pixels/s)
101. 143.84ms (11.66G pixels/s)
102. 147.66ms (11.36G pixels/s)
103. 151.60ms (11.07G pixels/s)
104. 143.73ms (11.67G pixels/s)
105. 147.73ms (11.36G pixels/s)
106. 143.84ms (11.66G pixels/s)
107. 147.62ms (11.37G pixels/s)
108. 147.70ms (11.36G pixels/s)
109. 143.76ms (11.67G pixels/s)
110. 147.75ms (11.35G pixels/s)
111. 147.65ms (11.36G pixels/s)
112. 147.68ms (11.36G pixels/s)
113. 147.73ms (11.36G pixels/s)
114. 147.62ms (11.37G pixels/s)
115. 147.68ms (11.36G pixels/s)
116. 143.80ms (11.67G pixels/s)
117. 147.71ms (11.36G pixels/s)
118. 143.83ms (11.66G pixels/s)
119. 147.67ms (11.36G pixels/s)
120. 151.58ms (11.07G pixels/s)
121. 143.73ms (11.67G pixels/s)
122. 147.70ms (11.36G pixels/s)
123. 147.68ms (11.36G pixels/s)
124. 151.53ms (11.07G pixels/s)
125. 147.78ms (11.35G pixels/s)
126. 159.17ms (10.54G pixels/s)
127. 163.07ms (10.29G pixels/s)
128. 178.62ms (9.39G pixels/s)

Razor1 · Sep 1, 2015

ToTTenTranz said:
I just find it curious that you chose to quote the only sentence in the whole damn page that seems to demean the use of Async compute, that's all.
Very curious.

It was specific to what we were seeing on the sample program that everyone was testing.......
curiously killed the cat

, be straight forward and you will get a straight answer.

And the Oxide dev or who ever he is, is not being straight froward by using vagaries in his comments about performance differences between the two architectures, which I would expect him to know exactly what is going on.

madyasiwi · Sep 1, 2015

Kepler vs Maxwell

madyasiwi · Sep 1, 2015

I made a mistake with the first (yesterday) chart, the data actually come from 390x not Fury-X. Sorry for my blunder. This is Fury-X.

Deleted member 13524 · Sep 1, 2015

Ok so I'm looking at the several results being posted from MDolenc's tool and here's what I'm seeing:

1 - All GCN chips seem to present an almost flat compute time of 50-60ms regardless of the number of compute kernels and the rendering task being enabled or not.

2 - a) All nVidia chips seem to present a time that increases with the increase of compute kernels. Maxwell chips show lower compute times than Kepler chips.
b) If rendering task time is X for an nVidia chip and compute time for a given number n of kernels is Y(n), then the "async compute" time for all nVidia chips seems to be very close to Y(n)+X.

If nVidia's chips need to add the rendering time to a compute task with even one active kernel, doesn't this mean that "Async Compute" is not actually working and nVidia's hardware, at least in this test, does not seem to support Async Compute? Even if the driver does allow Async Compute tasks to be done, the hardware just seems to be doing rendering+compute in a serial fashion and not parallel at all.

3dilettante · Sep 1, 2015

Jawed said:
The boundary is 31, 64, 96 and 128. So the first boundary is the outlier in this case, though it appears that the first boundary always behaves this way on the 3 NVidia architectures documented so far...

One idea I had is that, if this an internal processor or potentially a SIMD running a firmware routine, is that it's a 32-slot structure.
Particularly if this runs as a shader, there could be a single lane allocated as a sort of queue manager thread.
(edit: As an aside, if it were running as an internel compute kernel, that could push Nvidia's implementation outside of what HSA states should be done for queue management--if Nvidia cared.)

Even if not running as a shader, a reserved queue for a sort of system-controlled queue manager might be prudent.
That would provide a way for something to monitor the queue and this lane would be able to respond to prompts to kill a queue or suspend it.
It could overlap its work with the rest of the lanes, while remaining separate from possible hangs, malformed commands, or floods.
The downside as slots reach the end of the bundle, is what to do at the 32 limit.

One way to contain the cost of this without impacting successive sets of 32 is for the management software to double-book one of the lanes, which would take twice as long. (edit: Unless it can opportunistically schedule on a lane that finishes faster.) After that though, the queue manager's cost is hidden until the hardware's ability to take dispatches is exhausted, which seems far off with the current settings.

I was thinking of other ways to test this. The current method is potentially mixing concurrency testing with asynchronous execution.
If the kernel can programmatically shift the loop duration to fractions of itself, it might tease out how coarsely execution is tracked, and whether we're looking at some non-compute issue that is confounding measurements.
Something like having the shader loop terminate at 0, 1/4, 1/2, 1, 1.5, 2x rather than every one behaving the same, might show where the floor is in this measurement.
If this could be done at specific counts, it might show if there is something physical to the grouping behavior for Nvidia if its timing behavior changes based on delays incurred before or after certain limits.

Varying the iteration count might give some kind of idea if there is a single-threaded limitation on the GCN ones, or if it really is that insensitive to this level of load.
Injecting stalls into the graphics portion and/or some of the dispatches might show how flexibly the GPUs can work around them.

3dilettante · Sep 1, 2015

ToTTenTranz said:
If nVidia's chips need to add the rendering time to a compute task with even one active kernel, doesn't this mean that "Async Compute" is not actually working and nVidia's hardware, at least in this test, does not seem to support Async Compute? Even if the driver does allow Async Compute tasks to be done, the hardware just seems to be doing rendering+compute in a serial fashion and not parallel at all.

If the GPU is juggling two distinct modes internally, it might be that it cannot readily run both at the same time, hence the discussion of an expensive context switch. Rather than at a kernel or wavefront level, it might be a front-end context.
If the GPU is also not able to readily preempt the graphics portion, it might be that it will keep its context at the forefront barring an explicit way of yielding priority.

Deleted member 13524 · Sep 1, 2015

3dilettante said:
If the GPU is juggling two distinct modes internally, it might be that it cannot readily run both at the same time, hence the discussion of an expensive context switch.

But isn't Async Compute supposed to be a feature where the GPU does not need to juggle between two distinct modes (graphics/compute)?

So now we have a second test pointing to what both the Oxide employee and AMD_Robert claimed?

AMD_Robert said:
Oxide effectively summarized my thoughts on the matter. NVIDIA claims "full support" for DX12, but conveniently ignores that Maxwell is utterly incapable of performing asynchronous compute without heavy reliance on slow context switching.

Kollok said:
Curiously, their driver reported this feature was functional but attempting to use it was an unmitigated disaster in terms of performance and conformance so we shut it down on their hardware. As far as I know, Maxwell doesn't really have Async Compute so I don't know why their driver was trying to expose that.
(...)
AFAIK, Maxwell doesn't support Async Compute, at least not natively. We disabled it at the request of Nvidia, as it was much slower to try to use it then to not.
(...)
It didn't look like there was a hardware defect to me on Maxwell just some unfortunate complex interaction between software scheduling trying to emulate it which appeared to incur some heavy CPU costs.

BTW, what if we monitor CPU usage during MDolenc's test to see if the "heavy CPU costs" claimed by Kollok are there for Kepler/Maxwell, and then compare for GCN GPUs?

DX12 Performance Discussion And Analysis Thread

dogen

sebbbi

Kodiack

fellix

Alessio1989

Forceman

fellix

drSeehas

vedivis

Deleted member 13524

Guest

Razor1

Deleted member 13524

Guest

Andrew

Razor1

madyasiwi

madyasiwi

Deleted member 13524

Guest

3dilettante

3dilettante

Deleted member 13524

Guest

Similar threads