DX12 Performance Discussion And Analysis Thread

cduchesne · Sep 5, 2015

And so you have it for comparison, here is the unmodified run:

Compute only: 1. 27.83ms 512. 441.10ms
Graphics only: 27.65ms (60.68G pixels/s)
Graphics + compute: 1. 28.35ms (59.18G pixels/s) 512. 441.16ms (3.80G pixels/s)
Graphics, compute single commandlist: 1. 54.41ms (30.83G pixels/s) 512. 259.96ms (6.45G pixels/s)

SimBy · Sep 5, 2015

290 OC does close to the above. GPU usage and clocks are quite erratic.

Compute only: 1. 4.42ms 512. 68.57ms
Graphics only: 27.42ms (61.19G pixels/s)
Graphics + compute: 1. 25.90ms (64.77G pixels/s) 512. 81.70ms (20.54G pixels/s)
Graphics, compute single commandlist: 1. 31.59ms (53.11G pixels/s) 512. 74.99ms (22.37G pixels/s)

Jawed · Sep 5, 2015

cduchesne said:
Here's my results for the unrolled version on a 290x (1000Mhz / 1250 clock) with 15.8 beta driver.
I noticed that GPU clock rarely rose above 900 Mhz during the run and bounced between 30-100% utilization.

Utilisation isn't a useful metric, at least with AMD GPUs, as there are lots of subsystems whose utilisation are summed to make the overall GPU utilisation, I believe.

Your results are also pretty erratic, like HD 7950 - which was a bit of a surprise. I can't see any steps. Async shaders is still working really well though.

Now, what's puzzling me is why NVidia is apparently much slower despite the fact that AMD can only issue 1 instruction every 4 clocks on a SIMD, whereas NVidia should be issuing 1 instruction every clock.

Going back to the PTX that was found by OlegSH, we see this pattern repeating:

Code:

mul.f32  %f13, %f12, 0f3F000011;
fma.rn.f32  %f14, %f11, 0f3F000011, %f13;
fma.rn.f32  %f15, %f14, 0f3F000011, %f13;
mul.f32  %f16, %f15, 0f3F000011;
fma.rn.f32  %f17, %f14, 0f3F000011, %f16;
fma.rn.f32  %f18, %f17, 0f3F000011, %f16;

on AMD the pattern is different:

Code:

V_MUL_F32
V_MAC_F32

Does anyone have any ideas what's going on with the unrolled loop on NVidia? I think it should be much faster with less instructions.

Benny-ua · Sep 5, 2015

Jawed said:
I meant the version of MDolenc's test without unroll (which also uses up to 512 kernels).

Ok, results are below (original and yours version):

Gedeon · Sep 5, 2015

Hi everybody.
I'm really interested in this thread so I run the test in my both computers. It's curious how my 4 years old 7750 was already able to accept async tasks.

Compute only:
1. 65.94ms
2. 65.90ms
3. 65.90ms
4. 65.92ms
5. 65.95ms
6. 65.95ms
7. 65.90ms
8. 65.94ms
9. 65.95ms
10. 66.46ms
11. 66.05ms
12. 65.95ms
13. 65.92ms
14. 65.92ms
15. 65.93ms
16. 65.93ms
17. 65.90ms
18. 65.96ms
19. 65.92ms
20. 65.92ms
21. 65.93ms
22. 65.90ms
23. 65.90ms
24. 65.94ms
25. 65.96ms
26. 66.27ms
27. 65.93ms
28. 65.91ms
29. 65.93ms
30. 65.90ms
31. 65.89ms
32. 65.91ms
33. 65.91ms
34. 65.95ms
35. 65.91ms
36. 67.26ms
37. 65.92ms
38. 65.93ms
39. 65.94ms
40. 65.96ms
41. 67.67ms
42. 68.29ms
43. 65.99ms
44. 66.00ms
45. 65.98ms
46. 65.97ms
47. 65.97ms
48. 66.05ms
49. 65.93ms
50. 65.95ms
51. 65.94ms
52. 65.94ms
53. 66.29ms
54. 65.93ms
55. 65.95ms
56. 65.91ms
57. 66.02ms
58. 65.92ms
59. 65.92ms
60. 66.32ms
61. 65.91ms
62. 65.93ms
63. 66.10ms
64. 66.31ms
65. 87.05ms
66. 90.89ms
67. 96.57ms
68. 96.57ms
69. 97.03ms
70. 92.54ms
71. 94.71ms
72. 94.40ms
73. 98.09ms
74. 96.60ms
75. 98.75ms
76. 94.21ms
77. 94.99ms
78. 98.52ms
79. 96.75ms
80. 98.18ms
81. 97.92ms
82. 98.81ms
83. 97.74ms
84. 97.87ms
85. 97.75ms
86. 97.52ms
87. 98.74ms
88. 98.01ms
89. 97.40ms
90. 95.21ms
91. 98.97ms
92. 102.13ms
93. 97.73ms
94. 97.74ms
95. 97.51ms
96. 97.58ms
97. 105.50ms
98. 109.27ms
99. 106.84ms
100. 106.91ms
101. 107.21ms
102. 106.80ms
103. 107.09ms
104. 106.86ms
105. 106.88ms
106. 106.86ms
107. 106.86ms
108. 107.71ms
109. 106.85ms
110. 106.87ms
111. 106.88ms
112. 106.89ms
113. 106.87ms
114. 106.84ms
115. 108.17ms
116. 106.92ms
117. 106.87ms
118. 106.85ms
119. 106.80ms
120. 108.39ms
121. 106.88ms
122. 106.86ms
123. 106.85ms
124. 110.73ms
125. 106.86ms
126. 106.90ms
127. 106.91ms
128. 108.76ms
Graphics only: 126.41ms (13.27G pixels/s)
Graphics + compute:
1. 126.41ms (13.27G pixels/s)
2. 127.79ms (13.13G pixels/s)
3. 126.37ms (13.28G pixels/s)
4. 126.51ms (13.26G pixels/s)
5. 126.47ms (13.27G pixels/s)
6. 126.45ms (13.27G pixels/s)
7. 126.37ms (13.28G pixels/s)
8. 126.46ms (13.27G pixels/s)
9. 126.45ms (13.27G pixels/s)
10. 126.49ms (13.26G pixels/s)
11. 126.35ms (13.28G pixels/s)
12. 126.51ms (13.26G pixels/s)
13. 126.62ms (13.25G pixels/s)
14. 126.48ms (13.27G pixels/s)
15. 126.43ms (13.27G pixels/s)
16. 126.49ms (13.26G pixels/s)
17. 126.42ms (13.27G pixels/s)
18. 126.47ms (13.27G pixels/s)
19. 126.41ms (13.27G pixels/s)
20. 127.63ms (13.15G pixels/s)
21. 126.43ms (13.27G pixels/s)
22. 126.58ms (13.25G pixels/s)
23. 126.82ms (13.23G pixels/s)
24. 126.68ms (13.24G pixels/s)
25. 126.68ms (13.24G pixels/s)
26. 126.53ms (13.26G pixels/s)
27. 127.36ms (13.17G pixels/s)
28. 144.04ms (11.65G pixels/s)
29. 126.49ms (13.26G pixels/s)
30. 126.43ms (13.27G pixels/s)
31. 127.09ms (13.20G pixels/s)
32. 126.40ms (13.27G pixels/s)
33. 126.48ms (13.27G pixels/s)
34. 126.33ms (13.28G pixels/s)
35. 143.84ms (11.66G pixels/s)
36. 126.45ms (13.27G pixels/s)
37. 132.30ms (12.68G pixels/s)
38. 126.45ms (13.27G pixels/s)
39. 126.54ms (13.26G pixels/s)
40. 126.45ms (13.27G pixels/s)
41. 126.80ms (13.23G pixels/s)
42. 126.37ms (13.28G pixels/s)
43. 144.14ms (11.64G pixels/s)
44. 126.44ms (13.27G pixels/s)
45. 134.53ms (12.47G pixels/s)
46. 126.41ms (13.27G pixels/s)
47. 126.62ms (13.25G pixels/s)
48. 130.30ms (12.88G pixels/s)
49. 126.59ms (13.25G pixels/s)
50. 137.29ms (12.22G pixels/s)
51. 126.60ms (13.25G pixels/s)
52. 126.43ms (13.27G pixels/s)
53. 126.63ms (13.25G pixels/s)
54. 126.45ms (13.27G pixels/s)
55. 126.54ms (13.26G pixels/s)
56. 126.44ms (13.27G pixels/s)
57. 126.32ms (13.28G pixels/s)
58. 126.70ms (13.24G pixels/s)
59. 130.25ms (12.88G pixels/s)
60. 126.84ms (13.23G pixels/s)
61. 126.35ms (13.28G pixels/s)
62. 147.47ms (11.38G pixels/s)
63. 126.47ms (13.27G pixels/s)
64. 126.50ms (13.26G pixels/s)
65. 126.48ms (13.26G pixels/s)
66. 126.55ms (13.26G pixels/s)
67. 143.09ms (11.73G pixels/s)
68. 126.53ms (13.26G pixels/s)
69. 126.47ms (13.27G pixels/s)
70. 126.50ms (13.26G pixels/s)
71. 126.38ms (13.27G pixels/s)
72. 143.23ms (11.71G pixels/s)
73. 126.45ms (13.27G pixels/s)
74. 126.53ms (13.26G pixels/s)
75. 127.86ms (13.12G pixels/s)
76. 126.35ms (13.28G pixels/s)
77. 126.47ms (13.27G pixels/s)
78. 126.45ms (13.27G pixels/s)
79. 126.42ms (13.27G pixels/s)
80. 126.43ms (13.27G pixels/s)
81. 126.54ms (13.26G pixels/s)
82. 126.56ms (13.26G pixels/s)
83. 126.47ms (13.27G pixels/s)
84. 126.48ms (13.26G pixels/s)
85. 126.45ms (13.27G pixels/s)
86. 126.44ms (13.27G pixels/s)
87. 181.03ms (9.27G pixels/s)
88. 126.48ms (13.27G pixels/s)
89. 126.46ms (13.27G pixels/s)
90. 126.42ms (13.27G pixels/s)
91. 126.58ms (13.25G pixels/s)
92. 144.60ms (11.60G pixels/s)
93. 126.43ms (13.27G pixels/s)
94. 126.38ms (13.28G pixels/s)
95. 126.44ms (13.27G pixels/s)
96. 126.46ms (13.27G pixels/s)
97. 127.06ms (13.20G pixels/s)
98. 189.19ms (8.87G pixels/s)
99. 186.11ms (9.01G pixels/s)
100. 167.51ms (10.02G pixels/s)
101. 161.58ms (10.38G pixels/s)
102. 194.75ms (8.61G pixels/s)
103. 193.54ms (8.67G pixels/s)
104. 196.45ms (8.54G pixels/s)
105. 196.60ms (8.53G pixels/s)
106. 196.75ms (8.53G pixels/s)
107. 196.21ms (8.55G pixels/s)
108. 205.80ms (8.15G pixels/s)
109. 196.38ms (8.54G pixels/s)
110. 196.04ms (8.56G pixels/s)
111. 196.96ms (8.52G pixels/s)
112. 195.15ms (8.60G pixels/s)
113. 196.57ms (8.54G pixels/s)
114. 199.74ms (8.40G pixels/s)
115. 196.73ms (8.53G pixels/s)
116. 196.78ms (8.53G pixels/s)
117. 197.06ms (8.51G pixels/s)
118. 196.66ms (8.53G pixels/s)
119. 196.74ms (8.53G pixels/s)
120. 196.55ms (8.54G pixels/s)
121. 196.94ms (8.52G pixels/s)
122. 209.00ms (8.03G pixels/s)
123. 196.85ms (8.52G pixels/s)
124. 197.21ms (8.51G pixels/s)
125. 197.02ms (8.52G pixels/s)
126. 196.98ms (8.52G pixels/s)
127. 196.36ms (8.54G pixels/s)
128. 197.21ms (8.51G pixels/s)

Nobu · Sep 5, 2015

r9-270x before and after unrolled. Mine is the only 270x so far I think, so if you don't feel like graphing the one before unrolling that's fine. I did both of these today, though, so there shouldn't be anything different other than the hlsl source.

Still can't upload attachments--does it require flash, or is it a firefox thing?

fellix · Sep 5, 2015

As Overclock’s Mahigan explained:

“The Asynchronous Warp Schedulers are in the hardware. Each SMM (which is a shader engine in GCN terms) holds four AWSs. Unlike GCN, the scheduling aspect is handled in software for Maxwell 2. In the driver there’s a Grid Management Queue which holds pending tasks and assigns the pending tasks to another piece of software which is the work distributor. The work distributor then assigns the tasks to available Asynchronous Warp Schedulers. It’s quite a few different “parts” working together. A software and a hardware component if you will.

With GCN the developer sends work to a particular queue (Graphic/Compute/Copy) and the driver just sends it to the Asynchronous Compute Engine (for Async compute) or Graphic Command Processor (Graphic tasks but can also handle compute), DMA Engines (Copy). The queues, for pending Async work, are held within the ACEs (8 deep each)… and ACEs handle assigning Async tasks to available compute units.

Simplified…

Maxwell 2: Queues in Software, work distributor in software (context switching), Asynchronous Warps in hardware, DMA Engines in hardware, CUDA cores in hardware.
GCN: Queues/Work distributor/Asynchronous Compute engines (ACEs/Graphic Command Processor) in hardware, Copy (DMA Engines) in hardware, CUs in hardware.”

Looks like Nvidia pretty much distributed the asynchronous dispatching at multi-processor level in similar manner as tessellation. :???:

BRiT · Sep 5, 2015

For everyone who hasn't been keeping pace with things, please stop posting material that has already been fully posted in this thread.

We already know about the Oxide developer statements including Nvidia's response starting in the following original B3D post:

pharma said:
Oxide developer statement regarding D3D12:

The original source of the post is here:
http://www.overclock.net/t/1569897/...ingularity-dx12-benchmarks/2130#post_24379702

Jawed · Sep 5, 2015

Benny-ua said:
Ok, results are below (original and yours version):
View attachment 909 View attachment 910

This is very interesting, as it seems to imply that the erratic results are due to the unrolled loop, perhaps it's "too long".

If you would like to experiment you can try to use:

i < 1024 x 8 and j < 64

or

i < 1024 x 16 and j < 32

Judging by the results you got with Vlev's code, it seems that the loop shouldn't be unrolled more than 32.

Jawed · Sep 5, 2015

Gedeon said:
Hi everybody.
I'm really interested in this thread so I run the test in my both computers. It's curious how my 4 years old 7750 was already able to accept async tasks.

Yay, Cape Verde. Unfortunately, the newer version in this post: DX12 Async Compute Latency thread is what you should be using now as it's more informative.

Nobu · Sep 5, 2015

Just thought I should mention it, in case nobody noticed...in the single command queue test, the numbers in the square brackets ("[]") go up more uniformly than in the regular graphics+compute test. In that test, the numbers increase/decrease at random intervals (in multiples of ~4), whereas in the single command queue test they increase after a set number (66, 131, etc.).

Jawed · Sep 5, 2015

Nobu said:
Just thought I should mention it, in case nobody noticed...in the single command queue test, the numbers in the square brackets ("[]") go up more uniformly than in the regular graphics+compute test. In that test, the numbers increase/decrease at random intervals (in multiples of ~4), whereas in the single command queue test they increase after a set number (66, 131, etc.).

Yes I commented on that very briefly earlier. I wonder if consecutive runs are overlapping with each other, especially on Fiji.

Gedeon · Sep 5, 2015

Jawed said:
Yay, Cape Verde. Unfortunately, the newer version in this post: DX12 Async Compute Latency thread is what you should be using now as it's more informative.

Thanks! I'll rerun the correct one when I can.

Benny-ua · Sep 5, 2015

Jawed said:
to experiment you can try to use:
i < 1024 x 8 and j < 64

Code:

// Generated by Microsoft (R) HLSL Shader Compiler 10.0.10011.16384
cs_5_0
dcl_globalFlags refactoringAllowed
dcl_uav_structured u0, 4
dcl_input vThreadID.xy
dcl_temps 3
dcl_thread_group 1, 1, 1
utof r0.xy, vThreadID.xyxx
add r0.xy, r0.xyxx, l(1.000000, 1.000000, 0.000000, 0.000000)
mov r0.zw, r0.xxxy
mov r1.xz, l(0,0,0,0)
loop 
  ige r1.w, r1.z, l(8192)
  breakc_nz r1.w
  add r1.w, r0.w, r0.z
  mad r2.x, r1.w, l(0.500001), r0.w
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r1.y, r1.w, l(0.500001)
  mad r1.w, r2.x, l(0.500001), r1.y
  mul r1.x, r1.w, l(0.500001)
  iadd r1.z, r1.z, l(1)
  mov r0.zw, r1.yyyx
endloop 
store_structured u0.x, l(0), l(0), r1.x
ret 
// Approximately 139 instruction slots used

Jawed said:
or i < 1024 x 16 and j < 32

Code:

// Generated by Microsoft (R) HLSL Shader Compiler 10.0.10011.16384
cs_5_0
dcl_globalFlags refactoringAllowed
dcl_uav_structured u0, 4
dcl_input vThreadID.xy
dcl_temps 3
dcl_thread_group 1, 1, 1
utof r0.xy, vThreadID.xyxx
add r0.xy, r0.xyxx, l(1.000000, 1.000000, 0.000000, 0.000000)
mov r0.zw, r0.xxxy
mov r1.xz, l(0,0,0,0)
loop 
  ige r1.w, r1.z, l(0x00004000)
  breakc_nz r1.w
  add r1.w, r0.w, r0.z
  mad r2.x, r1.w, l(0.500001), r0.w
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r2.y, r1.w, l(0.500001)
  mad r2.x, r2.x, l(0.500001), r2.y
  mul r2.y, r2.x, l(0.500001)
  mad r1.w, r1.w, l(0.500001), r2.y
  mul r1.y, r1.w, l(0.500001)
  mad r1.w, r2.x, l(0.500001), r1.y
  mul r1.x, r1.w, l(0.500001)
  iadd r1.z, r1.z, l(1)
  mov r0.zw, r1.yyyx
endloop 
store_structured u0.x, l(0), l(0), r1.x
ret 
// Approximately 75 instruction slots used

Results:

Jawed · Sep 5, 2015

Thanks Benny-ua, it seems 64 is actually the most consistent, the sweet spot, at least for original GCN. Maybe I shouldn't be surprised that the hardware thread size is the sweet spot. I wonder if this has something to do with the CP and the ACEs having different algorithms for spreading work to cores, since some AMD GPUs have a step size of 128 when the single command list is used.

Also, I wonder if there's a sweet spot like this for Maxwell 2.

Anyone else might want to see if there's a sweet spot effect like this on the GPU you test. I suppose Fiji is the AMD GPU that looks the most different for async shaders...

Ext3h · Sep 6, 2015

Jawed said:
I wonder if this has something to do with the CP and the ACEs having different algorithms for spreading work to cores, since some AMD GPUs have a step size of 128 when the single command list is used.

I wouldn't trust in the ACEs being responsible for that step size of 128. It's IMHO more likely that two programs each got batched prior to sending to the GPU, so it's still actually wave fronts of 64. Doesn't look like the driver has been packing them in parallel though, or the speedup would have been more significant.

What I AM surprised about though, is the steps of 64 on the Cap Verde chip - shouldn't that thing only have 2 ACEs with only a single queue each result in a wavefront size of 2? And it's still keeping up latency wise, so it really can't have a lower level of parallelism than any other GCN 1.2 chip, despite only having two queues.

Why? What is the driver doing on GCN 1.0? And why hasn't it done the same thing on any other GCN 1.1 or 1.2 architecture?

Infinisearch · Sep 6, 2015

Ext3h said:
I wouldn't trust in the ACEs being responsible for that step size of 128.

Isn't the single command list case when the CP is being used?

lanek · Sep 6, 2015

Ext3h said:
I wouldn't trust in the ACEs being responsible for that step size of 128. It's IMHO more likely that two programs each got batched prior to sending to the GPU, so it's still actually wave fronts of 64. Doesn't look like the driver has been packing them in parallel though, or the speedup would have been more significant.

What I AM surprised about though, is the steps of 64 on the Cap Verde chip - shouldn't that thing only have 2 ACEs with only a single queue each result in a wavefront size of 2? And it's still keeping up latency wise, so it really can't have a lower level of parallelism than any other GCN 1.2 chip, despite only having two queues.

Why? What is the driver doing on GCN 1.0? And why hasn't it done the same thing on any other GCN 1.1 or 1.2 architecture?

10 Wavefront by CU on GCN1.o, but they can initiate the work on a different CU ( crossbar ). This said, the scalar engine will surely got a role in it, in GCN1.2, the scalar can be used for use the ALU of another CU for balance the load.. remember that GCN like to be overloaded, not the invert.

Razor1 · Sep 6, 2015

lanek said:
10 Wavefront by CU on GCN1.o, but they can initiate the work on a different CU ( crossbar ). This said, the scalar engine will surely got a role in it, in GCN1.2, the scalar can be used for use the ALU of another CU for balance the load.. remember that GCN like to be overloaded, not the invert.

Well it can only use the CU's next to the original CU right?

lanek · Sep 6, 2015

Razor1 said:
Well it can only use the CU's next to the original CU right?

I think, but, seriously, we need someone who have work on this arch for go further on this .. There's surely a lot of information that we miss ..

DX12 Performance Discussion And Analysis Thread

cduchesne

Attachments

SimBy

Attachments

Jawed

Benny-ua

Attachments

Gedeon

Attachments

Nobu

fellix

BRiT

(>• •)>⌐■-■ (⌐■-■)

Jawed

Jawed

Nobu

Jawed

Gedeon

Benny-ua

Attachments

Jawed

Ext3h

Infinisearch

lanek

Razor1

lanek

Similar threads