DX12 Performance Discussion And Analysis Thread

What are the advantages of these versus profiling tools in licensed engines? You get memory allocation profiling tools in Unity/Unreal for example.
Unreal and Unity GPU profiling tools aren't that great for graphics programmers. Only basic counters and timing brackets. You'd want to know the reason why the shader is slow. PC PIX is a good start. They are adding hardware GPU counter support soon. That will allow you to know something about the hardware bottlenecks. Hopefully it gets closer to console profiling experience.
 
I thought this was demoed in build last year or the year before.
Has it really taken so long to be released?
 
Unreal and Unity GPU profiling tools aren't that great for graphics programmers. Only basic counters and timing brackets. You'd want to know the reason why the shader is slow. PC PIX is a good start. They are adding hardware GPU counter support soon. That will allow you to know something about the hardware bottlenecks. Hopefully it gets closer to console profiling experience.

That's not completely accurate. Unity has a frame debugger that let's you step through draw calls. I solved a lot problems this way. You're right though it won't help you solve all issues (especially ones that only exist on certain ihvs), but it's more than basic counters and timing brackets. :)
 
DXIL (SM 6.0) compiler first public preview is out: https://github.com/Microsoft/DirectXShaderCompiler

img-24.gif


HURRAY!
 
Shader Model 6.0 will arrrive with the Creators Update. Nvidia/AMD should then have WDDM 2.2 drivers ready for release to support it.
 
That's not completely accurate. Unity has a frame debugger that let's you step through draw calls. I solved a lot problems this way. You're right though it won't help you solve all issues (especially ones that only exist on certain ihvs), but it's more than basic counters and timing brackets. :)
I was talking about profiling. Frame debuggers are a different thing. Unity's frame debugger is nice, but RenderDoc exceeds it clearly and works with any engine. RenderDoc for example allows you to step in shaders and inspect variables (registers) on every line of shader execution. You can also edit shaders on fly and check modified results (and timings). RenderDoc has excellent resource viewers (with captured human readable resource names). You can inspect buffers by raw binary view or input any type (and it reinterpret casts the data before showing it). RenderDoc also captures constant buffer layouts, making it easy to check whether all constants are correctly set for draws and dispatches. PC graphics development without RenderDoc is awful. I would say RenderDoc matches console debugging tools.

Console profiling tools however are way ahead of generic PC profiling tools. You get thousands of HW counters and various analysis reports based on the counter values (ALU, bandwidth, cache, bank conflicts, geometry pipeline stats, etc). Occupancy and bottleneck graphs are invaluable in performance finetuning, especially with async compute & overlapping draws/dispatches. The new PC PIX will soon expose GPU counters. This will be huge improvement over staring at per draw/dispatch millisecond counter without knowing why the GPU behaves the way it does.
 
................
Console profiling tools however are way ahead of generic PC profiling tools. ........

And that is somewhat a real mystery for me. Looking at the nature of both system. logically PC should have a quite larger advance in tools available. ( and thats not the case )
 
Why is it a mystery? As sebbbi said... There are generic profiling tools on PC. Problem is that all the really interesting stuff is proprietary and may change drastically with a change of GPU architecture (for example from Kepler to Maxwell, or from Polaris to Vega). And at the same time it does not translate well across different vendors (what you might be interested in on NV hardware might not be the same as on AMD hardware). IHVs have a much much better toolset suite for their hardware.
 
Last edited:
I wish the graphs didn't swap DX11/DX12 positions sometimes. Just a minor gripe.

Also you have javascript NaN errors on the RotTR graph, mainly on hover.
Normally, it should be automagically sorted with the highest performance on top - at least that's our standard way of doing it. I understand that it does incur a minor inconvenience clicking though.

As for the NaN errors: I cannot reproduce them. Might have something to do with langugage settings and different interpretation of the decimal point? In germany, we have a „.“ as decimal point instead of a "," (wow that looked awkward with typographic quotation marks).
 
New drivers, some more patches applied and results on FX-8350+R9 Nano are thus:
http://www.pcgameshardware.de/DirectX-12-Software-255525/Specials/Direct-X-11-Vulkan-1218975/
Most gains are in 720p when the CPU of choice (the hobbled 8350) is the bottleneck, otherwise we are looking at the classic Ashes and Hitman DX12 poster children. Results from the rest of the games do more harm than good for the image of DX12, some games even post negative results when CPU limited (Deus Ex and Warhammer). I second the calls for testing with a better CPU and GPU.
 
Though it did not improve at all the performance in the AMD modified D3D12nBodyGravity sample, looks like last drivers (17.2.1) re-enabled async-compute for GCN1 GPUs. Not sure how much will impact games (currentyl my 280 is used as slave gpus for non-game applications), but at least this should cause less problem to developers.

Here's a shot by GPU View, as you can see, the compute queue is back and runs in concurrency with the default queue. Please note that some flip and presentation issues are caused since the GCN1 GPU does not directly output to the monitor (which is controlled by a R9 380X, a GCN3 GPU) and all is handled by WDDM magic.

200as.png


And here is the log of that small CS test:

Compute only:
1. 54.19ms
2. 54.17ms
3. 54.17ms
4. 54.18ms
5. 54.17ms
6. 54.18ms
7. 54.17ms
8. 54.16ms
9. 54.17ms
10. 54.18ms
11. 54.19ms
12. 54.18ms
13. 54.17ms
14. 54.17ms
15. 54.17ms
16. 54.17ms
17. 54.18ms
18. 54.16ms
19. 54.18ms
20. 54.17ms
21. 54.15ms
22. 54.14ms
23. 54.14ms
24. 54.14ms
25. 54.15ms
26. 54.15ms
27. 54.14ms
28. 54.14ms
29. 54.14ms
30. 54.17ms
31. 54.15ms
32. 54.14ms
33. 54.14ms
34. 54.14ms
35. 54.14ms
36. 54.14ms
37. 54.14ms
38. 54.14ms
39. 54.14ms
40. 54.15ms
41. 54.14ms
42. 54.15ms
43. 54.15ms
44. 54.15ms
45. 54.15ms
46. 54.15ms
47. 54.15ms
48. 54.15ms
49. 54.14ms
50. 54.15ms
51. 54.15ms
52. 54.15ms
53. 54.15ms
54. 54.16ms
55. 54.15ms
56. 54.16ms
57. 54.15ms
58. 54.15ms
59. 54.15ms
60. 54.16ms
61. 54.15ms
62. 54.15ms
63. 54.16ms
64. 54.18ms
65. 54.17ms
66. 54.15ms
67. 54.15ms
68. 54.15ms
69. 54.15ms
70. 54.15ms
71. 54.15ms
72. 54.16ms
73. 54.15ms
74. 54.15ms
75. 54.17ms
76. 54.15ms
77. 54.14ms
78. 54.15ms
79. 54.15ms
80. 54.16ms
81. 54.14ms
82. 54.16ms
83. 55.18ms
84. 54.19ms
85. 54.15ms
86. 54.14ms
87. 54.14ms
88. 54.15ms
89. 54.16ms
90. 54.15ms
91. 54.14ms
92. 54.15ms
93. 54.17ms
94. 54.15ms
95. 54.14ms
96. 54.15ms
97. 54.15ms
98. 54.15ms
99. 54.17ms
100. 54.15ms
101. 54.14ms
102. 54.15ms
103. 54.15ms
104. 54.16ms
105. 54.18ms
106. 54.18ms
107. 54.15ms
108. 54.14ms
109. 54.15ms
110. 54.14ms
111. 54.14ms
112. 54.14ms
113. 54.14ms
114. 54.15ms
115. 54.14ms
116. 54.15ms
117. 54.15ms
118. 54.14ms
119. 54.14ms
120. 54.14ms
121. 54.14ms
122. 54.14ms
123. 54.14ms
124. 54.14ms
125. 54.15ms
126. 54.14ms
127. 54.14ms
128. 54.14ms
Graphics only: 56.29ms (29.81G pixels/s)
Graphics + compute:
1. 56.25ms (29.82G pixels/s)
2. 56.23ms (29.84G pixels/s)
3. 56.23ms (29.84G pixels/s)
4. 56.25ms (29.83G pixels/s)
5. 56.25ms (29.83G pixels/s)
6. 56.23ms (29.84G pixels/s)
7. 56.25ms (29.82G pixels/s)
8. 56.24ms (29.83G pixels/s)
9. 56.24ms (29.83G pixels/s)
10. 56.23ms (29.84G pixels/s)
11. 56.23ms (29.84G pixels/s)
12. 56.24ms (29.83G pixels/s)
13. 56.23ms (29.84G pixels/s)
14. 56.24ms (29.83G pixels/s)
15. 56.23ms (29.84G pixels/s)
16. 56.24ms (29.83G pixels/s)
17. 56.23ms (29.84G pixels/s)
18. 56.24ms (29.83G pixels/s)
19. 56.24ms (29.83G pixels/s)
20. 56.23ms (29.84G pixels/s)
21. 56.24ms (29.83G pixels/s)
22. 56.23ms (29.84G pixels/s)
23. 56.25ms (29.83G pixels/s)
24. 56.23ms (29.84G pixels/s)
25. 56.24ms (29.83G pixels/s)
26. 56.24ms (29.83G pixels/s)
27. 56.26ms (29.82G pixels/s)
28. 56.23ms (29.83G pixels/s)
29. 56.23ms (29.84G pixels/s)
30. 56.24ms (29.83G pixels/s)
31. 56.24ms (29.83G pixels/s)
32. 56.24ms (29.83G pixels/s)
33. 56.24ms (29.83G pixels/s)
34. 56.24ms (29.83G pixels/s)
35. 56.23ms (29.84G pixels/s)
36. 56.25ms (29.83G pixels/s)
37. 56.24ms (29.83G pixels/s)
38. 56.24ms (29.83G pixels/s)
39. 56.24ms (29.83G pixels/s)
40. 56.24ms (29.83G pixels/s)
41. 56.24ms (29.83G pixels/s)
42. 56.23ms (29.84G pixels/s)
43. 56.24ms (29.83G pixels/s)
44. 56.24ms (29.83G pixels/s)
45. 56.24ms (29.83G pixels/s)
46. 56.23ms (29.84G pixels/s)
47. 56.23ms (29.84G pixels/s)
48. 56.24ms (29.83G pixels/s)
49. 56.24ms (29.83G pixels/s)
50. 56.25ms (29.83G pixels/s)
51. 56.24ms (29.83G pixels/s)
52. 56.24ms (29.83G pixels/s)
53. 56.23ms (29.84G pixels/s)
54. 56.24ms (29.83G pixels/s)
55. 56.26ms (29.82G pixels/s)
56. 56.24ms (29.83G pixels/s)
57. 56.24ms (29.83G pixels/s)
58. 56.24ms (29.83G pixels/s)
59. 56.24ms (29.83G pixels/s)
60. 56.25ms (29.82G pixels/s)
61. 56.23ms (29.83G pixels/s)
62. 56.23ms (29.84G pixels/s)
63. 56.24ms (29.83G pixels/s)
64. 56.23ms (29.84G pixels/s)
65. 56.23ms (29.84G pixels/s)
66. 56.23ms (29.84G pixels/s)
67. 56.24ms (29.83G pixels/s)
68. 56.23ms (29.84G pixels/s)
69. 56.23ms (29.84G pixels/s)
70. 56.23ms (29.84G pixels/s)
71. 56.23ms (29.83G pixels/s)
72. 56.23ms (29.84G pixels/s)
73. 56.23ms (29.84G pixels/s)
74. 56.23ms (29.84G pixels/s)
75. 56.23ms (29.84G pixels/s)
76. 56.24ms (29.83G pixels/s)
77. 56.23ms (29.84G pixels/s)
78. 56.23ms (29.84G pixels/s)
79. 56.23ms (29.84G pixels/s)
80. 56.23ms (29.83G pixels/s)
81. 56.23ms (29.84G pixels/s)
82. 56.23ms (29.84G pixels/s)
83. 56.23ms (29.84G pixels/s)
84. 56.23ms (29.84G pixels/s)
85. 56.24ms (29.83G pixels/s)
86. 56.23ms (29.84G pixels/s)
87. 56.23ms (29.84G pixels/s)
88. 56.23ms (29.84G pixels/s)
89. 56.23ms (29.84G pixels/s)
90. 56.23ms (29.84G pixels/s)
91. 56.23ms (29.83G pixels/s)
92. 56.23ms (29.84G pixels/s)
93. 56.25ms (29.82G pixels/s)
94. 56.24ms (29.83G pixels/s)
95. 56.23ms (29.83G pixels/s)
96. 56.24ms (29.83G pixels/s)
97. 56.39ms (29.75G pixels/s)
98. 56.47ms (29.71G pixels/s)
99. 56.23ms (29.84G pixels/s)
100. 56.23ms (29.84G pixels/s)
101. 56.24ms (29.83G pixels/s)
102. 56.23ms (29.84G pixels/s)
103. 56.23ms (29.84G pixels/s)
104. 56.23ms (29.84G pixels/s)
105. 56.26ms (29.82G pixels/s)
106. 56.23ms (29.84G pixels/s)
107. 56.24ms (29.83G pixels/s)
108. 56.24ms (29.83G pixels/s)
109. 56.24ms (29.83G pixels/s)
110. 56.23ms (29.83G pixels/s)
111. 56.23ms (29.84G pixels/s)
112. 56.23ms (29.84G pixels/s)
113. 56.23ms (29.84G pixels/s)
114. 56.23ms (29.84G pixels/s)
115. 56.23ms (29.84G pixels/s)
116. 56.23ms (29.84G pixels/s)
117. 56.23ms (29.84G pixels/s)
118. 56.23ms (29.84G pixels/s)
119. 56.23ms (29.84G pixels/s)
120. 56.23ms (29.84G pixels/s)
121. 56.23ms (29.84G pixels/s)
122. 56.23ms (29.84G pixels/s)
123. 56.23ms (29.84G pixels/s)
124. 56.24ms (29.83G pixels/s)
125. 56.23ms (29.84G pixels/s)
126. 56.23ms (29.83G pixels/s)
127. 56.23ms (29.84G pixels/s)
128. 56.23ms (29.84G pixels/s)
 
Last edited:
Back in March last year I raised:
More for the developers out there.
Any idea just how much work it would take for NVIDIA to be able to move the core used aspects of Gameworks to DX12 and specifically Asynchronous Compute/shaders related functionality?
I just wonder if there is more than one possible headache NVIDIA is experiencing with regards to the Async compute debate, and I can imagine as far as they are concerned Gameworks must work going forward (at least several aspects of it anyway).
Not saying Gameworks is good/bad here (does seem to be cumbersome though to say the least), just wondering if this is also part of the logistics involved in NVIDIA being quiet to date on this subject and that Kollock of Oxide suggests support for async compute does exist in a driver they have although currently disabled (although again it is open to interpretation how that support is implemented).
Mahigan seems to be making assumptions for NVIDIA so prefer not to rely on all he mentions.


Thanks
Finally looks like Nvidia answered this at the GDC event.
Additionally it will be interesting to see how this pans out in terms of multithread support/performance for when CPU orientated such as PhysX (not the only GameWorks feature able to utilise CPU side), which responds well up to 3 threads.
http://physxinfo.com/news/11327/multithreaded-performance-scaling-in-physx-sdk/

Yeah I appreciate GameWorks has a love/hate depending upon you talk to and its influence on gaming, but fingers crossed this actually helps to improve its "bolt-on" related performance and much less cumbersome.
Anyway the liquid and flame/smoke particle simulation demo have come a long way.
Cheers
 
Last edited:
Seems they found a good ground for using Async Compute on their GPUs, It seems GPU PhysX represents a poor case for GPU utilization in general. NV intends to fix that through Async, giving that Async is almost useless to them under normal rendering due to their very high utilization rate.

but fingers crossed this actually helps to improve its "bolt-on" related performance and much less cumbersome.
I hope to see The Division and Tomb Raider implementing VXAO and HFTS under DX12. This will bring the DX12 path to parity with the DX11 path in the visual aspect.
 
Unfortunately it's quite to be quite a few years before pre-Pascal generations are replaced by consumers and they start seeing gains from async. People will be hanging on to their Maxwell cards for a long time since they still perform well under DX11 and simply don't see gains under DX12.
 
Seems they found a good ground for using Async Compute on their GPUs, It seems GPU PhysX represents a poor case for GPU utilization in general. NV intends to fix that through Async, giving that Async is almost useless to them under normal rendering due to their very high utilization rate.


I hope to see The Division and Tomb Raider implementing VXAO and HFTS under DX12. This will bring the DX12 path to parity with the DX11 path in the visual aspect.
PhysX would be better for Nvidia on their GPUs but developers prefer to keep it more neutral and use the CPU option, such as I think Witcher 3.

But more generally beyond my post, yeah Nvidia is suggesting you are going to see improvements with Tomb Raider and other games with the next 'Game Ready Driver Optimised for DX12'
Game Ready Driver Optimized for DX12
NVIDIA also revealed an upcoming Game Ready Driver optimized for DirectX 12 games. The company refined code in the driver and worked side by side with game developers to deliver performance increases of up to 16 percent on average across a variety of DirectX 12 games, such as Ashes of the Singularity, Gears of War 4, Hitman, Rise of the Tomb Raider and Tom Clancy's The Division.(1)
http://nvidianews.nvidia.com/news/nvidia-announces-gameworks-dx12
Link also covers aspects of GameWorks.

Cheers
 
Last edited:
Back
Top