DX12 Performance Discussion And Analysis Thread

Well that's not out yet. :p

Besides there's no gpu today that could drive it. You need dp 1.3 for 4k @ 120hz. Perhaps some mst scheme over dp 1.2? But even then that's less than ideal.
 
http://www.techradar.com/reviews/pc...s/dell-up3017q-oled-4k-monitor-1311504/review

https://pcmonitors.info/dell/dell-up3017q-4k-uhd-oled-monitor/

4K
30"
120 Hz
OLED (perfect viewing angles)
400,000:1 contrast
0.1ms latency
10 bit colors
100% Adobe RGB, 97.8% DCI-P3
What about 100% Rec2020 12-bit? Anyway, I still am unable to find a *OLED panel completely without glare issues on high contrast scenes, even on phones. *OLED technologies are still immature imo, remember that TFT LCD (led and IPS and IPS-like includes) take over 2 decades to become decent...
 
http://www.techradar.com/reviews/pc...s/dell-up3017q-oled-4k-monitor-1311504/review

https://pcmonitors.info/dell/dell-up3017q-4k-uhd-oled-monitor/

4K
30"
120 Hz
OLED (perfect viewing angles)
400,000:1 contrast
0.1ms latency
10 bit colors
100% Adobe RGB, 97.8% DCI-P3

Heh, yeah, I reported on that back in the OLED thread somewhere. I want it. But just a teensy bit out of my price range. For OLED I'd be willing to go down to a small 30" monitor again. Actually I still use a 30" 16:10 monitor in portrait for a side monitor now.

Regards,
SB
 
Witcher 2 is noticeably smoother but other games (Bioshock Infinite, Max Payne 3) are sluggish in comparison. Hard to explain - it feels like the game is updating player position and movement slower than the reported fps.
Yeah not all games handle higher refresh rates very well. Sometimes they simply can't hit the frame rates. Other times they can but since they are simulating on a fixed time step it doesn't help much beyond just making camera movement slightly smoother. Still others simulate on a fixed timestep but will interpolate to get smoother animation at higher frame rates. Suffice it to say it varies a lot!
 
HOCP did an interesting run with AOTS benchmark, They used Top of the line i7-5960X, 8 Cores and 16 Threads, then decided to handicap it @1.2 GHz, then fully overclock it from 3.5GHz to 4.5GHz and see what happens:

When the CPU is handicapped:
The classic picture returns, we have seen it several times before, NVIDIA rules when the CPU is the bottleneck in DX11, and even in DX12.
AMD suffers massively when the CPU is the bottleneck in DX11, but comes back in full force in DX12.

When the CPU is unlocked:
NVIDIA practically equals out, delivering a slight increase in performance.
AMD is a little faster, but nothing to brag home about.

Thoughts?
14610034644x5teWArBn_4_3.png
14610034644x5teWArBn_4_6.png



http://www.hardocp.com/article/2016...l_cpu_scaling_gaming_framerate/5#.Vxqg6JPc6Ds

14610034644x5teWArBn_5_2.png
 
Nothing new really. Nvidia isn't API/CPU bottlenecked and doesn't gain anything from DX12 wth a decent CPU. That's exactly what we are seeing in most DX12 and DX11 benchmarks.
 
I can tell you that you don't want to run the game on a 6970. :) I beat the final campaign mission on a i5 4570 + 6970 2GB @ 1080p and it stuttered badly once the fighting got thick. I had it on the Low preset I believe. It may have been partially the 2GB VRAM. The game will happily fill even 4GB VRAM from what I've seen.

On the other hand, I saw in the Steam forum that the developers are looking into a stuttering issue caused by the way they do terrain deformation.
http://steamcommunity.com/app/228880/discussions/1/392184289304101683/#p2


I usually play on a i5 3570 @ 4.3 with GTX 970 and that runs the game great at 2560x1440 on the Extreme preset. DX12 mode seems solid.
 
Last edited:
HOCP did an interesting run with AOTS benchmark, They used Top of the line i7-5960X, 8 Cores and 16 Threads, then decided to handicap it @1.2 GHz, then fully overclock it from 3.5GHz to 4.5GHz and see what happens:

When the CPU is handicapped:
The classic picture returns, we have seen it several times before, NVIDIA rules when the CPU is the bottleneck in DX11, and even in DX12.
AMD suffers massively when the CPU is the bottleneck in DX11, but comes back in full force in DX12.

When the CPU is unlocked:
NVIDIA practically equals out, delivering a slight increase in performance.
AMD is a little faster, but nothing to brag home about.

Thoughts?
This is also seen with the ps2 emulator where the cpu is always the bottleneck. Last time I checked the ps2 emulator pcsx2 performs somewhat worse with amd gpus regardless of how low you drop the resolution and that is most certainly a cpu bottleneck as ridiculous resolutions high don't impact framerate on modern AMD gpus, and as even an old nvidia/amd cards from could 2010 up the resolution from 1650x1080 to 3000x2000 without seeing much of a drop in framerate in pcsx2.
 
Last edited:
HOCP did an interesting run with AOTS benchmark, They used Top of the line i7-5960X, 8 Cores and 16 Threads, then decided to handicap it @1.2 GHz, then fully overclock it from 3.5GHz to 4.5GHz and see what happens:

When the CPU is handicapped:
The classic picture returns, we have seen it several times before, NVIDIA rules when the CPU is the bottleneck in DX11, and even in DX12.
AMD suffers massively when the CPU is the bottleneck in DX11, but comes back in full force in DX12.

When the CPU is unlocked:
NVIDIA practically equals out, delivering a slight increase in performance.
AMD is a little faster, but nothing to brag home about.

Thoughts?
14610034644x5teWArBn_4_3.png
14610034644x5teWArBn_4_6.png

Actually what is more interesting.

In a CPU limited scenario, The Fury X basically manages to equal the GTX 980 Ti. 38.3 FPS versus 39.0 FPS, an ~1.8% lead for Fury X, is virtually a tie.

However, when there is more CPU performance to be had, the Fury X pulls ahead. 40.3 FPS versus 43.6 FPS, an ~8.2% lead for the Fury X.

So basically the faster the CPU you have, the farther ahead Fury X is going to pull ahead. At least in Ashes of the Singularity.

But the biggest thing is that unlike with Dx11, where AMD never seemed to be able to get a good grasp on optimizing for it, with Dx12 AMD are far less reliant on absolutely having to have a good CPU in order to have good performance.

Again, however, this is just one isolated benchmark. It'll be interesting if this heralds a general trend for AMD or if it's just an outlier.

Regards,
SB
 
Ran a couple of benchmarks with AotS on my notebook. 860M (GM107) appears to prefer DirectX 11. Not only does it run slightly better, there is also some graphics corruption occasionally with D3D 12.

NV driver 364.72

Code:
== Hardware Configuration ================================================
GPU 0:     NVIDIA GeForce GTX 860M 
CPU:     GenuineIntel
     Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz
Physical Cores:       4
Logical Cores:       8
Physical Memory:      8112 MB
Allocatable Memory:     134217727 MB
==========================================================================
Quality Preset:           Standard
==========================================================================

Resolution:         1920x1080
Fullscreen:         True
Bloom Quality:         High
PointLight Quality:       High
Glare Quality:         Low
Shading Samples:        4 million
Terrain Shading Samples:      8 million
Shadow Quality:         Low
Temporal AA Duration:       0
Temporal AA Time Slice:       0
Multisample Anti-Aliasing:     2x
Texture Rank :         1

Code:
API:             DirectX

== Total Avg Results =================================================
Total Time:            60.008465 ms per frame
Avg Framerate:            21.714109 FPS (46.053005 ms)
Weighted Framerate:          21.282160 FPS (46.987713 ms)
Average Batches per frame:        13357.952148 Batches
==========================================================================
Code:
API:             DirectX 12

== Total Avg Results =================================================
Total Time:            60.001305 ms per frame
Avg Framerate:            20.881351 FPS (47.889622 ms)
Weighted Framerate:          20.032330 FPS (49.919308 ms)
CPU frame rate (estimated if not GPU bound):    56.753651 FPS (17.620012 ms)
Percent GPU Bound:          98.123894 %
Driver throughput (Batches per ms):      3520.489746 Batches
Average Batches per frame:        13116.093750 Batches
==========================================================================
the "code" bbcode doesn't seem to be doing fixed width properly for some reason.
 
Last edited:
What is interesting is comparing Guru3d using the internal preset benchmark and pcgameshardware that actually benchmark the game in a heavy draw calls zone.
Notice NVIDIA actually performs better in game it seems than the preset benchmark, this was also reported by a couple of reviewers when they analysed Dragon Age Origins and compared benchmark to one of the busiest zones.

Furthermore, seems pcgameshardware noticed a couple of anomalies regarding graphics quality comparison between DX11 and DX12, and also memory protection behaviour.
http://www.pcgameshardware.de/Hitma...Episode-2-Test-Benchmarks-DirectX-12-1193618/

The NVIDIA 980ti in their in-game monitoring peforms very well in both DX11 and DX12 up to 1080p, then there is no gains on higher resolutions with DX12, but still competes well against AMD.
They do something similar to PCPerspective and use a mix of the Intel PresentMon and also looking at dropped frame/frametime behaviour.
Cheers
 
The way I see it Pascal's dynamic load balancing is functionally equivalent to GCN's async shaders. At least I haven't seen anything to indicate otherwise.
Dynamic load balancing is a thing - yes, and it is a hardware feature. But it's nowhere the same, or even remotely comparable to GCN's async execution via the independent command lists dispatched by the ACE units.

Dynamic load balancing is only for efficiently switching between compute and graphic workloads inside a single command list, respectively for eliminating the need for a full command buffer flush every time the partition scheme changes.

So you can essentially now:
  1. Upload the next compute only command list while the previous mixed command list is still in execution as the SMMs may now switch the mode lazily after the finished the graphics portion.
  2. Vice versa also when switching back to graphics.
  3. The penalty for a driver screwup when you mix compute and graphics inside a single command list is also eliminated.

Technically, that means there is no longer a scheduling problem just from having compute portions in there, and by that you avoid stalling the command processor.

What it doesn't provide yet, is the resource sharing or the truly asynchronous scheduling AMDs hardware features. So it using asynchronous queues r compute is now only (almost...) "for free", but it's still not gaining you anything.

And without triggering actual, explicit preemption, you are not gaining truly asynchronous, independent execution yet either. You are still subject to all side effects resulting from cooperative scheduling.
Nvidia didn´t talk about async, only preemption.
But they are unfortunately still referring to their preemption extension for DX11 as "Async Compute" too. On purpose.
 
Dynamic load balancing is only for efficiently switching between compute and graphic workloads inside a single command list...

Is that an assumption on your part or is there evidence to support that?

In the Pascal reviewers guide nVidia explicitly mentions overlapping PhysX kernels with graphics tasks. Are you saying that those DirectX and CUDA tasks are submitted to the GPU in the same command list?

Technically, nVidia can claim compliance with DX12 async compute for marketing purposes as there is no "requirement" that independent tasks run concurrently on the hardware.

So much speculation, so little data!
 
Is that an assumption on your part or is there evidence to support that?

In the Pascal reviewers guide nVidia explicitly mentions overlapping PhysX kernels with graphics tasks. Are you saying that those DirectX and CUDA tasks are submitted to the GPU in the same command list?
Sorry, oversimplification from my side. I only referred to draw and dispatch calls going via the graphics command processor (which also handles the "compute" queues in DX12), and forgot that independent command processor handling CUDA.

Yes, the same benefits also apply to grids dispatched from the independent HW queues used for CUDA, which makes perfect sense if the reallocation happens independent from either command queue.

But the point is: It looks as if Nvidia also managed to get rid of the stall on the GPC all together which did previously cause a lot of problems. Several Maxwell performance guidelines, such as "avoid mixing compute and graphics", or "don't toggle between compute and graphic queues" are now void.

The old "CUDA has access to command queues which should be exposed as compute queues in DX12 rather than doing everything on the GPC" complaint appears to remain valid though. I've not seen any indicator that they've fixed this yet.
 
What does it mean when people say Nvidia has software async compute? Software scheduler or more driver level management regarding async compue compared to AMD? Is it still true for Pascal or has it never been true?
 
Last edited:
Back
Top