DX12 Performance Discussion And Analysis Thread

Andrew Lauritzen · Sep 28, 2015

BRiT said:
Isnt that for the "4gb" nvidia card where only 3.5gb is "fast" memory and the rest is "slow" memory?

It likely goes deeper. This advice is just saying you can't necessarily allocate your entire VRAM as one big chunk of contiguous virtual address space and keep it "resident". This is really bad idea to do regardless of whether or not it's possible anyways so it's good advice all around. Work with several smaller chunks of VRAM and make stuff resident/nonresident as necessary. A single big chunk is too inflexible and inefficient for moving around.

3dilettante · Sep 28, 2015

BRiT said:
Isnt that for the "4gb" nvidia card where only 3.5gb is "fast" memory and the rest is "slow" memory?

Maybe a virtual address limitation to what can be addressed directly?
The one architecture where I've seen this mentioned explicitly is Haswell's, but maybe some of Nvidia's architectures have a similar address ceiling?

Razor1 · Sep 28, 2015

Infinisearch said:
Does GCN have fine grain preemption or fine grain sharing? I thought it was the latter.

GCN has fine grain preemption and sharing, but I think the sharing can only occur over L1 cache (in realtime apps) with CU next to the original CU, when using L2 cache its global to all CU's but there is a performance penalty.

Alessio1989 · Sep 28, 2015

BRiT said:
Isnt that for the "4gb" nvidia card where only 3.5gb is "fast" memory and the rest is "slow" memory?

ToTTenTranz said:
Yup, which comes together with the GTX 660 Ti with 1.5GB at 192bit and 512MB at 64bit.

I don't think there are modern lower-end parts using Turbocache, but I imagine that would also be a big problem.

You still cannot reserve the full VRAM pool anyway.... I remember we cannot ask to reserve the full VRAM pool, and I remember there is and "extra budget" VRAM pool subject eventually to WDDM trim notification (and 5 seconds to free it if required)..

Andrew Lauritzen said:
It likely goes deeper. This advice is just saying you can't necessarily allocate your entire VRAM as one big chunk of contiguous virtual address space and keep it "resident".
This is really bad idea to do regardless of whether or not it's possible anyways so it's good advice all around. Work with several smaller chunks of VRAM and make stuff resident/nonresident as necessary. A single big chunk is too inflexible and inefficient for moving around.

Oh, I didn't notice the MSDN remarks of MakeResident()

Applications must handle MakeResident failures, even if there appears to be enough residency budget available. Physical memory fragmentation and adapter architecture quirks can preclude the utilization of large contiguous ranges.

So, the question is now, can we retrieve the "preferred" chunk size of an adapter? Because I do not like "try-catch" strategies at all.. ù.ù

Alessio1989 · Sep 28, 2015

Make use of NvAPI (when available) to access other Maxwell features

Advanced Rasterization features

Bounding box rasterization mode for quad based geometry

New MSAA features like post depth coverage mask and overriding the coverage mask for routing of data to sub-samples

Programmable MSAA sample locations

Fast Geometry Shader features

Render to cube maps in one geometry pass without geometry amplifications

Render to multiple viewports without geometry amplifications

Use the fast pass-through geometry shader for techniques that need per-triangle data in the pixel shader

New interlocked operations

Enhanced blending ops

New texture filtering ops

Are those things available on non-NDA version of NVAPI?
Why doesn't AMD expose custom MSAA sample on D3D?

More about MSAA custom sample points: https://mynameismjp.wordpress.com/2015/09/13/programmable-sample-points/

I want this in the next update of D3D!

Jawed · Sep 28, 2015

It's occurred to me that NVAPI stuff has been wired into UE4 for the Fable Legends test, which is why some passes are dramatically faster on NVidia.

lanek · Sep 28, 2015

Jawed said:
It's occurred to me that NVAPI stuff has been wired into UE4 for the Fable Legends test, which is why some passes are dramatically faster on NVidia.

Well, could be interessant to make an study on this. .. Im curious to see how much the shaders are replaced by the Nvidia drivers in correlation with the Nvapi.

Ext3h · Sep 28, 2015

Minimize the use of barriers and fences

Any barrier or fence can limit parallelism

Group barriers in one call to ID3D12CommandList::ResourceBarrier

This way the worst case can be picked instead of sequentially going through all barriers

Don’t sequentially call ID3D12CommandList::ResourceBarrier with just one barrier

This doesn’t allow the driver to pick the worst case of a set of barriers

Don’t expect fences to trigger signals/advance at a finer granularity then once per ExecuteCommandLists call.

Don’t create too many threads or too many command lists

Too many threads will oversubscribe your CPU resources, whilst too many command lists may accumulate too much overhead

Don’t create too many threads or too many command lists

Looks like Nvidias driver is really a picky eater.

Now combine that with Nvidias older advice: "No shader should run longer than 1ms", and it can be summed up as: "Don't even try to offload any scheduling or to create backpressure on Nvidia Hardware. It won't work. The driver will break down."

lanek · Sep 28, 2015

Personally, im more worried about this one:

Don’t create too many threads or too many command lists
- Too many threads will oversubscribe your CPU resources, whilst too many command lists may accumulate too much overhead
Don’t create too many threads or too many command lists

Alessio1989 · Sep 28, 2015

What does "too many" means? At least I would except to not have any particular issue having 1 thread per virtual CPU core...

lanek · Sep 28, 2015

Alessio1989 said:
What does "too many" means? At least I would except to not have any particular issue having 1 thread per virtual CPU core...

its what i was ask me, what is the limit ( minimum or maximum ? )

Maybe we got the response with CUDA ..

3dilettante · Sep 28, 2015

Ext3h said:
Looks like Nvidias driver is really a picky eater.

Now combine that with Nvidias older advice: "No shader should run longer than 1ms", and it can be summed up as: "Don't even try to offload any scheduling or to create backpressure on Nvidia Hardware. It won't work. The driver will break down."

Which of those recommendations stands out as an Nvidia glass jaw?
Recommending that fences and barriers be used only as much as they are needed is generally useful. A barrier by definition is limiting parallelism, and too many means limiting it unnecessarily.
Is it the statement that the driver will not try to infer whether a barrier is the true limit of a series of barriers if the programmer hides the necessary context?
Is it the recommendation that you don't use the API's freedom to arbitrarily choke the CPU?

lanek said:
Personally, im more worried about this one:

Don’t create too many threads or too many command lists

Too many threads will oversubscribe your CPU resources, whilst too many command lists may accumulate too much overhead

Don’t create too many threads or too many command lists

On AMD you want certainly the invert. divide, divide, divide . Or i misunderstand it maybe.

That's a pretty generic recommendation. Is there a scenario where having too many CPU threads generating too many command lists is somehow going to make an AMD system any happier?

lanek · Sep 28, 2015

3dilettante said:
Which of those recommendations stands out as an Nvidia glass jaw?
Recommending that fences and barriers be used only as much as they are needed is generally useful. A barrier by definition is limiting parallelism, and too many means limiting it unnecessarily.
Is it the statement that the driver will not try to infer whether a barrier is the true limit of a series of barriers if the programmer hides the necessary context?
Is it the recommendation that you don't use the API's freedom to arbitrarily choke the CPU?

That's a pretty generic recommendation. Is there a scenario where having too many CPU threads generating too many command lists is somehow going to make an AMD system any happier?

If we look at OpenCL kernel compiler, i will say yes ... But my knowledge is rather not at the level of your... And loooking at GCN, i can imagine the architecture is made for like it. ( pure hypotesis ), well can absorb it more easely )...

I remember a whitepaper on Siggraph about it ( thanks to consoles graphics engine developpers who do an excellent jobs at exposing their finds aboout GCN )... im not sure i can find it, should have it somewhere.

Ext3h · Sep 28, 2015

3dilettante said:
That's a pretty generic recommendation. Is there a scenario where having too many CPU threads generating too many command lists is somehow going to make an AMD system any happier?

Yes. Total of 64 queues in hardware. Each single one happy to accept a decent backlog. No re-shuffling done in software. Too many threads are eventually even going to increase GPU utilization on AMD even further, even with tiny command lists and several dozen barriers.

It's much, MUCH harder to get CPU limited with AMDs hardware, as the driver does significantly less work.

3dilettante · Sep 28, 2015

Ext3h said:
Yes. Total of 64 queues in hardware. Each single one happy to accept a decent backlog. No re-shuffling done in software. Too many threads are eventually even going to increase GPU utilization on AMD even further, even with tiny command lists and several dozen barriers.

It's much, MUCH harder to get CPU limited with AMDs hardware, as the driver does significantly less work.

According to this comparison:
http://www.anandtech.com/show/9659/fable-legends-directx-12-benchmark-analysis/3

It is somewhere on the order of >10% more CPU-limited on an i3.
It is ~15% more CPU-limited on an i5.
It is ~0% more CPU limited on an i7, with AMD regressing with a higher thread count from the i5.

Did the i7 turn off some of those 64 queues, or could there be something like asking the driver/CPU/OS to juggle more threads and list generation events is not a sure path to boundless performance growth?

Ext3h · Sep 28, 2015

PS: Remember the "benchmark" we had earlier in this thread? The one where Nvidia essentially failed the "sequential mode"?

Turns out it wasn't "sequential" per definition. It was just a single command list for every single compute command. As opposed to all compute commands in a single command list in "regular" mode.

Zero parallelism between command lists on Nvidia, only parallelism inside a single command list worked up to a level of 32 concurrent commands.

On the contrary, GCN 1.2 even required that type of workload to fully utilize the HWS unit, and pushed from 64 concurrent commands from a single command list to 128 concurrent commands from multiple lists. And that was still only 1/4th of GCN 1.2's power used.

lanek said:
Not someone here have post something like a limit of GCN threads at around 2688 ? ( or something like that, i dont have the number in mind, but way over 2000 ).. that let a good margin no ?

Upper limit of 640 active wavefronts. Or 40k threads. Lower limit 16k threads.

But thread level concurrency isn't decisive alone in this case. GCN also offers decent concurrency both on a per-command, and per-command-list base. Nvidia doesn't offer the last one at all, only the first two, and even the second one only to a rather limited level.

Ext3h · Sep 28, 2015

3dilettante said:
According to this comparison:
http://www.anandtech.com/show/9659/fable-legends-directx-12-benchmark-analysis/3

It is somewhere on the order of >10% more CPU-limited on an i3.
It is ~15% more CPU-limited on an i5.
It is ~0% more CPU limited on an i7, with AMD regressing with a higher thread count from the i5.

Did the i7 turn off some of those 64 queues, or could there be something like asking the driver/CPU/OS to juggle more threads and list generation events is not a sure path to boundless performance growth?

Single compute queue only, so only a single one in hardware as well. Also no backpressure on that compute queue, like, at all, according to GPUView screenshots. About 95% of the load have been handled by the graphics command processor, with async compute enabled, that is.

Penalties inside a single queue due to comparably high latencies on low speed CPUs can't be avoided with DX12 either. If command lists are separated by barriers which only the CPU can fulfill, it's going to be noticeable.

I know you guys don't trust the results, but take the results from Extremetech for comparison, namely the measurable performance gains a processor with an oversized L3 cache gave to both vendors at 720p and 1080p, compared to regular consumer CPUs. That's not actually so much a CPU limit in terms of peak performance, but pure latency, in this case noticeably reduced by less L3 cache misses despite lower clock speed.

Jawed · Sep 28, 2015

Barriers (at the task invocation level) are a mechanism that requires careful usage. They don't simply enforce that a kernel invocation waits behind another kernel invocation (or copy), they prevent the successor from starting until the predecessor has completely finished. Sometimes they are necessary, because the first task writes to random parts of the target and the second cannot use that target as input if writes from the first task would get lost.

Any time you ask a GPU to "completely finish" you're going to waste a lot of shader/TEX/ROP/memory cycles. The invocation that's finishing will gradually go from using the whole GPU down to using none of the GPU as work runs out. The invocation that starts after the barrier will take time to fill the GPU with work, too.

So, if you can, you should avoid an algorithm that requires task-level barriers.

Alternatively, you create multiple, parallel, queues each of which has its own sequence of tasks separated by barriers. But, now you need an algorithm which can be chopped up into such small pieces as to be spread over multiple queues.

If you can chop up an algorithm like this, then you can also stripe the tasks in a single queue: A1, B1, C1, A2, B2, C2.

The upshot is that it takes time to examine these implementation alternatives, multiplied by the types of GPU you might encounter. And there may be algorithms that obviate the need to wait for an invocation to completely finish, but they have to be found.

Hopefully low-level D3D12 coders will accept that they can write auto-tuning task managers rather than drudge through the full nightmare of catering for all the quirks. And PC gamers are used to frobnicating their graphics options to get the performance/IQ balance they want, so finding algorithms that are relatively stable in their performance profile across IQ and GPU capabilities is prolly more important.

3dilettante · Sep 28, 2015

Ext3h said:
I know you guys don't trust the results, but take the results from Extremetech for comparison, namely the measurable performance gains a processor with an oversized L3 cache gave to both vendors at 720p and 1080p, compared to regular consumer CPUs. That's not actually so much a CPU limit in terms of peak performance, but pure latency, in this case noticeably reduced by less L3 cache misses despite lower clock speed.

If you want to use results on an i7 5960, there is also context here:
http://techreport.com/review/29090/fable-legends-directx-12-performance-revealed/4

This highlights a trend noted with Anandtech's numbers, where AMD stops scaling at four threads and starts to regress. Nvidia's implementation appears to have a higher fixed overhead, but it is able to scale to higher thread counts before it starts to level off and slightly regress.
Anandtech's i5 numbers are numerically higher than the 5960 numbers at Extremetech, so we're looking at differing forms of CPU dependence on an immature platform.

Because of the unknowns, I am not willing to use some pretty vanilla recommendations for DX12 development to indict implementations.
Another premature bit of speculation is that I can think of one kind CPU activity that can thrash L1 and L2 caches, but can fall back to a large L3.

lanek · Sep 28, 2015

Ext3h said:
Upper limit of 640 active wavefronts. Or 40k threads. Lower limit 16k threads.

But thread level concurrency isn't decisive alone in this case. GCN also offers decent concurrency both on a per-command, and per-command-list base. Nvidia doesn't offer the last one at all, only the first two, and even the second one only to a rather limited level.

I was not have the numbers on my minds.. . GCN like high presssure on threading parts .

At 3Ddillettante .. this benchmark use UE4 ( or 4.1 ) and seems make an high use of Nvapi ( basically it is coded for it ) .. not for GCN..

There's absolutely no logical reason about the scaling over CPU threads ....

DX12 Performance Discussion And Analysis Thread

Andrew Lauritzen

Moderator

3dilettante

Razor1

Alessio1989

Alessio1989

Jawed

lanek

Ext3h

lanek

Alessio1989

lanek

3dilettante

lanek

Ext3h

3dilettante

Ext3h

Ext3h

Jawed

3dilettante

lanek

Similar threads