Driver optimizations and the API

Infinisearch · Sep 17, 2014

Two parts but related questions.

1. Besides custom versions of shaders are there any other optimizations that driver writers employ to improve the performance of client programs? If you can't share specifics a simple yes or no would help.

2. Lets say an API were to expose the capability to setup a bunch of outlines on how to render a frame. Then select which outline at the beginning of a frame and then step through the outine while rendering the frame, would it enable driver writers to either perform optimizations or perform optimizations they couldn't perform before? I ask since the ouline would allow you to know whats going on ahead of time and potentially make sense of exactly what the client programmer is trying to accomplish.

silent_guy · Sep 17, 2014

The most obvious IMO is probably memory related, such as playing tricks with the location of textures and other buffers (in main memory or in local DRAM). The API already provides hints about where a buffer can or should reside, but I imagine that a driver writer who knows what a game does, can be even smarter about it.

Another one that's sometimes used in DSP land, is playing tricks with caches: e.g. tweaking the cache line replacement policy based on the expected traffic pattern. That sounds like a good candidate for some textures as well.

Rodéric · Sep 17, 2014

What silent_guy said.

You wouldn't believe how much drivers compensate popular game, IHV need those games to run great on their hardware as they'll be used for benchmarking and comparison, so they'll add a good amount of custom code to handle them...

(That also implies that their's room for improving the way they use the API... which is interesting in itself.)

pMax · Sep 17, 2014

Rodéric said:
(That also implies that their's room for improving the way they use the API... which is interesting in itself.)

Let us take a fresh example with no legacy and so on, the crew's port on PS4.
They clearly state that GNMX, due to its advanced support, ease the development but prevent optimal performance due to its obvious performance hit compared to GNM.
And generic layers like DX11 takes even bigger hit, since they need to be generic enough to support multiple IHV.

Dominik D · Sep 17, 2014

There are a lot of things you can be smart about when you know how resources are used by a given application. For example your general (and reasonable) policy could be to flush commands to the GPU on render target change but in a specific case (some game, application, benchmark, whatever) you'd put command buffer to the side because you know that it flip flops between two RTs heavily before it flushes them. Or it could be the other way around: you defer everything and batch as mush as you can before you flush, except for some app where you know that the usage would benefit from you flushing immediately on RT switch.

Two things you definitely want to minimize is how chatty UMD and KMD are and the latency between flushing a workload and receiving results. So, again, in general case you may want to hold onto a command buffer until you've been asked to flush but perhaps waiting for too long is, well, waiting for too long and you should partition the stuff. There's also a lot you can do WRT state changes as most (custom) game engines are pretty liberal about state changes. Clearing RT several times over and over before settling on the actual bg color is quite often. Reseting viewports and scissors, that kind of stuff. Basically driver has to mask a lot of redundant calls from the application.

There are also a lot of obvious memory-related optimizations that you can do which come from semantics (this was mentioned above). E.g. memory layout of a surface could be something smarter than strided if you know that it won't be accessed from the CPU. APIs (well, DX/WDDM spec) have these obvious cases spelled out for the driver developer in many cases.

Infinisearch · Sep 18, 2014

Rodéric said:
You wouldn't believe how much drivers compensate popular game, IHV need those games to run great on their hardware as they'll be used for benchmarking and comparison, so they'll add a good amount of custom code to handle them...

Do all driver optimizations rely on exe detection? Would an api as outlined above allow those optimizations to be propagated to all client apps with similar behavior patterns? Thus reducing the load on driver teams and improving performance for non-AAA apps.

Infinisearch · Sep 18, 2014

Oh and to clarify on the original idea, its not necessarily about just knowing what is currently going on and what is going to happen next... its also knowing what can be definitely won't be happening now or more over in the current phase of execution.

Dominik D · Sep 18, 2014

Infinisearch said:
Do all driver optimizations rely on exe detection?

I would have to guess that all of the drivers can. I don't know if all of the drivers do. Having the ability to tweak per application is very helpful for us during development. But we definitely don't want to have tweaks in the production driver as this is not reliable. So basicaly you structure the driver in a way that makes it perform well in all scenarios and not suck when developers do something weird. Which leads to the second answer.

Would an api as outlined above allow those optimizations to be propagated to all client apps with similar behavior patterns?

I think the idea outlined by you won't work for the same reason drivers have to overcome certain problems with application code. Basically the more options you give developers to tell you what's right for them, the more chances they have to mess things up. There are many, many different pieces of HW out there with different quirks and dos/don'ts. If you can invest a lot of time and monies into development and testing, then sure. That's why Epic, Crytek and DICE are thrilled about Mantles, Metals and D3D12. What everyone else needs is a set of simple APIs that prevent one from shooting his or hers foot and a good set of tools to measure performance and hint on potential improvements. If you set viewport to draw a straight line or perform "for n in range(0...100) do select RT0; draw obj[n]; select RT1; draw obj[n]; end;" or do something else that's bonkers then no set of hints and outlines is going to make you renderer more sane.

Infinisearch · Sep 21, 2014

Dominik D said:
Basically the more options you give developers to tell you what's right for them, the more chances they have to mess things up.

But there's nothing to mess up, if the outline presented is incorrect then there program won't work. This isn't a case of providing hints but restructuring how you submit work to the GPU. As for extremely poorly written code the same will hold true of the current situation... i.e. nothing can be done for it.

Dominik D · Sep 22, 2014

You're shifting complexity from one set of APIs to the other. How is this solving anything?

Infinisearch · Sep 24, 2014

Just an attempt by a layman to how driver optimizations are implmented to propose a way for those optimizations to reach a greater number of client programs at less cost to IHV's by detecting behavioral patterns exposed by the way the API is structured.

What complexity are you refering to? Honestly I'm not really seeing what you're seeing.

Dominik D · Sep 24, 2014

Works both ways.

I think what would help (to me at least) if you could provide an example of what an application would expose to the driver, how and why. Perhaps then I'd be able to see how this would help the driver to optimize something better.

Infinisearch · Sep 24, 2014

Dominik D said:
Works both ways.

I think what would help (to me at least) if you could provide an example of what an application would expose to the driver, how and why. Perhaps then I'd be able to see how this would help the driver to optimize something better.

What do you mean when you say "works both ways" I've never heard that used in this context.

I haven't fleshed out this idea at all yet because I didn't know what kind of optimizations besides custom shaders driver writers employ, hence this thread. Perhaps if I have time I'll take a couple of weekends and see what I can come up with without concrete information, no promises though.

silent_guy · Sep 25, 2014

"Works both ways" as in "I'm not seeing what you're seeing either."

Infinisearch · Sep 25, 2014

silent_guy said:
"Works both ways" as in "I'm not seeing what you're seeing either."

Ah that makes a whole lot of sense, thanks it seems I'm a fool.

edit - a fool because I was born and raised state side and I still didn't get what Dominik D meant when he said "works both ways".

Infinisearch · Sep 28, 2014

Well here is my first attempt at API design, enjoy.
This is just for example purposes so it's just a jumble of thoughts and obviously incomplete.

When pipeline state objects are created scan for traits (stored as bitmap)

Pass type - a bitmap that stores allowable traits
or
listPSO entity - a list of pipeline state objects
(used to identify different types of pass's z-only,opaque,streamout...)

listRT entity - a list of render targets (not necessarily created yet)

FRAME entity - the basis of rendering contains a series of passes, non-nested loops???.

PASS enity - This is where you script setup and cleanup for draw loops.
ex. PASS zonly(params)
{
passtype (zonly) or set_listPSO(zonly) // passtype preferable???
setup rendertargets code // index into list of rendertargets
Drawloop(); // setups an API state where only setting render state and drawing is allowed.
//a break call is used to exit the draw loop, continue to draw again
//if listPSO is used, index into for set_listPSO
cleanup code
}

Any suggestions? Besides taking into account compute shaders (i gave them no thought)

Edit - I also gave no thought to geometry shaders and while I gave some thought to texture use I wasn't sure it'd be helpful.

Driver optimizations and the API

Infinisearch

silent_guy

Rodéric

a.k.a. Ingenu

pMax

Dominik D

Infinisearch

Infinisearch

Dominik D

Infinisearch

Dominik D

Infinisearch

Dominik D

Infinisearch

silent_guy

Infinisearch

Infinisearch

Similar threads