PC driver overhead

purpledog · Mar 3, 2006

Well, that's all in the title: what's this famous driver overhead ?
Ok, the question might be a bit general, so here's a more precise list:

- what is taking so long (and how much does it take) ?
A lot of small calls, OS badness, stupid latency, too much abstraction... What's your experience here ?

- does it tend to increase ?
Directx10 is coming with such nice features that the CPU will probably end up computing them for the GPU

- Why don't vendors provide the 3d communauty with a direct access to the GPU ? Ok, doing so can disclose a bit the hardware itself, but does it really ? In the end we are sending triangles, state change and shaders...

- How to profile it, how to overcome it ?

Demirug · Mar 3, 2006

purpledog said:
Well, that's all in the title: what's this famous driver overhead ?
Ok, the question might be a bit general, so here's a more precise list:

- what is taking so long (and how much does it take) ?
A lot of small calls, OS badness, stupid latency, too much abstraction... What's your experience here ?

There are many things that will steal you CPU cycles. One big issue is the transfer from the program to the kernel driver and back. This context changes are very expansive. Another problem is that the driver needs to check a large array of render states and find the right hardware commands over and over again every time after one of this states are changed between two draw calls. Most of this concepts are from a time were GPUs were only â€œsimpleâ€ 3D accelerators with a much smaller set of features and less secure operation systems.

purpledog said:
- does it tend to increase ?
Directx10 is coming with such nice features that the CPU will probably end up computing them for the GPU

Direct3D 10 (not DirectX 10) has many new features but use a new driver interface technique that need less CPU power. Some of this new Vista driver interface parts can be used for older Direct3D versions too.

purpledog said:
- Why don't vendors provide the 3d communauty with a direct access to the GPU ? Ok, doing so can disclose a bit the hardware itself, but does it really ? In the end we are sending triangles, state change and shaders...

Sure direct access would be faster but it will have a high price. Games will only run on hardware they know. Every time you buy a new card you can only hope that your old games will still run. We had this back in DOS times and it was a horror.

purpledog said:
- How to profile it, how to overcome it ?

As I am already had written the driver overhead will be reduced with Vista and it will reduced even more with Direct3D 10. But there will always a overhead. This is the price that you have to pay on a flexible system.

purpledog · Mar 4, 2006

Demirug said:
Sure direct access would be faster but it will have a high price. Games will only run on hardware they know. Every time you buy a new card you can only hope that your old games will still run. We had this back in DOS times and it was a horror.

One API per card is bad, agreed. One could argue that one API for all cards is as bad. I'm kind of dreaming here, but it would be interesting to have something in the middle, let say 3 or 4 driver models, and each GPU choose to implement one (or several) of them.

Now let's play the devil 's advocate

Your saying that the DOS times was horror. Granted. But has it changed really ? Even if the driver is unified, real-time softwares are checking hardware caps to achieve decent result. Some of them are even directly checking what GPU there are dealing with.

So I'm not even sure that this nice level of abstraction is making engineers life easier !
Is there any PC game programmers willing to (dis)approve that ?

pjbliverpool · Mar 4, 2006

Demirug said:
As I am already had written the driver overhead will be reduced with Vista and it will reduced even more with Direct3D 10. But there will always a overhead. This is the price that you have to pay on a flexible system.

So how much performance percentage would you estimate is lost in a PC compared to a console in terms of CPU, Memory and GPU power?

And how much do you think that will change with Vista/D3D10?

Guden Oden · Mar 4, 2006

purpledog said:
it would be interesting to have something in the middle, let say 3 or 4 driver models, and each GPU choose to implement one (or several) of them.

So you have your GPU which implements one or two of the models, and your game, which implements one or two also, neither of which are the same as the others.

What in that situation is better than the current system?

purpledog · Mar 4, 2006

Guden Oden said:
So you have your GPU which implements one or two of the models, and your game, which implements one or two also, neither of which are the same as the others.

What in that situation is better than the current system?

The idea was to code ALL different models, a bit like multi-plateform developper. Basically, instead of doing PC/PS3/XBOX2 version, one has to produce a PC1/PC2/PC3/PS3/XBOX2 version.

At first glance, this is more works without real benefits. Still, my assumption was that good performances would be easier to achieve because each card has its favorite model (1,2 or 3), closer to what its hardware really is.

I guess this kind of idea is working if it's possible to make such a simple partition of all the GPUs on the market... Hum... Doesn't look
good really ! Perhaps some kind of "per-vendor 3d library" ? I wonder if NVidia/ATI would be happy to develop their own 3d PC API...

ERP · Mar 4, 2006

purpledog said:
The idea was to code ALL different models, a bit like multi-plateform developper. Basically, instead of doing PC/PS3/XBOX2 version, one has to produce a PC1/PC2/PC3/PS3/XBOX2 version.

At first glance, this is more works without real benefits. Still, my assumption was that good performances would be easier to achieve because each card has its favorite model (1,2 or 3), closer to what its hardware really is.

I guess this kind of idea is working if it's possible to make such a simple partition of all the GPUs on the market... Hum... Doesn't look
good really ! Perhaps some kind of "per-vendor 3d library" ? I wonder if NVidia/ATI would be happy to develop their own 3d PC API...

Modern video cards just aren't that far apart featureset wise these days, part of that is a result of having to run D3D games fast. There is a bigger issue supporting old cards than supporting ATI and NVidia cutting edge stuff.

All you'd get if you moved to 3 different API's is either only one of them being used or devs using some common subset.

The primary issue is the ring transition when the app makes an OS Kernel level call.

It makes PC games very sensitive to batch counts (DrawIndexedPrim etc need to be called an absolute minimum number of times). But your not just paying the increases cost on the DrawPrim calls, applications are now constructed to minimise the batch counts and they can spend a lot of CPU time doing it. Just fixing the batch overhead wouldn't help old games as much as it helps newer games built with an driver model where it's not an issue.

db · Mar 5, 2006

I do not believe the user-kernel transition is the major source of overhead. The major source of overhead is poor impedance match between various parts of the stack and the hardware. The pre-Vista graphics stack is 10 years old with modest evolution to match changing hardware.

The sources of overhead are either of the form of spending too much time on the CPU - fondling data before passing it onto the hardware or blocking for pipeline hazards (e.g., state changes that can't be pipelined). The former is harder to fix since it is largely a matter of driver developer style and discipline. There are often tradeoffs between consolidating work in one place to simplify maintenance, or extra checks that need to be done to ensure apps can't crash the hardware, or extra shim processing to make the GPU execution go faster (or fix poor usage of the API by a game), or things like squashing redundant state changes. The shame is when this work gets repeated in multiple parts of the stack (left hand doesn't know what the right hand is doing).

Things like pipeline interlocks can defeat user-mode buffering and make the kernel transitions seem more expensive, but fundamentally the problem is that most operations need to be fully pipelined. Even with full pipeline, things like turning around a render target from a write target, to a read target (as a texture) back-to-back is going to stall and applications simply have to avoid doing this.

I don't think multiple APIs makes any difference - though one might suggest D3D8-9-10 or SM2-3-4 are three different APIs. Consoles reduce overhead by dumping the "attention to detail" part on to the developers. This isn't nearly so hard for a fixed piece of hardware, but note the long learning curve before apps take full advantage of the hardware. PC APIs try to avoid this approach because it is virtually impossible for anyone to keep track of for a large amount of hardware, and typically don't take full advantage of the hardware. That is the price paid to allow rapid and "unbridled" innovation in PC graphics hardware.

rwolf · Mar 5, 2006

I understand that direct3d 10 will eliminated lots of calls made from the driver to the graphics card. I am not exactly sure how.

purpledog · Mar 5, 2006

Interesting stuff, a bit confusing thought... Let me rephrase you.
So far, and as far as I understand, the driver overhead is due to

- DLL to DLL transition: "transfer from the program to the kernel driver and back", also called "ring transition when the app makes an OS Kernel level call" as well as "user-kernel transition".
I'm not quite sure what you're talking about here. Very naively perhaps, I see that as the traditionnal cost of a function call: some code must be loaded, some variables pushed on the stack... Basically, when a dll is calling another dll, one can expect this kind of additionnal cost. Is that the good interpretation ?

- standard to dedicated GPU calls translation: "the driver needs to check a large array of render states and find the right hardware commands over and over again".
I see that as a translation between a standard windows language and a given GPU dedicated one. This is a solved problem with Vista, as the driver directly produce GPU dedicated calls. Right ?

- Robust GPU: "There are often tradeoffs between consolidating work in one place to simplify maintenance, or extra checks that need to be done to ensure apps can't crash the hardware".
So basically, defensive programming slow down the whole stuff, because Windows doesn't want a bad programmer to crash its machine because of some GPU misuse... It makes sense, but I thought the direct3d debug/retail mode were suppose to handle such thing.

- ???: "poor impedance match between various parts of the stack and the hardware"
Need some clarification here !

- ???: "It makes PC games very sensitive to batch counts"
Help again !!! What do you mean by "batch count" ?

- OS badness: nobody has talked about OS requirements, which kind of surprise me... GPU are more and more considered as "just another processor". This means that it has to sync with other thread and so on. Aren't this kind of practive susceptible of introduce further driver overhead ?

Humus · Mar 5, 2006

There are several sources of CPU overhead. I'd say these are the most important ones:

1) Switching between rings (kernel mode / user mode). I don't know the exact cost of this, but it's way more expensive than the cost of a function call. This is more of a problem in D3D than OGL due to the driver model. In OGL a draw call goes into user space and the driver can batch many calls before going into kernel mode. In D3D the driver goes into kernel mode on every call.

2) Validation. This costs a lot.

3) Building command buffers. First the D3D runtime builds its buffer, passes to the driver which then rebuilds it to something the hardware understands. It's double work.

D3D10 should solve all these problems. 1) is solved by splitting the driver into two parts, one user mode part and a kernel mode part. 2) is solved by using state blocks that are prevalidated. 3) is solved by a thinner runtime where more stuff is passed on directly to the driver.

arjan de lumens · Mar 5, 2006

According to docs on AMD64 processors, the cost of a CALL+RET with ring change is roughly 220 clock cycles, while the cost of an ordinary function call (CALL+RET without ring change) is roughly 8 cycles. Something similar is IIRC the case with Intel processrs as well.

Demirug · Mar 5, 2006

arjan de lumens said:
According to docs on AMD64 processors, the cost of a CALL+RET with ring change is roughly 220 clock cycles, while the cost of an ordinary function call (CALL+RET without ring change) is roughly 8 cycles. Something similar is IIRC the case with Intel processrs as well.

AFAIK Windows donâ€™t use a Call+Ret for ring changes. All kernel API calls are interrupt based. And you need to add the whole overhead to go through the driver stack. Microsoft calculates with ~5000 cycles for a one way transmission. This gives us ~10000 cycles from the runtime to the driver and back without any work done by the driver and the runtime.

Guden Oden · Mar 6, 2006

5000 cycles, gods! What could the CPU possibly be doing during that period of time?? And why did MS create this setup in the first place, because they're idiots, or was there a valid reason somewhere way back?

If OGL has been able to batch stuff up without doing so much extra CPU overhead, surely D3D must have been able to do this also. So we're back at the "idiots" stage again I guess. Considering D3D was a catastrophe up until at least version 6, is there a chance they did it this way because they just didn't know any better?

And from a hardware POV, why does a ring change cost so many cycles?

arjan de lumens · Mar 6, 2006

Demirug said:
AFAIK Windows donâ€™t use a Call+Ret for ring changes. All kernel API calls are interrupt based. And you need to add the whole overhead to go through the driver stack. Microsoft calculates with ~5000 cycles for a one way transmission. This gives us ~10000 cycles from the runtime to the driver and back without any work done by the driver and the runtime.

INT+IRET with ring change is about as fast as CALL+RET with ring change (~200 cycles). That still leaves us with ~9800 cycles of overhead that isn't directly due to the CPU hardware ring change itself ..?

PC driver overhead

purpledog

Demirug

purpledog

pjbliverpool

B3D Scallywag

Guden Oden

Senior Member

purpledog

ERP

db

rwolf

Rock Star

purpledog

Humus

Crazy coder

arjan de lumens

Demirug

Guden Oden

Senior Member

arjan de lumens

Similar threads