One point is the driver overhead, vista virtualizes all memory, so it also pushes the data to the drivers when and how it wants, that intermediate-buffering-overhead is what makes vista in general slower than xp (and also makes it more difficult to make application specific optimizations like drivers did for a lot of games on winxp).
Well, D3D9 was faster even on Vista and Windows 7.
the main difference between d3d9 and d3d10 is the constant handling.
in D3D10-mode, the driver has to assume there are some new constants that have to be set, that means that at least the cache needs to be flushed. But in worst case it means that some shader optimizations are either done on per drawcall basis or not enabled at all.
in D3D9 mode, the driver gets the constants you set probably via the commandbuffer, if u dont set any, it knows all data is up2date, all shader can be kept.
Doesn't make sense to me.
I do update the constants all the time, at the very least I need to update the transform matrices for the object animation, and light positions and such.
With D3D9 I have to make a separate call for each constant that I update. With D3D10 instead, I just map the entire constant buffer in one go, put the new values in, and unmap it.
So in D3D10 I specifically tell the driver "I'm done with it, the constant buffer is up to date now", where with D3D9 it doesn't know what is going on exactly.
In my case I update all constants every frame anyway, because I used very simple shaders.
Also, what you're saying isn't entirely correct. You can have multiple constant buffers, and you should order them to the frequency of updating them. All this should make D3D10 more efficient, when used properly. So you don't need to push "the whole constant buffer" over the bus. Only the buffer you're updating at the time. Since you do the update in a single go, it should get maximum performance with a burst transfer over the bus.
However, in my case the constant buffers were very small. Only one matrix and a few float values. So bandwidth shouldn't be an issue anyway.
I wonder if it may have something to do with thread safety. D3D9 isn't thread-safe by default, and I never used that flag to get a thread-safe instance. I don't think D3D10 has this option, so perhaps you always get a thread-safe instance by default, which would explain at least some of the extra overhead.