How to accurately profile a D3D program?

991060

Regular
Hi all:
I'm currently working on a benchmark program, I need to accumulate the time GPU uses on rendering, the question is how to do that accurately? For now I just put 2 timers around the DrawPrimitive() methods to avoid other D3D call overhead, is that the correct way to do, or there's any better approach?
 
Since VPU's are massively parallel, you really can't get a view of rendering without timing very many frames. Flushing the VPU even once per frame can hit you for 30% performance or more.
 
Also if you just put a timer start before DrawPrimitive and timer end after DrawPrimitive you will get just the time CPU needed to execute DrawPrimitive call, GPU will (probably) not be finished by then (it might not even begin with rendering). So you put timer start on begining of frame (BeginScene) and timer end after Present.
 
As said, drawprimitive time is just drawprimitive time, it is not the time spent drawing the command since everything is then batched to the driver and is transparent to your program.

If you want to accurately discover where your rendering time went, you will need much more detailed tools.
Also don't forget that the tool change the measure. (you need a precise model of what you measure before you do any sort of calculation and extrapolate from there).

Some developpers have access to the perf tools by nvidia perhaps you should try to contact them.
 
LeGreg said:
As said, drawprimitive time is just drawprimitive time, it is not the time spent drawing the command since everything is then batched to the driver and is transparent to your program.
Yes, I know very little about what's happening underneath the code, that's why I had abnormal result earlier and came here for help. But I'm just a hobbish developer, it's not likely IHVs are willing to put any resources on guys like me. Actually I've tried some times with nVIDIA and ATi, there's no response at all. :cry:
 
Profiling specific calls is pretty much impossible. Even attempting to get per frame times can give you times that aren't entirely useful. If the drivers/card has a really large command FIFO then even per frame times you get can be fairly meaningless. This became really obvious when developing the D3DBench tool. These results that i got from the 3DMark2001 Dragothic High test here are 'per frame' and you can see a huge variation in per frame times.

http://www.users.on.net/triforce/draghigh.png

The only real way for you to test what you want to do is to redraw the same primitive lots of times, over lots of frames and get the total average time per call that way.
 
FWIW I know of no way to get an accurate GPU only timing in windows.

If you can guarantee that you are not CPU bound over the entire frame and you have no calls that cause the driver to synchronise (bad calls to lock) , and you don't fill the push buffer with less than one frames data, then after the first 3 or 4 frames (once the push buffer is full) the time between swap calls will be the time the rendering actually took. But that's a lot of if's.

The GPU probably has performance counters that shopuld be able to report the accurate times, I guess it's possible the individual vendors might expose these in some SDK.
 
Timing between BeginScene and Present won't be accurate either. The Present call returns immediately - it doesn't wait until rendering has completed unless the card is already a full frame ahead. In theory if you issue two additional Present calls with no rendering, you'll get a pretty accurate time as long as vsync is off or you've rendered plenty of frames.

WHQL has some strict rules on this which I can't remember, but I think the limit is 1.99 frames (i.e. you can't have two Present calls outstanding in the command queue). This rule has not always been strictly obeyed.... keep an eye on that mouse lag ;)

I would like to expose some perf counter generalities - so an application can get a rough idea of how idle the VPU is and what it could do differently to get better usage of it. Unfortunately it's not a simple process - there are technical and political obstacles as well as just a lack of API support. Don't hold your breath...
 
There used to be ways to do this prior to DX7, i.e. ways to synchronise on both the VBlank and GPU completion.

They were removed because devs abused them. I know this was one of the biggest complaints by early Xbox devs, and MS did add the functionality to the Xbox version of DX.

In general those waits are not something you'd want to do in a game, but they are really useful for profiling. On our XBox title we had a callback on render complete so we could isolate GPU time, but even that isn't sufficient to guarantee an accurate measurement. It can still be skewed if the app is starving the GPU though part of the frame, which early xbox titles commonly did.

MS now has excellent profiling tools on Xbox for the GPU, and I guess from the XNA announcement, the intention is to make some of those available to PC devs.
 
Back
Top