Join Date: Mar 2008
Driver overhead in GetRenderTargetData implementation
I'm using DX9 to render and capture the image to system memory. The capture is done using IDirect3DDevice9::GetRenderTargetData(). I'm using Windows Vista and GF8800GT.
The render target size is 782x160x4 (= ~0.5Mb). The capture time on my system is 0.5 ms, which gives a bandwidth of 1Gb/s. The real bandwidth, using CUDA bandwidthTest is ~3Gb/s.
I always wondered why this was, and I assumed there was some kind of constant overhead to each GetRenderTargetData call.
Now, using the new GPUView tool, I finally could look into this.
To my surprise, I discovered that GetRenderTargetData is implemented using 3 separate command buffers submitted to the kernel driver and GPU. Moreover, there's quite a lot of "dead" GPU time in between, where the UMD is working. Actually, looking at the GPU time of these 3 command buffers, their total is 170 us, which gives us almost exactly the real 3Gb/s bandwidth.
So, the questions are:
1. Why 3 submitions (with kernel mode switch overhead, Dxgk overhead etc.)?
2. What is the UMD doing during the rest of the time (330 us), given that there's no format conversion or any processing involved?