D3D9 texture paging with solid FPS (and turbo cache problems)

zqw · Apr 28, 2006

I've got a D3D9 app that needs to load textures on the fly while maintaining a solid 60fps (16.66ms) framerate. There are limited hardware targets that we control, but they are currently all low-end (5200, 6200 turbo cache)

I'm pretty stuck. So, thanks in advance for any ideas/pointers.

-Zach

First the questions:
Are there any examples/tutorials on loading textures and minimizing pipeline stalls online?
Is there a way to disable the 'virtual memory' on the 6200 turbo cache card for testing?
Can you pick out anything goofy below?
Are there any (free?) profiler tools to help me pinpoint my bottleneck more than I have? I'm about to try NVPerfHUD.

And here's a pile of details:
Currently, the app has a 'draw thread' that just draws like mad in a window with vsync on. A second 'control thread' receives instructions over the network and does all the setup/teardown of textures/shaders, etc. We were using D3DXCreateTextureFromFileEx(), but it was making a lock that stopped the draw thread and caused 'hitching' in the frame rate. So, I've separated the disk I/O from the d3d stuff. We use 3-5 large non-pow2 textures at a time - 640x480, 800x600, 1280x720, etc.

The draw thread basically spends all of it's time inside device->Present(NULL, NULL, NULL, NULL))

Currently the control thread works like this. Keep in mind that this texture is completely unknown to the draw thread. I'm just fighting the d3d global lock.
load png to gdi+ bitmap
D3DXCreateTexture(D3DUSAGE_DYNAMIC)
tex->LockRect(D3DLOCK_DISCARD)
memcpy rows to fill d3d texture
tex->UnlockRect(0)
application lock (enter critical section)
add new objects/textures to draw thread work queue
application unlock

This has allowed the hated 5200 (bad pixelshader perf) to run smoothly at 640x480 even under an artificial worst-case stress test (continuous texture churn), but the 6200 TC is still has a small 'hitch' in framerate when textures load. Under the stress test it drops to an unsmooth 30FPS, and appears to be texture/window res independent. (I tried forcing all 256x256 textures in a 640x200 window.) The 6200 TC can spend ~60ms inside device->Present() when adding a texture. I think the shader is fine, when textures aren't changing it's happy running the pixel shaders at 720p.

I'm using QPC for timings, but it can be hard to tell what is happening in the draw thread since the Nvidia driver appears to buffer more than 1 frame of commands. But, it looks like the d3d locks are only during texture create, and lock/fill/unlock. When running multithreaded, it can take quite a while for the command thread to acquire the d3d lock (CreateTexure and LockRect), but when running a single thread it is ~0.5ms. So, I think it's fine if the control thread waits on the draw thread. The texture fill/unlock takes only 2ms after the lock is acquired.

JHoxley · Apr 28, 2006

Sounds like you've got the general idea right...

Much of the loading/creation time for resources is spent I/O waiting or upload waiting...

Also, its not entirely clear from your post - but you're not using D3D across multiple threads (with D3DCREATE_MULTITHREADED?) are you? Best advice is to completely isolate it to a single thread - whilst it might make the application design simpler, there is no reason to have MT D3D. Only 1 thread can access the driver/GPU at a time, so you'll just end up passing a lock around...

The only thing I've done that you might want to try is a streaming form of loading - but this only works if you can handle 1/2 second (or more) delays between issuing a request and having a renderable texture. You can offset it greatly if you have pre-generated MIP-MAPS and DDS files though.

What you do is have your worker thread stream in the raw binary - but not in one big chunk, allow the thread to yield every so often. Either once its loaded, or during loading you copy small blocks over to the D3D-thread which can then upload small chunks per-frame. The basic idea is to avoid the big one-time copy operations and break it down into lots of smaller operations.

btw, with regards to timing - make sure you read the "Accurately profiling D3D Calls" in the SDK documentation, and might well be better to dig deep with PIX...

Jack

ERP · Apr 28, 2006

You want to stay away from initing D3D in multithreaded mode.

We usually load the data on a seperate thread then dispatch the texture to a queue for creation on the primary D3D thread.

Our main app loop limits the time that can be spent call create texture something like

(No real API's harmed in the construction of this psuedo code)

Timer.Reset();
while(Timer.Get() < someThreshhold)
{
item = getNextItemFromFifo();
item.doD3DCreateStuffHere();
}

We do one additional thing to deal with the asynchronysity at the application level and that is we don't have the app deal directly with these resources, but all that does is ease the burden on the application code.

Graham · Apr 30, 2006

sightly off topic...

Does the multi-threaded tag simply 'pass a lock around' or does it actually restrict you to rendering from the creation thread?

The reason I ask is because my current project has a pool of task threads, and while rendering to one device never occurs on more than one thread at once, it does jump from pool thread to thread.

Only problem I've run into is D3D doesn't like being released on a thread that differs from the creation thread... *grumble*

JHoxley · Apr 30, 2006

Graham said:
Does the multi-threaded tag simply 'pass a lock around' or does it actually restrict you to rendering from the creation thread?

Well, strictly speaking no - its not intentionally passing a lock around. But in practice its near enough the same thing. Given that only one thread can access the runtime/driver at a time, then the CritSec locks out all the other threads and as soon its done another thread picks it up and does the same thing... repeat until thoroughly bored....

If you search the DirectXDev mailing list archives theres been some interesting discussions on quite what D3DCREATE_MULTITHREADED does and why it should(n't) be used...

Jack

zqw · May 1, 2006

Thanks for all the info. And yes, I was doing d3d calls from both threads with the d3d multithreaded flag.

I'm in the process of moving (the calling of) all d3d code into my 'draw thread' by implementing a D3DInit queue like ERP outlined. I'm trying to avoid changing interfaces and avoid deadlocks ATM. I'll report back.

Meanwhile if anyone is interested: The 6200 turbocache box might end up being ok. It looks like the real bottleneck was the lack of hyperthreading on that one box (we can recreate it on a 6200 AGP box by disabling HT.) So, we're also looking at thread priorities and throttling our message dispatcher for those cases.

zqw · May 22, 2006

FWIW:

Implementing the 'D3D Init queue' that gets worked on when the draw thread has 'spare time' really smoothed things out. I wasn't able to come up with a great metric to figure out the maximum amount of work to do without impacting frame rate. If someone has one, I'd like to hear it. I went with a very conservative approach, and the latency is ok.

After that, getting all D3D calls into a single thread (and running w/o the multithreaded flag) didn't noticeably help. But, I already had very few batches per frame.

The issue on non-HT cpus was mainly resolved by simply boosting the priority on the draw thread. The 6200TC card turned out to be a nice performer - esp considering it's 64-bit memory interface and the 'virtual memory' stuff.

Thanks for the advice!

D3D9 texture paging with solid FPS (and turbo cache problems)

zqw

JHoxley

ERP

Graham

Hello :-)

JHoxley

zqw

zqw

Similar threads