Efficient memory mapping / Page fault performance bugs

Ext3h

Regular
Background info

Performance bug in the Windows 10 page fault handler
Partial workaround

Why the two above links? Because that bug has a pretty severe impact. Whenever you felt like a plain memcpy to a mapped buffer was much slower than it should possibly ever had been, it's likely that you fell for the very same trap.

A problem with all memory mapping APIs available in DX / Vulkan, is that we don't have control over which memory is going to be used for the mapping, respectively the drivers / the run time is not taking care of providing us with properly initialized memory.

Try e.g. ID3D11DeviceContext::Map with D3D11_MAP_WRITE_DISCARD. The pointer returned by the method points to memory which is still mapped to the zero-page, it's not actually backed by RAM yet. Not only are you forced to perform an additional copy, but writes to the target memory region will also trigger the page fault handler constantly, to fix up the zero-page mapping.

Fancy to try ID3D11DeviceContext::UpdateSubresource instead? The interface for this method looks like it's copy free, but that is also far from being true. It only behaves as copy-free if the GPU is by chance idle. If it isn't, the runtime tries to be sneaky, and performs an internal copy for you, whether you want or not. Things don't really get much better from this point on.


As far as I can tell, there is no method in all of Vulkan/DX11/DX12 which would allow for a guaranteed copy-free buffer upload directly using user provided memory as the source. Neither is there one which would provide you at least provide you with suitable CPU side staging buffers.

Why is that? "Security" considerations, the risk that the memory may be modified at inadequate points in time? Understandable, but does this really justify such a severe performance impact?

Just so you don't get me wrong:
A copy itself isn't bad, you wouldn't even notice a straight copy from committed RAM to committed RAM. Where it gets bad though, is if the runtime starts allocating buffers directly from the kernel (instead of an in-process heap), or if the memory can't be trivially written to as it isn't fully committed yet.

What could the APIs improve?

Well, for starters, if I request "mapped" memory, guarantee that the host side memory is already pinned to RAM. And do so without trashing the system memory, say start using a heap already so the Windows kernel doesn't have to take all the load. This would not even break the existing interface!

Preferably would be to be able to provide user defined CPU side buffers instead, and instead of the API faking to be non-blocking, either just plain out block or provide a callback / signal on completion.
I don't mind if I have to allocate these buffers via a specialized API, as long as I can reuse them as often as I want, and as long as their life time is not tied to the life time of a GPU side resource.


What can we do to hack around?

So far I have only found a single workaround: Using the VirtualLock / VirtualUnlock Windows APIs. For the page fault handler performance bug, that is the only working solution when using the "Map" style interfaces.

Also trying to trigger all non-avoidable page faults and Windows API calls in the same thread. This does unfortunately mean having a single, heavy thread which has to contain all API calls which would possibly trigger the page fault handling.

This means also avoiding buffer mappings in a deferred context. As elegant as it sounds - this will kill your performance.
 
Back
Top