asynchronous GPU to CPU memory transfer

purpledog

Newcomer
Is there any way I can transfer data (a texture for instance, preferably a big one) from the video memory to the main memory in an asynchronous manner?

I guess the answer comes with different flavors depending on the PC API:
- d3d9
- d3d10
- opengl
- cuda
- CTM

And why not:
- PS3
- Xbox360
 
Thanks, here is the corresponding example code found in:
http://www.opengl.org/registry/specs/ARB/pixel_buffer_object.txt

Any clue about the other API/platform?



const int imagewidth = 640;
const int imageheight = 480;
const int imageSize = imagewidth*imageheight*4;

glGenBuffers(2, imageBuffers);

glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, imageBuffers[0]);
glBufferData(GL_PIXEL_PACK_BUFFER_ARB, imageSize / 2, NULL,
GL_STREAM_READ);

glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, imageBuffers[1]);
glBufferData(GL_PIXEL_PACK_BUFFER_ARB, imageSize / 2, NULL,
GL_STREAM_READ);

// Render to framebuffer
glDrawBuffer(GL_BACK);
renderScene();

// Bind two different buffer objects and start the glReadPixels
// asynchronously. Each call will return directly after
// starting the DMA transfer.

glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, imageBuffers[0]);
glReadPixels(0, 0, imagewidth, imageheight/2, GL_BGRA,
GL_UNSIGNED_BYTE, BUFFER_OFFSET(0));

glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, imageBuffers[1]);
glReadPixels(0, imageheight/2, imagewidth, imageheight/2, GL_BGRA,
GL_UNSIGNED_BYTE, BUFFER_OFFSET(0));

// Process partial images. Mapping the buffer waits for
// outstanding DMA transfers into the buffer to finish.

glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, imageBuffers[0]);
pboMemory1 = glMapBuffer(GL_PIXEL_PACK_BUFFER_ARB,
GL_READ_ONLY);
processImage(pboMemory1);
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, imageBuffers[1]);
pboMemory2 = glMapBuffer(GL_PIXEL_PACK_BUFFER_ARB,
GL_READ_ONLY);
processImage(pboMemory2);

// Unmap the image buffers
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, imageBuffers[0]);
glUnmapBuffer(GL_PIXEL_PACK_BUFFER_ARB);
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, imageBuffers[1]);
glUnmapBuffer(GL_PIXEL_PACK_BUFFER_ARB);
 
In D3D10:

  1. Create a USAGE_STAGING texture with the CPU_ACCESS_READ flag. This will allocate memory optimized for CPU reads -- likely in cached system RAM.
  2. Call CopyResource() or CopySubresourceRegion() to copy from whatever the source resource is (e.g. your rendertarget) into the staging resource.
  3. (optional) Call Flush() to make sure the copy command gets sent to the GPU immediately.
  4. (optional) Call Map() on the staging resource with MAP_READ and the DO_NOT_WAIT flag. This will return E_WASSTILLRENDERING if the copy hasn't finished yet. You can then go do something else for a while and try again later.
  5. Once you've got nothing else to do, wait for the copy to finish by calling Map() without the DO_NOT_WAIT flag.
 
I thought cuda was async for memory copy, but it appears not:
The only functions from the runtime that are not asynchronous are the functions that perform memory copies between the host and the device, the functions that initializes and terminates interoperability with a OpenGL or Direct3D, the functions that register, unregister, map, and unmap an OpenGL buffer object or a Direct3D vertex buffer, and the functions that free memory.
 
Cool! Do you know if that's available in D3D9 as well?
D3D9 isn't as good as D3D10 because it doesn't have the new virtualization of WDDM under it (D3D9Ex is probably better though).

You can use D3DLOCK_DONOTWAIT for a lock on a surface, but I can't remember off the top of my head if you can actually get a CPU-accessible surface to/from a texture with an async copy. You'd have to dig around in the docs for that.

hth
Jack
 
The right D3D9 function to copy data from the GPU memory to CPU memory is GetRenderTargetData. But at the moment I am not sure how asynchrony it works. I need to check the WDK for more information.
 
I thought cuda was async for memory copy, but it appears not:

That's not completely true: in the 0.8 version, it was completely synchronous, but in the current 1.0 version, you can launch a memory copy and let the CPU do other stuff after that. However once you then order the next CUDA function, it will block until the previous copy has completed.

OTOH, it's currently(?) not possible to copy memory between host and device and run a kernel on the GPU at the same time.
 
That's not completely true: in the 0.8 version, it was completely synchronous, but in the current 1.0 version, you can launch a memory copy and let the CPU do other stuff after that. However once you then order the next CUDA function, it will block until the previous copy has completed.

Are you talking about cudaMemCpy (for instance)?
I cannot find anything reference about that. Any idea?
 
The right D3D9 function to copy data from the GPU memory to CPU memory is GetRenderTargetData. But at the moment I am not sure how asynchrony it works. I need to check the WDK for more information.

I had a look on the web but there's no clear answer.

Apparenly, in the best case, GetRenderTargetData trigger the transfer and a "lock" of the surface make sure the transfer is over. As far as I can tell, there's no way to check whether its complete without forcing a sync..

But that's the best case and I have the impression that it's highly dependant on driver/hardware... For instance in some configuration:
- the call to GetRenderTargetData will trigger a sync
- GetRenderTargetData doesn't even trigger the trasnfer but the lock does

Any precision?
 
Are you talking about cudaMemCpy (for instance)?
I cannot find anything reference about that. Any idea?

As I tried to look for information, I couldn't find any real reference myself. It looks like I started to confuse asynchronous kernel execution with memory copy. Sorry for that! :oops:
 
http://developer.download.nvidia.com/compute/cuda/1_0/NVIDIA_CUDA_Programming_Guide_1.0.pdf

4.5.1.5 Asynchronicity
[...]
The only functions from the runtime that are not asynchronous are the functions
that perform memory copies
between the host and the device, the functions that
initializes and terminates interoperability with a OpenGL or Direct3D, the functions
that register, unregister, map, and unmap an OpenGL buffer object or a Direct3D
vertex buffer, and the functions that free memory.

:cry:
 
From the the CUDA forums:
We are always working on reducing driver overhead, but I think the biggest benefit forthcoming will be CUDA v1.1's improved support for CPU/GPU overlap. The CPU will be able to continue executing code (including the driver) while the GPU is memcpy'ing or processing data, so driver overhead at least will be hidden as long as the GPU is busy.

Apps will need some updating to add the needed synchronization.
:yes:
 
Good news...

I wonder if there's some "awkward limitation" to that?

Obviously some of the memory bandwith is taken so that the CPU cannot access the main RAM at normal speed. Same thing for the GPU with the video RAM.

But that's the ideal case, I'm wondering if there's more to it...
 
Back
Top