Delta resource updates

Ext3h

Regular
This is about the old topic of resource updates.

If you go through the published material from the past 5 years or so, you essentially only see techniques which are always uploading buffers in whole to the GPU

What does differ is how the buffers are allocated, and in rare cases you see use of ID3D11DeviceContext::UpdateSubresource / glBufferSubData and alike for updating sub-allocations, but ultimately it always boils down to a direct write to the final buffer, either by discard strategy or synchronous mapping.

What I personally found to work quite well, are to record only deltas as updates are signaled by the engine (actually: record dirty entities, serialize them into delta in bulk and copy to buffer) and to patch the persistent / resident buffers by compute shaders. And while at it, also switch data layout in that step, from interleaved / packed form for delta recording / upload to formats better suited for further processing on GPU.

The resident buffers have been pre-allocated with sufficient spare room to grow, and even the initial content is streamed by this technique.
Code:
struct Data {
    // Long list of attributes where updates are correlated
    vec4 position;
    ...
}

struct DataUpdate {
    uint offset;
    Data data;
};

layout(std430, binding = 0) readonly restrict buffer src0
{
    DataUpdate data[];
} dataUpdate;

layout(std430, binding = 1) writeonly restrict buffer dst0
{
    vec4 data[];
} position;

...

uniform int uUpdateSize;

layout(local_size_x = 64) in;

void main()
{
    uint id = gl_GlobalInvocationID.x;
    if(id < uUpdateSize)
    {
        DataUpdate update = dataUpdate.data[id];
        position.data[data.offset] = update.data.position;
        ...
    }
}

// As you may have realized, this is OpenGL 4.3 syntax, so yes, this approach works perfectly with the older APIs too

From naive performance tests, that proofed to outperform any other option by far. Specifically:
  • the combination of delta buffer + compute shader outperformed any form of Map, be it either discard strategy with full re-upload, or sparse writes in a synchronous map. Just in terms of raw update time, not even accounting for pipeline stalls.
  • significantly lower overhead compared to any sparse update method provided by the individual graphic APIs.
  • lower VRAM memory footprint compared to any buffer rotation strategy. That benefit even holds if you start to pre-record partial deltas for future frames!
  • trivial to buffer / split updates if amount of data exceeds available transfer volume for a frame.
What came as quite a surprise was that this approach was quite robust with regard to recording deltas out-of-order, respectively in whatever order had been most efficient from engine perspective (with regard to host side cache hit rate). The samples may not have been representative, but there was no significant slowdown if the destination offsets were sparse or even "randomly" sorted during the initial, dense update.

Of course this isn't a novel invention (see e.g. https://on-demand.gputechconf.com/g...bisch-pierre-boudier-gpu-driven-rendering.pdf page 31), but it's somewhat a surprise to see so little awareness for something which works so well.
 
Back
Top