We had discussion related to custom vertex fetch in another thread, and I decided to write a little bit more about it.
DX10 introduced SV_VertexID system value semantic in the vertex shader. This special input gives you access to the vertex index (from the index buffer) inside the vertex shader. This allows you to manually fetch the vertex data from typed buffers or structured buffers (or using custom vfetch on Xbox 360), instead of using the hardware vertex buffer fetch. It is worth noting that custom vertex fetch is fully compatible with the hardware index buffer, and fully supports vertex shader result caching. Only one instance of a vertex shader is executed for each unique index buffer value (assuming the previous VS result of the same index is found in the post-transform cache, aka the parameter cache). Modern GPUs (such as GCN) cache input vertex data in the general purpose L1/L2 caches (no matter whether it originates from a vertex buffer or some other indexable buffer type).
Using multiple typed buffers in SoA layout (one for position, one for UV, one for tangent frame) is actually faster than using "hardware" vertex buffers on AMD GPUs. This is because: a) the GPU can coalesce the reads, b) the GPU can interleave loads with ALU (hide latency of position/UV/tangent loads while the transforming the position/tangents, etc).
When preprocessing vertex data for GPU rendering, it is a common practice to duplicate vertices that have different UV or different normal/tangent. This is required, because the hardware vertex fetch only supports one index buffer.
With programmable vertex fetch, you can trick the index buffering hardware, by bit packing two 16 bit indices to one 32 bit index. Low bits points to the position data, and high bits points to the UV/tangent data. Use SoA layout to separate the position data and other data and unpack the high/low bits of the SV_VertexID inside the vertex shader to get the two indices. This way you don't need to duplicate any vertex data. This saves both memory and memory bandwidth (as vertex data is more likely to be inside the L1/L2 caches). This trick doesn't generate any extra vertex shader invocations, as there are exactly as many different indices in the index buffer (and in the same order).
16 high/low bits limits you to 65536 different positions and UV/tangent pairs. So this only supports vertex buffers of size 65536 * vertexSize (similar limitation as 16 bit indices).
If you need more vertices, there is a way around it as well. Let's assume you never need more than four different UV/tangent pairs per position (a highly reasonable assumption). You reserve the two high bit of the 32 bit index for the UV/tangent set index. Low 30 bits are used as DWORD aligned memory offset to a raw buffer (raw address = low bits * 4). This offset is unpacked in the vertex shader, and all the vertex data (this time in AoS layout) is read from a single raw buffer. UV is read from baseVertexAddress + positionDataSize + uvSize * uvSetIndex. uvSetIndex is decoded from the highest 2 bits (of the SV_VertexID). In this layout, the vertex data always starts with the position data and that is followed with 1-4 UV/tangent pairs. Vertex size is thus variable. No data replication is needed at all. Total vertex data size is identical to the previous trick, but this trick supports vertex buffers up to 4 GB in size (= limitless in practice). Again, there are no extra vertex shader invocations, compared to traditional indexed draw call (with replicated vertex data).
There are other data compression tricks possible with abusing the index buffering hardware and custom vertex fetch.
DX10 introduced SV_VertexID system value semantic in the vertex shader. This special input gives you access to the vertex index (from the index buffer) inside the vertex shader. This allows you to manually fetch the vertex data from typed buffers or structured buffers (or using custom vfetch on Xbox 360), instead of using the hardware vertex buffer fetch. It is worth noting that custom vertex fetch is fully compatible with the hardware index buffer, and fully supports vertex shader result caching. Only one instance of a vertex shader is executed for each unique index buffer value (assuming the previous VS result of the same index is found in the post-transform cache, aka the parameter cache). Modern GPUs (such as GCN) cache input vertex data in the general purpose L1/L2 caches (no matter whether it originates from a vertex buffer or some other indexable buffer type).
Using multiple typed buffers in SoA layout (one for position, one for UV, one for tangent frame) is actually faster than using "hardware" vertex buffers on AMD GPUs. This is because: a) the GPU can coalesce the reads, b) the GPU can interleave loads with ALU (hide latency of position/UV/tangent loads while the transforming the position/tangents, etc).
When preprocessing vertex data for GPU rendering, it is a common practice to duplicate vertices that have different UV or different normal/tangent. This is required, because the hardware vertex fetch only supports one index buffer.
With programmable vertex fetch, you can trick the index buffering hardware, by bit packing two 16 bit indices to one 32 bit index. Low bits points to the position data, and high bits points to the UV/tangent data. Use SoA layout to separate the position data and other data and unpack the high/low bits of the SV_VertexID inside the vertex shader to get the two indices. This way you don't need to duplicate any vertex data. This saves both memory and memory bandwidth (as vertex data is more likely to be inside the L1/L2 caches). This trick doesn't generate any extra vertex shader invocations, as there are exactly as many different indices in the index buffer (and in the same order).
16 high/low bits limits you to 65536 different positions and UV/tangent pairs. So this only supports vertex buffers of size 65536 * vertexSize (similar limitation as 16 bit indices).
If you need more vertices, there is a way around it as well. Let's assume you never need more than four different UV/tangent pairs per position (a highly reasonable assumption). You reserve the two high bit of the 32 bit index for the UV/tangent set index. Low 30 bits are used as DWORD aligned memory offset to a raw buffer (raw address = low bits * 4). This offset is unpacked in the vertex shader, and all the vertex data (this time in AoS layout) is read from a single raw buffer. UV is read from baseVertexAddress + positionDataSize + uvSize * uvSetIndex. uvSetIndex is decoded from the highest 2 bits (of the SV_VertexID). In this layout, the vertex data always starts with the position data and that is followed with 1-4 UV/tangent pairs. Vertex size is thus variable. No data replication is needed at all. Total vertex data size is identical to the previous trick, but this trick supports vertex buffers up to 4 GB in size (= limitless in practice). Again, there are no extra vertex shader invocations, compared to traditional indexed draw call (with replicated vertex data).
There are other data compression tricks possible with abusing the index buffering hardware and custom vertex fetch.
Last edited: