Programmable vertex fetching and index buffering

sebbbi · Feb 17, 2016

We had discussion related to custom vertex fetch in another thread, and I decided to write a little bit more about it.

DX10 introduced SV_VertexID system value semantic in the vertex shader. This special input gives you access to the vertex index (from the index buffer) inside the vertex shader. This allows you to manually fetch the vertex data from typed buffers or structured buffers (or using custom vfetch on Xbox 360), instead of using the hardware vertex buffer fetch. It is worth noting that custom vertex fetch is fully compatible with the hardware index buffer, and fully supports vertex shader result caching. Only one instance of a vertex shader is executed for each unique index buffer value (assuming the previous VS result of the same index is found in the post-transform cache, aka the parameter cache). Modern GPUs (such as GCN) cache input vertex data in the general purpose L1/L2 caches (no matter whether it originates from a vertex buffer or some other indexable buffer type).

Using multiple typed buffers in SoA layout (one for position, one for UV, one for tangent frame) is actually faster than using "hardware" vertex buffers on AMD GPUs. This is because: a) the GPU can coalesce the reads, b) the GPU can interleave loads with ALU (hide latency of position/UV/tangent loads while the transforming the position/tangents, etc).

When preprocessing vertex data for GPU rendering, it is a common practice to duplicate vertices that have different UV or different normal/tangent. This is required, because the hardware vertex fetch only supports one index buffer.

With programmable vertex fetch, you can trick the index buffering hardware, by bit packing two 16 bit indices to one 32 bit index. Low bits points to the position data, and high bits points to the UV/tangent data. Use SoA layout to separate the position data and other data and unpack the high/low bits of the SV_VertexID inside the vertex shader to get the two indices. This way you don't need to duplicate any vertex data. This saves both memory and memory bandwidth (as vertex data is more likely to be inside the L1/L2 caches). This trick doesn't generate any extra vertex shader invocations, as there are exactly as many different indices in the index buffer (and in the same order).

16 high/low bits limits you to 65536 different positions and UV/tangent pairs. So this only supports vertex buffers of size 65536 * vertexSize (similar limitation as 16 bit indices).

If you need more vertices, there is a way around it as well. Let's assume you never need more than four different UV/tangent pairs per position (a highly reasonable assumption). You reserve the two high bit of the 32 bit index for the UV/tangent set index. Low 30 bits are used as DWORD aligned memory offset to a raw buffer (raw address = low bits * 4). This offset is unpacked in the vertex shader, and all the vertex data (this time in AoS layout) is read from a single raw buffer. UV is read from baseVertexAddress + positionDataSize + uvSize * uvSetIndex. uvSetIndex is decoded from the highest 2 bits (of the SV_VertexID). In this layout, the vertex data always starts with the position data and that is followed with 1-4 UV/tangent pairs. Vertex size is thus variable. No data replication is needed at all. Total vertex data size is identical to the previous trick, but this trick supports vertex buffers up to 4 GB in size (= limitless in practice). Again, there are no extra vertex shader invocations, compared to traditional indexed draw call (with replicated vertex data).

There are other data compression tricks possible with abusing the index buffering hardware and custom vertex fetch.

Scott_Arm · Feb 17, 2016

I don't have the experience to know how good this is, but I'm going to like it and say it's good anyway.

Infinisearch · Mar 4, 2016

Wow I missed this thread, thank you for this. I didn't know you could use the hardware index buffer with manual vertex fetching. You mind if I ask you a pair of questions?
1. If you don't use indexing does this mean you're getting one triangle per wavefront? In other words if there in no post-transform vertex caching do you get only 3 vertices per wavefront occupancy? I don't fully understand how the work in a wavefront is gathered.
2. Is the reason you used 64 vertex strips in the last trials (the one referenced in your GPU driven pipelines presentation) because you couldn't use the post transform vertex cache and if all the vertices were processed in the same wavefront you'd save work??? Again I don't understand wavefronts in relation to vertex fetching.

sebbbi · Mar 6, 2016

Infinisearch said:
Wow I missed this thread, thank you for this. I didn't know you could use the hardware index buffer with manual vertex fetching. You mind if I ask you a pair of questions?
1. If you don't use indexing does this mean you're getting one triangle per wavefront? In other words if there in no post-transform vertex caching do you get only 3 vertices per wavefront occupancy? I don't fully understand how the work in a wavefront is gathered.

Disclaimer: I am mainly talking about how (old and new) AMD GPUs work. Nvidia and Intel GPUs work similarly. I am only going to explain the logical pipeline. Real GPUs execute many things in parallel (multiple front ends), complicating things.

All shaders (VS, PS, DS, HS, CS) are executed in 64 thread waves (32 on Nvidia, 8/16/32 on Intel). GPU sheduler fills a wave with 64 threads of the same shader. Filled waves are executed by the compute units. Sometimes a wave is not filled completely. In this case execution mask is set to disable the remaining lanes of the wave. A good example is a full screen quad. It has only 4 vertices. The GPU will execute the vertex shader as a single wave, 4 threads are used, and 60 threads are disabled.

Non-indexed geometry is the easiest case. The first vertex shader wave will contain vertices (SV_VertexID) 0..63, the second will contain vertices 64..127, third 128..191, fourth 192..256, etc. Topology (trilist, tristrip, line, etc) doesn't matter at this stage.

It's worth noting that vertex data fetch doesn't happen at this stage. If a vertex buffer is used, the compiler emits code at the beginning of the vertex shader to fetch the vertex data from the vertex buffer (vertex = vertexbuffer[SV_VertexID]). Vertex data fetch is just a regular memory fetch (goes through L1 and L2 caches). I recommend using AMD CodeXL (http://developer.amd.com/tools-and-sdks/opencl-zone/codexl/) and check the output vertex shader microcode, if you want to see low level details.

After the vertex shader is executed, the triangles are formed. Triangle setup doesn't use waves (it is using fixed function hardware). With trilist topology, vertices (0, 1, 2) become the first triangle, (3, 4, 5) becomes the second, (6, 7, 8) the third, and so on. With tristrip vertices (0, 1, 2) form the first triangle, (1, 2, 3) form the second, (2, 3, 4) the third, etc. The number of executed vertex shader instances is thus known beforehand. For trilist it is 3x triangle count, and for tristrip it is triangle count + 2.

Indexed geometry uses so called "post transform cache" (also known as parameter cache) to reduce vertex shader invocations. When a vertex id (SV_VertexID) is found in the cache, that vertex shader instance is not executed. Instead the old result (VS output interpolator values) is reused. GPU front end advances the index buffer (some fixed amount of elements per clock), and fills the vertex waves. If an index (=SV_VertexID) is found that is in post transform cache (or already in that wave), that index is not added to the wave (*). This will reduce the number of vertex shader invocations per triangle. A grid (two triangles per cell) only requires 0.5 vertex shader invocations per triangle (assuming infinitely large post transform cache). In practice, a good post transform cache locality optimizer reaches around 0.7 vertex shader invocations per triangle.

Pixel shaders are shaded in quads (2x2 pixels). 16 pixel shader quads fit in a single wave. If a single draw call outputs less pixels, the rest of the wave will be empty. GPUs do not pack multiple draw calls to the same wave. The same is true for MultiDraw (or ExecuteIndirect). These are also counted as multiple draws. If the draws are small, there will be partially empty waves. Instanced draw (DrawInstanced) will count as a single draw. The GPU will pack vertex shader invocations of multiple instances to the same wave (and the same is true for pixel shaders and other shader stages).

Infinisearch said:
2. Is the reason you used 64 vertex strips in the last trials (the one referenced in your GPU driven pipelines presentation) because you couldn't use the post transform vertex cache and if all the vertices were processed in the same wavefront you'd save work??? Again I don't understand wavefronts in relation to vertex fetching.

Correction: RedLynx hasn't shipped any games using GPU-driven rendering yet. Trials Fusion used a modified version of the engine used in Trials Evolution.

Each 64 vertex strip fills exactly one vertex shader wave. The result is 62 triangles (64 - 2). Instanced draw is used (instance = 64 vertices). GPU adds a free strip cut at the end of each instance (otherwise you'd have to add degenerate triangles manually).

Vertex data is fetched manually (address calculated from SV_VertexID and SV_InstanceID). This allows us to output a different mesh for each instance. With standard vertex buffer fetch, each instance would use the same mesh. This is an important use case for custom vertex fetch.

Xbox 360 was the first console to support custom vertex fetch (VertexID input in the shader). Avalanche studios was among the first developers to exploit this. I recommend reading their SIGGRAPH paper, containing a technique called "Merge-Instancing": http://www.humus.name/Articles/Persson_GraphicsGemsForGames.pptx. This was a big inspiration for me at the time, because we were searching for a technique allowing us to draw any amount of different meshes in a single draw call (to compliment our virtual texturing that did the same for textures).

(*) This is a pure binary equality check between the index buffer values (SV_VertexID). If the values differ, the vertex is added to the wave. The index numeric value (or whether is it continuous or small does't matter). Bit packing multiple indices together and even hashing them (assuming no collisions) are all fine practices. Still you want to pay attention to the vertex data fetching locality. It is benefical that vertex shader invocations that are occurring close to each other (in time) fetch vertex data from adjacent addresses in memory (utilize cache lines fully). Using double indices (as stated above) actually improves the data cache utilization (assuming that you also order your vertex data by locality).

Infinisearch · Mar 13, 2016

Sorry for the late reply... been a little busy.

sebbbi said:
Non-indexed geometry is the easiest case. The first vertex shader wave will contain vertices (SV_VertexID) 0..63, the second will contain vertices 64..127, third 128..191, fourth 192..256, etc. Topology (trilist, tristrip, line, etc) doesn't matter at this stage.

It's worth noting that vertex data fetch doesn't happen at this stage. If a vertex buffer is used, the compiler emits code at the beginning of the vertex shader to fetch the vertex data from the vertex buffer (vertex = vertexbuffer[SV_VertexID]). Vertex data fetch is just a regular memory fetch (goes through L1 and L2 caches). I recommend using AMD CodeXL (http://developer.amd.com/tools-and-sdks/opencl-zone/codexl/) and check the output vertex shader microcode, if you want to see low level details.

After the vertex shader is executed, the triangles are formed. Triangle setup doesn't use waves (it is using fixed function hardware). With trilist topology, vertices (0, 1, 2) become the first triangle, (3, 4, 5) becomes the second, (6, 7, 8) the third, and so on. With tristrip vertices (0, 1, 2) form the first triangle, (1, 2, 3) form the second, (2, 3, 4) the third, etc. The number of executed vertex shader instances is thus known beforehand. For trilist it is 3x triangle count, and for tristrip it is triangle count + 2.

So there is some sort of post transform buffering going on. In the case of non-indexed triangles vertex 63 of the first wave needs to be paired with vertex 0 and 1 from the second wave, so is vertex 63 from the first wave buffered or transformed again?

sebbbi said:
GPU adds a free strip cut at the end of each instance (otherwise you'd have to add degenerate triangles manually).

The strip cut ends the instance?

But what I was trying to ask was was the reason you went with a strip topology because you couldn't get the post transform vertex cache to work with the manual vertex fetching combined with the merge instancing variant you're using?

3dcgi · Mar 13, 2016

Infinisearch said:
So there is some sort of post transform buffering going on. In the case of non-indexed triangles vertex 63 of the first wave needs to be paired with vertex 0 and 1 from the second wave, so is vertex 63 from the first wave buffered or transformed again?

The strip cut ends the instance?

A strip cut doesn't end an instance, but an end of instance stops a strip.

There is no sharing of vertices across instances so the start of each wave may re-transform vertices.

sebbbi · Mar 19, 2016

Graham Wihlidal (Frostbite) had a nice GDC presentation about GPU culling and using multidraw (executeindirect) to draw big chunks of the scene in a single draw call. Link: http://www.frostbite.com/2016/03/optimizing-the-graphics-pipeline-with-compute/

Their technique is similar to our GPU driven rendering technique presented at Siggraph earlier this year. Link to our technique discussion thread: https://forum.beyond3d.com/threads/gpu-driven-rendering-siggraph-2015-follow-up.57240/

Unfortunately multidraw is not available on all platforms, such as DX11 (without vendor specific backdoors). Fortunately it can be emulated with the index packing trick + custom vertex fetch (both described above).

Graham's culling algorithm outputs 3*N indices per visible triangle. It also outputs one draw call to indirect argument buffer per visible instance (object). The draw parameters contain per instance id to fetch the transform matrix and vertex start index to fetch vertex data from the correct mesh (assuming 16 bit indices).

To emulate this behaviour, you split the 32 bit index to 16+16 bits. High bits contain the instance id and low bit contain the index (to fetch vertex data). You output 3*N of these bitpacked indices per visible triangle, just as with the multidraw version. However you don't output indirect draw call parameters per instance. You submit only a single indirect indexed draw (not multidraw), where the primitive count = total visible triangle count of all "draws".

The vertex shader unpacks the instance id and the index from the SV_VertexID input. Instance data such as the transform matrix is fetched like this: float4x4 transform = transforms[SV_VertexID >> 16]. Each instance also has vertex start index. The vertex data is fetched like this: Vertex vertex = LoadVertex(vertexStarts[SV_VertexID >> 16] + SV_VertexID & 0xFFFF).

This technique executes equal amount of vertex shader instances as the multidraw based technique. It utilizes parameter cache as well as any indexed draw. The only downside is that this technique requires 32 bit indices, while the multidraw technique can use 16 bit indices to reduce memory bandwidth usage. However this technique has an advantage as well: low poly (<256 triangle) meshes do not overload the GPU command processor. It performs faster when rendering lots of tiny objects (= long draw distance).

Jawed · Mar 20, 2016

Is the disaster shown in slide 12, "Motivation - Death by 1000 draws" bad on Maxwell too?

sebbbi · Mar 20, 2016

Jawed said:
Is the disaster shown in slide 12, "Motivation - Death by 1000 draws" bad on Maxwell too?

GCN is fully bindless, meaning that resource changes never limit parallelism. Compute shaders don't have any global state, meaning that you don't have an upper limit on how many different compute shaders you can run concurrently.

Rendering pipeline has global state. You can easily see by checking all the different state objects in DirectX that the amount of configurable rendering pipeline state is huge. This means that it is unfeasible to load all this state to CU scalar registers (one copy per wave). GPUs thus have global state for the rasterizer pipeline. Modern GPUs can have multiple different rendering pipeline states active concurrently (to prevent full pipeline flush after each state change), but the amount of different states (contexts) is limited to a small number. If you change the state too frequently (not enough work for each state), the number of possible concurent states (contexts) will run out and the GPU stalls.

I don't know how many concurrent states (contexts) Maxwell supports, and which states are global and which are local (build inside the shader or configurable with bindless descriptors). It would be nice if someone had time to write a benchmark about this, as this has a big effect on performance in some cases.

Jawed · Mar 20, 2016

What I'm trying to understand is if the slide 12 disaster is because the objects are tiny (and then, whether they affect Maxwell much or at all) or whether the objects are tiny, which inflates the draw call cost, which may or may not affect Maxwell.

This deck covers two topics: GPU self work generation and object culling. It appears that to do object culling well one requires some kind of GPU generation of work. Which hides, at least from me, whether the problem being solved here is GCN-specific or modern-hardware-/API-specific.

sebbbi · Mar 20, 2016

Jawed said:
What I'm trying to understand is if the slide 12 disaster is because the objects are tiny (and then, whether they affect Maxwell much or at all) or whether the objects are tiny, which inflates the draw call cost, which may or may not affect Maxwell.

This deck covers two topics: GPU self work generation and object culling. It appears that to do object culling well one requires some kind of GPU generation of work. Which hides, at least from me, whether the problem being solved here is GCN-specific or modern-hardware-/API-specific.

Multidraw doesn't solve the context roll problem. You could just submit equal amount of draw calls (using lots of CPU cores with DX12 / Vulkan) to have equal amount of draws. You get context rolls by changing render pipeline state frequently. GPU driven culling doesn't reduce the number of state changes (compared to similar CPU based pipeline). I am not 100% sure how each GPU/driver handles predication in DX12 (https://msdn.microsoft.com/en-us/library/windows/desktop/dn903927(v=vs.85).aspx). Theoretically at least the GPU could skip state changes of a predicated (skipped) draw call. DX12 supports GPU side predicates (reading GPU buffer to skip over portion of command buffer).

Nvidia handles small draws a little bit better than AMD. I remember a OpenGL 4.3 AZDO (multidraw) benchmark comparing multiple AMD, Nvidia and Intel GPUs (please post the link if you still find it). IIRC, AMD performance started to plummet when draws were less than 300 triangles each. Nvidia's performance plummeted in draws less than 100 triangles. I don't remember Intel's results. My own results confirm the AMD result (under 256 triangles per draw will reduce the total triangle throughput, assuming small triangles, simple vertex shader and no pixel shader = shadow map rendering of high poly content). It seems to be a command processor bottleneck. Also it's worth noting that Nvidia's vertex shader wave size is 32, while AMD is twice as big (64 vertices). GPUs do not pack vertices from separate draw calls (including multidraw) to the same wave. This means that draw call vertex shader cost is always rounded up to the next 32/64. Small draws thus always waste cycles.

However my idea (above) packs all draws (of a single multidraw) to a single huge standard indexed draw. This way multiple small objects will be packed to the same wave, and there is no empty vertex shader lanes. And no command processor bottleneck either.

Jawed · Mar 20, 2016

The deck suggests that ~20% of the triangles remain after culling. Which would turn 1000 draw calls into, say, 200 (naively) which may be the sole reason for the performance increase in a conventional renderer. But as far as I can tell there is a performance increase merely from culling, even if it were possible to get all the objects (without culling) into a single draw call.

So, with an unknown about the separate benefits of these techniques and with no direct observations on Maxwell, it's hard to determine just how GCN-specific this is. If it is.

---

I think this is too ancient to be what you're talking about:

http://www.icare3d.org/news_articles/cn08.html

Perhaps something that cites that will get us to what you were talking about?

Ethatron · Mar 20, 2016

Voilà:
The minimum number of triangles per draw call

sebbbi · Mar 20, 2016

Culling triangles that do not produce pixels is also increasing the GPU utilization (occupancy) a lot. Triangles that are are not visible (backfacing, out of screen, htile occluded, early z occluded) are highly likely bottlenecked by primitive rate. Also the GPU is counting on having lots of pixel shader threads on flight to hide vertex fetch latency (as it cannot start enough vertex waves to fill the whole GPU + some extra to have enough VS threads on flight to hide the latency).

Nvidia's higher triangle rate means that it can recover from bunch of wasted triangles faster than AMDs hardware. AMD also has more ALUs to fill. I would guess that culling out invisible triangles helps AMD more.

Jawed · Mar 20, 2016

Aha! I remember those graphs now, so I left that site without rummaging properly!

sebbbi · Mar 20, 2016

Ethatron said:
Voilà:
The minimum number of triangles per draw call

Thanks! So it is even better on Nvidia. 16 triangles per draw is enough. So Nvidia's command processor is likely never the bottleneck if state doesn't change between the draws. I wonder what kind of primitive topology that test case uses. If it uses triangle strip 16 triangles = 18 vertices. If it uses triangle list 16 triangles = 48 vertices. Indexed geometry is could be anything between [8, 48] vertices (assuming grid mesh) depending on parameter cache hit rate. I would guess that Nvidia is solely bottlenecked by warp lane utilization (32 thread vertex shader warps).

Jawed · Mar 20, 2016

This, plus the very good behaviour when culling:

http://www.hardware.fr/articles/928-4/performances-theoriques-geometrie.html

is what makes me suspicious that Maxwell may not benefit very much from this kind of technique.

By the way the source is linked on that page.

sebbbi · Mar 23, 2016

Jawed said:
This, plus the very good behaviour when culling:

http://www.hardware.fr/articles/928-4/performances-theoriques-geometrie.html

is what makes me suspicious that Maxwell may not benefit very much from this kind of technique.

By the way the source is linked on that page.

My personal opinion: Doing per triangle culling manually (wasting ALU) and writing the result to memory (wasting BW) should not be needed (*). GPU has fixed function units for this job and these units should be fast enough to never become a bottleneck (in common cases like rendering 50% backfacing triangles). I have to congratulate Nvidia on a job well done.

However, I also believe that culling at coarse granularity (sub-objects) should be the developers job. API/driver/HW doesn't have the knowledge to do coarse culling (as the needs of each title are different). Coarse culling helps all GPUs. It is always good to early out as soon as possible. Coarse culling (for example in 256 triangle batches) results in big reduction in vertex data fetch.

Let's assume that a single vertex is 32 bytes and our coarse cluster bounds are 16 bytes. One cluster is 256 triangles. 256 triangles is roughly 256 vertices = 8192 bytes. Result: visible clusters use 0.2% more BW, culled clusters use 99.8% less BW. Not needing to access (and transform) the individual vertices (for rendering and/or culling) is a huge saving.

The ALU cost of coarse culling is also neglible. In the example case, the coarse fulling needs 256x less visibility tests than per triangle culling.

(*) But obviously we console devs need to hack around the hardware limits. Luckily AMDs slow geometry frontend is paired with very strong compute units.

Jawed · Mar 24, 2016

One thing I'm struggling to understand here is why LOD isn't making a lot of this culling effort moot.

For instance the GPU could compute LOD from sprite-imposters for the entire scene's geometry at low resolution using a normal forward render, and then issue an object pull list with LODs per object. What's rendered will be "just enough" triangles and the fixed function culling will do the rest.

Have we reached the point where LODded geometry is impractical?

Infinisearch · Mar 24, 2016

LOD would be the cause of the occupancy issue you mentioned previously.

FWIW AFAIK static LOD is still used in games.

Programmable vertex fetching and index buffering

Similar threads