Vertex shaders and backface culling

Nick

Veteran
Hi all,

I was wondering how modern GPUs combine vertex shaders and backface culling. Obviously it's very beneficial to perform backface culling as early as possible, but shaders can do much more than just one matrix transform to calculate camera space vertex positions. So is it even possible at all to do backface culling before the vertex shaders? Splitting the shader into a transform and a lighting+texturing part seems one way but involves extra overhead and could have bad worst-case performance. Or are they just using way more advanced tricks?

Just curious...
 
It's not entirely straightforward, because backface culling is performed on triangles, and each vertex might belong to multiple triangles.

You could do backface culling with a list of face normals, and the front view vector transformed into model space. That would require just a single dp3 per triangle. But it's not possible with current APIs, because face normals are not a requirement/not possible.

Splitting the shader is a possibility. You could start with calculating only the clip space position per vertex and writing it to the post-transform vertex cache, along with a flag indicating that further processing is required. Backface culling (and clipping maybe) then fetches position data of 3 vertices from post-transform cache and determines if the triangle is going to be visible. If so, for each vertex that requires further processing the second part of the vertex shader is calculated, writing more data to the post-transform cache and clearing the flag.

It's a bit more complex, especially regarding scheduling of "second half" vertex tasks, and fetching vertex data from memory. As well as hiding the latency when more processing is required.
 
Last edited by a moderator:
But in reality, they just brute force it, transform all verts, and cull in 2D in the rasterization stage.
 
Thanks for the information! Does the fixed-function pipeline also not perform early backface culling then? Performance is comparable to shaders if I'm not mistaken, but it's a lot easier to implement early backface culling and it would be stupid not to do it.

I just find it hard to believe that they would use the brute-force approach, considering that it potentially optimizes geometry performance by 2x. It also seems best to avoid letting backfacing polygons enter the next stage.
 
Anyone got additional information on this?

I've been searhing in vain for papers about advanced backface culling but they all assume a fixed-function pipeline so the camera position can be transformed into object space. This is a vast optimization but doesn't work with programmable shaders (without modification). But considering that modern graphics cards have comparable performance for the fixed-function and programmable pipeline they have to be either skipping the whole optimization (a considerable waste of transistors and clock cycles), or use other approaches for early backface culling. I'll do some detailed experiments once I get back on my development system, but it would be great if anyone would share their expertise in this intriguing matter.

Is this the real reason why NVIDIA beats ATI with a much smaller chip? ;)
 
On modern GPUs fixed functions are converted to shaders. With Vista they don’t even have to do it in the driver because the DirectX runtime contains the necessary modules.

Doing anything else than brute force here would require a complex logic and larger caches between the different function units. Maybe we will see some early cull outs as part of chips with unified shader architecture.
 
  • Like
Reactions: Geo
I believe most games aren't vertex limited so the brute force approach must not be hurting much if that's what's used. Unified shaders will give us more brute force power too.

It seems to me that a deferred renderer would be trying to solve some of the same problems. With a programmable pipeline does a deferred renderer split the vertex shader into a position shader and everything else into a second shader? If it does break out a position shader then it could perform backface culling before finishing the vertex shader.

Maybe you won't find the backface culling paper that you're looking for, but you might find the necessary ideas in other research. I don't know of any such papers I'm just theorizing. If you find anything good let us know.
 
To clarify what Xmas said, remember that vertex shaders process vertices, not triangles. Even if you could early out three vertices in a triangle doesn't mean you neccesarily could do so. Take a look at this simple triangle strip:

backfaceculling.png


Here triangle A and C are frontfacing, while B is backfacing. Still all vertices in B needs to be fully shaded to produce A and C.

Assume we could split the shader into a position part and a "other" part. How much could we save? If we assume every other triangle is backfacing, we'd still not save 50% for the reason above. On a low-res model we'd save just a little, on a high-res model we'd save more. Say on average we'd be able to cut off maybe 40% of the vertices. These still need to perform the position part. That may or may not be a significant portion of the shader. For something like characters it may be a full vertex skinning implementation, which could be well over half the shader. In other cases it could be a simple mvp * vertex computation. If we assume the position part is 1/4 of the shader on average, that means the amount of work we need to do is 60% + 1/4 * 40% = 70%. So we'd save 30% of the work. Given how seldom vertex shaders are the bottleneck I'm not sure it would be worth it to try to optimize for this case.
 
Humus said:
To clarify what Xmas said, remember that vertex shaders process vertices, not triangles. Even if you could early out three vertices in a triangle doesn't mean you neccesarily could do so. Take a look at this simple triangle strip:
...
Here triangle A and C are frontfacing, while B is backfacing. Still all vertices in B needs to be fully shaded to produce A and C.
Don't worry, I perfectly know backface culling theory and every other aspect of the standard graphics pipeline. I wrote swShader, which evolved into SwiftShader. Thanks anyway, it's a nice example.
Assume we could split the shader into a position part and a "other" part. How much could we save? If we assume every other triangle is backfacing, we'd still not save 50% for the reason above.
Actually you might. Imagine a sphere close to the camera, 90 degree FOV, then less than 50% of the geometry will be visible. Ok this is an extreme example but perspective definitely helps increasing the culling ratio.
These still need to perform the position part. That may or may not be a significant portion of the shader. For something like characters it may be a full vertex skinning implementation, which could be well over half the shader. In other cases it could be a simple mvp * vertex computation.
Skinning is definitely expensive, but when I look at e.g. UT 2007 screenshots I see huge and highly detailed worlds. That's all static geometry with a single matrix transform for position but many instructions for lighting and texturing, which also consumes precious bandwidth.
If we assume the position part is 1/4 of the shader on average, that means the amount of work we need to do is 60% + 1/4 * 40% = 70%. So we'd save 30% of the work. Given how seldom vertex shaders are the bottleneck I'm not sure it would be worth it to try to optimize for this case.
I'm not so much concerned about vertex processing being the bottleneck (on current hardware). But if you can save 30% transistors for vertex processing and a bit of bandwidth this simply allows cheaper and faster hardware. Doesn't every $ and % count?

Experimenting with software rendering shows me that it does make a difference to cull early. So with a unified hardware architecture, even if there can't really be a vertex processing bottleneck, every clock cycle you waste there has a direct influence on framerates...
 
Theoretically (that is — unless I am missing something) the driver could do this splitting itself for each vertex shader that is submitted / compiled.
It would trace backwards from the oPos register and do all the computations that influence it, yielding a simple shader that only does the transform-part, and then factoring out the rest of the shader that does set-up, texcoord-computations, fancy stuff, into a separate vertex program.
For each and every vertex, the simple transform shader is run each and everytime, but is only followed by the rest / 2nd shader when the triangle is visible.

But as Humus says, that's probably only worthwhile in SWVP; in real hardware a vertex shading bottleneck happens rarely.
 
OK, stupid question time, isn't this where the geometry "loop" in D3D10 comes in? VS(-GS)-streamout-VS-GS?

Additionally, if occlusion queries become practical in D3D10 (hey I'm no developer, just guessing) won't the vertex shading work saved by occlusion culling overwhelm the much smaller workload saving of backface culling? Particularly if the geometry loop is exploited?

Jawed
 
Nick said:
Skinning is definitely expensive, but when I look at e.g. UT 2007 screenshots I see huge and highly detailed worlds. That's all static geometry with a single matrix transform for position but many instructions for lighting and texturing, which also consumes precious bandwidth.

Well, if history is any indication I'd guess that UT2007 will be mainly CPU limited, and fragment limited on higher resolutions. So even if you can save vertex shading power, I'm not so sure it would help that much. Keep in mind that an increase in the number of vertices also increases the load on the fragment shader due to more polygon edges in the image.

Nick said:
I'm not so much concerned about vertex processing being the bottleneck (on current hardware). But if you can save 30% transistors for vertex processing and a bit of bandwidth this simply allows cheaper and faster hardware. Doesn't every $ and % count?

Sure, but I'm not convinced that you can save 30% transistors, especially considering that backface culling isn't used for everything. I'm not a hardware guy so I can only guess how much a vertex pipe supporting this kind of early out would require in terms of transistors, but it's clear that it's non-trivial to implement. In a software shading path it's probably easier and more beneficial as well.
 
Jawed said:
OK, stupid question time, isn't this where the geometry "loop" in D3D10 comes in? VS(-GS)-streamout-VS-GS?

Well, you can use that to reduce the vertex processing in multipass. D3D10 in general has a bunch of tools that can be used to reduce the vertex shading load. So that should make the problem of redundant computations less.

Jawed said:
Additionally, if occlusion queries become practical in D3D10 (hey I'm no developer, just guessing) won't the vertex shading work saved by occlusion culling overwhelm the much smaller workload saving of backface culling? Particularly if the geometry loop is exploited?

Occlusion queries are practical already and can be used to save a large amount of vertex shading if used properly.
 
Nick said:
I'm not so much concerned about vertex processing being the bottleneck (on current hardware). But if you can save 30% transistors for vertex processing and a bit of bandwidth this simply allows cheaper and faster hardware. Doesn't every $ and % count?
I don't think backface culling will save any bandwidth as the process is internal to the GPU pipeline.
 
Jawed said:
OK, stupid question time, isn't this where the geometry "loop" in D3D10 comes in? VS(-GS)-streamout-VS-GS?
While the geometry loop could be used for backface culling I don't expect it to be used like this in general. Bandwidth is precious so I'd expect developers to spend a little time on vertex processing rather than using streamout. Now if the game is already using a streamout loop then it might make sense for them to cull in the GS as this will save bandwidth in addition to future vertex processing.
 
3dcgi said:
I don't think backface culling will save any bandwidth as the process is internal to the GPU pipeline.
Vertex shaders require bandwidth for reading input registers and writing output registers (i.e. vertex buffer streams). If you can skip executing a major part of the vertex shader for backfacing triangles you're saving bandwidth, not?
 
The original (paraphrased parts) pertinent question :
So is it even possible at all to do backface culling before the vertex shaders? ... with modern GPUs...
No, not with the latest graphics chips and not until we know where the vertices are between vertex shading and rasterization, since that is the stage (the "hole", if you like) where backface culling currently happens in the GPU. I'm sure you already know this and that you'rte probably asking for work-arounds (which some have attempted to contribute in this thread) by starting this thread but I thought I'd just state the obvious, which is that there are no known "tricks" (like you had hoped!) even with the latest graphics chips.
 
I should add that this limitation should not be a significant problem if vertex shaders were small, which has been the case in all DX8/DX9 games thus far (relatively-speaking). The problem is of course that of performance, which has a small delta wrt this problem, which is what I think this thread is about.

However, current/latest GPUs might (keyword) actually have an optimization to detect when the vertex shader doesn't change the vertex position (but just passes it through from one of the input streams) and avoid invoking the vertex shader in that case. Probably something we should think about.
 
Nick said:
Vertex shaders require bandwidth for reading input registers and writing output registers (i.e. vertex buffer streams). If you can skip executing a major part of the vertex shader for backfacing triangles you're saving bandwidth, not?
I see what you mean. I was thinking write bandwidth, but you would save read bandwidth from the vertex buffer. There's a reason I typically preface something with "I think". I forget stuff. :smile:

I'm not sure when vertex buffer streams are used, but in general rendering output registers are on chip and bandwidth to these is enough to sustain the peak rate.
 
3dcgi said:
I'm not sure when vertex buffer streams are used, but in general rendering output registers are on chip and bandwidth to these is enough to sustain the peak rate.
Input registers are directly linked to the vertex streams, so they consume off chip bandwith. Output registers go to the vertex cache, which is on chip (I'm not sure if it helps to save on bandwidth and cache space but in theory it could help). However, R520/80 supposedly has a very big memory bus and although that's probably mostly for passing around data for the pixel shaders I assume early backface culling could save some transistors or save bandwidth for pixel processing.
 
Back
Top