Vertex shaders and backface culling

Nom De Guerre said:
No, not with the latest graphics chips and not until we know where the vertices are between vertex shading and rasterization, since that is the stage (the "hole", if you like) where backface culling currently happens in the GPU.
That's a very interesting formulation.

Does this mean the vertex shaders do get executed, but as soon as the position is written (of all three vertices) the backface test is performed and the shader is interrupted? This suddenly makes a lot of sense! Reorder the vertex shader so writing the position happens as early as possible. The backface culling unit can then watch the vertex cache for triangles that have their position computed and stops every shader unit that's working on the vertices of a backfacing triangle (so it can do more useful stuff). This way the shader doesn't have to be splitted so there's no overhead caused by dependencies (requiring the same instructions in both parts) and no extra setup. So no worst case either compared to just brute force executing everything. A win-win situation, am I right?

I don't think this can be translated to software easily though. I could perform some check after writing the position to see whether to invoke the backface culling but that seems real hard to implement and does add overhead. Anyway, I now have some new insights in the problem and there are a few things I'm going to experiment with... thanks!
 
Nom De Guerre said:
However, current/latest GPUs might (keyword) actually have an optimization to detect when the vertex shader doesn't change the vertex position (but just passes it through from one of the input streams) and avoid invoking the vertex shader in that case.
Why would any shader leave the position unchanged? That would require software processing to put the clip space position directly into the vertex stream. Or is that exactly what you're implying, letting the driver (not the application) process the position in software so the hardware can do early backface culling and no time is wasted on backfacing triangles?
Probably something we should think about.
I am intrigued by your user name, post count and the 'we' in this sentence... ;)
 
I am certain your experiments will gain you some knowledge but I am not certain if you will learn anything new or worthwhile wrt the topic at hand.

The fact of the matter is that, regardless of API/latest-GPU/whathaveyous, back face culling is not done until all (another keyword) the vertex shaders have been evaluated. While you could imagine separating the vertex shader into one part that calculates the position vector and another part that calculates everything else, I don't think it would be worthwhile.

Unless you have something specific in mind that may be ground-breaking, I'd say the keyword here is "worthwhile", from all perpectives and from all considerations.
 
Nick said:
I am intrigued by your user name, post count and the 'we' in this sentence... ;)
Username : It's exactly what it means, "pseudonym", in French. Used a lot on the Internet, I was pleasantly surprised it was available as a username in this forum.

Post Count : I registered during the weekend (I think). My very first post in this forum resides in this thread.

The "we" : I meant "us", as in those in this forum.
 
Nick said:
This suddenly makes a lot of sense! Reorder the vertex shader so writing the position happens as early as possible. The backface culling unit can then watch the vertex cache for triangles that have their position computed and stops every shader unit that's working on the vertices of a backfacing triangle (so it can do more useful stuff). This way the shader doesn't have to be splitted so there's no overhead caused by dependencies (requiring the same instructions in both parts) and no extra setup. So no worst case either compared to just brute force executing everything. A win-win situation, am I right?
Hmm, I can see how the out of order scheduler in a Xenos-like unified shader could easily "kill" a vertex very very easily in this situation.

The problem with a Xenos-like USA is that vertices are shaded in batches, so in Xenos, for instance, all 16 vertices would need to be killed to make a difference.

Also, I suppose it's worth pointing out that predicating vertices for back-facing triangles is a bit similar to predicating primitives for tiled rendering. Not hugely so, but as Xenos came up...

Jawed
 
Nick said:
Does this mean the vertex shaders do get executed, but as soon as the position is written (of all three vertices) the backface test is performed and the shader is interrupted? This suddenly makes a lot of sense! Reorder the vertex shader so writing the position happens as early as possible. The backface culling unit can then watch the vertex cache for triangles that have their position computed and stops every shader unit that's working on the vertices of a backfacing triangle (so it can do more useful stuff). This way the shader doesn't have to be splitted so there's no overhead caused by dependencies (requiring the same instructions in both parts) and no extra setup. So no worst case either compared to just brute force executing everything. A win-win situation, am I right?
There is added complexity to the hardware, and there is overhead for having interruptible and resumable vertex threads that have to store their state somewhere.

I think it's easier to have a fixed split, with the second part of the shader being able to read back intermediate results the first part wrote to the post-transform cache. But that means more read ports for the post-transform chache.
 
Nom De Guerre said:
Username : It's exactly what it means, "pseudonym", in French. Used a lot on the Internet, I was pleasantly surprised it was available as a username in this forum.

Post Count : I registered during the weekend (I think). My very first post in this forum resides in this thread.

The "we" : I meant "us", as in those in this forum.
Nice. I highly appreciate your inspiring posts. Welcome! :)
 
Jawed said:
The problem with a Xenos-like USA is that vertices are shaded in batches, so in Xenos, for instance, all 16 vertices would need to be killed to make a difference.
Interesting. Obviously for any kind of early backface culling to work they should be processed independently. Am I right that this would require (too) much extra logic to control instruction flow, so the brute force approach is actually more efficient despite the extra shader execution?
 
Xmas said:
There is added complexity to the hardware, and there is overhead for having interruptible and resumable vertex threads that have to store their state somewhere.
They don't really have to be resumed. They can just continue executing the shader and a few clock cycles later the culling unit can interrupt it when necessary.
I think it's easier to have a fixed split, with the second part of the shader being able to read back intermediate results the first part wrote to the post-transform cache. But that means more read ports for the post-transform chache.
If you split the shader there is definitely a lot of overhead. What I envisioned was the original shader in one piece with position being written as early as possible. As soon as three positions of a triangle are written to the vertex cache the culling unit can start and a few cycles later interrupt the shaders (that just continued to run) for back-facing triangles. No need for extra temporary storage.

There is a chance that the complete vertex is still needed by a front-facing triangle but that shouldn't happen too much. In most situations computing the position is only one matrix transform so the wasted time is minimal (compared to computing all vertices of all back-facing triangles).

Anyway, thanks for making me realize it's not a pure win-win. Only experiments can tell now if it works.
 
Nick said:
Jawed said:
The problem with a Xenos-like USA is that vertices are shaded in batches, so in Xenos, for instance, all 16 vertices would need to be killed to make a difference.
Interesting. Obviously for any kind of early backface culling to work they should be processed independently. Am I right that this would require (too) much extra logic to control instruction flow, so the brute force approach is actually more efficient despite the extra shader execution?
Wait a second... How does that work with dynamic branching? Doesn't that require shaders to be able to execute asynchronously depending on the branches? Or does it really batch per 16 vertices and wait for each shader to finish to stay synchronous or something? Sorry, I don't know much about this hardware. Could you elaborate on this a little?

Won't future unified architectures have fully independent vertex processing, so early backface culling becomes possible and useful?
 
Nick said:
Wait a second... How does that work with dynamic branching? Doesn't that require shaders to be able to execute asynchronously depending on the branches? Or does it really batch per 16 vertices and wait for each shader to finish to stay synchronous or something? Sorry, I don't know much about this hardware. Could you elaborate on this a little?
All vertices within a batch share the same program counter, if a vertex in a batch 'diverges' all the possible paths will be excuted on all the vertices (and results will be discarded/selected as needed to mantain an effective per vertex branching capability)
So you don't save anytime..it would be probably be even slower as branching does not come for free
Won't future unified architectures have fully independent vertex processing, so early backface culling becomes possible and useful?
Who knows.. at this time the only full MIMD vertex shader architecture I'm aware of is current nvidia implementation.
 
nAo said:
Who knows.. at this time the only full MIMD vertex shader architecture I'm aware of is current nvidia implementation.
Which seems to imply that you know that R5xx vertex shading hardware is SIMD (or multi-SIMD)... :?:

(Maybe this is generally known. I seem to remember some question marks over dynamic branching performance in R5xx but I can't remember the conclusions.)

Jawed
 
Jawed said:
Which seems to imply that you know that R5xx vertex shading hardware is SIMD (or multi-SIMD)... :?:
Jawed
The only thing I know about it is that ATI discourages dynamic branching in vertex shaders..
 
Nick said:
What I envisioned was the original shader in one piece with position being written as early as possible. As soon as three positions of a triangle are written to the vertex cache the culling unit can start and a few cycles later interrupt the shaders (that just continued to run) for back-facing triangles. No need for extra temporary storage.
If you do culling in NDC/screen space then you will also need to perform near-plane frustum clipping and perspective division prior to the cull test. For a HW pipeline this may introduce sufficient delay to mask any gains from the culler for all but more complicated vertex shading programs. There are definate gains that can be had for a SW pipeline though if the vertex shading and clipping/culling cannot be done in parallel.
 
Nick said:
If you split the shader there is definitely a lot of overhead. What I envisioned was the original shader in one piece with position being written as early as possible. As soon as three positions of a triangle are written to the vertex cache the culling unit can start and a few cycles later interrupt the shaders (that just continued to run) for back-facing triangles. No need for extra temporary storage.
Uhm - wouldn't this be constantly interrupting shaders on shared vertices that eventually need to be executed anyway, wasting time in the process?
I can't see this working without having enough topology information to determine reject/accept for a vertex only when All triangles sharing it have been bf tested. And normally vertex streams don't have this kind of data with them.

I did work on software implementation of this back in PS2 days, and overhead for early culling tends to mean the shader must be pretty darn expensive before it starts paying off (unless I worked on discrete triangle lists in which case the processing becomes trivial with no vertex sharing).
 
Last edited by a moderator:
Nick said:
If you split the shader there is definitely a lot of overhead.
There's hardware overhead, but not necessarily much overhead wrt shader execution.
 
Fafalada said:
Uhm - wouldn't this be constantly interrupting shaders on shared vertices that eventually need to be executed anyway, wasting time in the process?
Indeed that's some waste, but there shouldn't be that many vertices shared between front-facing and back-facing triangles (outline vertices). And this actually improves with high-detail geometry. Furthermore, there should be an almost equal chance that outline vertices are first processed as part of a front-facing or a back-facing triangle. So in theory only half of these vertices have their shader executed for the position part plus the whole shader. All other vertices are processed optimally.

So if a fraction S of all vertices is on the outline (shared), computing the position part of the shader takes X execution time of the full shader and of the remaining vertices F belong exclusively to front-facing triangles we get a relative workload (compared to brute-force) of: F * 1 + S * (2 + X) / 2 + (1 - S - F) * X. Estimating them as S = 15%, X = 30% and F = 40% we get 70%. And I believe these estimates are pretty conservative (though only experiments can tell), so a solid 30% lower workload really seems worth it. Early frustum culling could improve that even more.
 
Fafalada said:
Uhm - wouldn't this be constantly interrupting shaders on shared vertices that eventually need to be executed anyway, wasting time in the process?
I can't see this working without having enough topology information to determine reject/accept for a vertex only when All triangles sharing it have been bf tested. And normally vertex streams don't have this kind of data with them.

I did work on software implementation of this back in PS2 days, and overhead for early culling tends to mean the shader must be pretty darn expensive before it starts paying off (unless I worked on discrete triangle lists in which case the processing becomes trivial with no vertex sharing).

This matches my experience with "software implementations" aswell.
It's actually faster just to block transform the verts, but I could see cases where expensive shaders might save something.
 
Nick said:
Nom De Guerre said:
However, current/latest GPUs might (keyword) actually have an optimization to detect when the vertex shader doesn't change the vertex position (but just passes it through from one of the input streams) and avoid invoking the vertex shader in that case.
Why would any shader leave the position unchanged? That would require software processing to put the clip space position directly into the vertex stream. Or is that exactly what you're implying, letting the driver (not the application) process the position in software so the hardware can do early backface culling and no time is wasted on backfacing triangles?
That's just a speculation/guess of mine (hence my keyword). I don't know if the IHVs actually doing this but this seems like a really easy and logical fast-path for an IHV to implement.

I think, however, that whether the IHVs do or don't this shouldn't really affect most game developers' game design decisions, since it's just an optimization and if the GPU doesn't do it, then there's reasonable no way the ISVs can do it themselves (software vertex shading wouldn't be a viable option!).
 
Back
Top