Triangle setup

AFAIK nvidia hw does not perform any geometric clipping, so this step is not required
You always need 1/w for perspective correct interpolation, no? Since divisions are expensive it's better to compute 1/w once per vertex than three times per triangle.

Anyway, I have to read those articles in detail to see how it all fits together...
 
This is the main element of setup computations today. Computing the dz/dx, dz/dy and then computing slopes for the interpolants. All the computations above could be used -- The exact math used is generally a trade secret. On ATI HW, in general, the setup will compute all of these values at a rate of a prim/cycle, but if there are too many interpolants, the rate can drop somewhere in the pipeline, which affects primtive throughput (I'm being vague here). There's also some other work typically done on setup, such as wide lines computations, sprites, trivial reject, back face reject, etc... Some of those could be moved up or down the pipe, depending on things. But the setup generally operates in screen space, so that screen based computations (such as wide lines) can be done here or lower in the pipe.
 
Thanks for the info, sereric!

Why do you mention the z gradients separately from the other interpolants? Does this have something to do with z-cull?

Since not all interpolants are needed immediately (every shader instruction uses at most 1 texture coordinate and 1 interpolated color), does some of the gradient computation happen in multiple clock cycles?

I didn't know ATI hardware did wide lines entirely in hardware! Since it requires screen space operations like you say, does this also mean the 'plane equation approach' is used instead of the 'inverse matrix approach'?
 
It seems next generation GPUs might use shader ALUs to setup triangles..
The problem with doing these kinds of computations in the shader ALUs is the precision needed. For large displays / large amount of super-sampling and/or multi-sampling, the precision needed for the setup computations to maintain invariance or prevent cracks generally exceeds fp32, regardless of the approach you take to make those computations.

So you're left with several options then: Don't support high resolutions, allow cracks/non-invariance/other artifacts, or increase the precision of some or all your shader ALUs.


his also mean the 'plane equation approach' is used instead of the 'inverse matrix approach'?
You can also use barycentrics. There's more than one way of computing things :)
 
Thanks for the info, sereric!

Why do you mention the z gradients separately from the other interpolants? Does this have something to do with z-cull?

Since not all interpolants are needed immediately (every shader instruction uses at most 1 texture coordinate and 1 interpolated color), does some of the gradient computation happen in multiple clock cycles?

I didn't know ATI hardware did wide lines entirely in hardware! Since it requires screen space operations like you say, does this also mean the 'plane equation approach' is used instead of the 'inverse matrix approach'?

Z is different as it has higher precision requirements than interpolants. In fact, as Bob mentioned, barycentrics could be used for all interpolants and then you just depend on rasterization.

Yes, we can do wide lines in HW. I have some sort of patent on that (many years ago).

There are multiple ways, but all equivalent, to compute the slopes. Also, different scan converters/rasterizers might operate differently and want different setup data.
 
Last edited by a moderator:
So you're left with several options then: Don't support high resolutions, allow cracks/non-invariance/other artifacts, or increase the precision of some or all your shader ALUs.
Obviously the first option is a no go, the second one might be viable if the area cost of upping ALUs precision is still less than the area required to have dedicated ALUs into the setup engine.
 
Are they still using texkill to do user plane clipping?

They only ever did this for 3D clip planes scissoring and nearplane clipping were never done this way.

Actually I've never really understood how nvidia does their near plane clipping, since they don't have the pre divide by W values in the clipping unit. My assumption is that they must fudge the divide in someway to allow them to backout the value even in the case of a degeneracy. I do know that they do loose precision in the case of a near clip.
 
You can also use barycentrics. There's more than one way of computing things :)
Could you elaborate on this or point me to some papers? If I remember correctly, barycentric coordinates assign (0, 0), (1, 0) and (0, 1) to the vertex positions, so interpolants would just become a linear interpolation of these barycentric coordinates. But doesn't this require some transformation from (x, y) coordinates to barycentric coordiantes, and then a lot of multiplies and additions to get the interpolants? It saves the setup cost but adds a lot of per-pixel work, at first thought...
 
The problem with doing these kinds of computations in the shader ALUs is the precision needed. For large displays / large amount of super-sampling and/or multi-sampling, the precision needed for the setup computations to maintain invariance or prevent cracks generally exceeds fp32, regardless of the approach you take to make those computations.
I fail to see why you would get cracks (between triangles I assume), or anything like that. Doesn't the rasterizer use fixed-point computations?
 
There are multiple ways, but all equivalent, to compute the slopes. Also, different scan converters/rasterizers might operate differently and want different setup data.
There are even more approaches? :cool: Where can I learn about these?

Thanks a lot!
 
Could you elaborate on this or point me to some papers? If I remember correctly, barycentric coordinates assign (0, 0), (1, 0) and (0, 1) to the vertex positions, so interpolants would just become a linear interpolation of these barycentric coordinates. But doesn't this require some transformation from (x, y) coordinates to barycentric coordiantes, and then a lot of multiplies and additions to get the interpolants? It saves the setup cost but adds a lot of per-pixel work, at first thought...

You could iterate barycentrics as well as X,Y,Z per pixel, then use the barycentric to compute the other interpolants such as color and texture coordinates. The operation involved are pretty simple, per pixel. You would only need to iterate a few fixed elements then, regardless of the number of colors or texture coordinates you had.
 
I fail to see why you would get cracks (between triangles I assume), or anything like that. Doesn't the rasterizer use fixed-point computations?
You do need to convert from floating-point to fixed-point (if that's what you're using), and if your source values lost some bits, you can't recover them by moving to fixed-point after the fact.

Plus, if you round wrong, you also get to extrapolate outside the triangle, which doesn't always produce pleasing results :)
 
I fail to see why you would get cracks (between triangles I assume), or anything like that. Doesn't the rasterizer use fixed-point computations?

While you would normally use fixed-point for the rasterizer itself, you still need to set up some data that the rasterizer can work on, involving a float->fixed conversion somewhere in the path.

For completely robust edge equations for rasterization, you have at least 2 problems that need to be addressed in order to achieve 100% robust rendering:
  • Two polygons sharing an edge must be given the exact same equation (except for sign) for the edge (failures here result in gaps along the edge)
  • Multiple edges going through a vertex (that is shared between multiple polygons) must all cross exactly through the same point (failures here result in samples around a vertex being drawn too few or too many times)
The first problem should not be particularly hard; the second one is however quite difficult. In particular, trying to solve it by just throwing a random >fp32 precision level at it does NOT result in 100% robust rasterization; rather, you need to plan in detail how you are going to do all of the rounding steps in the edge equation calculation.
 
You could iterate barycentrics as well as X,Y,Z per pixel, then use the barycentric to compute the other interpolants such as color and texture coordinates. The operation involved are pretty simple, per pixel. You would only need to iterate a few fixed elements then, regardless of the number of colors or texture coordinates you had.
The operations per pixel are simple, but still more than just an addition for regular interpolation, right? Although I guess it fits nicely with the multiply required for perspective correction. It's a tradeoff between higher setup cost with lower per-pixel cost and lower setup cost with higher per-pixel cost, right? And the latter is preferred nowadays to maximize parallelism?
 
You do need to convert from floating-point to fixed-point (if that's what you're using), and if your source values lost some bits, you can't recover them by moving to fixed-point after the fact.
If two triangles share an edge, they share the exact same vertices and they are rounded to the same fixed-point coordinates. So how could this create cracks? Or are we talking about 'T-joints' between polygons?
Plus, if you round wrong, you also get to extrapolate outside the triangle, which doesn't always produce pleasing results
Isn't that solvable with some careful rounding?
 
The operations per pixel are simple, but still more than just an addition for regular interpolation, right? Although I guess it fits nicely with the multiply required for perspective correction. It's a tradeoff between higher setup cost with lower per-pixel cost and lower setup cost with higher per-pixel cost, right? And the latter is preferred nowadays to maximize parallelism?

It requires MAD operations. It's easier to iterate a fixed set of items and then a bunch of value based on them, than iterate a variable number of components. The amount of math is the same, but centralizing the rasterization is simpler.
 
They only ever did this for 3D clip planes scissoring and nearplane clipping were never done this way.

Actually I've never really understood how nvidia does their near plane clipping, since they don't have the pre divide by W values in the clipping unit. My assumption is that they must fudge the divide in someway to allow them to backout the value even in the case of a degeneracy. I do know that they do loose precision in the case of a near clip.

According to the OpenGL standard, clipping is to be done before the final per-vertex division-by-W; this way, a W=0 vertex doesn't cause clipping to blow up.

If you do not want to do geometric near-plane clipping at all, the alternative is to build into your rasterizer explicit support for rendering of "external triangles" (as described by the page nAo linked to near the beginning of this thread).
 
If two triangles share an edge, they share the exact same vertices and they are rounded to the same fixed-point coordinates.
Not necessarily. Arjan gave an example in an earlier post. You do need to pick the right rasterization rules for matching up edges, but that's not strictly sufficient.

You need to be vertex order invariant (Triangle ABC renders exactly the same as BCA, for example), and you do need to render the same pixels if you translate triangles on screen (without changing anything else). Because floats have more precision around 0 than around large numbers, you need to be careful that you compute things correctly and that you have enough precision to make this work both around the origin of the screen and around the opposite corner.

With really large screen sizes, this can become a problem.
 
It requires MAD operations. It's easier to iterate a fixed set of items and then a bunch of value based on them, than iterate a variable number of components.
I see. But as I noted above there's only a maximum of 2 interpolants required per shader instruction, so two adders could do the job (or even one with some code restructuring) instead of full multiply-add units?
The amount of math is the same, but centralizing the rasterization is simpler.
I can't see how the amount of math is the same. I do understand that this is good for tiny triangles (around the size of a quad) where setup time is crucial.

Thanks again for the insight.
 
Back
Top