Triangle setup: what exactly is it?

Really it depends on how SIMDs share screen-space tiles. e.g. in ATI 10 SIMDs can all be working on fragments for a common tile, I believe. So put the fragment pool in the shader export block and then triangle ID tracking is in one place (or 2 in Cypress I suppose).
You need a datastructure for that tracking ... lets call it I dunno, a scoreboard.
 
Why complicate setup by doing early-Z concurrently? Particularly if there are multiple early-Z buffers as well as multiple instances of the setup kernel?
Wait, misunderstood the question ...

It doesn't make much sense to do setup in shaders and then do rasterization in fixed function hardware again IMO. For small triangles (sub tile size) you will be doing a Hi-Z operations on almost a 1:1 ratio of tris to tiles, so it kinda makes sense to roll it all into one.
 
Last edited by a moderator:
You need a datastructure for that tracking ... lets call it I dunno, a scoreboard.
I was talking about where the scoreboard lives. With multiple SIMDs sharing a tile, it's extra work to put the scoreboard inside the SIMDs (they need to communicate), therefore in a pool afterwards is simpler.

Jawed
 
AFAIK, TBDR's sort triangles into tiles and then rasterize them in submission order.

Within a tile, a TBDR faces the same setup bottlenecks as a streaming renderer (both triangle setup and scoreboarding). But, multiple tiles can be rendered in parallel with no intercommunication required between those cores.

For example, Larrabee exploits pixel/block level parallelism via SIMD instructions, but make no claims about parallelization between triangles on a single core. Instead, every LRB core works on a different tile, so you can be processing as many triangles as their are cores.

The patents are extremely sparse (there is some old stuff, but it precedes unified pipelines). Generally though if you want to do parallel setup you will at the very least need cut up the hierarchical-Z stage into extra separate portions and you will need slightly longer queues after the vertex shader to be able to feed those roughly equally, apart from that I don't see any big overhead.

I think you're optimistic, but correct that larger vertex queues are a big part of being able to do parallel setup. The entire concept of TBDR is really boils down to 'absolutely gigantic buffers between vertex and fragment processing.'
 
Last edited by a moderator:
The triangle setup is very vendor specific. All it does is setting up the parameter that the rasterizer after it will need to know to draw the triangle on the screen (or not).

So 1 tri/clock for the triangle setup is not a problem as long as the triangles are large enough.
 
You only need linear interpolation of Z- at least for the purposes of Z-buffering.

Yeah, you are right. It's W buffer which needs hyperbolic interpolator. But I think since current pixel shader has depth input it probably need a correct Z value too (although that can be done in shading units?)
 
Yeah, you are right. It's W buffer which needs hyperbolic interpolator. But I think since current pixel shader has depth input it probably need a correct Z value too (although that can be done in shading units?)
You don't really want hyperbolic interpolation for "W" either. Using 1/W (and a floating point representation) is cheaper and more accurate. Ask any Dreamcast :)
 
Just for reference ... I assumed that the homogenous interpolation method was pretty much standard now (ie. whatever/w and 1/w are computed in the vertex shader, linearly interpolated for the pixel shader and then you divide whatever/w by 1/w for the correct result per pixel). Is that right?

Messing around with specialized hyperbolic interpolators seems a bit archaic.
 
Just for reference ... I assumed that the homogenous interpolation method was pretty much standard now (ie. whatever/w and 1/w are computed in the vertex shader, linearly interpolated for the pixel shader and then you divide whatever/w by 1/w for the correct result per pixel). Is that right?
It is for all DX9+ class hardware I know about.
 
Just for reference ... I assumed that the homogenous interpolation method was pretty much standard now (ie. whatever/w and 1/w are computed in the vertex shader, linearly interpolated for the pixel shader and then you divide whatever/w by 1/w for the correct result per pixel). Is that right?
I think it's smarter to just determine the persepective weights at the setup stage and just do multiplication for each attribute. In other words, for the three vertices 'whatever' winds up being (1,0,0) and (0,1,0). These calculations are needed for the edge functions anyway during homogenous rasterization. Now figure out the final value for each pixel (a, b, 1-a-b) and you're set for as many attributes as you want. I don't think anyone actually does it for each pixel, though, and instead the division is only done for the center of each quad and slopes are used to determine the value for each pixel.

A slight wrinkle has been thrown into the equation with DX11 because the sampling center (centroid or normal) can change on the fly in the pixel shader, so you have to adjust the weights by a bit.
 
I kinda doubt it, the perspective errors caused by doing linear interpolation for the pixels in a quad rather than just doing the perspective divide correctly per pixel seem hardly worth it. The weights for the linear interpolation are of course only calculated once per pixel, but the same goes for calculating 1/(1/w) ... obviously I glossed over a few things.
 
Last edited by a moderator:
I kinda doubt it, the perspective errors caused by doing linear interpolation for the pixels in a quad rather than just doing the perspective divide correctly per pixel seem hardly worth it.
You think so? We're talking about a half pixel of linear interpolation. If you choose the right slope then errors should be virtually undetectable. Back in the day, software renderers linearly interpolated over far larger areas to minimize the divisions.

The weights for the linear interpolation are of course only calculated once per pixel
I'm talking about perspective correct weights. Imagine if you have an attribute 'a' which has a value of 1 at vtx1, 0 at vtx2, and 0 at vtx3. Also, make an attribute 'b' which has a value of 0 at vtx1, 1 at vtx2, and 0 at vtx3. Now calculate 'a/w' and 'b/w' at each vertex, interpolate, and divide by interpolated 1/w. For each pixel, you now have 'a', 'b', and can calculate '1-a-b'. These are the persepective-correct weights. Also interesting is that 'a' and 'b' (and 'c') may be needed anyway for rasterization (see section 5.1 here).

Now for each attribute, it's a simple a*Attr1+b*Attr2+(1-a-b)*Attr3 for each pixel. It makes more sense to me than making a and b regular linear weights and calculating {a*(Attr1*(1/w))+b*(Attr2*(1/w))+(1-a-b)*(Attr3*(1/w))}*(1/(1/w)), even when considering that Attr_*(1/w) can be calculated once per vertex per attribute.

Now, as I mentioned before, DX11 means the weights aren't quite the same for each pixel, but it should be easy to handle if the quad-centre method is used.
 
http://forum.beyond3d.com/showthread.php?t=35443

You'll see a discussion of the plane equation (post 22 onwards) derived from the Oberman presentation/paper on the multifunction interpolator (I don't know of any working web link for those documents :cry: ). ATI used to generate the barycentrics for the SPI to consume (stuffing finalised attributes per fragment into the registers belonging to the fragment), but Evergreen puts them into LDS along with attributes.

Also I briefly referred to the PIOR flag in NVidia for keeping fragments in order or allowing them to go out of order.

Jawed
 
You think so?
Lets say you use Evergreen, if you do 1 pel per shader the extra divisions are defacto free (1 per VLIW communicated through LDS). Doing a quad per shader (mismatch between interpolation and pixel shader stage granularity might cause headaches) it's 3 cycles extra ... is that really going to make much of a difference?
 
You'll see a discussion of the plane equation (post 22 onwards) derived from the Oberman presentation/paper on the multifunction interpolator (I don't know of any working web link for those documents :cry: ).
The link to the paper wasn't working this morning, but it is now:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.4275&rep=rep1&type=pdf

I misremembered that paper. I thought A, B, and C were unique to each quad. Since they aren't, it must mean that the U they are interpolating is not the attribute, but instead Attr/w. The final value is arrived at by multiplying by 1/(1/w). So that means at least NVidia isn't doing the mathematical optimization I mentioned.

ATI used to generate the barycentrics for the SPI to consume (stuffing finalised attributes per fragment into the registers belonging to the fragment), but Evergreen puts them into LDS along with attributes.
Yup. I liked the old ATI strategy as it was very clean. No need to store per-vertex attributes near the shaders, often duplicating data across SIMDs, and attributes are usually needed only once early in the shader as initial data so the added register pressure is minimal. This method hurt in some poorly written theoretical tests, but so what?

DX11, however, allows a pixel shader to dynamically decide where the pixel center is for interpolation (centroid vs regular and maybe others). Interpolating before running the shader and stuffing the registers with the results is no longer possible.
 
Lets say you use Evergreen, if you do 1 pel per shader the extra divisions are defacto free (1 per VLIW communicated through LDS). Doing a quad per shader (mismatch between interpolation and pixel shader stage granularity might cause headaches) it's 3 cycles extra ... is that really going to make much of a difference?
It's fine for Cypress, but for lower end hardware and especially previous generations where it was all done without the ALUs, doing a divide per pixel can be expensive.
 
After viewport transform and before stream out within the PolyMorph Engine:

GF100 Whitepaper said:
Attribute setup follows, transforming post-viewport vertex attributes into plane equations for efficient shader evaluation.

And from the Arith17 slides "Flow of data for simple per-pixel perspective correct texture lookup and blending":


  • InterpAttr 1/w
  • RCP to form per-pixel w, needs to be ~1 ulp
  • InterpAttr S/w and T/w
  • Multiply S/w and T/w by per-pixel w to form S and T
  • Texture lookup based on S and T
  • InterpAttr R/w, G/w, B/w
  • Multiply by per-pixel w to form per-pixel versions of R,G,B
  • Use FMAD’s to blend texture R,G,B with per-pixel interpolated attribute R,G,B
Jawed
 
Back
Top