Triangle setup: what exactly is it?

rpg.314

Veteran
I have been thinking about the triangle setup stage of the pipeline for a while now and I am curious why it has not been sped further than 1 tri/clk. Even gf100's 4tri's/clk seems rather low when ALU's, bandwidth etc have increased by 2 orders of magnitude over the last decade.

A little bit of googling led me to this
http://www.extremetech.com/article2/0,2845,1155159,00.asp

It seems to suggest that triangle setup is just the calculation of slopes. That means 2 subtractions, and one division. Big deal. All of it costs 3 flops so far. Let's multiply that by 10. So 30 flops for one edge. 90 flops for 1 triangle. Let's make it 100 flops per triangle.

Cypress can do 1600 FMA per clock. So, even with this loose estimate, it should be able to do 16 tri/s per clock. :eek:

With this kind of disparity between hardware and doing it in shaders, I wonder why it has not been made into a kernel. I am sure the disparity (math wise) is there in GF100 too. Then why have 4 hw setup units?

What am I missing? :oops:
 
The various APIs require that triangles be rendered in the order in which they were submitted by the application. So if the application draws triangles A, B and then C, then the GPU must make it look like, regardless of any internal parallelism, triangles A is drawn first, then B, and then C.

There are several ways of achieving this, with various pros and cons for each way. The one chosen by IHVs so far has been to process triangles in order through Setup, avoiding a costly pixel-level or sample-level sorting.

Having Setup process multiple triangles simultaneously becomes complicated because of the ordering constraint.

The actual computation part of Setup is, of course, parallelizable, as demonstrated by gf100.
 
What am I missing?

That a ton of that power goes toward pixel shaders. I suppose the question on my mind is why the prioritization in favor of pixel shaders as opposed to geometry? But that's likely an easy answer - memory bandwidth!.. and dealing with the ineficiencies of micropoly-size geometry. Back when the prioritization likely started to manifest, we couldn't draw tens of thousands of polys per game character/object, but we wanted to, the closest we could get were hacks via pixel shaders, but those were easier to achieve than the huge buses, registers and back then, unfathomable quantities of RAM we'd need to do it all in geometry. That's just my theory, though. Today we're playing catch-up. Hardware's gotten a lot more general than it was; we can use our processing power for a lot more now, pixels, verts, et cetera, and the hardware's more appropriate than it *was* to support more geometry now... but we're still just now to the point where we could theoretically rasterize nearly whatever we please. We still can't feed hundreds of gigabytes of vertex data, and hundreds of gigabytes of pixel data to the GPU per second, so we work around it. We chuck over simplified geometry, tessellate up to our higher density mesh, displace to approximate what we wanted in the first place and call it a day. Makes me wonder how things will look in five years. Will we see a focus on parallel geometry engines and much more capable data busses? Where exactly will the designs go, which direction?
 
It seems to suggest that triangle setup is just the calculation of slopes. That means 2 subtractions, and one division. Big deal. All of it costs 3 flops so far. Let's multiply that by 10. So 30 flops for one edge. 90 flops for 1 triangle. Let's make it 100 flops per triangle.

It's not just the position though. You also need to calculate the interpolated color, Z, texture coordinates, and/or any other per vertex parameters. The interpolation of Z need a hyperbolic interpolator, and the interpolation of texture coordinates need doubly hyperbolic interpolators.

Also you shouldn't count one division as one "flop" because division is far more costly than other flops (add and mul).
 
Traditionally that's seen as part of rasterization and not setup. For the most part still true, but tile level Z interpolation is part of "modern" setup.
 
It's not just the position though. You also need to calculate the interpolated color, Z, texture coordinates, and/or any other per vertex parameters. The interpolation of Z need a hyperbolic interpolator, and the interpolation of texture coordinates need doubly hyperbolic interpolators.

Vertex attribute interpolation is already done in ALUs, so can't be a big deal. I am thinking from the POV of running setup in shader code.

Also you shouldn't count one division as one "flop" because division is far more costly than other flops (add and mul).

I left more than an OoM margin for this and other details.
 
Vertex attribute interpolation is already done in ALUs, so can't be a big deal. I am thinking from the POV of running setup in shader code.

Well, it's still possible to do that of crouse. To my understanding, Larrabee has no triangle setup engine and it's done with software, utilizing LNI.
 
The patents are extremely sparse (there is some old stuff, but it precedes unified pipelines). Generally though if you want to do parallel setup you will at the very least need cut up the hierarchical-Z stage into extra separate portions and you will need slightly longer queues after the vertex shader to be able to feed those roughly equally, apart from that I don't see any big overhead.

Of course following this argument you could conclude that I think ATI has got most of the hardware necessary to double their setup speed already, which is exactly what I think.
 
Bob's already said.

Setup isn't the sole triangle-ordering synchronisation point though. Each fragment takes an indeterminate time to shade, allowing fragments from newer triangles to theoretically overtake fragments from older triangles (if the pixel shading load-balancer allows it). But triangle order has to be respected when fragments are written to the render target. (In general, there are times when this isn't an issue.)

(ATI pools completed fragments post-shading to ensure correct ordering, I think. NVidia scoreboards pixel shading hardware threads, I believe, to prevent overtaking.)

If you have a screen-space tiled rasteriser then you need to ensure correct triangle order per tile, rather than globally, which simplifies things a little. If you have a geometry-binned tiled rasteriser then it's even easier.

So the crux of this, as I see it, is how much communication and what kind of data is communicated, if either: setup is distributed, or if a single setup distributes work to all rasterisers. Whether setup is a kernel or fixed-function isn't a particularly big deal (see Larrabee). The coherency/communication aspects are the issue.

Larrabee minimises setup/triangle synchronisation issues by incurring the communication costs of triangle binning to video memory (i.e. coarse rasterisation into screen-space tiles), i.e. writing bins of geometry+meta-data then reading them later. This is a win because these bins are much cheaper on memory than the random nonsense that GPUs engage in repeatedly RMW'ing bits of a render target.

When triangle order doesn't matter, e.g. when writing a shadow map, NVidia's distributed setup doesn't need to synchronise. GF100 should have a massive throughput for shadow rendering for the nasty little triangles you get in tessellation :D

(I wonder if it's shadow buffer generation that leads to the 15% setup dependency that's reported in the B3D article for HD5870.)

Jawed
 
The various APIs require that triangles be rendered in the order in which they were submitted by the application. So if the application draws triangles A, B and then C, then the GPU must make it look like, regardless of any internal parallelism, triangles A is drawn first, then B, and then C.

What about tile based deferred renderers? Don't they process the triangles in the setup stage out of order?
 
What about tile based deferred renderers? Don't they process the triangles in the setup stage out of order?

The point is to produce a result that's consistent to what should be if they are processed in the correct order. Generally, in a deferred renderer, if alpha blending is enabled, it has to maintain an ordering table for each pixel to make sure the triangles are drawn in correct order.
 
ATI pools completed fragments post-shading to ensure correct ordering, I think. NVidia scoreboards pixel shading hardware threads, I believe, to prevent overtaking.
AFAICS regardless of which end of the pixel shader you ensure sequencing you will need a scoreboard ... except if you do it at the end you need a counter instead of a single bit.
 
I am trying to work out the impediments in making setup yet-another-kernel. What do you think they are?
Well you need a 16 banked hierarchical Z-cache/scoreboard for each SIMD engine you wanted to run it on (which would probably be only a couple of them, so it wouldn't be an entirely unified architecture any more).
 
AFAICS regardless of which end of the pixel shader you ensure sequencing you will need a scoreboard ... except if you do it at the end you need a counter instead of a single bit.
Really it depends on how SIMDs share screen-space tiles. e.g. in ATI 10 SIMDs can all be working on fragments for a common tile, I believe. So put the fragment pool in the shader export block and then triangle ID tracking is in one place (or 2 in Cypress I suppose).

Someone should dig into NVidia's old patents to see how rasterisation/tiling is handled.

Jawed
 
Well you need a 16 banked hierarchical Z-cache/scoreboard for each SIMD engine you wanted to run it on (which would probably be only a couple of them, so it wouldn't be an entirely unified architecture any more).
Why complicate setup by doing early-Z concurrently? Particularly if there are multiple early-Z buffers as well as multiple instances of the setup kernel?

Jawed
 
Back
Top