Triangle Size vs performance

Recently I made a post on LOD on gamedev.net and the issue of triangle size came up and I was directed to this webpage.

http://www.g-truc.net/post-0662.html

I was under the impression that triangles were rasterized into quads, and then the quads (or pixels) were batched up into warps/wavefronts to be scheduled for execution. However the performance drops seen by the author of the webpage suprised me and seem to counter what I thought. Could anyone take the time to explain to me the mechanics of the performance drops seen by the author?

Thanks in advance for any help.
 
The webpage you linked does explore some answers to your question. I'm not the best qualified to answer further and others will probably correct/add to this, but I'll give it a shot anyway.

Basically, there are a bunch of other pieces of the rendering pipeline outside of the ones you mention, and they can cause bottlenecks. For example:

- Every triangle needs triangle setup, which calculates some edge equations for the triangle. AFAIK modern GPUs have a triangle setup rate of 2 triangles per cycle. That's far below the peak pixel output provided by 32 ROPs, which is attainable with simple enough shaders.
- Triangles need to be rasterized and generally barycentric coordinates need to be calculated (not sure if this is true even if the fragment shader has no varyings.. it's possible the same hardware always does both). Rasterization will determine where the edges of the triangles are using the edge equations. Performing this over adjacent pixels allows for some simplifications over a fully general formula, so the rasterization tends to take place over some grid of pixels for the same triangle.
- The GPU has a hard limited for processing primitive commands, and AFAIK you can't get near the 2 triangles/clock limit with separate commands. Tessellation is one way to break this barrier, but the example listed doesn't use it.

There are probably some other things. But the example given only really exposes this because it has such an unrealistically low shader load. With more intensive shaders it's possible to have a lot of small triangles and still be shader limited. But the quad granularity you mentioned will follow you throughout the pipeline because the shaders are executed in quads.
 
Last edited by a moderator:
I think the reason and need for "super-tiles" (larger than 2x2) is the z-buffer tiling and maybe the MSAA tiles. You couldn't manage the case of multiple concurrent 2x2 quads, all with the same plane-equation, wanting to inject themself into the same z-buffer tile. In the worst case (you could allow this by locking the tile while it's written to) this creates non-existing edge (not corner :), well, okay, also corner) cases in the z-tile representation, while the tile is sequentially updated by fragments belonging to the same triangle, which can tip the z-buffer compression into uncompressed mode, from which you can't reconstruct plane-equations and a revived z-tile when the whole triangle is done with. You can't even really say when the 2x2 quads of a rasterized triangle finish/converge, if you would allow the rasterizer to work asynchronously, while also supporting the z-buffer.

If no z-buffer is involved then it could be way faster, but the hardware rasterizer is probably very much optimized for that case. It would be easier to make such a rasterizer with compute, at competitive speed.
 
You couldn't manage the case of multiple concurrent 2x2 quads, all with the same plane-equation, wanting to inject themself into the same z-buffer tile.

You mean from different triangles right?

-------------------------

BTW I didn't even consider that his testing methodology might be wrong... any thoughts about the possibility?
 
Not only different ones. If you crack the triangle into independent sub-regions (say 2x2) before they hit the larger z-tile, in the hope to gain some concurrency, you'd have a hard time determining if they belong to the same date or not, and you also have to stitch the pieces back together even if you know. That's an unnecessary nightmare.
Not even thinking about depth-writing pixel shaders, say all from 8x8=64 different triangles in flight possibly concurrently wanting to access the same z-tile, and failing, so the tile could stay compressed instead of going uncompressed.

The rasterizer is effectively tiled at the largest tile-size one of the participating sub-systems employs, and serializes the binned triangle-streams per tile. I'm sure the hardware-designers do everything to take advantage of the inconvenient bottleneck, by maintaining as much shared calculated data as possible, which in turn makes the system unable to be configured/adapted at runtime, inflexible.

It's interesting to note that vertex-stream optimization to make better use of caching effects is in conflict with concurrent z-tile accesses. If you manage strips you lower mem latency but raise locking potential. If you try to never submit adjacent triangles for maximum rasterization throughput, you get more latency. It might not yet bee very pronounced, but possibly the day we have num(CU) == num(Rasterizer). Which are going to be the strategies to achieve maximum rasterizer occupation.?
 
I would agree with Ethatron. Depth compression hardware of modern GPUs is quite complex and works in larger macro blocks. This hardware is optimized for certain patterns and tiny one pixel triangles surely aren't a priority.

PC DirectX doesn't allow you to access the compressed depth (plane) data structures directly, but consoles and Mantle do (according to DICE Mantle presentation, HTILE access is a part of the API). Once Mantle is no longer under NDA we should have full public documentation about these things and we could have more productive discussions as well.

The newest incarnations of Maxwell (v2) and GCN (v2) also support delta color compression to save backbuffer bandwidth. I am quite sure this hardware is also block based, and is designed for similar usage patterns as the depth compression. Too bad neighther company gave us any technological details about their algorithms.
 
you'd have a hard time determining if they belong to the same date or not, and you also have to stitch the pieces back together even if you know. That's an unnecessary nightmare.
Is it really that bad? (I'd think about it myself but don't have time right now.)

The rasterizer is effectively tiled at the largest tile-size one of the participating sub-systems employs, and serializes the binned triangle-streams per tile. I'm sure the hardware-designers do everything to take advantage of the inconvenient bottleneck, by maintaining as much shared calculated data as possible, which in turn makes the system unable to be configured/adapted at runtime, inflexible.
His GTX 680 numbers concur with what you're saying (for 8x8 tiles with 1,4,16 pixel quads) but not the other to cards. Also any ideas on the non-square tile performance differentials? The only thing I could think of was render target swizzling on uncompressed tiles.
 
Back
Top