Think harder
Yeah, the limit for maximum tessellation is 2 triangles per vertex, but lower tessellation levels (sporadic use or adaptive) makes that lower. I suppose it's fair to say that in fact it will tend to be higher, because patches that are to be tessellated are likely to be parts of continuous meshes, such as characters or terrain or buildings, rather than bitty things dotted about the frame.
Well that sort of kills my theory. But why do you think Damien found that if you need to access multiple vertices in a shader the speed goes down?
Depends how the data's packed I suppose. I've never really spent any time looking at VS and my efforts to find representative HS and DS shaders have so far come to naught. I should try the DX SDK
I know, but the point I was making is that an iterative tessellation algorithm doesn't make sense. Even if it was used, it doesn't even explain 1 vert every 6 clocks, because you get lots of verts with each pass.
I think TS has a vertex-centric view with the triangles coming out in the wash, because tessellation factors are per edge, not per triangle.
I don't see the savings, nor do I see any explanation for 1 vert every 3 or 6 clocks
I don't really understand the mechanics of the Xenos/R600 tessellator, and why the D3D11 tessellator is "materially" different to the extent that Evergreen has two distinct tessellators, one for each type of work.
Perhaps the Xenos tessellator is a per-triangle tessellator, and the limit case of 2 triangles per vertex results in the halving of throughput (since vertex rate determines triangle setup rate, rather than vice-versa)? But the D3D11 tessellator, which has to support much higher amplification, reverts to a slower algorithm that's edge-centric?
It's a very different argument when you consider scale. 512 fragmentss/clk including all the compressed encoding/decoding, Z loading, HiZ, etc. is way, way more expensive than generating a pair of [0..1] numbers every two clocks.
Exactly. Generating 10 million visible triangles per frame whose average area <1 pixel requires vastly more hardware support in the rasterisation/fragment shading/back-end part of the GPU, so is therefore pointless.
I can see how increasing setup/culling speed is a bit of a pain due to the implications throughout the pipeline, but this is just laziness. Tessellation is a self-contained fixed function unit. Factors go in, coordinates come out.
And most of a GPU's work is due to an order of magnitude of amplification derived from those resulting triangles.
I really think we'll find that extreme tessellation (sub-pixel triangles) on ATI is slow because hardware threads are basically empty and the rasteriser is mostly shooting blanks. I also raise a question on PTVC, and I suspect a limit of 512 hardware threads in flight also plays a significant part.
If you look back at old presentations about ATI tessellation (R600 era) you'll find hundreds of fps being quoted for 1 or more million triangles. Those numbers simply don't gybe with Heaven and I think it's down to the sub-pixel triangles (whose pixel shaders are long-winded), not the absolute count of them - since we know that Heaven's absolute triangle counts are low (i.e < 2 million - though obviously some multiple of that for extreme mode).
No, that's not what I'm talking about. I'm saying that the number of triangles output by this SDK sample is a lot lower than a game would need. The 35% and 11% numbers aren't very useful.
It's not triangle count that matters, it's area per triangle. Any kind of adaptive algorithm is going to stop before it paints multiple triangles per fragment. Otherwise just delete the MSAA hardware and totally re-model the fragment shading part of the fixed-function GPU architecture.
Which, incidentally, makes me wonder about GTX480's rasterisers:
http://forum.beyond3d.com/showpost.php?p=1414371&postcount=5449
If MSAA is killing GTX480's performance (or is it the AF?...), is there some interaction there between Z rasterisation and tessellation?
Though I remain cautious about the hardwareluxx graph until someone reproduces that HD5870 result, as the results/graphs here:
http://www.pcgameshardware.com/aid,...Fermi-performance-benchmarks/Reviews/?page=16
don't tally.
Jawed