In general with lower tessellation factors and anisotropy, you'll get notably less than 2 triangles per extra vertex. Go count some tessellated patches' triangles and vertices if you don't believe me. I counted one tessellated patch with 146 triangles and 87 vertices, earlier
That's because you're double counting the edge vertices. They get reused in adjacent patches. Okay, I'll agree that you can't cache the verts on every edge, so it'll be lightly lower than two, but that's it.
I don't remember hearing about face factor
Well it's there. SV_InsideTessFactor. There's nothing weird about it. You're right, that image is using the same tesselation factor for all edges on a mesh, so it's not going to give you any idea of what fancy tesselation structures look like.
AMD says that the architecture's comfort zone bottoms out at 8 fragments per clock, in effect. It's not "my scenario", it's how the hardware works.
It has nothing to do with comfort zone. If you have tiny triangles, all samples fit in one quad most of the time. If you can only feed one triangle every three clocks, then you will only get one quad into the shading engine every three clocks.
The rasteriser is the bottleneck on hardware thread generation, I presume: a new hardware thread can be started once every 4 cycles per group of 10 SIMDs - and only one SIMD can start a hardware thread in any 4 cycle window, when setup is exporting 1-pixel triangles.
TS, in this scenario, is producing triangles every 3 cycles. So the SIMDs can't go any faster. They're starved by the huge granularity of rasterisation and thread generation, not by lack of triangles.
You're confusing threads with quads or doing something else. One thread every four clocks is the max you will ever need because a rasterizer can only generate 4 quads per clock at most and a thread has 16 quads in it. If you have one pixel triangles being fed to each rasterizer once every six clocks (on average), then each rasterizer will only produce a quad once every six clocks and
sit idle 83% of the time. The thread will not be full until it gets 16 quads, which means one thread every 96 clocks.
The rasterizers can handle six times the triangle throughput of the TS before they become a bottleneck. Before that, though, setup will be a bottleneck, but even that can handle 3x the speed of the TS.
You're missing the point, this architecture is designed for big triangles.
You're missing the point. This rasterizer can run at 6.25% efficiency with single pixel triangles instead of the 1% it does now due to limitations by the tesselator & setup. That's a big difference.
I'm talking about triangles entering rasterisation.
So you're saying 50M triangles before culling? No game does anywhere near that, so what's the point of this strawman example?
It says that the maximum triangles in any draw call is 1.6 million, coming out of TS.
So? I'm talking about triangle counts per frame in Heaven. Why do you care about count per call?
All academic for an architecture that likes big triangles, I'm afraid. This is completely the wrong workload.
Again, missing the point. What I just showed you is that the performance hit is
entirely due to the 3-cycle per tri hit of the tessellator. There is no evidence that inefficiency of the "rasterisation/fragment shading/back-end" makes tessellator improvements pointless. Your argument is completely bunk.
ATI can chop the performance hit of tessellation by a factor of six without major architectural changes. All they have to do is make the tessellator generate triangles faster and double the speed of the triangle setup/culling.