I said think harder!
Three/four vertices are always trivial. For a tri-patch, it's (0,0), (1,0), and (0,1). Every vertex generated after those creates two triangles, regardless of tessellation factors. (I use 'vertex' loosely in this paragraph.)
In general with lower tessellation factors and anisotropy, you'll get notably less than 2 triangles per extra vertex. Go count some tessellated patches' triangles and vertices if you don't believe me. I counted one tessellated patch with 146 triangles and 87 vertices, earlier
There's also a factor for the face. The edge factors are necessary for continuity between patches. When two patches have different factors, you need to have vertices on the edges match up or you get ugly seam problems. All four factors are used to tessellate.
I don't remember hearing about face factor and I was under the impression developers have to manage adjacent-edge tessellation factors (and orientation) very carefully in order to ensure there are no Ts.
I don't see how a face factor can work when a patch's edges, each with a different factor, each abutt another patch with its own face factor.
If you compare these two shots:
http://unigine.com/devlog/090928-dragon_no_tesselation_wire1.jpg
http://unigine.com/devlog/090928-dragon_tesselation_wire1.jpg
it appears they are tessellating all triangles' edges to the same factor regardless of size. Perhaps that's their work-around for Ts.
It's not pointless by any means. Cypress has 20 SIMDs, yet in your scenario it's getting fed one quad every three clocks.
AMD says that the architecture's comfort zone bottoms out at 8 fragments per clock, in effect. It's not "my scenario", it's how the hardware works.
Unless you have a 60 cycle shader (up to 60 fetches and 2400 flops), fragment shading ability is sitting idle. The RBEs can handle 24x the tessellator throughput. The rasterizer, according to Dave, can handle 6x the throughput.
The rasteriser is the bottleneck on hardware thread generation, I presume: a new hardware thread can be started once every 4 cycles per group of 10 SIMDs - and only one SIMD can start a hardware thread in any 4 cycle window, when setup is exporting 1-pixel triangles.
TS, in this scenario, is producing triangles every 3 cycles. So the SIMDs can't go any faster. They're starved by the huge granularity of rasterisation and thread generation, not by lack of triangles.
The architecture is designed for big triangles, spanning >64 fragments, with a single triangle coming out of setup and, in the best-case, being sent to both rasterisers and fully occupying them both, resulting in the generation of a total of 128 fragments during 4 cycles.
That's a silly argument. First of all, it's not a factor of 10, it's less than a factor of four. Second, you need a lot of work to pack samples together and reduce that amplification. Third, doing so doesn't always speed up processing, because texturing can't share LOD calcs between all pixels of a quad.
You're missing the point, this architecture is designed for big triangles.
Finally, and most importantly, it's a lame excuse. If your quads only have a few samples to be written, that's no reason to have 80% of your SIMD's outputting zero quads/clk.
Don't shoot the messenger.
BTW, 10 million triangles does not mean <1 pixel area. 50% are frustum culled due to object-level CPU culling granularity. 40% of the rest are backface culled. Half of the rest are in the shadow map (or more, due to off screen but casting shadows). Over half of the rest are invisible due to overdraw. So now we're down to screen res divided by (10M * 50% * 60% * 50% * 40%). That is most certainly not <1 pix/tri avg area.
I'm talking about triangles entering rasterisation.
And where do you get that information? B3D's review says that three states have 3.7 million triangles, and that accounts for only 71% of the frame time, so there's more.
It says that the maximum triangles in any draw call is 1.6 million, coming out of TS.
A lot of the remaining frame time will be taken with post-processing.
Let's assume a little more than 4M triangles from tessellation. Without tessellation, GPU time limited by geometry is probably minimal, so those triangles probably add 12M cycles to the frame time, or 14ms. B3D's numbers show the following fps without/with tessellation at different AA settings: 64/38, 43/27, 34/23. Render time differences: 11ms, 14ms, 14ms.
All academic for an architecture that likes big triangles, I'm afraid. This is completely the wrong workload. It's like giving NV40 nested, divergent, control flow in a pixel shader. It'll just puke it back in your face.
Jawed