Cedar doesn't have the same prim rates as the rest of the stack. Cypress, Juniper and Redwood are the same as each other.
So Dave, can you give us an idea as to why Cypress can only make one tessellated vertex every 6 clocks, or one tri every 3?
Yeah, the limit for maximum tessellation is 2 triangles per vertex, but lower tessellation levels (sporadic use or adaptive) makes that lower.
I said think harder!
Three/four vertices are always trivial. For a tri-patch, it's (0,0), (1,0), and (0,1). Every vertex generated after those creates two triangles, regardless of tessellation factors. (I use 'vertex' loosely in this paragraph.)
I think TS has a vertex-centric view with the triangles coming out in the wash, because tessellation factors are per edge, not per triangle.
There's also a factor for the face. The edge factors are necessary for continuity between patches. When two patches have different factors, you need to have vertices on the edges match up or you get ugly seam problems. All four factors are used to tessellate.
Exactly. Generating 10 million visible triangles per frame whose average area <1 pixel requires vastly more hardware support in the rasterisation/fragment shading/back-end part of the GPU, so is therefore pointless.
It's not pointless by any means. Cypress has 20 SIMDs, yet in your scenario it's getting fed one quad every three clocks. Unless you have a 60 cycle shader (up to 60 fetches and 2400 flops), fragment shading ability is sitting idle. The RBEs can handle 24x the tessellator throughput. The rasterizer, according to Dave, can handle 6x the throughput. Even the dated setup engine can handle 3x the throughput.
And most of a GPU's work is due to an order of magnitude of amplification derived from those resulting triangles.
That's a silly argument. First of all, it's not a factor of 10, it's less than a factor of four. Second, you need a lot of work to pack samples together and reduce that amplification. Third, doing so doesn't always speed up processing, because texturing can't share LOD calcs between all pixels of a quad.
Finally, and most importantly, it's a lame excuse. If your quads only have a few samples to be written, that's no reason to have 80% of your SIMD's outputting zero quads/clk.
BTW, 10 million triangles does not mean <1 pixel area. 50% are frustum culled due to object-level CPU culling granularity. 40% of the rest are backface culled. Half of the rest are in the shadow map (or more, due to off screen but casting shadows). Over half of the rest are invisible due to overdraw. So now we're down to screen res divided by (10M * 50% * 60% * 50% * 40%). That is most certainly not <1 pix/tri avg area.
since we know that Heaven's absolute triangle counts are low (i.e < 2 million - though obviously some multiple of that for extreme mode).
And where do you get that information? B3D's review says that three states have 3.7 million triangles, and that accounts for only 71% of the frame time, so there's more.
Let's assume a little more than 4M triangles from tessellation. Without tessellation, GPU time limited by geometry is probably minimal, so those triangles probably add 12M cycles to the frame time, or 14ms. B3D's numbers show the following fps without/with tessellation at different AA settings: 64/38, 43/27, 34/23. Render time differences: 11ms, 14ms, 14ms.
Clearly, tessellation time alone is more than enough to account for the performance impact.
It's not triangle count that matters, it's area per triangle.
And what do you think happens to area per triangle in a real game with triangles off screen, backfacing, in shadow maps, with higher detail displacement maps needing more tessellation, etc? The DXSDK sample has one square mesh that fits entirely within the screen.
Though I remain cautious about the hardwareluxx graph until someone reproduces that HD5870 result
As will I. There's a review out there that shows the GTX480 having a lower AA impact than Cypress much of the time.