That's because you're double counting the edge vertices. They get reused in adjacent patches. Okay, I'll agree that you can't cache the verts on every edge, so it'll be lightly lower than two, but that's it.
Ah yes, indeed, double-counting patch-edge vertices is my mistake.
Well it's there. SV_InsideTessFactor. There's nothing weird about it.
I thought you were trying to suggest that this was involved in making vertices on common edges align, but you were merely referring to another factor.
It has nothing to do with comfort zone. If you have tiny triangles, all samples fit in one quad most of the time. If you can only feed one triangle every three clocks, then you will only get one quad into the shading engine every three clocks.
You're assuming that the hardware can put multiple triangles into a hardware thread or even that multiple triangles can occupy a fragment-quad.
Going back to R300, ATI's architecture is based on small hardware threads. I'm still unclear on the actual count of fragments in R300's threads, whether it's 16 or 64 or 256 etc. But compare this with NV40 which we know has a monster hardware thread allocation across all pixel shading SIMDs, running into thousands for the simplest case of minimal register allocation per fragment.
Why, in this era, would ATI support multiple triangles per hardware thread, when hardware threads are small and when there's so few small triangles, ever?
R520 has a hardware thread size of 16. All the later high-end GPUs have grown this as an artefact of the ALU:TEX increases, so that we now stand at 64. For games in general and for moderate amounts of tessellation, 64 is fine, because average triangle sizes are large enough to occupy a significant portion of all pixel shading hardware threads.
But the point is, the basic architecture has been the same all along: SPI creates a thread of fragments' attributes at the rate of 1 or 2 attributes per fragment per clock, for a single triangle at a time.
Cypress deletes SPI and so some of SPI's responsibility for controlling register allocation/initiation has been dumped on the overall thread control unit. Now LDS has to be populated with barycentrics and attributes, for on-demand interpolation by fragments.
My theory is Cypress was due to have a multi-triangle, variable-throughput, thread allocator/LDS-populator. That, perhaps coupled with other changes to tessellation/setup/rasterisation, would have provided the tiny-triangle heft. But that was all dropped.
So you're saying 50M triangles before culling? No game does anywhere near that, so what's the point of this strawman example?
Since I believe that single-pixel triangle tessellation is a strawman, I wonder why I'm here, frankly.
Anyway, any decent adaptive tessellation routine will cull patches based on things like back-facing, viewport, occlusion querying, so that the your strawman of 50M triangles before culling is irrelevant. These approaches to tessellation will make it practical for multi-million rasterised triangles per frame.
Overdraw is always going to be a problem, even with a deferred renderer.
So? I'm talking about triangle counts per frame in Heaven. Why do you care about count per call?
The B3D graph shows that the longest draw call is ~28% of frame time. I don't know the frame rate at that time, but let's say it was 20fps, 50ms. Assuming 1.6 million triangles in 14ms (though that could have been 1M triangles in the same time, the article is very vague), that's 114M triangles per second coming out of TS.
Again, missing the point. What I just showed you is that the performance hit is entirely due to the 3-cycle per tri hit of the tessellator. There is no evidence that inefficiency of the "rasterisation/fragment shading/back-end" makes tessellator improvements pointless. Your argument is completely bunk.
On a close-up of the dragon the frame rate is about 45fps without tessellation. That implies a very substantial per-pixel workload - something like 260 cycles (2600 ALU cycles) per pixel assuming vertex load is negligible - and without knowing what proportion is full-screen passes. Anyway, with tessellation on, it doesn't require much of a drop in average fragments per triangle to kill pixel shading performance.
One of the factors here is we don't know what Heaven's doing per frame. Their site talks about advanced cloud rendering, for example. That'd be a workload that's unaffected by tessellation.
One of the things B3D could have done was to evaluate performance and draw call times with tessellation but no shadowing. I'm not sure how the night time portions of Heaven work, whether there's any shadowing involving tessellated geometry.
ATI can chop the performance hit of tessellation by a factor of six without major architectural changes. All they have to do is make the tessellator generate triangles faster and double the speed of the triangle setup/culling.
Ah yes, it was so easy and obvious, it's a feature of Cypress
Jawed