Well, I don't know what to say. I'm just telling you the DX11 specs for tessellation.
That's the spec for computation, I believe, not storage. And since we were talking about the space consumed in LDS, then 8 bytes per vertex is the truth.
You're making a lot more assumptions than that. There's no reason to simultaneously store all the triangles produced from tesselating a thread full of patches.
It's impossible, anyway, currently, since LDS is too small when TF=64.
You have a HS thread producing patch data, store that in the GDS until the input buffer is full (64k is overkill), and then let the HS thread sit idle until there's room again. In the meantime, TS is cranking out barycentric points one at a time, DS is consuming them 64 at a time, and it has no need to finish the shader any faster than 64 per clock because setup can't consume them any faster than that.
Of course. DS shaders tend to take a while.
There's no need to store 337*16 triangles.
Well, for a square patch the correct number would be 450*16 (thanks to AlexV for the correction - doh for not even thinking about it). Clearly LDS can't store them all. A ring buffer would have no trouble storing them all, of course. Or the 8192 vertices that TF=64 produces.
And when TS is working its way through the patch that results in thousands of vertices, it can't load-balance itself and switch to another patch, when the destination LDS for the first patch runs out of capacity.
enough patches from differing SIMDs? Nothing wrong with a FIFO. Why would the TS stall? It's by far the slowest stage in the pipeline.
B3D article says HS is 5% arithmetic throughput - i.e. 1 SIMD on Cypress. I think that's prolly too low for the generic case and potentially more of a reflection of the workload balance they gave the chip in the test. But still, there's likely to be a tendency to minimise the number of SIMDs running locked pairs of HS/DS.
If enough SIMDs are allocated to HS/DS then even with high TF there should be enough patches for TS to consume. But we've got zero data on SIMD allocation in high-tessellation scenarios...
So what? There's a lot of SIMDs per chip, and you could even be clever in your compiling so that the HS exist alongside another shader in the same SIMD.
That's not a compilation problem.
It's an open question whether a SIMD can support more than 2 different types of shaders. With HS and DS appearing to be a locked pair, a third type seems unlikely... This is primarily a register allocation problem: if you're allocating registers, how do you segregate the shader types and avoid fragmentation, with more than 2 shader types?
If Cayman uses off-die buffering when tessellation is active, one possibility is that all output from HS and TS goes off-die. This would allow asymmetric configurations of HS/DS to occupy SIMDs and allow more-fluid load-balancing of HS and DS workload.
Occupation of SIMDs is not the performance problem of Evergreen or NI. It's triangle throughput.
So off-die buffering is irrelevant?
You're being really narrow minded
Cayman's T ALU says hi
if you think off-die buffering can only improve performance if granularity sensitivity is a problem. There's all sorts of possibilities.
Which you still haven't eludicated.
If ATI GPUs can do 1 tri per clock with a regular VS using VBs/IBs from RAM, then they should be able to do the same with a DS, whether it fetches the barycentric coords from RAM, GDS or LDS. IMO the only possible bottleneck is the tessellator itself or a data path.
The "data path". Why does ATI's GS always push data off die?