TS data is very small, though. It's just 4 bytes per vertex if you use a triangle strip, and close to half if you do caching. If you can stage just one kilobyte then you have several wavefronts of vertices buffered up.
Where do you get 4 bytes per vertex from? I'm seeing TS output in examples as float2 or float3.
Tessellation factor of 15 generates 337 triangles (slide 9):
http://developer.download.nvidia.com/presentations/2010/gdc/Tessellation_Performance.pdf
15 is what R600 can do, so it's nothing special in today's terms.
If an HS hardware thread of 16 patches (4 control points per patch for terrain tessellation using quad patches = 64 control points sharing a hardware thread) generates 337 triangles per patch, then that's ~5.4K triangles/vertices, 42KB assuming 8 bytes per vertex. Obviously, DS will drain those triangles as TS produces them, in batches of 64 vertices (that's ~84 batches).
Which is more frequent: a vertex exported from TS or a vertex shaded by DS? DS needs to be <16 cycles to keep up with TS, if there's one SIMD running HS/DS.
Slide 25 is another hint about ATI: Moving work to DS instead of PS may not be a win. Why would that be? Probably because the population of DS threads in flight at any one time is too small.
So the count of DS threads in flight is open to question. The more of them, the more aggregate LDS is available to support the output from HS and TS. But DS count is locked to HS count by SIMD usage, which means that HS/DS load-balancing isn't independent - it would be like having a non-unified GPU, back to the bad old days of VS-dedicated and PS-dedicated shader pipes.
Then we're left not knowing the wall clock duration of a DS invocation. The longer that is the more coarse-grained will be the usage of LDS.
GS uses coarse-grained ring buffers off-die, seemingly for these reasons. The consumption of the ring buffer is not tied to specific SIMDs and it's not severely limited in capacity.
NVidia uses L2 to smooth these coarse-grained lumps of data, and it uses load-balancing across the entire GPU twixt stages to maximise overall throughput. Neither of these options seem to be available in Cypress.