I sure as hell hope they aren't doing tessellation with the GS. It's pretty easy to do tessellation with a geometry shader on any DX10 hardware, but the parallelization creates a huge buffer requirement because you're starting with a bunch of tessellation primitives (I assume 64) and then as you go through the GS code you generate one triangle from each primitive simultaneously.
I wasn't suggesting that GS was used for tessellation. Merely that GS is "always on" if TS is active. But in light of what you say next, plus a smidgen of thinking time, I think that's irrelevant to TS, per se.
Good fixed-function tessellation would take one primitive as input and generate all of its tessellation coordinates triangles before moving to the next primitive. You only need to buffer these before going into the domain shader, which should require far less space per wavefront than a normal vertex shader.
I think the buffering would only need to be to the extent of being able to fully populate a hardware thread.
I haven't studied the document, but this "small buffer" for post-TS data might correspond with the use of a small buffer that's being proposed for post-GS data. i.e. the same buffer suits either.
To be honest I can't see why LDS can't be used as this "small buffer", for the naked vertices produced by TS. Similar to the way that vertex attributes are buffered in LDS for consumption by interpolation instructions during pixel shading (and, as you say, taking less space).
I think the "small buffer" approach was the crux of NVidia's design for GS buffering, using shared memory (I reckon) to capture a batch of data from a hardware thread, before optionally submitting for setup or writing the triangles to memory (stream out). In Fermi shared memory is also known as L1 cache and because it's integrated with the L2 cache there's a fairly robust architecture in place to deal with the clumps of data that TS can produce, and to deal with problem of delivering vertices/triangles to the screen-space tiles for rasterisation, as the tiles are spread across all the SIMDs.