According to:
http://download.microsoft.com/download/f/2/d/f2d5ee2c-b7ba-4cd0-9686-b6508b5479a1/Direct3D10_web.pdf
each primitive can output 1024 32b values (e.g. 256 Vec4 FP32s), which is 4KB - the diagram doesn't agree with the text
indicating only 128 32b values (32x4x32b) for some reason
There's no reason to expect that ATI hardware would be working with such large batches as 64 pipes x 8 primitives = 512 primitives in a batch.
The batch size for vertices/primitives/fragments will prolly all come out the same. I'm gonna guess that it'll be 64 objects per batch: 16-wide and four clocks long (R580 is 12 wide and four clocks long, Xenos is 16 wide and 4 clocks long).
So a single GS batch could write as much as 64 input primitives x 4KB = 256KB. That's even more memory than Bob indicated. I'd guess that's about
100x more memory than a DX9 GPU's post vertex transform cache
(which normally only deals with 10s of vertices and their attribute data at a time, maximum).
Even as a minimum, with a unified design shading 64 vertices in a batch, the post-VS cache would prolly be 2x as big as in a DX9 GPU (32 vertices, say). And that doesn't provide any lee-way for load-balancing twixt stages, which a unified architecture needs.
While the GS is producing vertices and their attributes, putting them into the post-GS cache, the following stages (Clip+Project+...) are consuming the vertices - "pixel shading consumes the vertex data". The relative rates at which vertices are put into, and taken out of PGSC are obviously unpredictable (taken out: how many triangles are culled?; how many fragments are produced by each triangle?; how long are the pixel shaders?, etc.).
Since a unified architecture is predicated upon using inter-stage queues to smooth out the scheduling of batches, perhaps it's worth arguing that the cost of a large PGSC isn't so troublesome - unified architectures seem to use lots of memory as a trade for enhanced per-ALU utilisation (the register file seems to be over-sized, too). Memory is cheap (and easy to make redundant for good yield). Though as memory increases, it gets slower and harder to access.
You could deliberately size the PGSC so that it can only absorb, say, 128 32b values (instead of 1024) per primitive. The command processor would then issue "smaller" batches than normal if the PGSC can't hold all the data produced by the GS program (e.g. when it wants to output the full 1024 32b values). I dare say the command processor doesn't know in advance how much data that's going to be, so it might be forced to simply junk the GS-output of some pipes and re-submit those primitives in a following batch. So the overall throughput of primitives is cut (as though some pipes are running empty) as the amount of per-primitive data produced by the GS increases.
I suppose you have to ask what's going to be the typical data amplification produced by GS and how will ATI and NVidia choose to support data amplification.
It seems to me that future D3D10 GPUs can scale performance (concurrent primitives being shaded) by increasing the size of the post-GS cache, as well as increasing the total number of pipes for GS. Is GS-vertex-amplification an "architectural by-way" that'll be circumvented by D3D11? I doubt it, what's the alternative to data-amplification within GS?
I get the strong feeling that 1024 32b values per primitive has been cut down from a much larger number. Whether that's because, with batching, the PGSC just gets silly huge or because one IHV thinks that's already way too big compared to a DX9 GPU's PVTC, who knows?
Jawed