So with ATI, excluding the GS path, all intermediate data between VS and PS doesn't ever hit main memory.
Yep, it seems the position buffer and parameter cache are on-die and load-balancing twixt VS and PS can interact directly with these buffers, i.e. percentage-full and perhaps rate-of-change.
Just GS path is messy and hits memory because of VS output sharing between GS invocations.
Yes, two ring buffers are used to balance the workload and provide queues that can independently adjust. When you have triangles in the VS input buffers producing random numbers of triangles in the PS input buffers (i.e. position buffer and parameter cache), you have a fairly fundamental ordering problem, not to mention quantity of data problem.
One thing that I've not seen is how R600 etc. deal with the volume of data produced by tessellation amplification, which is up to 15x. Maybe that hard limit means it stays on-die?
The interesting thing about the DC shader is that it can effectively be load balanced against PS consumption of the position buffer and parameter cache - i.e. DC acts as a surrogate VS from the point of view of keeping pixel shading fed. If the GS is a culling rather than amplifying process it seems there's a strong risk of running out of work for PS to do - but that risk seems fundamental to any architecture.
The obvious problem with this double-ring-buffer approach is that there are 5 memory operations. Obviously there's less than that, on average, per vertex, if amplification is the norm, but in the extreme best case there's still a minimum of 2 memory operations per vertex (hmm, should be able to work it out with the 1024-scalars per invocation limit). The bandwidth cost is obviously troubling, too. ALU:TEX is supposed to be going up, though.
On the other hand, this is scalable. You can see why AMD was miffed at the 1024-scalars per GS-invocation limit that was written into D3D10, when they'd built a dataflow that doesn't care.
Whether it's due to ordering or volume of data, it seems to me a GPU must be able to spill inter-kernel buffers to memory. With something like Larrabee it's all through the cache, so there it's a question of deciding how many cache lines to use...
Append buffers are the general case for this. It would seem reasonable to assume these buffers can be cached, much like global atomics are in NVidia. So then it's really a question of sizing and load-balancing...
For DX11 can (IA,VS,HS) be grouped into one kernel? Meaning kernel manually fetches vertex inputs, and HS outputs store in post transform cache instead of VS outputs?
My head hurts every time I look at HS->TS->DS.
It seems that HS amplifies a patch's control points that are shaded in the VS, similar to the way in which GS amplifies vertices shaded by VS, using the "constant function":
Such a nice diagram, hope Jack doesn't mind the linkage, from:
http://www.gamedev.net/community/forums/mod/journal/journal.asp?jn=316777
which also has his experimentation using a CS to generate weightings ahead of terrain tessellation.
The GDC09 presentation is also useful:
http://developer.amd.com/gpu_assets/GDC09_D3D11Tessellation.pps
(can't find this on NVidia's site - there's a PDF of this available on the GDC09 site) showing the usage of 10 control points per triangle more explicitly. I'm unclear on whether 10 is the limit here. Just don't know enough about HS/TS/DS.
Jawed