It's still doesn't mean much for BW, especially for NI's 1-2 triangles per clock peak.
Well you've probably seen my reply to 3dcgi by now, which indicates that each vertex output by TS is effectively consuming 16 bytes, an entire register, as it's written to the register file, not LDS. And that allocation is multiplied by the entire allocation of registers for the DS.
I don't think you're reading my post properly. There is no reason to store all the vertices produced by a patch.
The basis of my position isn't that all the vertices need to be stored. It's that when a huge lump of vertices are the result of one or a few patches owned by a single SIMD, TS throughput can be affected. Subject to the total count of SIMDs that can accept TS output.
In the trivial case (which I suspect isn't realistic): if there's only one SIMD running HS/DS then TS has to stop while waiting for DS threads to complete processing. More realistically several SIMDs will be there to take on the workload of DS.
Regardless of the number of SIMDs occupied by HS/DS, when a patch results in TS outputting more than X vertices (X dependent on register allocation of DS), TS is going to stall because it can't multi-thread patches - it treats them strictly sequentially. That's my interpretation, and I suspect it's a major factor in the performance cliff we see. Cypress falls to 1/6 to 1/10 of GTX480 throughput in the worst cases.
Fermi probably stalls its TSs too, when amplification is very high. There's certainly warnings:
http://www.highperformancegraphics.org/media/Hot3D/HPG2010_Hot3D_NVIDIA.pdf
Not much more than typical vertex shaders.
For patches like that, you just have to write the 3 or 4 control points to GDS. Then any SIMD can do the DS.
Migration/sharing of workload, something like Fermi, but which appears not to be part of Evergreen. Although, to be fair, there's no hard evidence for this.
In a way it is. If you compile shaders to use the same number of registers per fragment, then you can basically have an ubershader to work on wavefronts using any of those shaders.
I imagine the architecture would have to be re-jigged for that kind of support. The implicit inputs to a pixel shader, say, aren't like the inputs to a DS. This ubershader would be of a type distinct from all those currently implemented.
When did I say it was irrelevent? Yes, I did give you a possibility: there could be a data path bottleneck somewhere.
Under "Hardware Tessellator Progression", the slide says "Gen 8 - AMD Radeon HD 6900 Series - Scalability and off-chip buffering". You keep trying to dodge around "off-chip buffering" as though it has nothing to do with making tessellation faster. If moving data off-die is a performance win it's probably for the same reason as seen in GS: coarse granularity data, in huge wodges, is too voluminous to keep on-die.
Anyway, it turns out that LDS wasn't the buffer under strain, it was the register file, which appears to make the strain worse...
You could have bank conflicts,
Yes, those can happen, definitely. Why would they scale with tessellation factor? LDS reads are solely for HS params that are inputs to DS, i.e. control-points and tessellation factors.
or maybe limitations from all fragments in the DS accessing the same control point.
Broadcast is fine in ATI as far as I know.
Caches used for regular vertex processing (where all data comes from RAM) may alleviate that.
The L1 texture/vertex cache you mean?
The DS I referenced earlier does read stuff from RAM (two VFETCH instructions). I suspect it might be something to do with the original stream of patches - perhaps it's just scale/bias and offset data for the patch buffer, to enable calculation of the right address in LDS to fetch HS params from.
In the same vein, the HS I referenced earlier reads two ints using a single VFETCH, which appear to be to generate LDS write addresses.
In both cases VFETCH doesn't look like it could be a bottleneck. There's a lot of ALU work in HS/DS for the SIMD to hide VFETCH latency.