You have to rasterize up to the hierarchical Z resolution to cull in the first place ... so you have a part of the pipeline which is behind setup proper which does rasterization ... lets just call it the rasterizer okay?
(Hierarchical fragment rejection is nice, even better with a fast path for small triangles, but it doesn't make sense to count it as part of setup.)
The output of setup is coarse rasterisation - screen-space tile resolution rasterisation (plus triangle data, of course).
So as long as the early-Z system is that coarse, then you can easily, conservatively, reject dozens of small triangles, generated by tessellation, that all fit within a single screen-space tile.
GS is the most obvious place to do this kind of pre-setup culling, because it's the first time that a post-tessellation triangle comes into existence.
I'm wondering if it's possible to move all the non-position attribute calculation out of DS into GS (e.g. normal or colour per vertex). GS can decide if it's going to cut the triangle, so that it doesn't reach setup. If the triangle is emitted by GS then GS just makes sure that all the attributes are generated. This is normal stuff for GS. Manipulating the shaders like this is something the driver can do.
Trying to cull stuff pre-tessellation is another ballgame, as that paper you referred to earlier indicates. All I'm suggesting is that even without that kind of technique, there are opportunities for NVidia to improve the performance of setup - either by culling triangles before they get there, or culling them before they're exported.
Regardless I'm hopeful that NVidia's implemented setup as a kernel - making setup scalable. Though I'd still like to see evidence that tessellation is likely to make GPUs setup limited in games (not synthetics). Rasterisation-, fillrate- or shader-limited will still be very much the norm and tessellation only increases pressure on those. It's really a question of whether setup becomes a bottleneck due to tessellation.
Oh, and I suppose it's worth asking: is setup at 1 triangle per clock (in GPUs that work that way) because the early-Z system can't run any faster? Is early-Z the
real bottleneck? If so, perhaps that's the heart of NVidia's improvements in Fermi.
Jawed