I can't find what you're referring to.
Tom Forsyth had a presentation for SIGGRAPH2008.
All I can find is figure 13 in Seiler et al. where "Pre-Vertex" is ~1-2% of workload.
Wouldn't allocating triangles to a bin require the rasterization portion of the workload as well?
Forsyth's slides apparently included this in the front-end estimate.
Seiler refers to the processing of a PrimSet as occurring on a single core, which effectively makes it a serialisation point. In general Larrabee will be working on multiple PrimSets in parallel (i.e. it can overlap processing of draw calls, so long as per-tile primitive ordering is respected).
Since this work is relatively undemanding you might argue that it simply isn't worth spreading around chips, but it doesn't sound intrinsically single-chip.
The actual cost I see is the creation of a bin and then having any core pick up a bin for processing. Both would be more expensive to do.
Forsyth's slides also indicated that a bin contains tris, shaded verts, and rasterized fragments.
I'm not sure if the fragments would be a concern for the distribution phase that might be passing over the interconnect.
Since the memory subsystem should maintain a coherent image of memory across the chips, there is no algorithmic reason why it would be single-chip.
The costs of this work have been evaluated as being sufficiently low only for a single-chip scenario, however.
Other multi-chip rendering methods often opt for duplicating setup work. This is the case for GPUs, and also for a number of distributed rendering schemes for CPUs, though in those case there is often a trip over a network interconnect that raises costs even further.
Tessellation is my biggest question mark with Larrabee, to the extent that a just-in-time approach is used for the creation of bins. i.e. create bins with non-tessellated triangles (patches) and then when rasterisation/shading/back-end starts, the bin is tessellated. This inevitably causes leakage across tile boundaries though (new vertices can't be constrained by a tile), which makes it seem unworkable. Dunno.
The flexibility of the software pipeline is the reason for Forsyth's estimate for front-end work being so wide.
It's 10% if deferring attribute, vertex, and tesselation work to the back end. It's variable because those three can be done in either front or back end.
Bin size would be the most amenable for sending to another chip if this work is deferred, but back-end burden and bin spread would be worse.
If done in the front-end, bin size becomes much larger and more costly to send to a remote pool of cores, though the bins themselves would be much better-behaved.
If the front-end is duplicated, we about double the computation required for PrimSet dispersal and front-end work, but with minimal increase in synchronization or bandwidth burden on the interface. The developer would be much more free to decide on where to put work between the front and back ends.
The PrimSet distribution by one core is actually well-suited to the likely ring-bus configuration Larrabee will use. If it is anything like the polarity-shifting method used by Beckton, only a small subset of cores will be available for the control core to send updates to in a given cycle. This is fine as the control core can only serially send out updates a for a handful of cores anyway.
Given the speed and bandwidth of the on-chip bus, the costs for this are probably safe to accept.
I'd be curious to see how this works for dispersing updates to an ever-increasing number of cores and then over a chip-to-chip link, which is both more constrained and higher latency than the ring-bus.
It's also the case that if a bin is set up and ready for back-end processing, a scheme that is not aware of multi-chip NUMA is going to have much more traffic over the interconnect--something that will not happen if the setup scheme has duplicate front-ends that specifically minimize inter-chip rendering traffic.
I'm sceptical Intel plans to brute-force tessellation...
Tesselation is about creating more triangles. At some level, amplifying the number of triangles and then turning them into a bandwidth+latency cost is a liability any scheme that apportions work heedless of chip location will take on.
It would be functional so long as Intel keeps inter-chip coherence, but Larrabee's bandwidth savings would be mitigated if the chip link is saturated, even if in absolute GB/s consumption is lower.
I guess in theory Intel could massively overspecify the inter-chip connections, but that sounds expensive.