Tom Forsyth had a presentation for SIGGRAPH2008.
Found it, I keep forgetting that
Wouldn't allocating triangles to a bin require the rasterization portion of the workload as well?
Forsyth's slides apparently included this in the front-end estimate.
It's a tile-level rasterisation only. So very cheap in terms of rasterisation, but requires that all vertex shading that affects the position attribute is computed.
Slide 26 says front-end is ~10% of the entire compute effort.
The actual cost I see is the creation of a bin and then having any core pick up a bin for processing. Both would be more expensive to do.
Forsyth's slides also indicated that a bin contains tris, shaded verts, and rasterized fragments.
I'm not sure if the fragments would be a concern for the distribution phase that might be passing over the interconnect.
The amount of data in a bin varies, you've described a heavy-weight bin. A flimsy bin with nothing more than triangle IDs would be cheap in a multi-chip solution. This would make the back-end more compute-heavy, which would hide some of the latency associated with NUMA. This trade-off between light/heavy is seen in current games where developers elect either to compute all attributes during vertex shading or leave some of them for computation during pixel shading (these are attributes derived from other attributes, normally) - you can view this as a form of compression of the per-vertex data.
Since the memory subsystem should maintain a coherent image of memory across the chips, there is no algorithmic reason why it would be single-chip.
The costs of this work have been evaluated as being sufficiently low only for a single-chip scenario, however.
Consumption of vertex data is basically a streaming problem, i.e. quite latency tolerant, if you have some decent buffers. Vertex data, due to the connectivity of triangles, strips, etc., never neatly fits precisely into cache lines, so the best approach is just to read big-ish chunks rather than individual vertices/triangles. So two chips (conventional or Larrabee) consuming from a common stream are going to be slightly more wasteful in this regard - this is similar to the wastages that occur with different vertex orders in PTVC.
But Larrabee can run multiple render states in parallel. So most trivially you can have the two chips working independently. Whether two successive render states are working with the same vertex inputs (e.g. shadow buffer passes, one per light?) or whether they're independent vertex inputs, the wastage is down purely to NUMA effects.
The flexibility of the software pipeline is the reason for Forsyth's estimate for front-end work being so wide.
It's 10% if deferring attribute, vertex, and tesselation work to the back end. It's variable because those three can be done in either front or back end.
Bin size would be the most amenable for sending to another chip if this work is deferred, but back-end burden and bin spread would be worse.
If done in the front-end, bin size becomes much larger and more costly to send to a remote pool of cores, though the bins themselves would be much better-behaved.
Bin spread should fall if flimsy bins are used, since the tiles can be larger (which reduces bin spread).
Back-end burden would be perfectly spread across both chips. Sure, two chips won't achieve 100% scaling - we aren't expecting that. Even Intel's estimates/simulations for scaling with core count on a single chip aren't linear...
If the front-end is duplicated, we about double the computation required for PrimSet dispersal and front-end work, but with minimal increase in synchronization or bandwidth burden on the interface. The developer would be much more free to decide on where to put work between the front and back ends.
I don't understand how you get double.
The PrimSet distribution by one core is actually well-suited to the likely ring-bus configuration Larrabee will use.
I don't understand what you mean by PrimSet distribution. Each PrimSet can run independently on any core. The data each produces is a stream of bins. They consume vertex streams and, if they already exist, render target tiles.
Some scheduler, somewhere, must then assign bin sets to cores. This is not a heavy task. Back-end has to consume the bin and create/update the render target tile. The scheduler isn't delivering bin data to back-end tasked cores.
It's also the case that if a bin is set up and ready for back-end processing, a scheme that is not aware of multi-chip NUMA is going to have much more traffic over the interconnect--something that will not happen if the setup scheme has duplicate front-ends that specifically minimize inter-chip rendering traffic.
If these are flimsy bins then I don't see the issue. If these are bins with in-progress render target tiles, then that's a bit more costly. Clearly heavy-weight bins are going to be the mostly costly. There's zero reason to build a multi-chip non-NUMA-aware software pipeline - Intel clearly intends not to build a one-size-fits-all software pipeline. Though I'll happily agree that multi-chip is low priority until single-chip is working really well, apart from anything else because it's harder.
Tesselation is about creating more triangles. At some level, amplifying the number of triangles and then turning them into a bandwidth+latency cost is a liability any scheme that apportions work heedless of chip location will take on.
It would be functional so long as Intel keeps inter-chip coherence, but Larrabee's bandwidth savings would be mitigated if the chip link is saturated, even if in absolute GB/s consumption is lower.
I guess in theory Intel could massively overspecify the inter-chip connections, but that sounds expensive.
Overall, though, I would expect that a memory-bandwidth:link-bandwidth ratio of X would serve Larrabee better than traditional GPUs. You have a huge amount of programmer freedom with Larrabee to account for the vicissitudes of NUMA.
Jawed