It's a tile-level rasterisation only. So very cheap in terms of rasterisation, but requires that all vertex shading that affects the position attribute is computed.
Slide 26 says front-end is ~10% of the entire compute effort.
I must have have mixed some of Seiler's description and Forsyth's together mentally. I had thought the full amount of the rasterization remained with the minimal front-end solution.
The amount of data in a bin varies, you've described a heavy-weight bin. A flimsy bin with nothing more than triangle IDs would be cheap in a multi-chip solution.
That would be what I was commenting about when I mentioned the lower range.
There would be additional costs in the case of tesselation spawning new triangles that might worsen bin spread, I would think.
It would also dictate a specific style of front-end.
Programmers could not opt to have a heavy front-end in the multi-chip case, and they may find themselves constrained by this.
Running the front end on both chips would possibly allow this choice to remain.
This would make the back-end more compute-heavy, which would hide some of the latency associated with NUMA.
Yes and no.
It depends on just how naive the solution is.
Worst-case, a lot of the necessary data resources do not have local copies on the second chip, and then the entire thing is throttled by the interconnect.
The way back-end cores eagerly snatch up the next available tile was not described as taking into account any locality.
This is probably fine in the single-chip case since it's all the same memory controllers and ring bus.
It can incur additional costs if this causes self-assignment to hop chips.
Consumption of vertex data is basically a streaming problem, i.e. quite latency tolerant, if you have some decent buffers.
Would these buffers be in local memory per-chip or in the L2 caches?
An additional concern is that this turns into a streaming problem when the remote cores are aware that the additional vertex work is available. (edit for clarity: so we must factor in synchronization costs that do not exist otherwise)
The most latency-tolerant methods would lean most heavily on bandwidth and buffers to work their magic, and we have an unknown ceiling in interconnect bandwidth. It's probably safe to assume interconnect bandwidth << DRAM bandwidth.
Vertex data, due to the connectivity of triangles, strips, etc., never neatly fits precisely into cache lines, so the best approach is just to read big-ish chunks rather than individual vertices/triangles.
I figured alignment issues and line problems as falling in the noise.
I think reading from a bin in memory would be done in chunks anyway.
So two chips (conventional or Larrabee) consuming from a common stream are going to be slightly more wasteful in this regard - this is similar to the wastages that occur with different vertex orders in PTVC.
I have my doubts about how similar they can be.
Any low-level event that leads to waste in the multi-chip case something that, even if rare, is in my figuring likely to be 2-10x as expensive to handle versus a waste problem that stays local.
This is why I am reluctant to assume that something that is 10% of the load in a single-chip scenario stays at 10% with multiple chips.
Assigning a PrimSet to a core can take possibly tens of cycles with one chip.
Assigning triangles to bins can take place at the full bandwidth of the chip DRAM bus.
A back-end thread detecting that a bin is ready would take tens of cycles, and reading bin contents in the back end can use full bandwidth.
About half the time with multi-chip situations, assuming completely naive assignment, these assumptions will not be true.
But Larrabee can run multiple render states in parallel. So most trivially you can have the two chips working independently. Whether two successive render states are working with the same vertex inputs (e.g. shadow buffer passes, one per light?) or whether they're independent vertex inputs, the wastage is down purely to NUMA effects.
This would come down to the complexity and amount of independent nodes in the dependency graph.
In the scale of granularity where per-frame barriers are coarsest and per-pixel or fragment is finest, coordinating at a render state might be medium-to-coarse synchronization. It would be additional synchronization, and this would be a performance penalty even with one-chip.
Some of this may be unavoidable regardless of scheme used, as those buffers eventually have to be used and so much of that data will need to cross the interconnect.
It might be that this can safely be done by demand streaming to each chip or perhaps a local copy of each buffer will exist for each buffer in each memory pool.
Bin spread should fall if flimsy bins are used, since the tiles can be larger (which reduces bin spread).
The bin sets are stored in main memory, though.
The Seiler paper posited the number of color channels and format precision as the factors in deciding tile size.
A bin's contents could be streamed from a chip's DRAM pool based on demand, so why would this impact the tile size?
Granted, if for some reason the bin were on the other chip's memory pool due to a non-NUMA-aware setup, costs in latency, bandwidth or memory buffering would be higher.
Actually, if the scheme is that naive, it wouldn't know to add additional buffering and the chips would just stall a lot.
I don't understand how you get double.
I'm talking about running the exact same front-end setup process on both chips simultaneously, so each operation would be performed twice. There would be two bin structures, one per chip, and each chip would get portions of the screen space so that each chip can avoid writing triangle updates for non-local bins, and cores would not try to work with data from the other chip's DRAM pool unless absolutely necessary.
I don't understand what you mean by PrimSet distribution. Each PrimSet can run independently on any core. The data each produces is a stream of bins. They consume vertex streams and, if they already exist, render target tiles.
Some scheduler, somewhere, must then assign bin sets to cores.
This is what I was talking about.
The control core that assigns PrimSets to other cores can do so in a manner that fits well with a possible implementation of the ring-bus.
The other cores don't magically become aware of the assignment without some kind of ring-bus transaction.
Overall, though, I would expect that a memory-bandwidth:link-bandwidth ratio of X would serve Larrabee better than traditional GPUs. You have a huge amount of programmer freedom with Larrabee to account for the vicissitudes of NUMA.
I've been discussing Intel's own software rasterizer, though. I'm not clear on just how much can be set by a programmer not messing with Intel's driver and all that.
Sure, if a developer rolled their own solution they could do what they want.
I think that any multi-chip solution from Intel would include software that was modified so that as much work gets done locally as possible, even if at the cost of duplicated computation.