There's always a bottleneck somewhere
Others have insisted that since Larrabee is a CPU and uses x86, it won't have bottlenecks...
I wouldn't be surprised if interconnect : RAM ratio is better in Larrabee (i.e. less of a performance constraint relative to a single chip) than in traditional GPUs. If any of these ever do this, of course.
It shouldn't be too hard to do, given how bad the bandwidth is between GPUs.
As long as performance scales adequately with multiple chips, who cares? It's a question of whether it sells, not absolute performance. People have been buying AFR X2 junk for years now, putting up with frankly terrible driver support.
It would depend on how much that 10% expands in a multi-chip case, and what adequate scaling is.
An increase in the percentage for a dual-chip case would be a penalty taken out of the doubled peak versus a lower-overhead single-chip solution.
Without actual simulations, it would be little more than hand-waving on my part to guess how much that 10% could expand.
I can't work out what you're quantifying here.
I was listing out areas of the current scheme that can potentially jump chip, and that have some unquantified cost that is currently assumed to be acceptable.
These are areas where either the cycle count can jump by an order of magnitude, or the bandwidths can drop by an order of magnitude.
Ultimately the key thing about Larrabee is it has lots of FLOPs per watt and per mm² and is heavily dependent on GPU-like implicit (4-way per core) and explicit (count of fibres) threading to hide systematic latencies.
So whether the latency is due to a texture fetch, a gather or data that is non-local, the key question is can the architecture degrade gracefully? Maybe only once it's reached version X?
At a hardware level, Larrabee is not as latency tolerant as a GPU.
At a software level, a long-enough strand appears able to cover texture latency in a single-chip case.
A texture read or remote fetch would be even longer than that, though what that costs I am not sure.
The fiber would have to be compiled to be longer, which may have some similar penalties to increasing GPU batch size.
I'm not sure what practical limits there are to fiber length.
I think I completely misintrepreted what you said before. I'm not sure why you say bin spread is going to get worse with flimsy binning of triangles.
I was thinking about the case of deferring something like tesselation to the back-end. Bins would already be farmed out to the other chip when newly generated triangles might cross tile boundaries and need to update other bins.
In the final result, that spread would happen anyway, so in retrospect I was abusing the term.
The bin spread percentage is the measure of the number of triangles that span bins and may not actually change, but the timing and cost of it might.
I'd hope there'd be performance counters and the programmers make the pipeline algorithms adaptive. The fact that there are choices about the balance of front-end and back-end processing indicates some adaptivity. Though that could be as naive as a "per-game" setting in the driver.
I know modern chips have a ton of monitors, though I'd be curious how much a core based on the P55 will sport.
Larrabee should run shader compilation on-chip, so maybe it can make adjustments.
Yes, I agree in general, since computation is cheap and, relatively speaking, cheaper in multi-chip. All I'm saying is that Intel has a lot of potential flexibility (if it's serious about multi-chip) and it's a matter of not leaving gotchas in the architecture. Considering the appalling maturation path of multi-GPU, so far, Intel could hardly do worse. The only risk seems to be that consumers get sick of multi-chip (price-performance, driver-woes).
The next question is when Larrabee is small enough and cool enough for such a setup.
The chip in the die shot certainly doesn't look like a promising candidate, but perhaps at a later node.
The two large ICs on a card scheme GPU makers have been using may be a fluke.
Even if Larrabee does sport better scaling, the market that would benefit from this is already very niche.
Of course now that we learn that R800 doesn't have dual-setup engines and is merely rasterising at 32 pixels per clock, it does put the prospects of any kind of move to multiple-setup (and multi-chip setup) way off in the future.
Yeah, that was pretty underwhelming.
Does this help much for tesselation?
The peak triangle throughput numbers don't change.