That seems reasonable. The description appears to be at a somewhat higher level, where the particular implementation details of the DX11 pipeline would not be material.Looking at slide 22, the only way I can interpret tessellation being done in the back-end (along with GS) is if TS is synonymous with VS->HS->TS->DS (i.e. it is not a reference purely to the TS stage).
The reduced amount of data passed between the front and back ends would lead to bandwidth savings. Each core could potentially try to read from the same set of attributes, but this should be forwarded as needed within the cache hierarchy relatively quickly, and there is hopefully no write traffic to those locations in this phase.The advantages of delaying some DS work would include reduced storage in global memory and re-distribution of workload (e.g. later DS might lead to better load-scheduling).
There may be ordering and atomicity constraints for GS. Perhaps if the scheduler can determine that there is no interaction between invocations, they can be allowed to persist over the non-deterministic delay between binning and bin pickup.GS can do a variety of things. If GS is used merely to delete vertices/triangles then in theory it can be delayed until after binning - again this is a load-balancing question, I think. i.e. run GS across lots of cores as they do binning, rather than on a few cores while creating bins.
Maybe there are some other usages of GS that are amenable to delayed execution (e.g. generating attributes)?
This appears possible. Intel claimed earlier that the rasterizer part of the pipeline wasn't the hard part.By the way, the term "rasteriser" is often used to describe all of these stages: setup->rasterisation->pixel shading->output merger (ROP). So it's possible to interpret the statement about the lack of a fixed-function rasteriser as actually descriptive of lack of "setup->rasterisation->pixel shading->output merger". To be honest I think this is very likely the correct interpretation.
Larrabee may have been near the top end of what is possible for a PCIe graphics accellerator, at about 2/3 of the die. The rest of the die had IO, controllers, UVD, texture blocks, and miscellaneous logic.I pretty much always thought it would be years before Intel was competitive at the enthusiast end, but process would eventually allow it to catch up. A major question for the other IHVs is what proportion of die space ends up being programmable compute, and the higher that rises the more competitive Intel becomes.
A good amount of the uncore would need to scale as well, otherwise the comput portion would be strangled.
x86 penalty aside, the decision to use full cores for that 2/3 of the die was also a contributing factor to the size and power concerns.
There can be programmable processing units either way, but past a certain number of fully-fledged CPU cores the utility of having even more would have been reduced. There was a lot of front-end and support silicon for the amount of vector resources one got per core.