LRB does not need any extra chip to mid wife it's multi gpu work. It's all done in software so just gang the two chips together at the hardware level and they are good to go.
There is no such thing as a pure software solution: you need to run on it on some piece of hardware eventually.
Larrabee will have to execute the same identical set of input commands as any other GPU, after all, they're going to comply with DX10 etc.
The intra-frame dependencies are inherent to those commands. When you render a pixel that depends on a texture that was rendered in a previous step, then you have an intra-frame dependency. When you have a Z-only prepass before rendering the pixel, you have an intra-frame dependency.
You're going to have to store those pieces of data somewhere and they will have to be accessed later on. The latter part has always been the problem: how can you make one chip access that data efficiently when it's stored in the local memory of another.
It is no different than an SMP system (unlike what you imagine them to be, not exactly champions when it comes to linear performance scaling) where one thread needs to wait for results of an other.
In a system where you have dependencies, it's vastly more efficient to keep everything under control on the same chip than to spread it out over many: a high bandwidth inter-chip data sharing interface is hard to design no matter what's sitting on either side of that interface.
In practice, you'll almost always end up with a NUMA situation, where local memory access is an order of magnitude faster than accessing data across the inter-chip bus... This is no different with SMP system. There's a reason for the existence of NUMA aware Linux schedulers...
Even in AFR, you have dependencies but they're only inter-frame, e.g. if a frame uses a previous one for reflections.
I'm afraid your view is a little bit naive...