Interesting, thanks! As for cost, what I was thinking of really is if it'd logically be cheaper for a VERY fast bus (>= 100GB/s), since that'd likely open up possibilities in terms of CPU-GPU collaboration that could be nearly as good as what a single-chip approach could deliver. And I'm not thinking specifically about Nehalem here, of course.
As long as the package isn't too complex (like only two chips and not a massive amount of IO going off-package) the situation would be similar to how Intel's used MCMs to keep ahead of AMD in the core race without putting too much design effort on a "native" design too soon.
It's not directly applicable because that solution didn't have a dedicated on-package bus, the two chips are hanging off the same bus.
Naive multicore where each core could just as easily be on its own can do well with an MCM solution if it doesn't involve too many die (dies, dice?).
In the case of CPU/GPU cooperation, it sounds like most near-term plans keep the GPU highly separate, with Fusion perhaps hanging the GPU off the internal crossbar.
At that level of separation, current paradigms already have a lot of latency compensation built-in. An MCM strikes me as being good enough, assuming something can be done about the hit it incurs on yields.
Considering the yield curve on overly large multicore single-die solutions, however, an MCM sounds like a good initial step. It seems worth at least one process node transition for a given doubling of the core count.
The equation changes if the performance level and the number of separate chips involved goes up. The closer one gets to a POWER4/5 MCM, the more costly and niche the product becomes.