Your experience correlates well to articles I have read. Hi speed IO usually requires pretty hi voltage to maintain signal integrity, kind of a viciuos circle with regard to power consumption, the higher frequency, the higher voltage.I do think costs can be sufficiently lower for an APU than discrete components on a mature process.
Yes, yields will always be better for two smaller chips than one big one, but at some point on a mature process they'll be good enough and eventually the costs of packaging, testing, shipping, and installing two chips will be more expensive than one bigger one.
I don't know when then break even point is, but AMD seems to think it's in their business interest to sell a GPU+APU chip instead of two lower cost chips.
For my experience in the embedded world, I've seen that I/O can be a significant contributor to power. I've seen very high speed I/O ( at least for our application ~1GB/s) consume around 15-20% of the power budgets (~10W). Granted some of that could have been our own implementation faults, but I imagine that keeping CPU/GPU IO on chip would be very beneficial especially if there's a large-on-chip eDRAM.
Using a silicon interposer for the two chips would actually mitigate some of this problem due to shorter traces and better signal conditions.
Regarding some break even point (die size wise) for when to stay discrete and when to go SOC, that is of course very much depending on the maturity of the process in question and the chip design. If you have a design with fine grained redundant elements that will help yields significantly. The GPU part may be better than the CPU in that regard, but this is where you can tweek your design to better suit a SOC or to stay discrete.
Packing and testing one chip design instead of two is certainly a significant cost reduction as well as you point out.