A10 APU with a small dedicated GPU
Small, power efficient, cheap and potent.
An A10 5800k with a dedicated 6670 GPU pulls around 142w of power during gaming on PC, maybe factor that up to 160w+ in a console due to every inch of the hardware is always run at full load.
There is quite a bit of redundancy present.
The A10 die measures 246 mm², the HD6670 (Turks) measures 118mm² in 40nm (and can't be shrunk that much because of the 128bit interface). The A10 has a 128 Bit DDR3 interface, 8 ROPs, 24 TMUs and 6 CUs (VLIW4). It provides a PCI-Express interface, display outputs, UVD engine and so on. The Turks chips also has a 128 Bit memory interface (albeit a combined DDR3/GDDR5 one), 8 ROPs, 24 TMUs, 6 CUs (but of the VLIW5 type, which makes the whole proposition highly unlikely from the start in my opinion) and again provides 16 PCIe lanes, display outputs UVD ...
That is going to be expansive to manufacture, requires a more complex layout of the board/package (if you want to put both on an MCM in the Wii U style) and contains simply a lot of unneeded stuff. If you start to cut parts out of both chips, you can also build something reasonable from the start.
If you want to keep the bulldozer/piledriver modules, two of them measure 61mm² in 32nm, or maybe around 45 to 50mm² in 28nm. Add the northbridge/uncore/glue of Trinity and you may arrive at 70 mm² for the 4 cores including the glue to the GPU part. Now add just for the sake of the argument a full Pitcairn chip which includes 20 CUs with 80 TMUs (they could use 18 active CUs for high yields), a 256 Bit GDDR5 interface, 32 ROPs, of course display outputs, an UVD-engine and also 16 PCI-Express lanes. And all that in some additional 212mm². You actually don't need the 16 PCIe lanes. Two or four would be completely enough to connect to some southbridge (or you could integrate that also on die). The northbridge of Trinity also includes already a memory controller, which can be scrapped or let's say unified with the one from the GPU saving a few mm². In the end, you would end up with a die only slightly larger than the A10, but with vastly higher performance.
The SoC solution would have
20 CUs (1280 SPs)
80 TMUs
32 ROPs
two way setup with 32 Pixel/clock raster
256 Bit GDDR5 interface
All in < 300mm² (probably doable in <280mm²) in 28nm. It would shrink to <200mm² in 20nm, but further shrinks would get complicated because of the 256Bit interface (that's an incentive for an alternative memory solution). And on top of that one would have quite a few more resources than the combined ones of the APU+GPU combination, it's easier to use them in an efficient way, too! One also spares the complications of two different memory pools and such stuff.
Of course it relies on porting the Piledriver core to 28nm (easiest would be TSMC's, as the GCN implementation exists for this process already), but it's rumored to happen anyway (getting the GPU parts of the APUs to run well on GF's 32nm SOI process, took a bit of time [and die space!]; maybe IBM's/GF's 28nm bulk process makes it easier or the engineers can use some of the experience from the 32nm port). And in the middle of 2013 (when the mass production should run full steam) the 28nm process should be pretty mature and a lot cheaper than it was a year ago.
If one doesn't want to take the risk of porting the CPU cores to another process and the additional benefit of a synthezisable core (can be ported to another fab or shrunk much more easily than a BD derivative), AMD has the Jaguar core which offers exactly that. They are quite small so you can easily fit twice the number of cores in the same space (2,9mm² in TSMC 28nm, one "module" of four cores including 2MB L2 is probably slightly smaller than a BD/PD module with the same amount of L2 when both are at 28nm), offer comparable IPC (but at a lower maximum clock) and don't consume a lot of power. Within a ~35W power budget for the CPU cores alone (leaving 100+W for the GPU part [which allows decent clock speeds], if one limits the power consumption of the whole console to <200W), it's probably even faster (because PD cores can't use the high clock ceiling in that power constrained scenario) in a well threaded engine (which should be the case).
My personal view is, that one wouldn't build it as described (basically slapping some CPU cores on a Pitcairn), it should only demonstrate that one could get a vastly better overall solution than this proposed APU+GPU combo in the same power and a similar die size or cost budget. But the solutions we will see in XBoxNext/720 and Orbis/PS4 could be somewhat close. One would probably see some fine tuning with the size and layout of the GPU part, maybe removal of a few features (for what does one needs 6 display outputs on a console?) or adding some other ones (a few additional HSA features of Sea Islands? As it is a closed system, one could get similar functionality in a leaner and meaner way). And there is of course the question of the memory interface. 128 Bits are probably cheaper in the long run, even if it may be more expensive in the short term when using eDRAM or stacked memory (which could get quite large and has even the potential to form the main memory pool) for compensation.