We already know how "dense" it is. 40mm2 at 45nm for 32MB.
Also we know Durango is low speed at 100GB/s, heck it would be very slow even for an eDram bank.
When including overhead, at 45nm Mosys 1T-SRAM-Q (expensive process, 4x density) would end up at 34mm2 and IBM eDRAM would be 45mm2 (even more expensive process, 3x density). SRAM at the same process would be 133mm2 at 45nm, so it would be below 60mm2 at 28nm. The same data from Mosys shows a 50% overhead for SRAM at 90, 65 and 45, so I'm using it for 28 too. It all fits, the numbers still give me 60mm2 no matter how I slice them. Even when using a die shot of the WiiU and crunching the numbers in reverse.
It's the wrong reference. Your mistake is caused by your inability to differentiate between a last level cache and a simple pool of ram. That's why you end up with such a hilariously wrong number. You need to explain the WiiU ram area with a better explanation than "it's dense".
I can't elaborate on your estimate but if the scratchpad is indeed in that ballpark, 60mm^2, it would make sense to me to cut corners and try to get the chip under the 185mm^2.
A cap verde with that amount would in that ball park. Though looking at Durango, there are 2 more CUs, more memory controllers, as I see it you also have fit the IO (as in Xenos) to feed the CPU, etc. Definitely it would miss the mark.
It really make me wonder about which GPU architecture MSFT chose its GPU.
From juniper was ~1 billion transistors, Cap verde is 1.5 millions the scaling is almost perfect wrt to transistor density. I wonder about how many transistors a 12 SIMD GPU based on Cayman architecture would "weight".
It is worse to notice that the 4 SIMD (/256sp vliw4 design) version of trinity beats in most of the cases (or at worst matches) the highest end llano part which features 5 SIMD (/400sp vliw5 design).
It is quiet a feat, too bad there wasn't that many products released based on that architecture so it is tough to guess how good are the "perf per transistor".
I wish we could compare say Juniper, a 10 SIMD part based on Cayman architecture and cap verde wrt to perfs and perfs per transistors. Looking at both Cayman and trinity I would assert that a 10 SIMD/16 ROPs would be in the ball park of juniper wrt to transistor count while out performing it in every scenario. I would also bet that it would also bet that it performs closer to Cap verde than to Juniper.
Overall I wonder if MSFT could have looked at existing AMD architecture ask for an estimate about how "big" a Vliw4 would be using TSMC 28 nm process and decided it was good enough.
Overall the transistor density could be lower than in Cap verde but it could still be a "win" with regard to the die size. Say if the scaling is 0.6 (GCN pulls a perfect 0.5), assuming a 1 billion transistor chip (~juniper) with 10 SIMD and 16 ROPs (vliw4 design), you get a 100mm^2 chip vs 123 for cap verde. Saving 23mm^2 may not sound like that much but MSFT could definitely be after those kind of "win".
If they could get 2 chips, =<185mm^2 for the GPU and =<80mm^2 for the CPU, definitely I could see the system sell for cheap, as I said pretty much replacing the existing SKU (pretty much the silicon budget of Xenos alone at the 360 launch).
They may do like Nintendo put the two chips on a MCM with a single cooling solution.
Power consumption could be surprisingly low.
25 Watts for the CPU doesn't sound out of place
45-55 Watts for the GPU (extrapolate from cap verde hd 7750, which have less SIMD, the same amount of ROPs, a tad higher clock speed, a more power hungry memory controller but durango has more of them, the figure sounds "right").
~80 Watts for the heart of the system.
EDIT
All this is not super clear to say in another manner, I wonder if AMD it-self for Trinity may have vouched for their VLIW4 design based on its own merit more than based on timeline issues.
It seems that a lot of the win in GCN GPUs were backed into the vliw4 design. Looking at how juniper compares to Cap Verde, I would assert that a "GCNed" Redwood could have ended up in the ~900 millions transistors (from ~600millions). That a beefy increase in silicon budget and it is not clear by which extend it would have beaten the VLIW4 design in trinity which (in transistors) weight mostly the same as the redwood integrate in Llano and has an extra SIMD to play with.
Looking at the choices made by Nvidia from Fermi to Kepler, one has to wonder if MSFT could have come to same conclusion while comparing AMD VLIW4 and GCN design, ie the price paid for the massive increase in compute performance is too high.
I don't state is as GCN doesn't bring improvement in the graphic department but less that what extra SIMD/ROPs could buy you.
It's a quiet inaccurate way to view things but take cap verde die size and a scaling of 60% (GCN achieved 0.5) from 40nm to 28nm, get the size Cap verde could have at 40 nm you get chip of 205mm^2. Juniper is 166mm^2, that's 20 % tinier, the other way the resulting chip is 23% bigger than Juniper.
Now consider that a VLIW4 part of ~170mm would out perform Juniper. Grow that part (add SIMD) till you reach a die size area around 205mm^2.
You will end with grosso mod half a HD6970: so 12 SIMD and 16 ROPs.
You shrink it (same scaling 0.6 as used before), instead of Cap verde you have a part with 12 SIMD, so 20% increase in shading power (or cap verde has 83% the shading power of that hypothetic part). Does GCN improved efficiency make up for that, for graphic alone, I would bet in most case no.