Wii U hardware discussion and investigation *rename

Fourth Storm · Feb 5, 2013

Gipsel said:
I agree. But I must also say that the GPU and the layout of the SIMDs looks a bit strange. The size of the SIMD blocks would be consistent with a ~15% higher density layout than one sees in Brazos. Not completely impossible given the maturity of 40nm, AMD's experience with it, and the low clock target, especially if it uses an older iteration of the VLIW architecture (DX10.1 R700 generation instead of DX11 R800 generation) as base.
But there is more. I think function noticed already the halved number of register banks in the SIMDs compared to other implementations of the VLIW architecture. I glossed over that by saying than each one holds simply twice the amount of data (8kB instead of 4kB) and everything is fine. It's not like the SRAM stuff takes significantly less space on the WiiU die than it takes on Brazos (it's roughly in line with the assumed generally higher density).
But thinking about it, each VLIW group needs parallel access to a certain number (four) of individually addressed register banks each cycle. The easiest way to implement this is to use physically separate banks. That saves the hassle of implementing multiported SRAM (but is also the source of some register read port restrictions of the VLIW architectures). Anyway, if each visible SIMD block would be indeed 40 SPs (8 VLIW groups), there should be 32 register banks (as there are on Brazos as well as Llano and Trinity [btw., Trinity's layout of the register files of the half SIMD blocks looks really close to the register files of GCN's blocks containing two vALUs]). But there are only 16 (but obviously twice the size if we are going with the 15% increased density). So either they are dual ported (then the increased density over Brazos is even more amazing) or something really fishy is going on. Before the Chipworks guy said the GPU die is 40nm TSMC (they should be able to tell), I would have proposed to think again about that crazy sounding idea of a 55nm die (with then only 160SPs of course).

Thanks for the analysis. If the SRAM registers are dual ported, is this something that could be visibly discerned under close inspection?

Entropy · Feb 5, 2013

swaaye said:
What's with his excitement over embedded DRAM? Consoles have been using it since Gamecube and PS2.

And it was costly there as well. Look at it this way, if you will - compare the size of the CUs to the size of the EDRAM. It's staggering how much more Nintendo spent on EDRAM rather than ALU power in those terms. Not taking into account that having DRAM on chip affects the whole of the process, and thus attainable clocks and associated parameters, and of course yields.

How would it have affected the overall cost of the console if they had simply ditched the EDRAM, and instead have gone with a 128-bit interface to DDR3. Or GDDR5, for that matter? Back of the enveloping says that at least the first alternative would have been cheaper, and possibly the second as well, though that is a tougher call, especially over time.

The amount of EDRAM on the chip is, in relation to the whole of it, huge. The defining feature, unquestionably. Even though I estimated it to that size earlier in the thread it is still an eye-opener to actually see the die shot. I can't say I understand Nintendo's decision, but it can't be said that they haven't committed resources to their approach.

Grall · Feb 5, 2013

babybumb said:
It might be expensive and if it is very bad for Nintendo

That GPU is so small it just can't be all that expensive. Even with the multichip module it's not going to be very expensive, mounting a whopping two (small, and tiny, respectively) dies on one substrate isn't exactly rocket science these days. It's been done in tons of devices in the past, including one of nintendo's previous consoles...

Exophase · Feb 5, 2013

Gipsel said:
I agree. But I must also say that the GPU and the layout of the SIMDs looks a bit strange. The size of the SIMD blocks would be consistent with a ~15% higher density layout than one sees in Brazos. Not completely impossible given the maturity of 40nm, AMD's experience with it, and the low clock target, especially if it uses an older iteration of the VLIW architecture (DX10.1 R700 generation instead of DX11 R800 generation) as base.
But there is more. I think function noticed already the halved number of register banks in the SIMDs compared to other implementations of the VLIW architecture. I glossed over that by saying than each one holds simply twice the amount of data (8kB instead of 4kB) and everything is fine. It's not like the SRAM stuff takes significantly less space on the WiiU die than it takes on Brazos (it's roughly in line with the assumed generally higher density).
But thinking about it, each VLIW group needs parallel access to a certain number (four) of individually addressed register banks each cycle. The easiest way to implement this is to use physically separate banks. That saves the hassle of implementing multiported SRAM (but is also the source of some register read port restrictions of the VLIW architectures). Anyway, if each visible SIMD block would be indeed 40 SPs (8 VLIW groups), there should be 32 register banks (as there are on Brazos as well as Llano and Trinity [btw., Trinity's layout of the register files of the half SIMD blocks looks really close to the register files of GCN's blocks containing two vALUs]). But there are only 16 (but obviously twice the size if we are going with the 15% increased density). So either they are dual ported (then the increased density over Brazos is even more amazing) or something really fishy is going on. Before the Chipworks guy said the GPU die is 40nm TSMC (they should be able to tell), I would have proposed to think again about that crazy sounding idea of a 55nm die (with then only 160SPs of course).

I've tried thinking about this for a while.. I'm pretty tired so I might make some obvious mistake, please bear with me :/

The virtue of SIMD is that you can the entire vector like one big register, and can access it via one very wide single register file port. So the amount of individual addressing you need shouldn't scale with the number of lanes (VLIW groups) in the SIMD. Of course this will be limited if the SIMD is broken up over multiple blocks like.

If using R700 series as a base this document may be helpful: http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf

Section 4.7.4 is helpful. One of GPR.X, GPR.Y, GPR.W, and GPR.Z can be read each cycle. GPR.Trans takes a read port from one of these. There's also a constant file which I think can supply one read per cycle (over 4 cycles?) I'm not sure where writes fit in though, if they can fit in with this too (and if so, if the SRAMs need to be read + write ported). So a VLIW SIMD block may only need something like 5-6 SRAM banks dedicated to it.

upnorthsox · Feb 5, 2013

Grall said:
That GPU is so small it just can't be all that expensive. Even with the multichip module it's not going to be very expensive, mounting a whopping two (small, and tiny, respectively) dies on one substrate isn't exactly rocket science these days. It's been done in tons of devices in the past, including one of nintendo's previous consoles...

What was the die size for the gpu?

Exophase · Feb 5, 2013

upnorthsox said:
What was the die size for the gpu?

About 150mm^2.. not really what I'd characterize as small, it's not that much smaller than a quad core IB for instance (160mm^2), and larger than dual core GT2 IB (120mm^2) and Cape Verde (123mm^2). Maybe "medium" sized.

Still, I'd like to know where this $100 estimate comes from. The CPU die and little flash chip should cost next to nothing. I don't know how much MCMing it all together costs but it can't be that bad if it was still in the $130 Wii. If eDRAM is such a price burden then they made a pretty weird decision using it in the CPU instead of SRAM.

Azgoodazdead · Feb 5, 2013

New info on the Wii U GPU:

http://www.eurogamer.net/articles/df-hardware-wii-u-graphics-power-finally-revealed

Possibly equivalent to an HD 4650/4670 in graphics power comparison.

Gerry · Feb 5, 2013

Azgoodazdead said:
New info on the Wii U GPU:

http://www.eurogamer.net/articles/df-hardware-wii-u-graphics-power-finally-revealed

Possibly equivalent to an HD 4650/4670 in graphics power comparison.

No offence, but what do you think people have been discussing for the last 5 pages?

fellix · Feb 5, 2013

Those are in scale, as precise as I could get it. Wii U GPU doesn't seem to be holding the same amount of SIMD logic per partition. :???:

Take into account the difference in manufacturing process.

function · Feb 5, 2013

Gipsel said:
I agree. But I must also say that the GPU and the layout of the SIMDs looks a bit strange. The size of the SIMD blocks would be consistent with a ~15% higher density layout than one sees in Brazos. Not completely impossible given the maturity of 40nm, AMD's experience with it, and the low clock target, especially if it uses an older iteration of the VLIW architecture (DX10.1 R700 generation instead of DX11 R800 generation) as base.

There's another possibility that occurred to me after you pointed out the additional edram columns earlier. Perhaps the shader blocks are swollen by including a redundant VLIW5 vector unit.

(Edit: I'm typing this in bed while falling asleep, and have messed up. My suggestion would swell a 40 shader block to be even bigger, *increasing* density. I'm an idiot.

Never mind, push on. Leave it here in case there's some value to the redundancy idea.)

We're assuming there are 8, and adding a ninth would increase the size of the none-sram area by 12.5%. In such a scenario I could imagine it going in the centre of the cross-shaped area, with traces from all of the other vector units continuing into the centre to the redundant unit. Then you could maybe laser off the traces that weren't needed (all of them if the 8 outer units were all good).

I might have a naive view of the complexity of redundancy, but if the chip is as expensive and edram as difficult to manufacture as Chipworks say, then you might not want to throw away an otherwise good chip for the sake of a single bad VLIW5 vector unit.

The block shape kind of fits.

function · Feb 5, 2013

fellix said:
Those are in scale, as precise as I could get it. Wii U GPU doesn't seem to be holding the same amount of SIMD logic per partition.
Take into account the difference in manufacturing process.

... so ... following on from my last post, maybe it's 4 x VLIW5 + 1 redundant. It might help soak up some of the space.

I bet Grall hasn't had as much fun with his Wii U as we've had looking at the GPU die shot.

shinobi · Feb 5, 2013

function said:
... so ... following on from my last post, maybe it's 4 x VLIW5 + 1 redundant. It might help soak up some of the space.

I bet Grall hasn't had as much fun with his Wii U as we've had looking at the GPU die shot.

so it could be less then 352 gflops?

function · Feb 5, 2013

At this point ... I don't know. But lets keep gazing at the die shot until it's no longer fun.

Exophase · Feb 5, 2013

Entropy said:
The amount of EDRAM on the chip is, in relation to the whole of it, huge. The defining feature, unquestionably. Even though I estimated it to that size earlier in the thread it is still an eye-opener to actually see the die shot. I can't say I understand Nintendo's decision, but it can't be said that they haven't committed resources to their approach.

Nintendo probably feels they need the eDRAM to replace the 24MB of 1T-SRAM available on Wii's Hollywood chip. I don't know if the higher latency DDR3 would have caused problems for any Wii games, but Nintendo has always been extremely conservative with their BC approaches. They seem to want it as risk free and close to perfect as possible, and that often means including a bunch of the same, equivalent, or better hardware.

function said:
... so ... following on from my last post, maybe it's 4 x VLIW5 + 1 redundant. It might help soak up some of the space.

So you think the block has 5 VLIW instead of 8. If this is really done on TSMC 40nm it should have denser shaders than Brazos, not less dense. Like Gipsel says they have fewer features if they lack DX11 support, but they could also save more space by lacking double precision support, for instance.

Grall · Feb 5, 2013

function said:
... so ... following on from my last post, maybe it's 4 x VLIW5 + 1 redundant.

I highly doubt the wuugpu contains redundant anything, much less VLIW units...

I bet Grall hasn't had as much fun with his Wii U as we've had looking at the GPU die shot.

Nope. I actually haven't had any fun at all with the wuu, it's been one unmitigated disaster from beginning to end!

upnorthsox · Feb 5, 2013

Exophase said:
About 150mm^2.. not really what I'd characterize as small, it's not that much smaller than a quad core IB for instance (160mm^2), and larger than dual core GT2 IB (120mm^2) and Cape Verde (123mm^2). Maybe "medium" sized.

Still, I'd like to know where this $100 estimate comes from. The CPU die and little flash chip should cost next to nothing. I don't know how much MCMing it all together costs but it can't be that bad if it was still in the $130 Wii. If eDRAM is such a price burden then they made a pretty weird decision using it in the CPU instead of SRAM.

Thanks. Yes, that $100 figure was why I was asking because its just baffling. You get about 330 die that size from a 300mm wafer at 85% yield. At that price that'd be $33,000 for a processed wafer when TSMC typically charges $4500 for a 40nm processed wafer. Whoa.

function · Feb 6, 2013

Exophase said:
So you think the block has 5 VLIW instead of 8. If this is really done on TSMC 40nm it should have denser shaders than Brazos, not less dense. Like Gipsel says they have fewer features if they lack DX11 support, but they could also save more space by lacking double precision support, for instance.

I don't know what's in there, I'm just looking at the shape of the shader block (short and fat unlike R770 and Llano) and thinking about possible arrangements. Go with 8+1 if you think it makes any better of a case (maybe it doesn't).

The picture than fellix posted is interesting though. Maybe an edram friendly process doesn't lend itself to the same density (or greater) than Brazos acheived. Or maybe it makes no difference.

Exophase · Feb 6, 2013

Pardon me for constantly linking things I find posted on NeoGAF, but this one really takes the cake..

https://twitter.com/marcan42/status/298922364740190208

(apparently he's pasting a bunch of other interesting things.. and it looks like he's deducing it from symbols in system/game code)

fellix · Feb 6, 2013

Grall said:
I highly doubt the wuugpu contains redundant anything, much less VLIW units...

I still think it's possible there's some SIMD redundancy:

http://i.imgur.com/5oArkj4.jpg

The highlighted block, right next to the multiprocessors, contains similar GPR memory banks (in red), in 16:1 size ratio -- the same redundancy ratio found in RV770.

function · Feb 6, 2013

I've created a highly detailed technical schematic to illustrate one possible explanation for the shader block shape and size/density discrepancy apparent in the scaled shots fellix posted.

I don't have an image hosting account and so have uploaded it as an attachment.

Wii U hardware discussion and investigation *rename

Fourth Storm

Entropy

Grall

Invisible Member

Exophase

upnorthsox

Exophase

Azgoodazdead

Gerry

fellix

function

None functional

function

None functional

shinobi

function

None functional

Exophase

Grall

Invisible Member

upnorthsox

function

None functional

Exophase

fellix

function

None functional

Attachments

Similar threads