Wii U hardware discussion and investigation *rename

Status
Not open for further replies.
My opinion - for whatever it's worth - is that there is no black and white answer. What most of us here suspected for a long time is still probably the case, and that is that it's weaker in some ways and stronger in others.

Going by analysis of the early games and going by the die sizes and processes and power consumption and clocks it seemed that the Wii U had more "shader power", less "CPU power" and that there was some issue that was limiting effective fillrate, particularly when transparency was involved. And I think all of that still stands.

If I had to give an opinion it would be that the Wii U is about on par with current systems, give or take, and that it'd really struggle with next gen ports even at half the resolution and and/or half the frame rate.

thanks! really appreciate your time.
 
^ Stop asking when people are still deciphering the information, they heard you the first time.

Either way, it would not be a certain regardless. The Wii U does some things better than the 360 and PS3, and it does some things worse, its not a straight up or down situation.

The main bandwidth is slower, EDRAM is in higher quantity but slower, GPU is 50% more powerful give or take a few percentage points, CPU is marginally weaker.

thanks for the answer, and yea i asked again cause some guy was saying 550 gflops on neogaf and he was suppose to be some kind if tech expert where everybody was believing him.
 
WkoSmXc.png


Scaled comparison of single banks from both EDRAM pools.
 
I was thinking that the smaller, probably faster block of edram might have another role to play.

If it's 4 MB then that'd be enough for Z at 1280 x 720. This would free up BW in the large edram pool for colour. If a game can't fit it's Z buffer in the smaller, faster chunk of edram then without tiling it'd have to go in the larger pool and contend for BW. This might explain why a game like BLOPs 2 with it's MSAA runs so badly on the Wii U whenever there's a sniff of transparency.

We still need to understand why the Wii U has foliage cut out of games like Darksiders 2 and Tekken, and why BLOPs chokes on transparency. Slow edram for colour still seems like the most likely scenario.

I believe this is due to architectural differences, unfamiliarity with how to maximize/optimise the unorthodox Espresso chip configuration, immature library documentation & toolsets, optimal BW use, engine adaptation & porting etcetera. Also iirc, the most mature SDK was released in November of 2012. Essentially, without a somewhat extensive retooling PS3 & 360 ports will always suffer by comparison. I can assure you that 'X', Bayonetta 2, (may steal Nintendo's E3 show) 3D Mario, Zelda, Retro's project, (my personal system seller along with 'X') SMTXFE, as well as other proprietary software will not suffer from the aforementioned issues. The CPU is not a major bottleneck btw. Ancel, Shin'en, Frozenbyte, Gearbox, as well as my own sources haven't cited the CPU as a significant hurdle to extracting performance from the hardware. Even Pikmin will have no problem with transparencies, nor complex environmental geometry as will be shown in some of the aforementioned software.

so with 352 gflops, bandwidth and cpu bottle necks, is it weaker or stronger then current gen, your opinion would be welcomed.

Stronger than current gen, these bottlenecks are grossly exaggerated. As in any system, engines must be tailored to exploit platform strengths while minimizing weaknesses. As I've said before, the CPU is indeed weaker than the 360's, but this is a heavily GPU reliant system. Also slower ram is unusable now? A truly functional tessellation unit, better texure IQ, 50%+ more dedicated system ram, advanced realtime lighting & shadowing capabilities, taken as a whole this bests the current generation. Albeit not by any vast margin. In these discussions I find we always discount the tablet's capabilities, rendering a totally unique game scene completely independent of the screen. Replete with shading & geometry.
 
Last edited by a moderator:
This is a scary notion from DF:

While there's still room for plenty of debate about the Wii U hardware, the core fundamentals are now in place and effectively we have something approaching a full spec. It took an extraordinary effort to get this far and you may be wondering quite why it took a reverse engineering specialist using ultra-magnification photography to get this information, when we already know the equivalent data for Durango and Orbis. The answer is fairly straightforward - leaks tend to derive from development kit and SDK documentation and, as we understand it, this crucial information simply wasn't available in Nintendo's papers, with developers essentially left to their own devices to figure out the performance level of the hardware.
 
I had no idea the documentation was that dire, yikes! Though this does explain some of the porting issues...... this will definitely negatively impact 3rd party support.
 
Last edited by a moderator:
WkoSmXc.png


Scaled comparison of single banks from both EDRAM pools.
The small pool is probably just 2MB size in total. It carries simply more overhead.
But have you noticed that the rightmost blocks of the small pool has a different size? It is not composed of 16x16 blocks, but is 18 wide. Just redundancy (But if that's the case, why doesn't the large pool show this? Should be more important there.) or some sign of parity/ECC protection?
 
Last edited by a moderator:
The small pool is probably just 2MB size in total. It carries simply more overhead.

Oh well, there goes my "4 MB fast buffer" hypothesis. Probably some kind of cache (big high BW texture cache?) or there for BC.

But have you noticed that the rightmost blocks of the small pool has a different size? It is not composed of 16x16 blocks, but is 18 wide. Just redundancy (But if that's the case, why doesn't the large pool show this? Should be more important there.) or some sign of parity/ECC protection?

Nice spot!

Perhaps there are redundant elements within the larger blocks, so no additional ones are needed? Or they just settle for 31.9 MB (for example) and round up to 32 when talking about it one the dev kits (if the OS or pad functions reserve some edram it's not like developers would be able to to tell anyway)?
 
It's normal for newer architectures to be more efficient than older ones, but I'm wondering if it might still be easier to get high effective utilisation of Xenos than Latte.

The 360 has more TMUs per vector unit, and a "texture data crossbar" (thanx B3D article) that should mean each vector unit has a higher maximal and average input of bilinear filtered texels available. No doubt the newer latte architecture is more efficient, but would this enough to overcome a big relative deficit of TMUs per vector unit and no crossbar?

The same might be true for the ROPs affecting GPU utilisation too - on the 360 they can always run at full tilt, and there are more of them in relation to the number of vector units. The Wii U has relatively fewer ROPs and they are unlikely to be as efficient as the 360s "magic ROPs" embedded into the daughter die with effectively unlimited BW.

Just to be clear, I'm talking about % utilisation and not about rawr powah where the Wii U should always be ahead (more shaders, higher clock). The assumption is that the Wii U should always be a lot more efficient, but maybe it's not always that clear cut.

Edit: In other words, it might be easier to keep the Xenos ALUs busy, based on a very simplistic "paper specs" comparison. Or maybe not. Maybe someone with experience of the Xbox 360 and R7xx development can say?
 
Last edited by a moderator:
Some new comments from a Chipworks employee were posted on the NeoGAF thread:

Jim Morrison said:
Been reading some of the comments on your thread and have a few of my own to use as you wish.

1. This GPU is custom.
2. If it was based on ATI/AMD or a Radeon-like design, the chip would carry die marks to reflect that. Everybody has to recognize the licensing. It has none. Only Renesas name which is a former unit of NEC.
3. This chip is fabricated in a 40 nm advanced CMOS process at TSMC and is not low tech
4. For reference sake, the Apple A6 is fabricated in a 32 nm CMOS process and is also designed from scratch. It’s manufacturing costs, in volumes of 100k or more, about $26 - $30 a pop. Over 16 months degrade to about $15 each
a. Wii U only represents like 30M units per annum vs iPhone which is more like 100M units per annum. Put things in perspective.
5. This Wii U GPU costs more than that by about $20-$40 bucks each making it a very expensive piece of kit. Combine that with the IBM CPU and the Flash chip all on the same package and this whole thing is closer to $100 a piece when you add it all up
6. The Wii U main processor package is a very impressive piece of hardware when its said and done.

Trust me on this. It may not have water cooling and heat sinks the size of a brownie, but its one slick piece of silicon. eDRAM is not cheap to make. That is why not everybody does it. Cause its so dam expensive

This resulted in lots of fanfare that'll probably spill over here - but I think there's one really serious misconception going around. When he says that this is custom he means that it's not using licensed hard macros. That doesn't mean that it's using a totally new GPU microarchitecture, and that information isn't something you can tell just by looking at the die. You don't need to have markings for an implementation of IP licensed in RTL form. For comparison, he also says Apple's A6 is completely custom; we know that Apple is licensing plenty of IP, like the GPU and media encode/decode.

So we don't know anything about how similar or different it is from other GPUs. There's zero question that Nintendo is licensing the GPU from AMD: http://blogs.amd.com/play/2012/09/2...ertainment-with-proud-technology-partner-amd/ and that technology can be based on anything. There could be some customizations to fit Nintendo's needs but I can't fathom that AMD really designed a whole new GPU just for Wii U...
 
What's with his excitement over embedded DRAM? Consoles have been using it since Gamecube and PS2.
 
This resulted in lots of fanfare that'll probably spill over here
Not sure there's a whole lot of custom logic in that GPU; GPUs are mostly coded using high-level RTL language; one of the few major exceptions was Nvidia's shader processors in 8800GTX series up until 580GTX.

Looking at the die shot, apart from SRAM and DRAM arrays and I/Os it's almost entirely "sea of transistors" type logic, IE not custom layout...

Also, I think Nintendo would be unrealistically lucky selling anywhere near 30 million wuu/yr. 15, maybe, but wuu looks more like a gamecube successor to me, which was what, 35 million total lifetime sales...? Seeing that post-Xmas sales curve crash doesn't inspire a lot of confidence in the console's success.

Btw... Anyone spotted the ARM "starlet" successor core yet? Assuming it's there, of course. Might be near-impossible to find I suppose, considering how small it ought to be. It could masquerade as almost anything on that die. :) Also, there's supposed to be a DSP too somewhere, right?
 
There could be some customizations to fit Nintendo's needs but I can't fathom that AMD really designed a whole new GPU just for Wii U...
I agree. But I must also say that the GPU and the layout of the SIMDs looks a bit strange. The size of the SIMD blocks would be consistent with a ~15% higher density layout than one sees in Brazos. Not completely impossible given the maturity of 40nm, AMD's experience with it, and the low clock target, especially if it uses an older iteration of the VLIW architecture (DX10.1 R700 generation instead of DX11 R800 generation) as base.
But there is more. I think function noticed already the halved number of register banks in the SIMDs compared to other implementations of the VLIW architecture. I glossed over that by saying than each one holds simply twice the amount of data (8kB instead of 4kB) and everything is fine. It's not like the SRAM stuff takes significantly less space on the WiiU die than it takes on Brazos (it's roughly in line with the assumed generally higher density).
But thinking about it, each VLIW group needs parallel access to a certain number (four) of individually addressed register banks each cycle. The easiest way to implement this is to use physically separate banks. That saves the hassle of implementing multiported SRAM (but is also the source of some register read port restrictions of the VLIW architectures). Anyway, if each visible SIMD block would be indeed 40 SPs (8 VLIW groups), there should be 32 register banks (as there are on Brazos as well as Llano and Trinity [btw., Trinity's layout of the register files of the half SIMD blocks looks really close to the register files of GCN's blocks containing two vALUs]). But there are only 16 (but obviously twice the size if we are going with the 15% increased density). So either they are dual ported (then the increased density over Brazos is even more amazing) or something really fishy is going on. Before the Chipworks guy said the GPU die is 40nm TSMC (they should be able to tell), I would have proposed to think again about that crazy sounding idea of a 55nm die (with then only 160SPs of course). :oops:
 
Not sure there's a whole lot of custom logic in that GPU; GPUs are mostly coded using high-level RTL language; one of the few major exceptions was Nvidia's shader processors in 8800GTX series up until 580GTX.

Looking at the die shot, apart from SRAM and DRAM arrays and I/Os it's almost entirely "sea of transistors" type logic, IE not custom layout...

I'd still make the distinction between synthesized RTL (with whatever constraints in place) and a full GPU hard macro placed on that die. I figure that's all he really meant by custom, with the remark that the latter would have die markings identifying it (does anyone know if this is really a given/requirement?)

If we're talking hand layouts I don't think it applies to most of A6 either.

Also, I think Nintendo would be unrealistically lucky selling anywhere near 30 million wuu/yr. 15, maybe, but wuu looks more like a gamecube successor to me, which was what, 35 million total lifetime sales...? Seeing that post-Xmas sales curve crash doesn't inspire a lot of confidence in the console's success.

30m/year is insane. Especially if we're talking on average, where even Wii hasn't come anywhere close to that. I don't think he's that familiar with game console sales.
 
Not sure there's a whole lot of custom logic in that GPU; GPUs are mostly coded using high-level RTL language; one of the few major exceptions was Nvidia's shader processors in 8800GTX series up until 580GTX.

Looking at the die shot, apart from SRAM and DRAM arrays and I/Os it's almost entirely "sea of transistors" type logic, IE not custom layout...
I don't think he meant a custom or hand layout. I would interprete that guy as saying it's probably a custom designed GPU in the sense, that AMD created a fully custom RTL description for Nintendo (design owned by Nintendo, not AMD), not just some small adjustments to a licensed Radeon GPU design. I don't know if I should believe that.
 
I agree. But I must also say that the GPU and the layout of the SIMDs looks a bit strange. The size of the SIMD blocks would be consistent with a ~15% higher density layout than one sees in Brazos. Not completely impossible given the maturity of 40nm, AMD's experience with it, and the low clock target, especially if it uses an older iteration of the VLIW architecture (DX10.1 R700 generation instead of DX11 R800 generation) as base.
But there is more. I think function noticed already the halved number of register banks in the SIMDs compared to other implementations of the VLIW architecture. I glossed over that by saying than each one holds simply twice the amount of data (8kB instead of 4kB) and everything is fine. It's not like the SRAM stuff takes significantly less space on the WiiU die than it takes on Brazos (it's roughly in line with the assumed generally higher density).
But thinking about it, each VLIW group needs parallel access to a certain number (four) of individually addressed register banks each cycle. The easiest way to implement this is to use physically separate banks. That saves the hassle of implementing multiported SRAM (but is also the source of some register read port restrictions of the VLIW architectures). Anyway, if each visible SIMD block would be indeed 40 SPs (8 VLIW groups), there should be 32 register banks (as there are on Brazos as well as Llano and Trinity [btw., Trinity's layout of the register files of the half SIMD blocks looks really close to the register files of GCN's blocks containing two vALUs]). But there are only 16 (but obviously twice the size if we are going with the 15% increased density). So either they are dual ported (then the increased density over Brazos is even more amazing) or something really fishy is going on. Before the Chipworks guy said the GPU die is 40nm TSMC (they should be able to tell), I would have proposed to think again about that crazy sounding idea of a 55nm die (with then only 160SPs of course). :oops:

160SP a good possibility?:oops:
 
Last edited by a moderator:
As a complete novice when it comes to matters of such technical nature, I have a question that may or may not make me look like an idiot. But since I'd like to know the answer, I'm sure you guys here are the best to answer - plus there is no such as a stupid question, right?

Anyway, given so much of the die space is unaccounted for, yes, one of the possible suggestions is that that is for fixed function features allow for greater performance - though I'm certainly in no place to suggest it or confirm either way.

That said, given the low power draw of the system, could even be remotely possible that developers don't have access to the full power of every feature at once?

What I mean is that, could it be that there are X amount of units soley for simple vertex manipulation and developers could use those if they wished but then that would reduce the amount of programmable shaders they had access too. The point being they were faster at a specific operation but it might not be useable in all games and in those cases, you can rely on the other shaders but you are ultimately sacrificing total performance for a custom appearance.

If everything was usable at once with maximum efficiency, surely if there was a power gap to be exploited, there would be a shred of evidence to support it by now, correct?

Is that a silly thing thing to suggest, a possiblitiy or how they work anyway?
 
Status
Not open for further replies.
Back
Top