Wii U hardware discussion and investigation *rename

Status
Not open for further replies.
So you think games are running massively more efficiently on the PC than on the Wii U?

I think the fact that the Wii U has a 3-core CPU that seems to be about 1/3rd the size of the quad-core K10.5 in Llano, at some 2/3rd of the core frequency, along with about a half of the memory bandwidth, might have something to do with the performance difference between Llano and the Wii U.

TBH, I think I (and many others) would be a lot more confident about the Wii U's performance if it had just a Llano A3500M with 128bit DDR3 1333MHz.
That way no one would be questioning whether it was more powerful than the X360.

Instead, Nintendo just decided to spend a couple of millions developing a crappy custom crappy CPU for the sake of a crappy BC feature that some 200 people in the world will actually care about.
Oh well, that crappy decision is made and it's up to us to vote with our wallets.


But I digress: the CPU and memory bandwidth in the Wii U seem to be so much worse than Llano's and X360's that we can't really determine that it has a 160 ALU GPU instead of a 320 ALU one because of the performance difference in early games.
 
The CPU wouldn't prevent the GPU from running the games at higher resolutions and with better AA, and wouldn't explain why Sonic runs at the lowest resolution of all three platforms, or why a game like Tekken with dynamic resolution has to drop resolution to maintain performance (just like it does on the 360).

I think it's pretty clear that the Wii U GPU isn't delivering the performance that a 16:320:8 550 mHz part on the PC does (and that's putting it mildly). So far no game has shown anything like the jump in performance that you would expect even in the far from ideal PC environment.

So either every game is bottlenecked down to 360 levels and the GPU is wasting a huge chunk of its time (on every game) or Latte isn't actually as powerful as a HD 5550.
 
I agree that there is probably something missing. The 320 number don't seem to match up with anything. The layout of the SIMD looks like its the same as for 20 ALUs with the same number of cache blocks. The only thing explaining 320 SPs is the supposed 40nm process and the block being slightly too big. Even that doesn't explain it fully.

The SIMD blocks are 60% the size of Llano's and only about 30% larger than bobcat's 20 SPs. Even on 40nm, its pretty absurd that the density increased so much. We also don't have conclusive evidence it is 40nm. The only thing the pins 40nm right now seems to be the eDRAM size. Which is a really rough estimate from what I can tell.

There is too much unconfirmed things. I don't even know how everyone jumped onto the 320 SPs ship so fast. So far the similarities of the SIMD blocks compared with bobcat should point at 20 shaders per block at a larger manufacturing process. Thats what you'd get if you only looked at the SIMD blocks.

I find its much more likely they found a way to pack eDRAM slightly denser than to somehow pack the ALU logic smaller and cut away half the cache blocks. Or maybe the whole chip is 40nm but the logic isn't packed very dense because it is not originally designed for that process and fab. This is all much more likely from my point of view than magically have 320 SPs in so little space.

Or maybe the chip is 28nm and we do have 320 SPs and the eDRAM is actually 48MB with 16MB reserved for the OS!
 
I think the fact that the Wii U has a 3-core CPU that seems to be about 1/3rd the size of the quad-core K10.5 in Llano, at some 2/3rd of the core frequency, along with about a half of the memory bandwidth, might have something to do with the performance difference between Llano and the Wii U.

...

But I digress: the CPU and memory bandwidth in the Wii U seem to be so much worse than Llano's and X360's that we can't really determine that it has a 160 ALU GPU instead of a 320 ALU one because of the performance difference in early games.
If Nintendo designed in such low level bottlenecks in CPU and RAM performance, would it make any sense to chuck in a GPU that will be constantly throttled? A 320 shader part would be more GPU than the platform can support, it seems, which means a 320 shader part would be wasted silicon and wasted money, and that's really not in keeping with Nintendo's philosophy! The logic behind proportionally massive GPU power isn't there, because things like GPGPU need data to work on and high throughput. Ergo, if a 320 shader part can be proven far more capable than 360 as function does, that is pretty conclusive for me. Nintendo wouldn't not put in a part that capable and then completely gimp it unless their engineers are incompetant.
 
My current preferred hypothesis is:

The WiiU GPU is 40nm, the shaders are in 8 blocks of 20 and the reason they're big / low density is that they're shrinks of 80/65 nm R6xx based designs originally mooted for a "HD Gamecube" platform that never came to pass. It's the only current hypothesis that adequately explains:

- Shader block sizes (why they look 55nm)
- Number of register banks
- edram density
- Marcan's "R600" references
- The "old" architectures for both the CPU and GPU
- The very Xbox 360 like level of performance

It might also explain the transparency performance if Nintendo decided to ditch the b0rked MSAA resolve / blend hardware in R6xx, or if the edram was originally intended to be on an external bus.

If Nintendo had intended to release a "GC HD in 2006/2007 then R6xx on 80/65 nm and a multicore "overclocked" Gekko on 90 or 65 nm are precisely what they would have gone for. 90% of the 360 experience at 1/2 of the cost and 1/3 of the power consumption. Use 8~16 MB of edram and 256~512 MB of GDDR3 2000 and you've almost got Xbox 360 performance in a smaller, quieter machine that doesn't RRoD and vibrate.

Would have been quite something.

Edit: for anyone on NeoGaf reading this, I'm not suggesting that the Wii U design was lying around for years, I'm suggesting that the core technology used in the Wii U may have been decided on a long time ago when Nintendo had different objectives. Specifics of the Wii U design, such as clocks, edram quantity, process, main memory type, all that custom stuff for handling two displays etc would be much newer.

At this rate, you're eventually going to call it an overclocked N64.

Anyway, the contradiction here is that it eliminates the "it's more modern so it can match 360 with a low number of SPs" theory, since in this case the GPU wouldn't be more modern and, therefore, would just flat-out be 30% weaker than 360 on top of having a CPU that's 30-50% weaker and RAM that's 50% slower.
 
Last edited by a moderator:
I wouldn't be surprised if most of the hardware has been ready for years. They may have been sitting on it, riding out Wii's useful lifetime. Or maybe they had a change in plans that necessitated a delay (ie tablet stuff). If it had launched, a die shrink would have eventually happened, so this could be them launching the shrink. Who knows...

If it is really R600-based that's kind of fascinating in a way. I guess. :)
 
Looks like that is the case.. I figured the eDRAM logic handled color and depth separately but I guess not. It's confusing since a lot of stuff out there refers to the hi-Z as a form of early-Z (which it is I guess)

I'd think hierarchical means that it won't test all 4 pixels in a 2x2 tile. But if the SIMDs actually skip the pixel shading it would be early. Flipper/Hw has early Z and I don't think nintendo would drop such a thing.

Btw, say that WiiUGPU indeed has hardwired per pixel light attenuation and vectornormalization. What would be a feasible approach to implement it? If they'd utilize 8 pipelines the FLOPS would go skyhigh or am I wrong?

Edit: with skyhigh I exaggerate ofcourse, but +50GFOPS or so?
 
Last edited by a moderator:
At this rate, you're eventually going to call it an overclocked N64.

Anyway, the contradiction here is that it eliminates the "it's more modern so it can match 360 with a low number of SPs" theory, since in this case the GPU wouldn't be more modern and, therefore, would just flat-out be 30% weaker than 360 on top of having a CPU that's 30-50% weaker and RAM that's 50% slower.

The R6xx series are all more modern than Xenon, and they're VLIW5 designs and should be more capable at a given clock speed/shader count. Thanks to DRS and Barbarian we also now know that the 360 doesn't have early Z either, while I'm pretty sure R6xx do, so that should improve efficiency.

There might well be nothing in the idea of the Wii U tracing its evolution back to an R6xx based design, but I really think there's something in the idea of 160 shaders.
 
Wow, only 160 SP - even I never thought the Wii U would be this weak and felt everyone was just exaggerating for effect since it wasn't a true next gen box.

Someone should tell DF to update their article.

If it's a 160 SP, what kind of FLOPS are we looking at?
 
Wow, only 160 SP - even I never thought the Wii U would be this weak and felt everyone was just exaggerating for effect since it wasn't a true next gen box.

Someone should tell DF to update their article.

If it's a 160 SP, what kind of FLOPS are we looking at?

I think someone said about 350 for 320SP so around half that or 170-180 Gigaflops which seems in line with the idea that it is roughly 90% Xbox 360 performance and it fits the memory bandwidth.
 
There might well be nothing in the idea of the Wii U tracing its evolution back to an R6xx based design,

Mind you, given the nature of these monolithic designs, you might even find stuff dating farther back, which just adds to the uncertainty of claiming it's necessarily R6xx - these are all evolutionary designs.

*shrug*
 
So, the idea is that the gpu would be inefficient at 320sp, because of a slow CPU and slow ram, so it makes more sense for the gpu to be 160sp. Yes this seems much more efficient, a r700 part that performs 7gflops per watt... How again does this make sense? Or does 14gflops per watt make better sense, ports performing badly should never be evidence of something like this, some of which were done in only 6months. If you could do something like this then we could look to the HD remakes of ps2 games running on ps3 and 360 at 60+fps, at 1080p. The logic here is in the gutter.
 
So, the idea is that the gpu would be inefficient at 320sp, because of a slow CPU and slow ram, so it makes more sense for the gpu to be 160sp. Yes this seems much more efficient, a r700 part that performs 7gflops per watt... How again does this make sense? Or does 14gflops per watt make better sense, ports performing badly should never be evidence of something like this, some of which were done in only 6months. If you could do something like this then we could look to the HD remakes of ps2 games running on ps3 and 360 at 60+fps, at 1080p. The logic here is in the gutter.
What number are you using? As already posted this whole gpu die has about 15 watts to work with. There is more than just a gpu core on this chip.

If everything else takes a watt that leaves you at 172 glfop @ 12.57 gflop per watt or 352 glfop @ 25.14 gflop per watt.

Looking at amd gpu at 40nm ex. Radeon HD 5550 [320:16:8 @ 550mhz ] 9.03 gflops per watt.
 
I'd think the GPU most likely is not hitting the TDP or anything. It most likely has a TPD of ~30W and uses 15-20W or so during normal gaming. Only things like futuremark or GPU burn tests can actually hit the TPD.
 
Here is an interesting little test performed by user named Blu on Neogaf.

http://www.neogaf.com/forum/showpost.php?p=47593495&postcount=3295

It compares the Wii's Broadway CPU at 729Mhz to an AMD Bobcat core running at 1.33Ghz. Normalized for clockspeed, Broadway completed the test ~26% faster than the Bobcat core. For this particular workload, then, a Broadway @ 1.243Ghz would perform similarly to a 1.6GHz Bobcat. Use of paired singles can make quite the difference, and I have to wonder how well many of these multi-platform games are utilizing the FPU in Espresso.


As regards the GPU, since this is a custom chip, is it possible or probable that we could be looking at 30 ALUs or 6 VLIW5 pipelines per-cluster on Latte? It would be an odd number, but it could explain the size discontinuity between some of AMD's other chips. It would also be an identical arrangement to Xenos, 240:16:8, just with the ALUs broken into more SIMDs. It doesn't explain the registers sizes though. . .
 
Last edited by a moderator:
Here is an interesting little test performed by user named Blu on Neogaf.

http://www.neogaf.com/forum/showpost.php?p=47593495&postcount=3295

It compares the Wii's Broadway CPU at 729Mhz to an AMD Bobcat core running at 1.33Ghz. Normalized for clockspeed, Broadway completed the test ~26% faster than the Bobcat core. For this particular workload, then, a Broadway @ 1.243Ghz would perform similarly to a 1.6GHz Bobcat. Use of paired singles can make quite the difference, and I have to wonder how well many of these multi-platform games are utilizing the FPU in Espresso.


As regards the GPU, since this is a custom chip, is it possible or probable that we could be looking at 30 ALUs or 6 VLIW5 pipelines per-cluster on Latte? It would be an odd number, but it could explain the size discontinuity between some of AMD's other chips. It doesn't explain the registers sizes though. . .
interesting. I guess the 64 bit SIMDs really bottlenecks bobcat in those tests. I would expect integer performance to be more more in favor of the bobcat.

I really doubt that much customization of the shader blocks for the gpu. The layout looks very standard.
 
Status
Not open for further replies.
Back
Top