Wii U hardware discussion and investigation *rename

Status
Not open for further replies.
What disturbs me about the "160 SPUs" hypothesis, is not only the glaring disparity between eDRAM and SPU density. It is also that every developer testimony I've seen has said that the GPU has performance margins over the 360, typically allowing them to implement some minor new feature(s).

The 360 has 240 (VLIW5) SPUs, and the hypothesis posits the WiiU to have 160 (VLIW5) SPUs of a generationally very close architecture. The clock frequency is similar, so if all other things were equal, the 360 would have 50% higher raw ALU capabilities. But this clashes with the sentiments of developers on record. You could assume that Nintendo has commissioned modifications to the ALU-blocks that make them a whole lot more performant, but that begs the question why you would make a comparison of numbers of SPUs in the first place if it doesn't correlate reasonably to relative ALU capabilities.

So I'm inclined to think of the WiiU ALU as having 320 SPUs for the simple reasons that it makes sense in terms of die area, but just as importantly that it simply jives with what the developers are saying - that the WiiU has a somewhat more performant GPU than its previous generation competitors. Regardless of what the actual configuration of the WiiU GPU is, putting its capabilities a small fraction over the 360 seems like a reasonable summation of developer testimonials in terms of real-world capability.
 
So we cant really know anything from the dye photo then?
Major functional blocks can be identified in some cases, but the smaller they are the more ambiguous it becomes. IE, this section over here can be an array of TMUs - maybe. Could possibly be something else, who knows for sure! Without proprietary inside information it's all guesses - more or less educated, sure, but still only guesses.

Guessing bandwidth of the eDRAM array by looking at a photo like this (IE, not even an x-ray image that shows internal details, just a standard surface photo) is basically impossible. Modern complex ICs like this one always have many layers of interconnect fabric, the lowest ones are the most important, but also completely invisible due to minute size, and they might not even be exposed at all by the etching process. We simply can't tell by eye how the chip is wired up, and the bandwidth directly depends (in part) on the physical wiring.

The 360 has 240 (VLIW5) SPUs
AFAIK, xenos GPU uses scalar processors, not VLIW5.
 
What disturbs me about the "160 SPUs" hypothesis, is not only the glaring disparity between eDRAM and SPU density. It is also that every developer testimony I've seen has said that the GPU has performance margins over the 360, typically allowing them to implement some minor new feature(s).

The 360 has 240 (VLIW5) SPUs, and the hypothesis posits the WiiU to have 160 (VLIW5) SPUs of a generationally very close architecture. The clock frequency is similar, so if all other things were equal, the 360 would have 50% higher raw ALU capabilities. But this clashes with the sentiments of developers on record. You could assume that Nintendo has commissioned modifications to the ALU-blocks that make them a whole lot more performant, but that begs the question why you would make a comparison of numbers of SPUs in the first place if it doesn't correlate reasonably to relative ALU capabilities.

So I'm inclined to think of the WiiU ALU as having 320 SPUs for the simple reasons that it makes sense in terms of die area, but just as importantly that it simply jives with what the developers are saying - that the WiiU has a somewhat more performant GPU than its previous generation competitors. Regardless of what the actual configuration of the WiiU GPU is, putting its capabilities a small fraction over the 360 seems like a reasonable summation of developer testimonials in terms of real-world capability.

Well, looking at say, mario kart 8, it would certainly seem that you would have more.

But looking at that die shot, and with the leaked info with 2 simd arrays of 80 spu's lining up nicely with that die shot, it is pretty damning. Which leaves with well, shrugged shoulders.

I dont know, I know the 360's vliw5 wasnt particularly effecient, with average 3 out of 5 getting to be used. I dont know HOW, but maybe Nintendo found a way to up that average (seriously anyone, HOW?). Maybe the 'memory intensive' design removes a lot of unnecessary bottlenecks, maybe it is less on peak theoretical performance, but can surpass considerably on usable real world performance, I dont know.

I guess every peice of the puzzle we can uncover will only help.

Major functional blocks can be identified in some cases, but the smaller they are the more ambiguous it becomes. IE, this section over here can be an array of TMUs - maybe. Could possibly be something else, who knows for sure! Without proprietary inside information it's all guesses - more or less educated, sure, but still only guesses.

Guessing bandwidth of the eDRAM array by looking at a photo like this (IE, not even an x-ray image that shows internal details, just a standard surface photo) is basically impossible. Modern complex ICs like this one always have many layers of interconnect fabric, the lowest ones are the most important, but also completely invisible due to minute size, and they might not even be exposed at all by the etching process. We simply can't tell by eye how the chip is wired up, and the bandwidth directly depends (in part) on the physical wiring.


AFAIK, xenos GPU uses scalar processors, not VLIW5.
I gues that would explain a little about the 360's shaders.

Anyways.

iqvEV5YkRFvM.png


Im assuming weve been looking at UX7LSeD 128 Kw x 256 b . The size is right (32Mb) the time is right, entering mass market production at the end of 2007, whilst I feel the 40nm sucsessor would be too late for Nintendos comfort at the second half of 2009.

So... whats the bandwidth of this products 32 Mb macros?
 
Last edited by a moderator:
Xenos has 48 shader ALUs that can each co-issue a Vec4 and a Scalar instruction. Vec4+scalar. It's an older configuration similar to ATI D3D9 chips. It has also been posted in this thread many times and we have a search function here!!!!!! ;)

Now I really don't know what this means for efficiency compared to VLIW5 though. One would think that the newer architecture is more efficient. There is no PC part with unified Vec4+scalar to compare against VLIW5.

There was an aborted PC project called R400 that was probably unified Vec4+scalar. But ATI didn't bring unified to PC until VLIW5 for unknown reasons. Xenos is perhaps a form of that R400.
 
Xenos has 48 shader ALUs that can each co-issue a Vec4 and a Scalar instruction. Vec4+scalar. It's an older configuration similar to ATI D3D9 chips. It has also been posted in this thread many times and we have a search function here!!!!!! ;)
LOL! Existence of search function duly noted. :D Thanks for the xenos correction btw.

Btw, been thinking about the abnormal memory distribution of the wuu - could it be that the 1GB reserve is used as a pre-caching buffer for downloadable game titles?

It totally would not surprise me if the BR drive is considerably faster than on-board flash in the wuu, meaning that downloaded game softwares couldn't keep up with disc-based titles, which would cause problems for streaming titles and so on. A large read-ahead buffer would help with that, especially if the game title could send hints to the OS of what to pre-load into the buffer (like say, when a player turns onto a different street in a free-roaming title like the GTA series for example.)

Of course, this is all speculation, but what other purpose could the 1GB reserve possbibly have...? Nintendo sure isn't adding any new features to the system which might have use for all that memory.
 
I don't think the entire 1GB reserved serves any purpose as of yet. If it does, it remains a mystery from what Nintendo shows from a functionality standpoint.
 
Only if we use our amateur tea-leaf reading to put the console down.

That's your area of expertise, as your vague (and grossly inaccurate) handwaving below demonstrates.

What disturbs me about the "160 SPUs" hypothesis, is not only the glaring disparity between eDRAM and SPU density.

There is no glaring disparity between edram and SPU density. We've gone over this. It's not a TSMC fabbed chip and it's not laid out by AMD. Direct comparison of SPU density with AMD/TSMC 40nm chips are a very, very poor basis for comparison. (Register counts, however, will transfer across processes).

You went quiet for a few months after insulting people for questioning whether it was a TSMC chip (you claimed TSMC was a 'FACT!!') but it turns out you were wrong, and now seem to have forgotten that the basis for your 'SPU density' argument isn't there.

It is also that every developer testimony I've seen has said that the GPU has performance margins over the 360, typically allowing them to implement some minor new feature(s).

Actually most of them didn't add new features. Some did, and at least one has said they could have if they had more time. And some games reduced GPU load in specific areas.

But this is only meaningful if you want to claim that the Wii U couldn't improve anything without having more 'raw ALU' than the 360. And that's fundamentally wrong, given what we have seen happen in the PC space from which these GPUs are derived.

The 360 has 240 (VLIW5) SPUs, and the hypothesis posits the WiiU to have 160 (VLIW5) SPUs of a generationally very close architecture. The clock frequency is similar, so if all other things were equal, the 360 would have 50% higher raw ALU capabilities.

Wrong again! And again it's something that's come up in this thread a number of times. The 360 isn't VLIW5, it's an entire generation (or more) behind, and 'raw ALU' is only 23% higher on the Xbox 360 and nothing like 50%. And god knows how much more efficient VLIW5 will be. In things like triangle setup and raw fillrate the Wii U will be outright faster.

Not to mention that the Wii U won't have the same tiling penalties and that it can read from edram without a copy out / resolve to main ram (and even read from and write to the same buffer).

The 360 doesn't even have early z rejection!


But this clashes with the sentiments of developers on record.

Not at all, especially as the 'this' (above) is basically wrong.

You could assume that Nintendo has commissioned modifications to the ALU-blocks that make them a whole lot more performant ...

There's absolutely no need to make that assumption.

And a 320 shader part would be able to trounce the 360, running 360 games easily at much higher resolutions just like it does on the PC. Seeing minor improvements is not justification for vastly more powerful hardware.
 
There was an aborted PC project called R400 that was probably unified Vec4+scalar. But ATI didn't bring unified to PC until VLIW5 for unknown reasons. Xenos is perhaps a form of that R400.

I don't doubt it's the cancelled R400, ported to 90nm and with some tweaks maybe.
I think it was too early . Everyone wrote than the HD 2600 sucked (R600 technology), that's because a unified GPU architecture is transistor hungry, what with the front-end and scheduling stuff. So R400 would probably have been expensive and too slow.
 
Seeing as how the shader units only take up a small portion of the die space, how influential are the other components to a GPU's performance? For example, could the ALU's actually be pretty comparable in performance, but the rest of the hardware on the die is responsible for the increased performance that developers are seeing. For example, is fillrate 100% ALU bound? I ask because Fuzzywuzzygames said they were having fillrate issues on 360, but had no such issues with Wii U, even with higher resolution and post processing effects. All the attention has been on the shader ALU's, but perhaps the ROP's and textures units are also much more efficient than they were on Xenos?

The focus has been the ALU's, but perhaps the fillrate performance is better because of the ROP's. Wii U runs 50Mhz faster than the Xenos, giving the ROP's a 10% advantage in fillrate, and I would assume that just like ALU's there has been design changes to ROP's that give better performance. The same clock speed advantage applies to the texture units as well, and again I would think that the texture units in Wii U are better than Xenos.
 
Last edited by a moderator:
Seeing as how the shader units only take up a small portion of the die space, how influential are the other components to a GPU's performance?
I assume there is a lot of other logic that is purely for BC purposes. Some may be fix function hardware that is accessible to the developer but I doubt it would be a big difference. The TMUs and ROPs might be better than the ones in xenos but the configuration doesn't seem to lend itself to be revolutionary.
 
I assume there is a lot of other logic that is purely for BC purposes. Some may be fix function hardware that is accessible to the developer but I doubt it would be a big difference. .

I dont see any reason for that whatsoever, a programmable shader gpu would have no problem whatsoever perfectly emulatng wii's simplistic color combiner.

Particularly when the people doing it have the precise information on how to emulate the older device.


Seeing as how the shader units only take up a small portion of the die space, how influential are the other components to a GPU's performance? For example, could the ALU's actually be pretty comparable in performance, but the rest of the hardware on the die is responsible for the increased performance that developers are seeing. For example, is fillrate 100% ALU bound? I ask because Fuzzywuzzygames said they were having fillrate issues on 360, but had no such issues with Wii U, even with higher resolution and post processing effects. All the attention has been on the shader ALU's, but perhaps the ROP's and textures units are also much more efficient than they were on Xenos?

The focus has been the ALU's, but perhaps the fillrate performance is better because of the ROP's. Wii U runs 50Mhz faster than the Xenos, giving the ROP's a 10% advantage in fillrate, and I would assume that just like ALU's there has been design changes to ROP's that give better performance. The same clock speed advantage applies to the texture units as well, and again I would think that the texture units in Wii U are better than Xenos.

Well, I dont have anything on the wii u end, but I think I can tell why 360 is having trouble.

On the 360, the rops were on the daughter die with the embedded Dram, a well known story. This enables that 256 Gb/s bus thats brought up so often. Sounds like excellent operational bandwidth for filling frame buffers and other rops bound operations.

Unfortunately, when it comes time to leave or bring data, the picture is much less rosy. First off, turns out that 10Mb edram just isnt enough for everything that needed to be done, to get done in one pass. It was capacity screwed. So, devs came up with a method, that data, had to be transferred out of the edram at MUCH les than 256 Gb/s, the other half of the data moved on, processed, moved out, joined up, and sent out, having to cross every bridge on its way out and pay the time toll.
 
Last edited by a moderator:
Its crazy to think about, but prior to the launch of the Wii U, we had very little info, basically a few developer comments that it has a lot of memory, a more modern GPU that is a bit more powerful than, and a CPU that is a little slow. Fast forward 18 months, thousands of post on the net have been made, hundreds of pages for this thread alone, a couple of die photo's released and a couple clock speeds from a hacker, and how much more do we really know than 18 months ago? Not much. There are some really smart people here, but even on this forum the specifics seem to remain a mystery.
 
I dont see any reason for that whatsoever, a programmable shader gpu would have no problem whatsoever perfectly emulatng wii's simplistic color combiner.

Particularly when the people doing it have the precise information on how to emulate the older device.

I might be wrong, but wasn't the fact they opted for 1:1 hardware BC speculated to be something to do with piracy? They must have had some reason to go that route, anyway.
 
gpu-memory-perf-3.png


http://www.realworldtech.com/gpu-memory-bandwidth/2/

Back on to the tech side of things, Beyond3D has a link on the front page that after reading it, seems very relevant to Wii U's situation when comparing it to 360 and PS3.

As this chart shows, there can be a big disparity between two cards with the same gflop rating. I think a lot of people, including myself have struggled to grasp how a 176Gflop GPU can not only match a 240Gflop GPU, but also surpass it by a decent margin. As we can see from the chart, a 192 Gflop Geforce 9800M GTS is able to outperform the 249 Gflop Geforce GT 435M by a respectable margin. It makes me wonder if the GPU's in the 360 and PS3 were actually memory bandwidth starved. The article also talks about the importance of memory controller performance, and that can also make a pretty big difference. Wii U's GPU would have far superior texture caches, presumably a much better memory controller and the bandwidth of the edram seems like its allowing the GPU to perform more like a graphics card with GDDR5 memory than the same card with DDR3 memory.
 
Yup, game performance is determined only partially by shader ALU performance. 9800M has around double the memory bandwidth and pixel/texture fillrate compared to GT 435M.

I thought I saw in this thread that someone figured out that WiiU's GPU has a rather atypical 4 TMU & 4 RBE setup. That's not much texture power. RV770 has 40 TMUs with 16 ROPs for example. The ratio was/is usually around 2:1 TMU:ROP.
 
Yup, game performance is determined only partially by shader ALU performance. 9800M has around double the memory bandwidth and pixel/texture fillrate compared to GT 435M.

I thought I saw in this thread that someone figured out that WiiU's GPU has a rather atypical 4 TMU & 4 RBE setup. That's not much texture power. RV770 has 40 TMUs with 16 RBEs for example. The ratio was/is usually around 2:1 TMU:RBE.


Does that make any sense? It seems like its become mostly accepted that the GPU is 160 stream processors, 8 texture units, and 8 rops.
 
Does that make any sense? It seems like its become mostly accepted that the GPU is 160 stream processors, 8 texture units, and 8 rops.
It's still strange since Radeons typically have around about 2:1 TU:ROP. And 8 ROP Radeons also have >= 320 SPs. Latte has a unique allotment of hardware.
 
No question, but since there is no way to find a like for like comparison, some educated guesses have been made. Like some people here have concluded, an HD5550 with its 320 SPU's has no trouble outclassing 360 and PS3, so its doubtful that its that capable. For me, its all about seeing where the performance really lines up. For me, seeing a chart with multiple GPU's of similar Gflop ratings score so differently really showed just how much performance can vary. When you factor in increased shader efficiency, better texture caches, improved Z compression, early Z test and fast Z clear, and excellent bandwidth for shader and rendering operations thanks to the edram.
 
Status
Not open for further replies.
Back
Top