Wii U hardware discussion and investigation *rename

Lalaland · Feb 4, 2014

ToTTenTranz said:
My GK104 cards don't stutter when I activate PhysX, at least. Neither did my previous GTX580.

I think the plural of GK104 cards solves the mystery there as does the GTX 580 previously. Both are fairly high end cards that could afford to hive off CUs for purely GPGPU, it's on the lower end cards that PhysX causes hassle as it is forced to share CUs. I had a GTX260 a while back and I never turned on PhysX as it tanked the framerate in return for fluff I didn't really care about (just like TressFX except that's worse as I find it actively distracting).

Goodtwin · Feb 19, 2014

How efficient are GPU accelerated physics? I ask that, because it seems from what I have read elsewhere, its a CPU task that can be effectively moved onto a GPU, freeing up the CPU. We know that developers are finding quite a bit of extra headroom with the Wii U GPU compared to the 360 and PS3, but the trouble seems to be at the CPU. There is no question that it could be done, but would running physics on the Wii U's GPU eat up a large percentage of its capacity? I dont think there is any doubt that a ground up Wii U game will almost certainly use the extra GPU cycles to create a better looking game, but for a multi plat that started life on the 360, would it make sense to move the physics engine over to the GPU? I cant find anything on GPU physics efficiency. Im not even talking about high end physics with lot of destructible objects, but simply achieving the same physics they run on the 360 version, but do it on the Wii U's GPU instead. For example, would this eat up 10% of the GPU's cycles? More? Less? What do you guys know about the subject?

swaaye · Feb 19, 2014

Something else to consider is VLIW5 is great for graphics, but not so much for compute (this is why VLIW4 came along, and now GCN). You can see in some old reviews where RV770 is demolished by the older G92 in GPGPU. For example, Folding@Home. I don't recall any games doing GPGPU processing on that generation of Radeons. On the other hand, some games used Physx on GeForce cards with higher performance compute and it always had a very negative effect on game performance.

In other words, the little WiiU GPU may not be very useful at all for physics processing.

Goodtwin · Feb 19, 2014

What type of ALU did the 360 and PS3 GPU's use? VLIW5 isnt considered efficient at all from everything I can search on the net, and if Wii U really only has 160 stream processors, then its hard to fathom the Wii U GPU performing better.

swaaye · Feb 19, 2014

The 360's Xenos shaders are apparently similar to those of R5x0 and G7x but in a unified design. It's not unlikely that it's less efficient than the R700 generation for everything. PS3 is a classic non-unified design (it's a reduced G71) and probably useless for GPGPU. Cell has to help shore up even RSX's graphics capability.

VLIW5 is apparently a nice format for doing graphics work. Lots of ALUs, decent utilization, and doesn't take many transistors. It's when you try to do more general work that it becomes very hard / impossible to utilize the 5-way shaders with any efficiency. If you want to know more I'm sure a forum search will find you some technical discussions from the past.

3dilettante · Feb 19, 2014

VLIW4 wasn't that much of an improvement and AMD insisted that its packing efficiency got pretty good in the end.

The clause-based code execution model and inflexible cache hierarchy got dumped for GCN to great effect.

Goodtwin · Feb 19, 2014

swaaye said:
The 360's Xenos shaders are apparently similar to R5x0 but in a unified design. It's surely less efficient than the R700 generation. PS3 is a classic non-unified design and probably useless for GPGPU. Cell has to help shore up even RSX's graphics capability.

VLIW5 is apparently a nice format for doing graphics work. Lots of ALUs, decent utilization, and doesn't take many transistors. It's when you try to do more general work that it becomes very hard to efficiently utilize the 5-way shaders. If you want to know more I'm sure a forum search will find you some technical discussions from the past.

Excellent, thanks.

Neighborhoodcroak · Feb 20, 2014

I didn't cite anything while I typed it since I just made this for myself but here are my findings. It was also written in a presentation like form. This was written over a period of time so it's a little messy and contains some dated information.

Power of the Wii U?

Part I: Expresso and his parents.

The grandfather -Gekko- was introduced to the world inside the GCN during 2000. The Gekko microprocessor is a heavily modified PowerPC 750CXe. A RISC architecture CPU though what's certain is the addition of arithmatic instructions geared towards 3D graphics [e.g. SIMD]. With 32 FPR's [64-bit] and 32 GPR's Gekko has the ability to perform plenty mathematical calculations not forgetting its two 32-bit integer units and a single 64-bit FPU (can be considered 2 x 32 FPU's). Add another FPU and imagine what this processor could do! Moving on.

Gekko was more than a CPU, it was a GPCPU. Sure I made it up and I'm exaggerating a bit but the point being was that Gekko is working with a Fixed Function GPU- NB system called Flipper. A lot of enthusiasts will say "it's not fixed function, it can do shaders" and so on and so forth. Simple fact is no, it cannot. However Flipper has TEV stages which we'll discuss in a bit. Due to Flipper's incompetence Gekko was required to do vertex transformations and possibly including primitive assembly (contradicts the statement ahead.) Then we got into the fixed function T&L unit on Flipper [you're also stuck with 8 hardware lights, which conforms with OpenGL 1.x Specifications]. As a result most user-defined, custom light, was performed on the Gekko microprocessor. An example use for Gekko is for local lights/finite lighting. Gekko in by itself is incredibly associated with the Flipper chip, so much it even has to pass through Flipper to access main memory (24 megabytes of 1T-SRAM).

Onto basic specs Gekko is rated at a 1.9 GFLOPS. The 4-stage pipeline Gekko can issue 4 instructions and retire 2.

Processor Clock * instructions * floating point units = #GFLOPS

486Mhz * 2 * 2 = 1.944 GFLOPS (and 1.944 GIPS)

An unsurprisingly low number but it more than gets the job done. Since Gekko and Flipper are best buds let's talk about Flipper briefly though it's beyond the scope of this article. Flipper is clock synced at the speed of Gekko's FSB at 162 Mhz. Flipper should also be considered the northbridge since it performs audio (integrated audio DSP), does I/O, and is Gekko's route to main memory. The actual Flipper core, the graphics chip is in two words, Fixed-Function. 4 pixel pipelines with 1 texture unit on each giving use 648 MPixel/s and 648 Mtexel/s. For a 640x480 24-bit framebuffer (RGB8, RGBA6) this is fine. Where most of the confusion regarding Flipper is the programmability of it. The source is the 16 TEV stages withheld on the graphics chip. Their functuality can be compared to nvidia's register combiners. Nvidia's implementation is like "You have X number of inputs to get 3 distinct outputs." How you decide to combine your inputs/what operations performed to get your output is the programmability.

thakis@users.sourceforge.net. --> claims the TEVs have 4 inputs and 1 output.

The operations allowed on both NV_register_combiners and GX's TEVs are not arbitrary so there are still limitations on what the user can do. Any additional functuality the TEV does is beyond me but increasing the number of TEV stages has detrimental effects. Similar to adding extra passes and exposes the application to becoming more CPU-limited? Beyond my knowledge but this is the basis of so-called GCN/Wii shaders, GX TEV stages.

Broadway, the CPU of the Wii, is not much difference. Information regarding Broadway is sparse. Most notable differences is smaller fabrication technology, lower power consumption, higher clorck rate, and additional registers. Broadway is mostly a joke of an improvement with current public information. Gekko is at much disappoint of this child. Since the Wii is an overall system improvement (from CPU to RAM to GPU) it's performance is arguably but justifiably roughly twice that of the Gamecube.

Part II: Expresso

Crying in shame Broadway then had three inseparable triplet sons who called themselves Expresso. Expresso is still a derivative of Gekko. Roughly 70% faster with the combined intellect of 3. A triple core processor. Clocked at 1.24 Ghz Expresso has a theoretically rating of 14.9 GFLOPS. An outstanding jump from 2.9 GFLOPS from Broadway. Compared to modern x86 microprocessors this is insanely weak but let's put this into perspective. The Wii U architecture is largely different than the GCN/Wii. On this processor chip is a vary large eDRAM cache which undoubtedly has a lot of bandwidth. While bandwidth shouldn't have been an issue, with a multithreaded processor things start to change. With this extra layer of the memory hiearchy the time the CPU sits idle is lessened further improving Expresso's performance utilization versus previous predecessors. Further more look at Gekko's previous duties, it needed to perform vertex transformations and developers would often perform custom lighting effects on this chip; alongside physics, AI, etc.
In the HD realm we're dealing with resolutions up to 3 times as large (roughly), the number of pixels increases in a square relationship so 1920x1440p features 9 times the pixels and roughly 9 times the processing power of 640x480. When it comes to 1280x720p native resolution games, Expresso beats this curve. However at 1920x1080 Expresso falls short even though it has 35% less pixels than the 4:3 aspect ratio version. This ratio more than balances itself out with the fact Expresso no longer needs to do vertex transformations which is costly. With techniques such as tiled rendering it's highly likely we can still use Expresso similar to Gekko all these years, but why would we want to do that? This is where we move on to the Wii U's GPU.

Latte is by no means related to Flipper or Hollywood. A customized Radeon R700 series GPU. Latte and Expresso share the room, sounds like some nice data is going between them....and very fast.
Latte is equipped with 320 (based on teardowns) stream processors at 549.9~ for 352 GFLOPS.
16 TMU's and 8 ROP's for 8.8 GT/s and 4.4 GP/s.

Wait a minute.....doesn't the PS3 have the SAME pixel fillrate and a higher texel fillrate? 352 GFLOPS isn't even twice of Xenos, the GPU of the Xbox 360. What is this...garbage? For one Latte will suffer from A LOT less limitations with data types and operands, more full functuality and in the end better hardware utilization. Let's start with the most obvious, 4.4 GP/s is within the range of last gen consoles excluding the Wii. No improvement, what's the deal? Consumer graphics cards generally complete the 3D Mark Vantage Fillrate test with roughly 1/3 of their theoretical fill-rate. There is a strong correlation between framebuffer bandwidth and fragment operations. In many cases you're more likely to run out of memory bandwidth before peaking at the card's rated fill-rate (pixel). In otherwords, the power of Latte is entirely dependent on the bandwidth of the 32MB eDRAM. If it provides less than 64 GB/s then it has failed. The magic number is between 64 GB/s and 96 GB/s sustained bandwidth. *Note, the Xbox 360 eDRAM daughter die has 256GB/s of bandwidth but must cross a bus providing only 32 GB/s. If it falls within this range then having the hardware run games at 1080p would prove to be worth very little challenge (doesn't seem to be the case though.) This magical range would begin to expose hardware limitations before hitting the brick wall of becoming frame buffer bandwidth bound. Most games are rarely texel fill-rate bound, otherwisely put the number of TMU's generally have negligeable performance impact, for a console more or less so (texture operations increase in speed/lower performance penalty when using them.) Now that we've exposed Latte a bit you can't help but wonder about how some games still don't run 1080p native. There are three possible reasons [and more]. The first reason is the eDRAM bandwidth falls short of the sweetspot/reaches the lower end. The second reason is the console is near its pixel fill-rate limit. The last reason is there aren't enough ALU's to perform the desired lighting calculations or shader-limited. However if you reached that [bandwidth] limit you should seriously consider using a tiled renderer instead of a traditional forward/deferred render engine. If you hit the pixel fill limit with tiled rendering then it's time for Expresso to save the day.....implying you didn't tie them down to keep them from their lady. Expresso has four things going. Faster clock speeds, triple core architecture, bandwidth, and faster memory access times. Higher processor utilization without the awful burden of vertex transformations. A fivefold increase in processor performance free of vertex transformations plus the lighting computation savings of tiled rendering. By no means is this a dirty trick, the Cell microprocessor knows this all too well.....in fact is in envy of why the owners placed such a prohibitive GPU. "Sure I'll calculate that with 32-bit float! Give me four clock cycles" or "Ooh texture fetch! [Forgets it was doing shader processing and goes after texels.]"(I made it up but the point remains it is nowhere as robust as the one found in the Xbox 360.)

Latte is a refined GPU, but her capabilities can only be determined by the sustained bandwidth to the eDRAM. Last but not least is the 4 512MB 16-bit DDR3-1600 Hynex memory split into 2 1 GB regions. I won't explain much here because I haven't found anything special about this memory. 12.8 GB/s main memory, compared to todays standard it is low, Expresso could undoubtedly use close to the rated transfer rate of this memory, meaning 2 years ago this would be no problem as system memory.

Part III: How does the Wii U stack against the PS3 and Xbox 360?

The PS3, by far, has the most powerful CPU solution (why it was so expensive) in a console, in terms of raw processing power it still outperforms most CPU's used in commercial desktops. The Cell Microprocesser, with the master PPE and 6 slave SPEs, 1 retired SPE, and 1 I do what the fuck I want SPE. Computational it's powerful, we're probably two more generations before seeing a console with a CPU solution as powerful as this [possibly one generation if the current gen console life span lasts longer than 6 years.] The Xbox 360 on the other hand features a lame duck king called Xenon. Certainly bright but strict with In-Order-Oxecution. "Can I do this---" "NO! Do this first THEN do THAT!". Xenon can operate on two threads concurrently, I won't confirm nor deny this but it's said that the design is such that when two threads are active the core is running at half speed. In terms of GFLOPS, Cell > Xenon > Expresso. Woah, time out! Expresso is down in the dumps! We'll explain this later but let's move on to the GPU's.

We already discussed the young forgetful RSX [Reality Synthesizer]. Not only does this GPU have less computational performance than Xenos, it's also bandwidth starved.....might explain some of his invasive nature. Furthermore is the hardware limitations such as the inability to execute shader code and perform a texture fetch, another is floating point operations are heavily discouraged on the GPU..... I'll need to reread the Geforce 7 series whitepaper and programming guides to give concrete examples. Brother to Xenon is the GPU Xenos. A kind hardworking scholar with a daughter whom he refuses to speak her name. They work together in a pair. The 10MB daughter die contains functuality to do Z-buffering, anti-aliasing, and some other stuff not texture related, essentially framebuffer operations. An internal 256 GB/s memory bandwidth die bursting compressed data across a 32GB/s bus to Xenos. Impressive, this scholar is a lot less picky with data types and is just plain smarter than the young onion. In terms of GFLOPS,
Latte > Xenos > RSX. We all came to this fact when developers praised the Wii U GPU. It's to no surprise.

Let's explain the logic behind Expresso. The Expresso team is rather special.....not retarded. I've already stressed the point at how the CPU was used for vertex transformations & primitive setup alongside custom lighting while running traditional game code. Thus the limiting factor of a game's polygon count became how many triangles you can push through the hardware, meaning the the GPU's maximum triangle rate more often than not [if not always] far exceeded what developers could do. What this means is that Gekko and descendents featured useful SIMD and vector instructions [some claim weak SIMDs as they are mostly AltiVec instructions]. Favored by the fact Expresso is FAR more likely to reach its peak potential due to intelligent hardware design. Ports run slower for one because it's not a Xbox 360 or PS3. In fact odds are porting Xbox 360 paradigm code to PS3 and vice-versa would result in significant slowdowns.....more so than what's observable on the Wii U. Last but not least Expresso has a large amount of eDRAM which favors multithreaded processing and reduces memory latency. We'll rest the case at there for now and begin our next discussion.

Part IV: <Insert Here>

The GCN was built from the ground up to reach its potential. Both successors followed that faith and has been tried. Expresso's biggest advantage over current gen processors is it's incredibly high IPC rate. For a given clock cycle, per core, Expresso outperform both the CPU of the Xbox One and PS4. Being a RISC architecture the instruction latency for any given instruction is considerably less than the processor of both competing consoles (some instructions on AMD's bulldozer have a latency greater than 32ns). Obviously, no matter how you look at it, the three brothers cannot compete with the almighty bobcat family with a litter of 8. What is certain though is that Expresso is consistent, they will always run at higher system utilizations, one might even say on their worst days. The GPU comparison isn't even fair really..... but it goes back to Latte's eDRAM. You've got OpenGL 3.x class hardware vs. OpenGL 4.x hardware. 0.35 TFLOPS vs 1.31 TFLOPS vs 1.84 TFLOPS. We haven't even gotten into pixel and texel fillrate though the former is less relevant. How effective the memory bandwidth of the Xbox One's ESRAM is beyond the article's scope but it requires careful optimization and even then isn't guaranteed the same bandwidth as the PS4. As to current standings it appears that the PS4 holds the bandwidth advantage for its APU. But wouldn't that be bad? When GPU's touch their main memory the latency is atrocious! Well it's not GDDR5's fault, GPU's are just bad at accessing memory but know how to use it.....to its fullest, when everything comes into perspective the DDR3 and GDDR5 have roughly the same latency. Sony's PS4 holds the bandwidth advantage, the Xbox One's GPU ability is already at a downright loss. How much better it is than Latte is dependent on how the 32MB eSRAM functions which ultimately determines how much of the bandwidth is "real". We cannot stress enough the disadvantage Latte is. That 32MB eDRAM is what determines whether she can hang with the big boys or back down like a dog.

Last but not least the Wii U is a high performance console, in a sense it will reach its peak performance more so than any console in the previous and current generation. The every extra mile pushed into increases the already superb hardware balance is key to determining the gap between the Wii U and the Xbox One. Not to sound biased but the ultimate truth is the PS4 is the most powerful current gen system. The Wii U will keep the crown of its predecessors as the most efficient current gen system. The Xbox One? Use the power of the cloud and cheat 4x performance to take the crown [laughs]. Not really but who knows?

Wii U appendum.

Many of us go to wikipedia as a start for research. When it comes to the Wii U's GPU it's disorganized and relatively inaccurate. The information here may also be inaccurate in regards that most of it is not confirmed by Nintendo but go bring up the topics discussed on wikipedia and see which is more "believable".

This appendum is more or less discussing the Wii U's GPU Latte. Forward rendering has been getting much attention, you might even say a revival with the technique of forward ++/forward tiled rendering. We're going to be running AMD's Tiled Lighting Demo by which you can get it from here http://developer.amd.com/tools-and-sdks/graphics-development/amd-radeon-sdk/

My graphics card is an AMD Radeon HD 7770 clocked at 1.2 Ghz and 1.4 Ghz (RAM). I can actually clock it to 1.25 Ghz, faster than the processor of the Wii U (of course certain applications become unstable so it doesn't necessarily count.)
AMD's Graphic's Core Next (GCN) architecture is different from that of its predecessors. Because of this the performance achieved get may not be a good representation despite similar clock rates.
The goal will be to underclock the CPU to a reasonable rate.

Specs: AMD Radeon HD 7770 @ 300Mhz & 1000 Mhz RAM
640 SPs
16 ROPs
40 TMUs

Statistics:
Processing Power |384.0 GFLOPS
Pixel Fillrate | 4.8 Gpixels/s
Texel Fillrate | 12.0 Gtexels/s
Memory Bandwidth | 64.0 GB/s
Max TDP 40W
Max Geometry Assembly Rate: 5 million triangles @ 60 frames per second (lower than Latte)

Findings:
The shadows rendered in this demo is a frame rate killer (too accurate) so we'll disable them. Transparent Objects are also disabled. My monitor is a 1680x1050 so we're down 15% pixel area. With max triangle density and Grid Objects we can achieve a maximum of 1.25 million polygons for the scene. Due to bandwidth constraints I turned off anti-aliasing which is only available as 4x MSAA. Considering the framerate penalty we could have gotten away with 2x with negligeable performance impact. The shadow caster lights we managed to keep our frame time less than 16ms and with the Random Lights we're able to render roughly 640 lights. We're expect to see a 15% increase in frame time resulting in a FPS of 54.3 (18.4 ms). Upon initial review it might seem a little ridiculous at how our mock Latte runs this demo.

However because this is a GPU demo it practically uses no CPU power. In fact it could run amazingly well if I underclocked my CPU by a factor of four. The amount of draw calls on this application is incredibly low as well. This is a more likely scenario, at least how I would do it.
Perform culling on Core 1 (Expresso's processor, Core 1 has more L2 cache than Core 0 and 2) and break the geometry & issue more draw calls. Due to the larger cache size Core 1 is more suited for "complicated" tasks and many of them. Issuing more draw calls CAN improve the GPU's performance (smaller VBO/VAO sizes.) With that in mind AI should also be performed on the same core leaving physics and software lighting to the other two cores. A game with potentially one million polygons being rendered is quite an easy feat on the Wii U at 60 fps. The console actually has fillrate issues which is no surprise. Being a tech demo that's targetted at an AMD 7970 the fact it scaled so low is impressive.

Aside from moving some work to the CPU we've got an optimized graphics API (GX2) and interestingly is the Terascale Engine. The architecture of the AMD Radeon RV700 series would allow the GPU in best case scenario to achieve 704 GFLOPS, compared to the GCN architecture there are four times as many ALUs as there are SIMD units. Ultimately running complex shader programs (in calculation, not by flow control) will be easy on the Wii U but will require careful optimizations for tasks not handling well-suited data (data sets of 4, one for each ALU, E.G. RGBA color manipulation.) When it comes to well-designed shaders the GPU can hold its own but currently when it comes to native 1080p rendering Latte is spilling out a lot of her potentional thus most pre-existing engines not designed for Full HD rendering will not render on the console. This is evident in Donkey Kong Country Returns Tropical Freeze. They only managed to scale a 480p game to 720p whereas Mario Kart 8 looks just as good (IMO better) at 1080p.

Here's something interesting. OpenGL 3.x class hardware is able to perform instancing. This allows the developer to render a field of trees with only a few draw calls. Using instancing with a Geeks3D instancing demo we're able to realize 8 million triangles at 28 fps with 720p windowed rendering. Adding 2x MSAA gives us an increase of only 2ms (26 fps) and increasing the GPU bandwidth from 64 GB/s to 96 GB/s gives 38 fps without AA and with 4x MSAA we experience only a 1ms difference (36 fps) in frame. What if the application was optimized for the Wii U's ALUs? Clocking the GPU for 400Mhz gave us no appreciable difference in frame time, 38 fps despite 512 GFLOPS. Lowering the memory bandwidth has no appreciable performance impact (24 GB/s.)
NOTE: [The GPU core clock raised itself during the high bandwidth tests. This boosted the core clock to 400 Mhz resulting in an increase in performance.]

*Another interest fact is Geeks3D OpenGL 2.1 HDR, DoF, Radial Blur Demo. At 64 GB/s we receive 60 fps but once we jump it to 96 GB/s we see the hardware barely rendering past 61 fps. Activating 8x MSAA we're still rendering at 60 fps. This bandwidth potentially demonstrates how "free" anti-aliasing is on 720p. Raising the resolution to 1680x1050 we see a decrease in fps with 8x MSAA. We're receiving 58 fps as opposed 60, this is an insignificant but noticeable increase in frametime.

*To make accurate comparisons the clock is 400 Mhz which is within the Wii U's possible range of GFLOPS.

We were not limited by pixel fillrate during any of the tests which meant that we were shader/ALU limited. Meaning it would've been equally possible for the Wii U to achieve the require 512 GFLOPS for the tests. My GPU at my personal OC'd frequencies would be roughly equivalent to the Xbox One's GPU (take away bandwidth and it'd be roughly 1.4x the power.) Various tests would place this GPU [the Xbox One's GPU vs. Latte?] at BEST roughly 2.5x more rendering performance (realize we're comparing apples to oranges.) With the Tiled Forward Demo it may have also been possible our geometry assembly was killing the frame rate by up to 33% (to rationalize this we'd need to run a GPU that has a lower SPU count at a frequency to match the GFLOPS my GPU operated at.)

This will be the last conclusion and notes. The gap between the Wii U and the Xbox One versus the gap between the Xbox One and the PS4 is no larger than what I estimate a 57% difference. Why is that? The GFLOP difference between the Wii U & Xbox One is at WORST about 960 FLOPS. **However AMD's VLIW architecture excels at ALU calculations which can be a big win and reduce this number at best to 810 GFLOPS. The Pixel Fillrate and Texel fillrate are sure losses for Nintendo's hardware and there is no overcoming them. However the PS4 has almost double the pixel fillrate. The total difference in pixel fillrate is higher when comparing the Xbox One and PS4 than the Xbox One and Wii U, this is already evident in cross platform games of the two competing consoles. Memory architecture is another thing that'll come into play. The PS4 has 176 GB/s of unified memory whilst the Xbox One has 68 GB/s main memory and a 102 (204 both directions) GB/s GPU eDRAM. To keep up with the PS4's bandwidth developers will need tricks and sacrifices when designing/porting their games. Either console thwarts the Wii U but the memory provided by the Wii U operates at a lower latency. Memory accesses will cost more on the PS4 and Xbox One than the Wii U undoubtedly. Then you've got the fact that AMD's Jaguar cores have [relatively] low IPC rates. With 8 cores the Xbox One and PS4 operates around 60 GFLOPs for all 8 cores, and that conjecture is based upon a synthetic test. The Wii U CPU, being a native RISC CPU, will execute virtually any other instruction in less time than either consoles meaning superior single-threaded performance. Though multithreaded applications are the future there is overhead associated with them. The Xbox 360 probably had the best overall single core performance and thus could handle the largest amount of drawcalls than any other mainstream game console (Direct 3D 9 draw calls were bloated so it may not be evident). There exists a high possibility the Wii U can keep up (possibly exceeding) with the number of draw calls used by the other two current gen consoles. This in turn means support for an excessive amount of polygons and ultimately gives the graphic designers a lot of creative freedom. The Wii U's performance will be consistent and operate at a higher efficiency. With that being said the CPU's of the last gen consoles were incredibly powerful CPU solutions whilst current consoles feature weaker CPU's and more powerful GPU's.

**This resulted from a misinterpretation of how the stream processors were counted. If 320 stream processors with 4 ALUs each would result in at least 4x GFLOPS performance theoretically. However it is (I think) 80 or 64 SIMD units with 4 or 5 ALUs each which would add up to 320 – 400 stream processors.

The ultimate truth is there will be a discernable gap between each console. While the PS4 and Xbox One are frequently lumped together but the PS4 has a sizeable GFLOPS difference than the Xbox One (about 500 GFLOPS apple-apple) and an even bigger difference in their pixel fillrate. Roughly 12.4 Gpixels/s difference, about 4 Gpixels more than the Wii U-Xbox One gap. Unlike the last generation the difference in quality between the PS4 and the Xbox One will be as clear as night and day. The difference between the Wii U and Xbox One can be describe as 2.5x but the difference between the Xbox One and the PS4 may be described as 2.0x. When a Sony fan brags, understand that they have their rights. However the Xbox One will be left behind cleaning up the massive amounts of pixels spilling from the PS4. Because of how efficient Nintendo's console is the gap with the Xbox One is determined by intricate values. Best case scenario the Xbox One becomes slapped right in the middle between the PS4 and Wii U (good news for Nintendo fans, bad news for Microsoft.)

Off topic but how do you edit posts?:smile:

Commenter · Feb 21, 2014

I'm guessing that even if latte does somehow beat xenos's 240 shaders with 160 more efficient shaders then it can't be by that much of a margin.

backgroundpersona · Feb 21, 2014

Xbox 360 and PlayStation 3 processors are very powerful in case of single precision when comes to tasks that are done on a GPU and in double precision it is very weak/underpowered when comes to tasks that are done on a CPU.

Xbox 360 and PlayStation 3 processors are more GPU's than CPU's if you ask me and Sony wanted to only have CELL in PlayStation 3 to act as both CPU and GPU. Both of these CPU's have 100 GFLOPS in single precision, but not in double precision...

For example I will take AMD's latest GPU which is R9 290X, this GPU has 5632 GFLOPS in Single Precision and 704 GFLOPS in Double Precision thus for every 8 GFLOPS in Single Precision you can convert them to 1 GFLOPS in Double Precision.

This means that Xbox 360 and PlayStation 3 processor have 12.5 GFLOPS in Double Precision which means that they are weaker than Wii U processor in Double Precision that has at least 14.9 GFLOPS without accounting to various improvements it has(for example new modern cache)... Thus Wii U's processor could have 16.39 or even 17.88 GFLOPS depending on how extensive customization and improvements Espresso over its predecessor.

Now we need to consider that R9 290X uses latest available technologies that AMD has and that is miles a head in efficiency thus I must assume that Xbox 360 and PlayStation 3 conversion/efficiency rate ratio of Single Precision

ouble Precision is considerably lower thus that Xbox 360 and PlayStation 3 processors achieve 1 GFLOPS of Double Precision per 10+ GFLOPS of Single Precision because of flaws in architecture of both of these processors.

Thus Xbox 360 and PlayStation 3 CPU could have 10 GFLOPS or less in Double Precision.

Goodtwin · Feb 21, 2014

Commenter said:
I'm guessing that even if latte does somehow beat xenos's 240 shaders with 160 more efficient shaders then it can't be by that much of a margin.

Im with you on that. The shader units in the 360 and PS3 would have to be <50% efficient for the Wii U's shaders to make up for the fact that there are less of them, so as far as shaders go, the Wii U probably doesnt have much better shader performance than the 360/PS3. Granted the Wii U supports more modern shaders, so that should give a better results even with the same flop cost. Fuzzywuzzygames mentioned that their game was becoming fillrate bound on the 360 at 600p, and it runs in 720p on Wii U with even more post processing effects. So regardless of the shader units, the GPU on the Wii U has much better fillrate than current gen console.

liolio · Feb 21, 2014

Neighborhoodcroak said:
I didn't cite anything while I typed it since I just made this for myself but here are my findings. It was also written in a presentation like form. This was written over a period of time so it's a little messy and contains some dated information.

Power of the Wii U?

Part I: Expresso and his parents.

The grandfather -Gekko- was introduced to the world inside the GCN during 2000. The Gekko microprocessor is a heavily modified PowerPC 750CXe. A RISC architecture CPU though what's certain is the addition of arithmatic instructions geared towards 3D graphics [e.g. SIMD]. With 32 FPR's [64-bit] and 32 GPR's Gekko has the ability to perform plenty mathematical calculations not forgetting its two 32-bit integer units and a single 64-bit FPU (can be considered 2 x 32 FPU's). Add another FPU and imagine what this processor could do! Moving on.

Gekko was more than a CPU, it was a GPCPU. Sure I made it up and I'm exaggerating a bit but the point being was that Gekko is working with a Fixed Function GPU- NB system called Flipper. A lot of enthusiasts will say "it's not fixed function, it can do shaders" and so on and so forth. Simple fact is no, it cannot. However Flipper has TEV stages which we'll discuss in a bit. Due to Flipper's incompetence Gekko was required to do vertex transformations and possibly including primitive assembly (contradicts the statement ahead.) Then we got into the fixed function T&L unit on Flipper [you're also stuck with 8 hardware lights, which conforms with OpenGL 1.x Specifications]. As a result most user-defined, custom light, was performed on the Gekko microprocessor. An example use for Gekko is for local lights/finite lighting. Gekko in by itself is incredibly associated with the Flipper chip, so much it even has to pass through Flipper to access main memory (24 megabytes of 1T-SRAM).

Onto basic specs Gekko is rated at a 1.9 GFLOPS. The 4-stage pipeline Gekko can issue 4 instructions and retire 2.

Processor Clock * instructions * floating point units = #GFLOPS

486Mhz * 2 * 2 = 1.944 GFLOPS (and 1.944 GIPS)

An unsurprisingly low number but it more than gets the job done. Since Gekko and Flipper are best buds let's talk about Flipper briefly though it's beyond the scope of this article. Flipper is clock synced at the speed of Gekko's FSB at 162 Mhz. Flipper should also be considered the northbridge since it performs audio (integrated audio DSP), does I/O, and is Gekko's route to main memory. The actual Flipper core, the graphics chip is in two words, Fixed-Function. 4 pixel pipelines with 1 texture unit on each giving use 648 MPixel/s and 648 Mtexel/s. For a 640x480 24-bit framebuffer (RGB8, RGBA6) this is fine. Where most of the confusion regarding Flipper is the programmability of it. The source is the 16 TEV stages withheld on the graphics chip. Their functuality can be compared to nvidia's register combiners. Nvidia's implementation is like "You have X number of inputs to get 3 distinct outputs." How you decide to combine your inputs/what operations performed to get your output is the programmability.

thakis@users.sourceforge.net. --> claims the TEVs have 4 inputs and 1 output.

The operations allowed on both NV_register_combiners and GX's TEVs are not arbitrary so there are still limitations on what the user can do. Any additional functuality the TEV does is beyond me but increasing the number of TEV stages has detrimental effects. Similar to adding extra passes and exposes the application to becoming more CPU-limited? Beyond my knowledge but this is the basis of so-called GCN/Wii shaders, GX TEV stages.

Broadway, the CPU of the Wii, is not much difference. Information regarding Broadway is sparse. Most notable differences is smaller fabrication technology, lower power consumption, higher clorck rate, and additional registers. Broadway is mostly a joke of an improvement with current public information. Gekko is at much disappoint of this child. Since the Wii is an overall system improvement (from CPU to RAM to GPU) it's performance is arguably but justifiably roughly twice that of the Gamecube.

Part II: Expresso

Crying in shame Broadway then had three inseparable triplet sons who called themselves Expresso. Expresso is still a derivative of Gekko. Roughly 70% faster with the combined intellect of 3. A triple core processor. Clocked at 1.24 Ghz Expresso has a theoretically rating of 14.9 GFLOPS. An outstanding jump from 2.9 GFLOPS from Broadway. Compared to modern x86 microprocessors this is insanely weak but let's put this into perspective. The Wii U architecture is largely different than the GCN/Wii. On this processor chip is a vary large eDRAM cache which undoubtedly has a lot of bandwidth. While bandwidth shouldn't have been an issue, with a multithreaded processor things start to change. With this extra layer of the memory hiearchy the time the CPU sits idle is lessened further improving Expresso's performance utilization versus previous predecessors. Further more look at Gekko's previous duties, it needed to perform vertex transformations and developers would often perform custom lighting effects on this chip; alongside physics, AI, etc.
In the HD realm we're dealing with resolutions up to 3 times as large (roughly), the number of pixels increases in a square relationship so 1920x1440p features 9 times the pixels and roughly 9 times the processing power of 640x480. When it comes to 1280x720p native resolution games, Expresso beats this curve. However at 1920x1080 Expresso falls short even though it has 35% less pixels than the 4:3 aspect ratio version. This ratio more than balances itself out with the fact Expresso no longer needs to do vertex transformations which is costly. With techniques such as tiled rendering it's highly likely we can still use Expresso similar to Gekko all these years, but why would we want to do that? This is where we move on to the Wii U's GPU.

Latte is by no means related to Flipper or Hollywood. A customized Radeon R700 series GPU. Latte and Expresso share the room, sounds like some nice data is going between them....and very fast.
Latte is equipped with 320 (based on teardowns) stream processors at 549.9~ for 352 GFLOPS.
16 TMU's and 8 ROP's for 8.8 GT/s and 4.4 GP/s.

Wait a minute.....doesn't the PS3 have the SAME pixel fillrate and a higher texel fillrate? 352 GFLOPS isn't even twice of Xenos, the GPU of the Xbox 360. What is this...garbage? For one Latte will suffer from A LOT less limitations with data types and operands, more full functuality and in the end better hardware utilization. Let's start with the most obvious, 4.4 GP/s is within the range of last gen consoles excluding the Wii. No improvement, what's the deal? Consumer graphics cards generally complete the 3D Mark Vantage Fillrate test with roughly 1/3 of their theoretical fill-rate. There is a strong correlation between framebuffer bandwidth and fragment operations. In many cases you're more likely to run out of memory bandwidth before peaking at the card's rated fill-rate (pixel). In otherwords, the power of Latte is entirely dependent on the bandwidth of the 32MB eDRAM. If it provides less than 64 GB/s then it has failed. The magic number is between 64 GB/s and 96 GB/s sustained bandwidth. *Note, the Xbox 360 eDRAM daughter die has 256GB/s of bandwidth but must cross a bus providing only 32 GB/s. If it falls within this range then having the hardware run games at 1080p would prove to be worth very little challenge (doesn't seem to be the case though.) This magical range would begin to expose hardware limitations before hitting the brick wall of becoming frame buffer bandwidth bound. Most games are rarely texel fill-rate bound, otherwisely put the number of TMU's generally have negligeable performance impact, for a console more or less so (texture operations increase in speed/lower performance penalty when using them.) Now that we've exposed Latte a bit you can't help but wonder about how some games still don't run 1080p native. There are three possible reasons [and more]. The first reason is the eDRAM bandwidth falls short of the sweetspot/reaches the lower end. The second reason is the console is near its pixel fill-rate limit. The last reason is there aren't enough ALU's to perform the desired lighting calculations or shader-limited. However if you reached that [bandwidth] limit you should seriously consider using a tiled renderer instead of a traditional forward/deferred render engine. If you hit the pixel fill limit with tiled rendering then it's time for Expresso to save the day.....implying you didn't tie them down to keep them from their lady. Expresso has four things going. Faster clock speeds, triple core architecture, bandwidth, and faster memory access times. Higher processor utilization without the awful burden of vertex transformations. A fivefold increase in processor performance free of vertex transformations plus the lighting computation savings of tiled rendering. By no means is this a dirty trick, the Cell microprocessor knows this all too well.....in fact is in envy of why the owners placed such a prohibitive GPU. "Sure I'll calculate that with 32-bit float! Give me four clock cycles" or "Ooh texture fetch! [Forgets it was doing shader processing and goes after texels.]"(I made it up but the point remains it is nowhere as robust as the one found in the Xbox 360.)

Latte is a refined GPU, but her capabilities can only be determined by the sustained bandwidth to the eDRAM. Last but not least is the 4 512MB 16-bit DDR3-1600 Hynex memory split into 2 1 GB regions. I won't explain much here because I haven't found anything special about this memory. 12.8 GB/s main memory, compared to todays standard it is low, Expresso could undoubtedly use close to the rated transfer rate of this memory, meaning 2 years ago this would be no problem as system memory.

Part III: How does the Wii U stack against the PS3 and Xbox 360?

The PS3, by far, has the most powerful CPU solution (why it was so expensive) in a console, in terms of raw processing power it still outperforms most CPU's used in commercial desktops. The Cell Microprocesser, with the master PPE and 6 slave SPEs, 1 retired SPE, and 1 I do what the fuck I want SPE. Computational it's powerful, we're probably two more generations before seeing a console with a CPU solution as powerful as this [possibly one generation if the current gen console life span lasts longer than 6 years.] The Xbox 360 on the other hand features a lame duck king called Xenon. Certainly bright but strict with In-Order-Oxecution. "Can I do this---" "NO! Do this first THEN do THAT!". Xenon can operate on two threads concurrently, I won't confirm nor deny this but it's said that the design is such that when two threads are active the core is running at half speed. In terms of GFLOPS, Cell > Xenon > Expresso. Woah, time out! Expresso is down in the dumps! We'll explain this later but let's move on to the GPU's.

We already discussed the young forgetful RSX [Reality Synthesizer]. Not only does this GPU have less computational performance than Xenos, it's also bandwidth starved.....might explain some of his invasive nature. Furthermore is the hardware limitations such as the inability to execute shader code and perform a texture fetch, another is floating point operations are heavily discouraged on the GPU..... I'll need to reread the Geforce 7 series whitepaper and programming guides to give concrete examples. Brother to Xenon is the GPU Xenos. A kind hardworking scholar with a daughter whom he refuses to speak her name. They work together in a pair. The 10MB daughter die contains functuality to do Z-buffering, anti-aliasing, and some other stuff not texture related, essentially framebuffer operations. An internal 256 GB/s memory bandwidth die bursting compressed data across a 32GB/s bus to Xenos. Impressive, this scholar is a lot less picky with data types and is just plain smarter than the young onion. In terms of GFLOPS,
Latte > Xenos > RSX. We all came to this fact when developers praised the Wii U GPU. It's to no surprise.

Let's explain the logic behind Expresso. The Expresso team is rather special.....not retarded. I've already stressed the point at how the CPU was used for vertex transformations & primitive setup alongside custom lighting while running traditional game code. Thus the limiting factor of a game's polygon count became how many triangles you can push through the hardware, meaning the the GPU's maximum triangle rate more often than not [if not always] far exceeded what developers could do. What this means is that Gekko and descendents featured useful SIMD and vector instructions [some claim weak SIMDs as they are mostly AltiVec instructions]. Favored by the fact Expresso is FAR more likely to reach its peak potential due to intelligent hardware design. Ports run slower for one because it's not a Xbox 360 or PS3. In fact odds are porting Xbox 360 paradigm code to PS3 and vice-versa would result in significant slowdowns.....more so than what's observable on the Wii U. Last but not least Expresso has a large amount of eDRAM which favors multithreaded processing and reduces memory latency. We'll rest the case at there for now and begin our next discussion.

Part IV: <Insert Here>

The GCN was built from the ground up to reach its potential. Both successors followed that faith and has been tried. Expresso's biggest advantage over current gen processors is it's incredibly high IPC rate. For a given clock cycle, per core, Expresso outperform both the CPU of the Xbox One and PS4. Being a RISC architecture the instruction latency for any given instruction is considerably less than the processor of both competing consoles (some instructions on AMD's bulldozer have a latency greater than 32ns). Obviously, no matter how you look at it, the three brothers cannot compete with the almighty bobcat family with a litter of 8. What is certain though is that Expresso is consistent, they will always run at higher system utilizations, one might even say on their worst days. The GPU comparison isn't even fair really..... but it goes back to Latte's eDRAM. You've got OpenGL 3.x class hardware vs. OpenGL 4.x hardware. 0.35 TFLOPS vs 1.31 TFLOPS vs 1.84 TFLOPS. We haven't even gotten into pixel and texel fillrate though the former is less relevant. How effective the memory bandwidth of the Xbox One's ESRAM is beyond the article's scope but it requires careful optimization and even then isn't guaranteed the same bandwidth as the PS4. As to current standings it appears that the PS4 holds the bandwidth advantage for its APU. But wouldn't that be bad? When GPU's touch their main memory the latency is atrocious! Well it's not GDDR5's fault, GPU's are just bad at accessing memory but know how to use it.....to its fullest, when everything comes into perspective the DDR3 and GDDR5 have roughly the same latency. Sony's PS4 holds the bandwidth advantage, the Xbox One's GPU ability is already at a downright loss. How much better it is than Latte is dependent on how the 32MB eSRAM functions which ultimately determines how much of the bandwidth is "real". We cannot stress enough the disadvantage Latte is. That 32MB eDRAM is what determines whether she can hang with the big boys or back down like a dog.

Last but not least the Wii U is a high performance console, in a sense it will reach its peak performance more so than any console in the previous and current generation. The every extra mile pushed into increases the already superb hardware balance is key to determining the gap between the Wii U and the Xbox One. Not to sound biased but the ultimate truth is the PS4 is the most powerful current gen system. The Wii U will keep the crown of its predecessors as the most efficient current gen system. The Xbox One? Use the power of the cloud and cheat 4x performance to take the crown [laughs]. Not really but who knows?

Wii U appendum.

Many of us go to wikipedia as a start for research. When it comes to the Wii U's GPU it's disorganized and relatively inaccurate. The information here may also be inaccurate in regards that most of it is not confirmed by Nintendo but go bring up the topics discussed on wikipedia and see which is more "believable".

This appendum is more or less discussing the Wii U's GPU Latte. Forward rendering has been getting much attention, you might even say a revival with the technique of forward ++/forward tiled rendering. We're going to be running AMD's Tiled Lighting Demo by which you can get it from here http://developer.amd.com/tools-and-sdks/graphics-development/amd-radeon-sdk/

My graphics card is an AMD Radeon HD 7770 clocked at 1.2 Ghz and 1.4 Ghz (RAM). I can actually clock it to 1.25 Ghz, faster than the processor of the Wii U (of course certain applications become unstable so it doesn't necessarily count.)
AMD's Graphic's Core Next (GCN) architecture is different from that of its predecessors. Because of this the performance achieved get may not be a good representation despite similar clock rates.
The goal will be to underclock the CPU to a reasonable rate.

Specs: AMD Radeon HD 7770 @ 300Mhz & 1000 Mhz RAM
640 SPs
16 ROPs
40 TMUs

Statistics:
Processing Power |384.0 GFLOPS
Pixel Fillrate | 4.8 Gpixels/s
Texel Fillrate | 12.0 Gtexels/s
Memory Bandwidth | 64.0 GB/s
Max TDP 40W
Max Geometry Assembly Rate: 5 million triangles @ 60 frames per second (lower than Latte)

Findings:
The shadows rendered in this demo is a frame rate killer (too accurate) so we'll disable them. Transparent Objects are also disabled. My monitor is a 1680x1050 so we're down 15% pixel area. With max triangle density and Grid Objects we can achieve a maximum of 1.25 million polygons for the scene. Due to bandwidth constraints I turned off anti-aliasing which is only available as 4x MSAA. Considering the framerate penalty we could have gotten away with 2x with negligeable performance impact. The shadow caster lights we managed to keep our frame time less than 16ms and with the Random Lights we're able to render roughly 640 lights. We're expect to see a 15% increase in frame time resulting in a FPS of 54.3 (18.4 ms). Upon initial review it might seem a little ridiculous at how our mock Latte runs this demo.

However because this is a GPU demo it practically uses no CPU power. In fact it could run amazingly well if I underclocked my CPU by a factor of four. The amount of draw calls on this application is incredibly low as well. This is a more likely scenario, at least how I would do it.
Perform culling on Core 1 (Expresso's processor, Core 1 has more L2 cache than Core 0 and 2) and break the geometry & issue more draw calls. Due to the larger cache size Core 1 is more suited for "complicated" tasks and many of them. Issuing more draw calls CAN improve the GPU's performance (smaller VBO/VAO sizes.) With that in mind AI should also be performed on the same core leaving physics and software lighting to the other two cores. A game with potentially one million polygons being rendered is quite an easy feat on the Wii U at 60 fps. The console actually has fillrate issues which is no surprise. Being a tech demo that's targetted at an AMD 7970 the fact it scaled so low is impressive.

Aside from moving some work to the CPU we've got an optimized graphics API (GX2) and interestingly is the Terascale Engine. The architecture of the AMD Radeon RV700 series would allow the GPU in best case scenario to achieve 704 GFLOPS, compared to the GCN architecture there are four times as many ALUs as there are SIMD units. Ultimately running complex shader programs (in calculation, not by flow control) will be easy on the Wii U but will require careful optimizations for tasks not handling well-suited data (data sets of 4, one for each ALU, E.G. RGBA color manipulation.) When it comes to well-designed shaders the GPU can hold its own but currently when it comes to native 1080p rendering Latte is spilling out a lot of her potentional thus most pre-existing engines not designed for Full HD rendering will not render on the console. This is evident in Donkey Kong Country Returns Tropical Freeze. They only managed to scale a 480p game to 720p whereas Mario Kart 8 looks just as good (IMO better) at 1080p.

Here's something interesting. OpenGL 3.x class hardware is able to perform instancing. This allows the developer to render a field of trees with only a few draw calls. Using instancing with a Geeks3D instancing demo we're able to realize 8 million triangles at 28 fps with 720p windowed rendering. Adding 2x MSAA gives us an increase of only 2ms (26 fps) and increasing the GPU bandwidth from 64 GB/s to 96 GB/s gives 38 fps without AA and with 4x MSAA we experience only a 1ms difference (36 fps) in frame. What if the application was optimized for the Wii U's ALUs? Clocking the GPU for 400Mhz gave us no appreciable difference in frame time, 38 fps despite 512 GFLOPS. Lowering the memory bandwidth has no appreciable performance impact (24 GB/s.)
NOTE: [The GPU core clock raised itself during the high bandwidth tests. This boosted the core clock to 400 Mhz resulting in an increase in performance.]

*Another interest fact is Geeks3D OpenGL 2.1 HDR, DoF, Radial Blur Demo. At 64 GB/s we receive 60 fps but once we jump it to 96 GB/s we see the hardware barely rendering past 61 fps. Activating 8x MSAA we're still rendering at 60 fps. This bandwidth potentially demonstrates how "free" anti-aliasing is on 720p. Raising the resolution to 1680x1050 we see a decrease in fps with 8x MSAA. We're receiving 58 fps as opposed 60, this is an insignificant but noticeable increase in frametime.

*To make accurate comparisons the clock is 400 Mhz which is within the Wii U's possible range of GFLOPS.

We were not limited by pixel fillrate during any of the tests which meant that we were shader/ALU limited. Meaning it would've been equally possible for the Wii U to achieve the require 512 GFLOPS for the tests. My GPU at my personal OC'd frequencies would be roughly equivalent to the Xbox One's GPU (take away bandwidth and it'd be roughly 1.4x the power.) Various tests would place this GPU [the Xbox One's GPU vs. Latte?] at BEST roughly 2.5x more rendering performance (realize we're comparing apples to oranges.) With the Tiled Forward Demo it may have also been possible our geometry assembly was killing the frame rate by up to 33% (to rationalize this we'd need to run a GPU that has a lower SPU count at a frequency to match the GFLOPS my GPU operated at.)

This will be the last conclusion and notes. The gap between the Wii U and the Xbox One versus the gap between the Xbox One and the PS4 is no larger than what I estimate a 57% difference. Why is that? The GFLOP difference between the Wii U & Xbox One is at WORST about 960 FLOPS. **However AMD's VLIW architecture excels at ALU calculations which can be a big win and reduce this number at best to 810 GFLOPS. The Pixel Fillrate and Texel fillrate are sure losses for Nintendo's hardware and there is no overcoming them. However the PS4 has almost double the pixel fillrate. The total difference in pixel fillrate is higher when comparing the Xbox One and PS4 than the Xbox One and Wii U, this is already evident in cross platform games of the two competing consoles. Memory architecture is another thing that'll come into play. The PS4 has 176 GB/s of unified memory whilst the Xbox One has 68 GB/s main memory and a 102 (204 both directions) GB/s GPU eDRAM. To keep up with the PS4's bandwidth developers will need tricks and sacrifices when designing/porting their games. Either console thwarts the Wii U but the memory provided by the Wii U operates at a lower latency. Memory accesses will cost more on the PS4 and Xbox One than the Wii U undoubtedly. Then you've got the fact that AMD's Jaguar cores have [relatively] low IPC rates. With 8 cores the Xbox One and PS4 operates around 60 GFLOPs for all 8 cores, and that conjecture is based upon a synthetic test. The Wii U CPU, being a native RISC CPU, will execute virtually any other instruction in less time than either consoles meaning superior single-threaded performance. Though multithreaded applications are the future there is overhead associated with them. The Xbox 360 probably had the best overall single core performance and thus could handle the largest amount of drawcalls than any other mainstream game console (Direct 3D 9 draw calls were bloated so it may not be evident). There exists a high possibility the Wii U can keep up (possibly exceeding) with the number of draw calls used by the other two current gen consoles. This in turn means support for an excessive amount of polygons and ultimately gives the graphic designers a lot of creative freedom. The Wii U's performance will be consistent and operate at a higher efficiency. With that being said the CPU's of the last gen consoles were incredibly powerful CPU solutions whilst current consoles feature weaker CPU's and more powerful GPU's.

**This resulted from a misinterpretation of how the stream processors were counted. If 320 stream processors with 4 ALUs each would result in at least 4x GFLOPS performance theoretically. However it is (I think) 80 or 64 SIMD units with 4 or 5 ALUs each which would add up to 320 – 400 stream processors.

The ultimate truth is there will be a discernable gap between each console. While the PS4 and Xbox One are frequently lumped together but the PS4 has a sizeable GFLOPS difference than the Xbox One (about 500 GFLOPS apple-apple) and an even bigger difference in their pixel fillrate. Roughly 12.4 Gpixels/s difference, about 4 Gpixels more than the Wii U-Xbox One gap. Unlike the last generation the difference in quality between the PS4 and the Xbox One will be as clear as night and day. The difference between the Wii U and Xbox One can be describe as 2.5x but the difference between the Xbox One and the PS4 may be described as 2.0x. When a Sony fan brags, understand that they have their rights. However the Xbox One will be left behind cleaning up the massive amounts of pixels spilling from the PS4. Because of how efficient Nintendo's console is the gap with the Xbox One is determined by intricate values. Best case scenario the Xbox One becomes slapped right in the middle between the PS4 and Wii U (good news for Nintendo fans, bad news for Microsoft.)

Off topic but how do you edit posts?:smile:

You have to reach ten posts I believe, ten posts without spamming that is it.
That aside you put a lot of efforts into that post but lots of things are incorrect, the XB1 is really close to the PS4 in many ways, CPU power, available ram, even the GPU is not that far off, it is trickier and can't push as many pixels but it is still really close. The 2 systems are closer to one another than any systems to date, they use the same IP blocks from AMD.
The WiiU is a last gen consoles wrt to specs and the process used to produce it, I would think its main advantage is the extra RAM and what seems to be a well though out pool of scratchpad memory. It is really nothing impressive in any way and that is a nice way to state it.
A mean way is that I would be surprised if anybody else than Nintendo could have done a system with such performances in 2012 out of almost 200mm2 of silicon

/

Kaotik · Feb 21, 2014

backgroundpersona said:
Xbox 360 and PlayStation 3 processors are very powerful in case of single precision when comes to tasks that are done on a GPU and in double precision it is very weak/underpowered when comes to tasks that are done on a CPU.

Xbox 360 and PlayStation 3 processors are more GPU's than CPU's if you ask me and Sony wanted to only have CELL in PlayStation 3 to act as both CPU and GPU. Both of these CPU's have 100 GFLOPS in single precision, but not in double precision...

For example I will take AMD's latest GPU which is R9 290X, this GPU has 5632 GFLOPS in Single Precision and 704 GFLOPS in Double Precision thus for every 8 GFLOPS in Single Precision you can convert them to 1 GFLOPS in Double Precision.

This means that Xbox 360 and PlayStation 3 processor have 12.5 GFLOPS in Double Precision which means that they are weaker than Wii U processor in Double Precision that has at least 14.9 GFLOPS without accounting to various improvements it has(for example new modern cache)... Thus Wii U's processor could have 16.39 or even 17.88 GFLOPS depending on how extensive customization and improvements Espresso over its predecessor.

Now we need to consider that R9 290X uses latest available technologies that AMD has and that is miles a head in efficiency thus I must assume that Xbox 360 and PlayStation 3 conversion/efficiency rate ratio of Single Precisionouble Precision is considerably lower thus that Xbox 360 and PlayStation 3 processors achieve 1 GFLOPS of Double Precision per 10+ GFLOPS of Single Precision because of flaws in architecture of both of these processors.

Thus Xbox 360 and PlayStation 3 CPU could have 10 GFLOPS or less in Double Precision.

Double Precision performance is irrelevant in terms of gaming

Goodtwin · Feb 21, 2014

liolio said:
You have to reach ten posts I believe, ten posts without spamming that is it.
That aside you put a lot of efforts into that post but lots of things are incorrect, the XB1 is really close to the PS4 in many ways, CPU power, available ram, even the GPU is not that far off, it is trickier and can't push as many pixels but it is still really close. The 2 systems are closer to one another than any systems to date, they use the same IP blocks from AMD.
The WiiU is a last gen consoles wrt to specs and the process used to produce it, I would think its main advantage is the extra RAM and what seems to be a well though out pool of scratchpad memory. It is really nothing impressive in any way and that is a nice way to state it.
A mean way is that I would be surprised if anybody else than Nintendo could have done a system with such performances in 2012 out of almost 200mm2 of silicon /

Pretty much, I think a couple things happened. Nintendo overestimated the value of backwards compatibility. They obviously thought they would carry over a ton of Wii gamers to Wii U, having a new Wii Fit game backs that up. They also put a lot of value on low power consumption and being in a small package. I also think they may have bought into the idea that the PS4 and X1 wouldnt come until 2014/2015, making the PS3 and 360 its main competition.

Do you think Sony or Microsoft could have done that much better with such criteria? Would they have a much more powerful system if it was limited to the size and power consumption of the Wii U, and maintain backwards compatibility? I doubt it. Did Nintendo overestimate the value of those things? I think so, Im a Nintendo guy and I dont really put much value on a small console that is power efficient.

Neighborhoodcroak · Feb 21, 2014

liolio said:
You have to reach ten posts I believe, ten posts without spamming that is it.
That aside you put a lot of efforts into that post but lots of things are incorrect, the XB1 is really close to the PS4 in many ways, CPU power, available ram, even the GPU is not that far off, it is trickier and can't push as many pixels but it is still really close. The 2 systems are closer to one another than any systems to date, they use the same IP blocks from AMD.
200mm2 of silicon /

The CPU's are nearly identical that much is for sure (different clock rates 1.6 Ghz, vs. 1.7 Ghz.) However I would put the Xbox One "close" to the PS4. The PS4 GPU, bandwidth, and memory architecture speaks for itself. The PS4 enjoys a little less than 1.4x more gflops amounting to somewhere around 500 GFLOPS. The PS4 ROPS also indulges itself with almost 1.9x (13.65 vs. 25.6)pixel fill and then there is higher texel fill, though less relevant. The PS4 has a simpler memory system and overall more usable bandwidth (even with the ESRAM the XBOX One would only match at best the PS4's bandwidth, it's also more restrictive since it holds only 32MB.)

Architecturally their practically the same. Glorified PC's with budget performance. Their relative performance to each other is as clear as night and day. PS4 titles can look easily better than Xbox Ones, no tricks and gimmicks. It's in the specs. And it's already been seen in multiplatform titles, it's because they're identical that I make such bold statements.

There is no way the Wii U GPU is last gen? Spec-wise it's no leaps and bounds. However it is at a minimum an OpenGL 3.x class and supports said functionality. None of the next gen consoles (I count the Wii U) will be seeing their potential soon I believe. Obviously this console would need the "tricks and gimmicks" to bridge the apparent gap but Mario Kart 8 looks great. There is more to art design than normal maps and HDR rendering. Developers on the Xbox One would need to realize this if they want to come close to the PS4 graphically.:smile:

Neighborhoodcroak · Feb 21, 2014

I guess this will come out as a double post but in the large block of statements I had A LOT of cross outs that didn't format itself back in [these cross-outs were important because some of the information turned out to be false/ambiguous.]

E.G. Gekko did NOT have to do vertex transformations because flipper had a fixed function T&L unit.

Another is the misinterpreted 800 GFLOPS. It should be double asterisks but the theoretical GFLOP performance I strongly agree would be 72 SIMDs 352 GFLOPS VLIW4 or 440 GFLOPS VLIW5 (4 vs. 5 ALUs). Any number between there should be guaranteed.

Unless Nintendo are even cheaper bastards I would expect the compiler performance for AMDs VLIW architecture to be improved, at least marginally.

function · Feb 21, 2014

Only 160 shaders in that there WiU GPU. 176 Gflops if the 550 mHz clock is correct (and it seems to be).

Grall · Feb 21, 2014

Neighborhoodcroak said:
E.G. Gekko did NOT have to do vertex transformations because flipper had a fixed function T&L unit.

Flipper can't do soft skinned characters, so anything that flexes/bends (like joints etc) has to be transformed on the CPU in Gamecube/Wii. The original Xbox (which was roughly contemporary to the gamecube) on the other hand had vertex shaders, and as such could do skinning on the GPU.

The Gamecube didn't suffer all that much though because of this, a game like the Metroid Prime series had not just soft-skinned characters but also ragdoll physics together with complex environs and so on. Quite impressive, really.

liolio · Feb 21, 2014

Goodtwin said:
Pretty much, I think a couple things happened. Nintendo overestimated the value of backwards compatibility. They obviously thought they would carry over a ton of Wii gamers to Wii U, having a new Wii Fit game backs that up. They also put a lot of value on low power consumption and being in a small package. I also think they may have bought into the idea that the PS4 and X1 wouldnt come until 2014/2015, making the PS3 and 360 its main competition.

BC is a great thing to have, so is a cool system a neat form factor. The thing is the WiiU is too costly and the WiiUmote failed to gain traction.

Do you think Sony or Microsoft could have done that much better with such criteria? Would they have a much more powerful system if it was limited to the size and power consumption of the Wii U, and maintain backwards compatibility? I doubt it. Did Nintendo overestimate the value of those things? I think so, Im a Nintendo guy and I dont really put much value on a small console that is power efficient.

Honestly I like to post but I'm not sure I would be a good business adviser.
How much does the WiiU really cost to Nintendo? Could they do better while keeping BC, may be not within their budget, the system is expensive already.
Would a better hardware have save them? There are no easy win, the Wiiumote was not what the market wanted, their OS sucks so does their online infrastructure, the price is too high, on top of it 2011 may have proven a better launch windows.

Now would MSFT or Sony have done better? My gut feeling is telling me that IBM would have leveraged its position as the sole PowerPC provider to use some of its fab capacity which kills the idea of a SOC and going for the most optimal implementations. Nintendo also had the benefit that the legacy parts were tiny, for the ps360 not that much, they may have had to resort to just more of the same old units. So no.

My belief is that like MSFT and Sony, Nintendo should have freed himself from a company like IBM that can bind is IP to custom or/and proprietary process. Sony and MSFT did not emancipated them-selves as they binded them-selves to X86, and there not that many actors... Complete emancipation would have been ARM CPUs. The price to pay was BC, but once they adopted ARM they would have been good to go for a long while. That would not on its own ensure the success of the system.

liolio · Feb 21, 2014

function said:
Only 160 shaders in that there WiU GPU. 176 Gflops if the 550 mHz clock is correct (and it seems to be).

It is quite funny to think that if one link a A4-5000 /kabini to GDDR5 and tweak the clock a tad (lower the cpu clock further, up the GPU speed), one could out do with minimal R&D effort the PS3, the 360 and the WiiU.
If you go for 1GB of GDDR5 and a reasonable memory usage for the OS (say 128MB x4 what the 360 uses), you would have a system quite possibly cheaper to produce than any of the 3 and more efficient in every ways :8o:

Wii U hardware discussion and investigation *rename

Lalaland

Goodtwin

swaaye

Entirely Suboptimal

Goodtwin

swaaye

Entirely Suboptimal

3dilettante

Goodtwin

Neighborhoodcroak

Commenter

backgroundpersona

Goodtwin

liolio

Aquoiboniste

Kaotik

Drunk Member

Goodtwin

Neighborhoodcroak

Neighborhoodcroak

function

None functional

Grall

Invisible Member

liolio

Aquoiboniste

liolio

Aquoiboniste

Similar threads