Part I: Expresso and his parents.
The grandfather -Gekko- was introduced to the world inside the GCN during 2000. The Gekko microprocessor is a heavily modified PowerPC 750CXe. A RISC architecture CPU though what's certain is the addition of arithmatic instructions geared towards 3D graphics [e.g. SIMD]. With 32 FPR's [64-bit] and 32 GPR's Gekko has the ability to perform plenty mathematical calculations not forgetting its two 32-bit integer units and a single 64-bit FPU (can be considered 2 x 32 FPU's). Add another FPU and imagine what this processor could do! Moving on.
Gekko was more than a CPU, it was a GPCPU. Sure I made it up and I'm exaggerating a bit but the point being was that Gekko is working with a Fixed Function GPU- NB system called Flipper. A lot of enthusiasts will say "it's not fixed function, it can do shaders" and so on and so forth. Simple fact is no, it cannot. However Flipper has TEV stages which we'll discuss in a bit. Due to Flipper's incompetence Gekko was required to do vertex transformations and possibly including primitive assembly (contradicts the statement ahead.) Then we got into the fixed function T&L unit on Flipper [you're also stuck with 8 hardware lights, which conforms with OpenGL 1.x Specifications]. As a result most user-defined, custom light, was performed on the Gekko microprocessor. An example use for Gekko is for local lights/finite lighting. Gekko in by itself is incredibly associated with the Flipper chip, so much it even has to pass through Flipper to access main memory (24 megabytes of 1T-SRAM).
Onto basic specs Gekko is rated at a 1.9 GFLOPS. The 4-stage pipeline Gekko can issue 4 instructions and retire 2.
Processor Clock * instructions * floating point units = #GFLOPS
486Mhz * 2 * 2 = 1.944 GFLOPS (and 1.944 GIPS)
An unsurprisingly low number but it more than gets the job done. Since Gekko and Flipper are best buds let's talk about Flipper briefly though it's beyond the scope of this article. Flipper is clock synced at the speed of Gekko's FSB at 162 Mhz. Flipper should also be considered the northbridge since it performs audio (integrated audio DSP), does I/O, and is Gekko's route to main memory. The actual Flipper core, the graphics chip is in two words, Fixed-Function. 4 pixel pipelines with 1 texture unit on each giving use 648 MPixel/s and 648 Mtexel/s. For a 640x480 24-bit framebuffer (RGB8, RGBA6) this is fine. Where most of the confusion regarding Flipper is the programmability of it. The source is the 16 TEV stages withheld on the graphics chip. Their functuality can be compared to nvidia's register combiners. Nvidia's implementation is like "You have X number of inputs to get 3 distinct outputs." How you decide to combine your inputs/what operations performed to get your output is the programmability.
thakis@users.sourceforge.net. --> claims the TEVs have 4 inputs and 1 output.
The operations allowed on both NV_register_combiners and GX's TEVs are not arbitrary so there are still limitations on what the user can do. Any additional functuality the TEV does is beyond me but increasing the number of TEV stages has detrimental effects. Similar to adding extra passes and exposes the application to becoming more CPU-limited? Beyond my knowledge but this is the basis of so-called GCN/Wii shaders, GX TEV stages.
Broadway, the CPU of the Wii, is not much difference. Information regarding Broadway is sparse. Most notable differences is smaller fabrication technology, lower power consumption, higher clorck rate, and additional registers. Broadway is mostly a joke of an improvement with current public information. Gekko is at much disappoint of this child. Since the Wii is an overall system improvement (from CPU to RAM to GPU) it's performance is arguably but justifiably roughly twice that of the Gamecube.
Part II: Expresso
Crying in shame Broadway then had three inseparable triplet sons who called themselves Expresso. Expresso is still a derivative of Gekko. Roughly 70% faster with the combined intellect of 3. A triple core processor. Clocked at 1.24 Ghz Expresso has a theoretically rating of 14.9 GFLOPS. An outstanding jump from 2.9 GFLOPS from Broadway. Compared to modern x86 microprocessors this is insanely weak but let's put this into perspective. The Wii U architecture is largely different than the GCN/Wii. On this processor chip is a vary large eDRAM cache which undoubtedly has a lot of bandwidth. While bandwidth shouldn't have been an issue, with a multithreaded processor things start to change. With this extra layer of the memory hiearchy the time the CPU sits idle is lessened further improving Expresso's performance utilization versus previous predecessors. Further more look at Gekko's previous duties, it needed to perform vertex transformations and developers would often perform custom lighting effects on this chip; alongside physics, AI, etc.
In the HD realm we're dealing with resolutions up to 3 times as large (roughly), the number of pixels increases in a square relationship so 1920x1440p features 9 times the pixels and roughly 9 times the processing power of 640x480. When it comes to 1280x720p native resolution games, Expresso beats this curve. However at 1920x1080 Expresso falls short even though it has 35% less pixels than the 4:3 aspect ratio version. This ratio more than balances itself out with the fact Expresso no longer needs to do vertex transformations which is costly. With techniques such as tiled rendering it's highly likely we can still use Expresso similar to Gekko all these years, but why would we want to do that? This is where we move on to the Wii U's GPU.
Latte is by no means related to Flipper or Hollywood. A customized Radeon R700 series GPU. Latte and Expresso share the room, sounds like some nice data is going between them....and very fast.
Latte is equipped with 320 (based on teardowns) stream processors at 549.9~ for 352 GFLOPS.
16 TMU's and 8 ROP's for 8.8 GT/s and 4.4 GP/s.
Wait a minute.....doesn't the PS3 have the SAME pixel fillrate and a higher texel fillrate? 352 GFLOPS isn't even twice of Xenos, the GPU of the Xbox 360. What is this...garbage? For one Latte will suffer from A LOT less limitations with data types and operands, more full functuality and in the end better hardware utilization. Let's start with the most obvious, 4.4 GP/s is within the range of last gen consoles excluding the Wii. No improvement, what's the deal? Consumer graphics cards generally complete the 3D Mark Vantage Fillrate test with roughly 1/3 of their theoretical fill-rate. There is a strong correlation between framebuffer bandwidth and fragment operations. In many cases you're more likely to run out of memory bandwidth before peaking at the card's rated fill-rate (pixel). In otherwords, the power of Latte is entirely dependent on the bandwidth of the 32MB eDRAM. If it provides less than 64 GB/s then it has failed. The magic number is between 64 GB/s and 96 GB/s sustained bandwidth. *Note, the Xbox 360 eDRAM daughter die has 256GB/s of bandwidth but must cross a bus providing only 32 GB/s. If it falls within this range then having the hardware run games at 1080p would prove to be worth very little challenge (doesn't seem to be the case though.) This magical range would begin to expose hardware limitations before hitting the brick wall of becoming frame buffer bandwidth bound. Most games are rarely texel fill-rate bound, otherwisely put the number of TMU's generally have negligeable performance impact, for a console more or less so (texture operations increase in speed/lower performance penalty when using them.) Now that we've exposed Latte a bit you can't help but wonder about how some games still don't run 1080p native. There are three possible reasons [and more]. The first reason is the eDRAM bandwidth falls short of the sweetspot/reaches the lower end. The second reason is the console is near its pixel fill-rate limit. The last reason is there aren't enough ALU's to perform the desired lighting calculations or shader-limited. However if you reached that [bandwidth] limit you should seriously consider using a tiled renderer instead of a traditional forward/deferred render engine. If you hit the pixel fill limit with tiled rendering then it's time for Expresso to save the day.....implying you didn't tie them down to keep them from their lady. Expresso has four things going. Faster clock speeds, triple core architecture, bandwidth, and faster memory access times. Higher processor utilization without the awful burden of vertex transformations. A fivefold increase in processor performance free of vertex transformations plus the lighting computation savings of tiled rendering. By no means is this a dirty trick, the Cell microprocessor knows this all too well.....in fact is in envy of why the owners placed such a prohibitive GPU. "Sure I'll calculate that with 32-bit float! Give me four clock cycles" or "Ooh texture fetch! [Forgets it was doing shader processing and goes after texels.]"(I made it up but the point remains it is nowhere as robust as the one found in the Xbox 360.)
Latte is a refined GPU, but her capabilities can only be determined by the sustained bandwidth to the eDRAM. Last but not least is the 4 512MB 16-bit DDR3-1600 Hynex memory split into 2 1 GB regions. I won't explain much here because I haven't found anything special about this memory. 12.8 GB/s main memory, compared to todays standard it is low, Expresso could undoubtedly use close to the rated transfer rate of this memory, meaning 2 years ago this would be no problem as system memory.
Part III: How does the Wii U stack against the PS3 and Xbox 360?
The PS3, by far, has the most powerful CPU solution (why it was so expensive) in a console, in terms of raw processing power it still outperforms most CPU's used in commercial desktops. The Cell Microprocesser, with the master PPE and 6 slave SPEs, 1 retired SPE, and 1 I do what the fuck I want SPE. Computational it's powerful, we're probably two more generations before seeing a console with a CPU solution as powerful as this [possibly one generation if the current gen console life span lasts longer than 6 years.] The Xbox 360 on the other hand features a lame duck king called Xenon. Certainly bright but strict with In-Order-Oxecution. "Can I do this---" "NO! Do this first THEN do THAT!". Xenon can operate on two threads concurrently, I won't confirm nor deny this but it's said that the design is such that when two threads are active the core is running at half speed. In terms of GFLOPS, Cell > Xenon > Expresso. Woah, time out! Expresso is down in the dumps! We'll explain this later but let's move on to the GPU's.
We already discussed the young forgetful RSX [Reality Synthesizer]. Not only does this GPU have less computational performance than Xenos, it's also bandwidth starved.....might explain some of his invasive nature. Furthermore is the hardware limitations such as the inability to execute shader code and perform a texture fetch, another is floating point operations are heavily discouraged on the GPU..... I'll need to reread the Geforce 7 series whitepaper and programming guides to give concrete examples. Brother to Xenon is the GPU Xenos. A kind hardworking scholar with a daughter whom he refuses to speak her name. They work together in a pair. The 10MB daughter die contains functuality to do Z-buffering, anti-aliasing, and some other stuff not texture related, essentially framebuffer operations. An internal 256 GB/s memory bandwidth die bursting compressed data across a 32GB/s bus to Xenos. Impressive, this scholar is a lot less picky with data types and is just plain smarter than the young onion. In terms of GFLOPS,
Latte > Xenos > RSX. We all came to this fact when developers praised the Wii U GPU. It's to no surprise.
Let's explain the logic behind Expresso. The Expresso team is rather special.....not retarded. I've already stressed the point at how the CPU was used for vertex transformations & primitive setup alongside custom lighting while running traditional game code. Thus the limiting factor of a game's polygon count became how many triangles you can push through the hardware, meaning the the GPU's maximum triangle rate more often than not [if not always] far exceeded what developers could do. What this means is that Gekko and descendents featured useful SIMD and vector instructions [some claim weak SIMDs as they are mostly AltiVec instructions]. Favored by the fact Expresso is FAR more likely to reach its peak potential due to intelligent hardware design. Ports run slower for one because it's not a Xbox 360 or PS3. In fact odds are porting Xbox 360 paradigm code to PS3 and vice-versa would result in significant slowdowns.....more so than what's observable on the Wii U. Last but not least Expresso has a large amount of eDRAM which favors multithreaded processing and reduces memory latency. We'll rest the case at there for now and begin our next discussion.
Part IV: <Insert Here>
The GCN was built from the ground up to reach its potential. Both successors followed that faith and has been tried. Expresso's biggest advantage over current gen processors is it's incredibly high IPC rate. For a given clock cycle, per core, Expresso outperform both the CPU of the Xbox One and PS4. Being a RISC architecture the instruction latency for any given instruction is considerably less than the processor of both competing consoles (some instructions on AMD's bulldozer have a latency greater than 32ns). Obviously, no matter how you look at it, the three brothers cannot compete with the almighty bobcat family with a litter of 8. What is certain though is that Expresso is consistent, they will always run at higher system utilizations, one might even say on their worst days. The GPU comparison isn't even fair really..... but it goes back to Latte's eDRAM. You've got OpenGL 3.x class hardware vs. OpenGL 4.x hardware. 0.35 TFLOPS vs 1.31 TFLOPS vs 1.84 TFLOPS. We haven't even gotten into pixel and texel fillrate though the former is less relevant. How effective the memory bandwidth of the Xbox One's ESRAM is beyond the article's scope but it requires careful optimization and even then isn't guaranteed the same bandwidth as the PS4. As to current standings it appears that the PS4 holds the bandwidth advantage for its APU. But wouldn't that be bad? When GPU's touch their main memory the latency is atrocious! Well it's not GDDR5's fault, GPU's are just bad at accessing memory but know how to use it.....to its fullest, when everything comes into perspective the DDR3 and GDDR5 have roughly the same latency. Sony's PS4 holds the bandwidth advantage, the Xbox One's GPU ability is already at a downright loss. How much better it is than Latte is dependent on how the 32MB eSRAM functions which ultimately determines how much of the bandwidth is "real". We cannot stress enough the disadvantage Latte is. That 32MB eDRAM is what determines whether she can hang with the big boys or back down like a dog.
Last but not least the Wii U is a high performance console, in a sense it will reach its peak performance more so than any console in the previous and current generation. The every extra mile pushed into increases the already superb hardware balance is key to determining the gap between the Wii U and the Xbox One. Not to sound biased but the ultimate truth is the PS4 is the most powerful current gen system. The Wii U will keep the crown of its predecessors as the most efficient current gen system. The Xbox One? Use the power of the cloud and cheat 4x performance to take the crown [laughs]. Not really but who knows?
Wii U appendum.
Many of us go to wikipedia as a start for research. When it comes to the Wii U's GPU it's disorganized and relatively inaccurate. The information here may also be inaccurate in regards that most of it is not confirmed by Nintendo but go bring up the topics discussed on wikipedia and see which is more "believable".
This appendum is more or less discussing the Wii U's GPU Latte. Forward rendering has been getting much attention, you might even say a revival with the technique of forward ++/forward tiled rendering. We're going to be running AMD's Tiled Lighting Demo by which you can get it from here
http://developer.amd.com/tools-and-sdks/graphics-development/amd-radeon-sdk/
My graphics card is an AMD Radeon HD 7770 clocked at 1.2 Ghz and 1.4 Ghz (RAM). I can actually clock it to 1.25 Ghz, faster than the processor of the Wii U (of course certain applications become unstable so it doesn't necessarily count.)
AMD's Graphic's Core Next (GCN) architecture is different from that of its predecessors. Because of this the performance achieved get may not be a good representation despite similar clock rates.
The goal will be to underclock the CPU to a reasonable rate.
Specs: AMD Radeon HD 7770 @ 300Mhz & 1000 Mhz RAM
640 SPs
16 ROPs
40 TMUs
Statistics:
Processing Power |384.0 GFLOPS
Pixel Fillrate | 4.8 Gpixels/s
Texel Fillrate | 12.0 Gtexels/s
Memory Bandwidth | 64.0 GB/s
Max TDP 40W
Max Geometry Assembly Rate: 5 million triangles @ 60 frames per second (lower than Latte)
Findings:
The shadows rendered in this demo is a frame rate killer (too accurate) so we'll disable them. Transparent Objects are also disabled. My monitor is a 1680x1050 so we're down 15% pixel area. With max triangle density and Grid Objects we can achieve a maximum of 1.25 million polygons for the scene. Due to bandwidth constraints I turned off anti-aliasing which is only available as 4x MSAA. Considering the framerate penalty we could have gotten away with 2x with negligeable performance impact. The shadow caster lights we managed to keep our frame time less than 16ms and with the Random Lights we're able to render roughly 640 lights. We're expect to see a 15% increase in frame time resulting in a FPS of 54.3 (18.4 ms). Upon initial review it might seem a little ridiculous at how our mock Latte runs this demo.
However because this is a GPU demo it practically uses no CPU power. In fact it could run amazingly well if I underclocked my CPU by a factor of four. The amount of draw calls on this application is incredibly low as well. This is a more likely scenario, at least how I would do it.
Perform culling on Core 1 (Expresso's processor, Core 1 has more L2 cache than Core 0 and 2) and break the geometry & issue more draw calls. Due to the larger cache size Core 1 is more suited for "complicated" tasks and many of them. Issuing more draw calls CAN improve the GPU's performance (smaller VBO/VAO sizes.) With that in mind AI should also be performed on the same core leaving physics and software lighting to the other two cores. A game with potentially one million polygons being rendered is quite an easy feat on the Wii U at 60 fps. The console actually has fillrate issues which is no surprise. Being a tech demo that's targetted at an AMD 7970 the fact it scaled so low is impressive.
Aside from moving some work to the CPU we've got an optimized graphics API (GX2) and interestingly is the Terascale Engine. The architecture of the AMD Radeon RV700 series would allow the GPU in best case scenario to achieve 704 GFLOPS, compared to the GCN architecture there are four times as many ALUs as there are SIMD units. Ultimately running complex shader programs (in calculation, not by flow control) will be easy on the Wii U but will require careful optimizations for tasks not handling well-suited data (data sets of 4, one for each ALU, E.G. RGBA color manipulation.) When it comes to well-designed shaders the GPU can hold its own but currently when it comes to native 1080p rendering Latte is spilling out a lot of her potentional thus most pre-existing engines not designed for Full HD rendering will not render on the console. This is evident in Donkey Kong Country Returns Tropical Freeze. They only managed to scale a 480p game to 720p whereas Mario Kart 8 looks just as good (IMO better) at 1080p.
Here's something interesting. OpenGL 3.x class hardware is able to perform instancing. This allows the developer to render a field of trees with only a few draw calls. Using instancing with a Geeks3D instancing demo we're able to realize 8 million triangles at
28 fps with 720p windowed rendering. Adding 2x MSAA gives us an increase of only 2ms (26 fps) and increasing the GPU bandwidth from 64 GB/s to 96 GB/s gives 38 fps without AA and with 4x MSAA we experience only a 1ms difference (36 fps) in frame. What if the application was optimized for the Wii U's ALUs? Clocking the GPU for 400Mhz gave us no appreciable difference in frame time, 38 fps despite 512 GFLOPS. Lowering the memory bandwidth has no appreciable performance impact (24 GB/s.)
NOTE: [The GPU core clock raised itself during the high bandwidth tests. This boosted the core clock to 400 Mhz resulting in an increase in performance.]
*Another interest fact is Geeks3D OpenGL 2.1 HDR, DoF, Radial Blur Demo. At 64 GB/s we receive 60 fps but once we jump it to 96 GB/s we see the hardware barely rendering past 61 fps. Activating 8x MSAA we're still rendering at 60 fps. This bandwidth potentially demonstrates how "free" anti-aliasing is on 720p. Raising the resolution to 1680x1050 we see a decrease in fps with 8x MSAA. We're receiving 58 fps as opposed 60, this is an insignificant but noticeable increase in frametime.
*To make accurate comparisons the clock is 400 Mhz which is within the Wii U's possible range of GFLOPS.
We were not limited by pixel fillrate during any of the tests which meant that we were shader/ALU limited. Meaning it would've been equally possible for the Wii U to achieve the require 512 GFLOPS for the tests. My GPU at my personal OC'd frequencies would be roughly equivalent to the Xbox One's GPU (take away bandwidth and it'd be roughly 1.4x the power.) Various tests would place this GPU [the Xbox One's GPU vs. Latte?] at BEST roughly 2.5x more rendering performance (realize we're comparing apples to oranges.) With the Tiled Forward Demo it may have also been possible our geometry assembly was killing the frame rate by up to 33% (to rationalize this we'd need to run a GPU that has a lower SPU count at a frequency to match the GFLOPS my GPU operated at.)
This will be the last conclusion and notes. The gap between the Wii U and the Xbox One versus the gap between the Xbox One and the PS4 is no larger than what I estimate a 57% difference. Why is that? The GFLOP difference between the Wii U & Xbox One is at WORST about 960 FLOPS. **However AMD's VLIW architecture excels at ALU calculations which can be a big win and reduce this number at best to 810 GFLOPS. The Pixel Fillrate and Texel fillrate are sure losses for Nintendo's hardware and there is no overcoming them. However the PS4 has almost double the pixel fillrate. The total difference in pixel fillrate is higher when comparing the Xbox One and PS4 than the Xbox One and Wii U, this is already evident in cross platform games of the two competing consoles. Memory architecture is another thing that'll come into play. The PS4 has 176 GB/s of unified memory whilst the Xbox One has 68 GB/s main memory and a 102 (204 both directions) GB/s GPU eDRAM. To keep up with the PS4's bandwidth developers will need tricks and sacrifices when designing/porting their games. Either console thwarts the Wii U but the memory provided by the Wii U operates at a lower latency. Memory accesses will cost more on the PS4 and Xbox One than the Wii U undoubtedly. Then you've got the fact that AMD's Jaguar cores have [relatively] low IPC rates. With 8 cores the Xbox One and PS4 operates around 60 GFLOPs for all 8 cores, and that conjecture is based upon a synthetic test. The Wii U CPU, being a native RISC CPU, will execute virtually any other instruction in less time than either consoles meaning superior single-threaded performance. Though multithreaded applications are the future there is overhead associated with them. The Xbox 360 probably had the best overall single core performance and thus could handle the largest amount of drawcalls than any other mainstream game console (Direct 3D 9 draw calls were bloated so it may not be evident). There exists a high possibility the Wii U can keep up (possibly exceeding) with the number of draw calls used by the other two current gen consoles. This in turn means support for an excessive amount of polygons and ultimately gives the graphic designers a lot of creative freedom. The Wii U's performance will be consistent and operate at a higher efficiency. With that being said the CPU's of the last gen consoles were incredibly powerful CPU solutions whilst current consoles feature weaker CPU's and more powerful GPU's.
**This resulted from a misinterpretation of how the stream processors were counted. If 320 stream processors with 4 ALUs each would result in at least 4x GFLOPS performance theoretically. However it is (I think) 80 or 64 SIMD units with 4 or 5 ALUs each which would add up to 320 – 400 stream processors.
The ultimate truth is there will be a discernable gap between each console. While the PS4 and Xbox One are frequently lumped together but the PS4 has a sizeable GFLOPS difference than the Xbox One (about 500 GFLOPS apple-apple) and an even bigger difference in their pixel fillrate. Roughly 12.4 Gpixels/s difference, about 4 Gpixels more than the Wii U-Xbox One gap. Unlike the last generation the difference in quality between the PS4 and the Xbox One will be as clear as night and day. The difference between the Wii U and Xbox One can be describe as 2.5x but the difference between the Xbox One and the PS4 may be described as 2.0x. When a Sony fan brags, understand that they have their rights. However the Xbox One will be left behind cleaning up the massive amounts of pixels spilling from the PS4. Because of how efficient Nintendo's console is the gap with the Xbox One is determined by intricate values. Best case scenario the Xbox One becomes slapped right in the middle between the PS4 and Wii U (good news for Nintendo fans, bad news for Microsoft.)