Wii U hardware discussion and investigation *rename

Status
Not open for further replies.
Probably a fair amount of performance sensitive game code is still compiled C/C++/whatever and not hand written assembly, so you're at the mercy of the compiler to do a good job scheduling instructions, hinting branches, and performing prefetches (although that part at least can probably be done w/builtins and intrinsics). And while compilers have gotten fairly good I don't think they can really do as well as skilled humans at these tasks - particularly when they end up increasing register pressure more than a human would and cause more spills.
Sure, but there's a lot you can do in C/C++ to make sure compiler doesn't jeopardize your attempts to write performant code. It all starts with data optimization (stuff like Mike Acton's "if there's one, there's many" mantra), minimizing branches, dereferences, fake cache sharing and many, many more. Even OoO CPU in Wii U will benefit from a lot of these. I'd be shocked to learn that existing, fine-tuned code run on Wii U gets a two- or three-fold boost. Compilers are not that dumb and where they were - fixing mistakes of this magnitude were the low hanging fruits developers probably reached for first. That being said...

Even if we do take perfectly scheduled code - ie, no stalls that could have been scheduled around - and all branches hinted at well in advance or predicted, Broadway is still better off per clock. While they're both primarly 2-wide cores (and both have 4-instruction fetch and AFAICT both can fold branches in the frontend) Broadway has two ALUs while the PPEs only have one. Given that the PPEs are in-order and have two-cycle ALU latency you're going to see a lot of code get a lot faster per-cycle since it'll be able to run back to back ALU instructions in one cycle instead of two.
This makes assumption (which honestly I don't know if it is or is not correct) that CPU is mostly used for scalar operations and only minor portions of the code have been vectorized and utilize VMX units (or equivalent). There's a lot of stuff you can't vectorize but I can't imagine that this is the core of what's being run on CPU (but my imagination is obviously limited so I'd love to hear if I'm completely wrong here). And AFAIR from ancient (and perhaps debunked by now) articles by Jon Stokes - 360 has two VMX units and I'd expect Wii U cores to have the same number of vector units. But of course here I assume that 1. this is the case and 2. developers could and did vectorize a lot of code that's run on CPU.

In the real world predicting or pre-hinting all code branches very well is just impossible. If you have code that branches every 5-6 instructions you don't have enough room to provide a hint for enough in advance to avoid the mispredict penalty. You'd at least need a way to cascade a bunch of hints in flight and that'd be a big added challenge. The mispredict penalty on the PPEs is absolutely massive, 24 cycles vs just a few cycles for Broadway.
Sure. I simply assume that this is a well known fact to all the developers and a lot of data and algorithms are built around not branching if it's unnecessary. Sometimes it's better to calculate something and discard results. Even if there are no branchless conditionals conditional loads/exchanges/whatever on PPEs I assume people use bit hacks for these where it makes sense.
 
IMO having worked in and outside the games industry, the games industry has about the same mix of experts as most other industries doing large scale software development.

20 years ago when team sizes were small and everyone in the industry was self taught and highly motivated it was different. When you're hiring out of college and it's just a job to a lot of people, you're in no better position than any other industry.

There are exceptional teams in the games industry, but that's true of many other industries as well.
Sure, but. :) Large developers/publishers have dedicated optimization ninjas and there are guidelines and procedures against making code retarded. On top of that I'd say that majority of AAA developers use middleware pushing a lot of optimization work to the middleware providers. I find it hard to believe that there are low hanging fruits in UE or CE or Havok or pretty much anything "mainstream" out there. Sure, you have to tune code for your scenario (Rocksteady guys probably did a lot of changes to UE to enable seamless sandbox; Midway/NetherRealm did a lot of changes to get 60fps out of UE) but not every game is that unique. And even for games relying less on middleware we see a clear trend that engine building efforts are consolidating under various umbrellas (Frostbite development is a good example).

I guess my point is that with 4 devs on a team everyone has to be careful; with 40 people you're dealing with multiple layers of code and delegate responsibilities in a way, that sensitive code is being mostly worked on by your best. I mean - it wouldn't make sense to give the top priority tasks to grads. :/
 
Sure, but. :) Large developers/publishers have dedicated optimization ninjas and there are guidelines and procedures against making code retarded. On top of that I'd say that majority of AAA developers use middleware pushing a lot of optimization work to the middleware providers. I find it hard to believe that there are low hanging fruits in UE or CE or Havok or pretty much anything "mainstream" out there. Sure, you have to tune code for your scenario (Rocksteady guys probably did a lot of changes to UE to enable seamless sandbox; Midway/NetherRealm did a lot of changes to get 60fps out of UE) but not every game is that unique. And even for games relying less on middleware we see a clear trend that engine building efforts are consolidating under various umbrellas (Frostbite development is a good example).

I guess my point is that with 4 devs on a team everyone has to be careful; with 40 people you're dealing with multiple layers of code and delegate responsibilities in a way, that sensitive code is being mostly worked on by your best. I mean - it wouldn't make sense to give the top priority tasks to grads. :/

This is way off topic but ...
I think you're massively overestimating general code quality, and the level of optimization across the codebase as a whole.

Havok get's used for player/enemy to world collision detection all the time, it's one of my pet hates and I'd guess that a dedicated solution to the problem most of the teams using it in this way are actually solving would be 20-50x faster than using Havok, but it's easier and quicker to get running with Havok, and painful to replace later.

A lot of game code is just whatever is fastest to get it working. Game quality is more about speed of iteration than it is about code quality.

Doesn't mean the code is bad just that there isn't an emphasis on performance for most of the cod, games are 1M+ lines of code it isn't like 1992 when your average SNES or Genesis game was perhaps 40K lines of assembler.

Hell I've worked on big teams where the majority of "game programmers" couldn't explain a cache hierarchy.

Yes there is a class of code, usually core engine code (and I have a very narrow view of what that includes) where people still fret over clock cycles, that's the <5% of the code volume that's taking 30-50% of the time.
 
I honestly haven't looked at a lot of 360 game code recently, but I'd guess there is still a pretty heavy amount of CPU work for things like animation and this is where the WiiU core would struggle in comparison.

Then devs will have to get used to do that kind of stuff mostly on the GPU, if they want optimal performance.
Seems much depends on just how powerfull that GPU is.
 
That's what threading is. If you have a thread that has its own parallel code stream and execution units, you've got a core. ;) Threading is about optimising use of execution units by running multiple streams of code through the processor. Depending on the number and type of execution units, threading can have no benefits at all.

Well I guess the point is the same either way, Wii U cpu and Xenon not comparable thread to thread (among everything else apparently).
 
A lot of game code is just whatever is fastest to get it working. Game quality is more about speed of iteration than it is about code quality.

He who first reinvents the LISP machine/Burroughs 5000 style architecture will own the world. ;)
 
Sure, but there's a lot you can do in C/C++ to make sure compiler doesn't jeopardize your attempts to write performant code. It all starts with data optimization (stuff like Mike Acton's "if there's one, there's many" mantra), minimizing branches, dereferences, fake cache sharing and many, many more. Even OoO CPU in Wii U will benefit from a lot of these. I'd be shocked to learn that existing, fine-tuned code run on Wii U gets a two- or three-fold boost. Compilers are not that dumb and where they were - fixing mistakes of this magnitude were the low hanging fruits developers probably reached for first. That being said...

I'm not saying that there isn't a lot you can do to tune source code to make it friendlier on these processors. What I'm saying is that that compiler won't completely eliminate benefit for reordering and decent branch prediction on the CPU. That was your claim - that OoO wouldn't help significantly because the code is already tuned for in-order.

Nobody is saying two or three-fold boost, I don't know where this is coming from. Something like a 20% improvement in perf/clock easily qualifies as significant.

This makes assumption (which honestly I don't know if it is or is not correct) that CPU is mostly used for scalar operations and only minor portions of the code have been vectorized and utilize VMX units (or equivalent). There's a lot of stuff you can't vectorize but I can't imagine that this is the core of what's being run on CPU (but my imagination is obviously limited so I'd love to hear if I'm completely wrong here). And AFAIR from ancient (and perhaps debunked by now) articles by Jon Stokes - 360 has two VMX units and I'd expect Wii U cores to have the same number of vector units. But of course here I assume that 1. this is the case and 2. developers could and did vectorize a lot of code that's run on CPU.

I'm not assuming that VMX isn't used, I'm assuming that integer heavy code is still reasonbly common. Often even alongside VMX code, for instance for address generation and flow control. But in between VMX code as well - for instance lacking gather/scatter instructions means sometimes you need to move stuff into integer registers for computed loads/stores, and you will often want to use ALUs for this since the load/store unit has weak address generation. AFAIK there's a huge penalty for transferring registers between VMX and the integer ALU so you probably generally don't want to try using VMX as a second integer port.

Back to my original point that a lot of code is C/C++/whatever and not ASM, which you agree with - compilers are even further from optimal when it comes to vectorization. Although game developers may be using ones that are better than what I'm used to. Still, it's one of those things that benefits a lot from tight control which is difficult to communicate in high level source code.

Broadway doesn't have VMX. It has paired singles (2x32-bit FP32) that can execute in one FPU pipe so it can execute 2 FMAs (4 FLOPs) per cycle. The VMX/VMX-128 units on Cell PPE/Xenon respectively can perform 4 FMAs (8 FLOPs) per cycle. But it can do a bunch of other stuff over several other pipelines, like 8/16/32-bit integer SIMD, permutes, and vector loads/stores. It can issue two of these per cycle. So it can a lot more stuff done.

Sure. I simply assume that this is a well known fact to all the developers and a lot of data and algorithms are built around not branching if it's unnecessary. Sometimes it's better to calculate something and discard results. Even if there are no branchless conditionals conditional loads/exchanges/whatever on PPEs I assume people use bit hacks for these where it makes sense.

It's true that you can often do things branchless (like with predication or equivalent). But often it'll be slower on CPUs that do have good branch prediction, so it's a balancing act.. if you're not writing assembly then you probably want to try to keep the code reasonably well performing for all of your target platforms.
 
Well I guess the point is the same either way, Wii U cpu and Xenon not comparable thread to thread (among everything else apparently).

You would do better to compare them core to core, rather than thread to core.


A short pipeline OoO CPU core might only run at 200 MHz, but could sustain 0.45 IPC.
A longer pipeline in-order CPU core could run up at 600 MHz, but might only sustain 0.15 IPC.

Combine the two and both CPUs give you 90 million instructions per second. You get the same amount of work done, just in different ways.
 
Last edited by a moderator:
You seem to be under the mistaken presumption that Broadway (and Gekko, and the original PowerPC 750s for that matter) was in-order in the first place. The rumors that OoO was "added" to Wii U on top of Broadway are incorrect. When we talk in this thread about Wii U having only weak OoO it's based on what is publicly known about Broadway and nothing else.
Fair point. I did know that, but it's a buried memory. ;) We're probably looking at pretty much exactly a tri-core Broadway.

I'm not saying that there isn't a lot you can do to tune source code to make it friendlier on these processors. What I'm saying is that that compiler won't completely eliminate benefit for reordering and decent branch prediction on the CPU. That was your claim - that OoO wouldn't help significantly because the code is already tuned for in-order.

Nobody is saying two or three-fold boost, I don't know where this is coming from.
The comparison between Wuu U's Espresso CPU and Xenon. I don't think it's been raised in this thread directly, but there are those thinking features of Espresso can make all the difference, and it's worth identifying how much difference they can really make. 20% perf/clock is valuable in a chip but not going to pull Espresso up to Xenon performance overall. What's hard to pin down is how much code running on XB360 is vectorised and leaning on the SIMDs, and how much is branchy 'integer' work. eg, using ERPs example of Havok, if Havok is moderately optimised for Xenon and runs well across its fat VMX units, what will devs encounter using the same lib on Wuu? Will Havok need to be refactored for GPGPU work? In which case how much does that take from the graphics potential of the machine? Whereas if the collision engine on games (lets say UE3 as a common engine) was written to run on the 'integer' side of the CPU, Espresso could probably trump Xenon.
 
I'm sorry if I'm butting in, but the discussion in this thread seems to be leaning towards code quality and optimization across different architectures. Wouldn't the Wii U be in a rather enviable position though? Big development houses have had to port games between the far more grunty Xbox 360 and PS3 to the diminutive Wii for ages. And though this has probably been limited mainly to assets, I assume they've relied on an ever expanding set of ever more mature compilers, libraries, middleware and practices, that have been tweaked since the GameCube era for their general code.

Circumventing the lack of VMX and other discrepancies between the PS360 PowerPC implementation and Espresso should really have been issues tackled in some form already too, with new solutions available thanks to the GPGPU configuration. GPGPU isn't a a magic bullet off course, and it makes the performance discussion kind of queer. The consensus here seems to be that it isn't possible to apply an apples to apples comparison, and It sort of devolves into what kind of fruitsalad can be made with the different components in hand. Primarily whether or not the Wii U based fruitsalad has the potential to taste better... and that comes down to the chefs involved, I suspect, more now then we've seen between hardware generations before.

Or am I being too optimistic?
 
GPGPU isn't a a magic bullet off course
Particularly not if the GPU is based on Radeon HD 4xxx-series as has been rumored/speculated up until now; almost all GPGPU applications on HD4xxx, including top of the line cards, had weak or outright bad performance on that architecture due to things like its VLIW5 configuration, and a cache hierarchy which was poorly designed for compute tasks. There were probably other issues as well.

Harsh reality is we really have no evidence whatsoever that GPGPU was ever any concern at all during wuu's development. Mostly it's just wishful thinking and word of mouth/rumors all originating from the same source(s) being repeated and taken more or less as fact. Most likely, wuu's as poorly equipped for GPGPU as it is for vector processing on its CPU cores.
 
Particularly not if the GPU is based on Radeon HD 4xxx-series as has been rumored/speculated up until now; almost all GPGPU applications on HD4xxx, including top of the line cards, had weak or outright bad performance on that architecture due to things like its VLIW5 configuration, and a cache hierarchy which was poorly designed for compute tasks. There were probably other issues as well.

Harsh reality is we really have no evidence whatsoever that GPGPU was ever any concern at all during wuu's development. Mostly it's just wishful thinking and word of mouth/rumors all originating from the same source(s) being repeated and taken more or less as fact. Most likely, wuu's as poorly equipped for GPGPU as it is for vector processing on its CPU cores.
Except the whole GPGPU stuff isn't based on rumors, it's a feature highlighted by Nintendo itself, both in public presentations and in the documentation for developers. If that's something they were focussing on, and that definitely seems to be the case, they most likely changed and extended the GPU accordingly.
 
Fair point. I did know that, but it's a buried memory. ;) We're probably looking at pretty much exactly a tri-core Broadway.

The comparison between Wuu U's Espresso CPU and Xenon. I don't think it's been raised in this thread directly, but there are those thinking features of Espresso can make all the difference, and it's worth identifying how much difference they can really make. 20% perf/clock is valuable in a chip but not going to pull Espresso up to Xenon performance overall. What's hard to pin down is how much code running on XB360 is vectorised and leaning on the SIMDs, and how much is branchy 'integer' work. eg, using ERPs example of Havok, if Havok is moderately optimised for Xenon and runs well across its fat VMX units, what will devs encounter using the same lib on Wuu? Will Havok need to be refactored for GPGPU work? In which case how much does that take from the graphics potential of the machine? Whereas if the collision engine on games (lets say UE3 as a common engine) was written to run on the 'integer' side of the CPU, Espresso could probably trump Xenon.

Damn, I can't help myself.
It is impossible unless you have access to both the WIiU and either HD-twin to make direct benchmark comparisons. But if we use geekbench, and the PS3 and PPC7447 (similar to the 750, but with Altivec and improved bus interface), Their integer and floating point scores at 3.2GHz and 1.25GHz respectively is:
INTEGER AGGREGATE:
PS3:920
PPC7447:879
FLOATING POINT AGGREGATE:
PS3:702
PPC7447:925

So... even IF the Broadway core is largely untouched, at 1.25GHz it should still be roughly at the level of the PPE and by extension the Xenos at 3.2GHz, excluding SIMD FP. And of course, the WiiU CPU has a much more sympathetic memory subsystem than the mac mini I used for this comparison, and I still believe IBM has done a bit more for its $1 billion than just tack a new L2 cache interface and rudimentary SMP support.
 
After the "enhanced Broadway" stuff got confirmed, I was looking through some old rumors as I distinctly remembered some dude posting just that more than a year ago. This guy also posted back then that SMP was implemented using a ringbus - which strikes me as odd. I believe no conventional CPU uses a ringbus for SMP? I googled a bit, and apparently, Larrabee was meant to use a ringbus, and a few DSPs use ringbusses, but that seems to be it...
 
it's nearly impossible to make good comparisons on Espresso and Xenon on paper specs alone, the two architecture are radically different so it's hard to know how they will perform.

I remember years ago when PS3 vs 360 was all the rage, I thought naively that the PPU from Cell was basically equivalent to one core from Xenon. Both cores were really really similar from a high level point of view, same units, same frequency, same latencies for instructions... I had a hard time seing how they would not perform the same (excepted for the enhanced VMX unit on the Xenon). But at that time someone on this forum with knowledge on the matter (I think it was ERP but I can be mistaken) told me that on their benchmarks they were seing really different performance from both core. So even with two cores really similar, the reality was not so easy to guess from paper specs alone.
 
A lot of game code is just whatever is fastest to get it working. Game quality is more about speed of iteration than it is about code quality.
Agreed. That seems to be almost universally true for game logic and UI code (and basically everything game specific code). Fast iteration time is critical to finetune the gameplay and make it fun.

Technology code tends to be developed differently, but that's usually less than 10% of the whole code base.
Hell I've worked on big teams where the majority of "game programmers" couldn't explain a cache hierarchy.
That's also true for smaller teams :). Most game programmers don't need to worry about things like cache hierarchies and store forwarding stalls. As long as technology programmers understand the low level hardware, things usually proceed fine. Technology code (graphics rendering, physics simulation, area/ray/etc queries) tends to use the majority of CPU cycles, so designing this part of code to be as efficient as possible is often a huge step forward to the goal. Of course low level programmers need to monitor the performance of game code, and once in a while fix performance bottlenecks caused by the game/UI code as well.

Most modern games use Flash based UI engines, so basically there's no hope to get UI code to run well, no matter what you do :)
Nobody is saying two or three-fold boost, I don't know where this is coming from. Something like a 20% improvement in perf/clock easily qualifies as significant.
I think the 2x-3x boost figure comes from the rumors of the WiiU clock rate. Compared to Xenon's 3.2 GHz clock rate, it needs much more than 20% IPC improvement to match the performance.
I'm not assuming that VMX isn't used, I'm assuming that integer heavy code is still reasonbly common. Often even alongside VMX code, for instance for address generation and flow control. But in between VMX code as well - for instance lacking gather/scatter instructions means sometimes you need to move stuff into integer registers for computed loads/stores, and you will often want to use ALUs for this since the load/store unit has weak address generation. AFAIK there's a huge penalty for transferring registers between VMX and the integer ALU so you probably generally don't want to try using VMX as a second integer port.

Back to my original point that a lot of code is C/C++/whatever and not ASM, which you agree with - compilers are even further from optimal when it comes to vectorization. Although game developers may be using ones that are better than what I'm used to. Still, it's one of those things that benefits a lot from tight control which is difficult to communicate in high level source code.
VMX code is mostly used in performance critical areas only (less than 5% of the code base). However these areas of the code can easily use more than 50% of your CPU cycles.

I don't know about other developers, but we do not use any compiler autovectorization tools. Autovectorization is just too fragile to work properly (especially on a architecture that has no scalar<->vector register moves and no store forwarding).

Basically all our VMX code is hand written intrinsics. We don't have lots of it, but we have enough to run many of our performance critical parts completely in VMX. On Xenon, you do not want to mix (tiny pieces of) vectorized code with standard code. You want to run large agressively unrolled number crunching loops that are pure VMX. This is because the VMX pipelines are very long, and you have big penalties in moving data between register sets (you need to do it though L1 cache, but you have no store forwarding, so you have to wait a long time before the data gets available). OoO CPUs have register renaming, so loop unrolling is not required, but fortunately Xenon has lots of vector registers, so loop unrolling gives the compiler lots of options to reorder instructions / cascade registers to keep the long VMX pipelines filled.

Another reason why this kind of brute force VMX batch processing is a good fit for Xenon is the memory prefetching. You have to manually cache prefetch, or you will suffer A LOT. Batch processing has predictable memory access pattern, so it's easy to prefech things in time (even if you are using some kind of (cache-line) bucketed structures). Running this kind of unrolled cache friendly vector processing code on a modern OoO core doesn't improve it's performance that much, since the code doesn't need branch predictors, automated cache prefetching, register renaming, etc, etc.
It's true that you can often do things branchless (like with predication or equivalent). But often it'll be slower on CPUs that do have good branch prediction, so it's a balancing act.. if you're not writing assembly then you probably want to try to keep the code reasonably well performing for all of your target platforms.

(+ lots of other branching discussion in this thread)
Talk about branching is always close to my heart :)

First of all, most branches inside tight loops are just bad code. Clean code should separate common cases from special cases. You should take the branch out of the loop, and execute the code (that was inside the branch) only for the elements that have that property (extract it to another separate loop, preferably to a separate file). This style of programming makes the code easier to read & understand, and allows you to easily extend it (add new special case functionality without the need to modify existing code). This kind of processing is also very cache friendly (if you also extract the data needed by each special property to a separate linear array).

Random branches are always hard to predict. If you don't analyze your data, and do nothing to control your branching behaviour (for example sort your structures to improve branching regularity), you have to design around the worst case. For structures that might at some point of the game contain around half of the elements with a certain property (requiring branching), it means you have to estimate a 50% branch mispredict rate (as elements are in "random" order).

How much a single branch mispredict costs? For Sandy Bridge that is 14-17 cycles. How much work Sandy Bridge can do in this time? It can do 14-17 eight wide AVX additions + multiplies (+ lots of loads, stores and address generation in other ports). AVX optimized 4x4 matrix multiply executes in 12 cycles (http://software.intel.com/en-us/forums/topic/302778). Mispredicted branch costs more than a 4x4 matrix multiply! And Haswell makes things even worse (it has two FMA ports + more integer ports). It will be able to do two matrix multiplies for each mispredicted branch :(

Branches that are easy to predict are often fine (as long as they do not pollute the code base). We had a discussion about branching with a colleague of mine. If basically was about things dying in the game, and whether it would be better to have a listener-based solution to tell entities to remove references to a dead entity, or do a "null check" on use (we don't use pointers, so "null check" is not exactly what we do). In this case a single check (branch instruction) during usage costs a single cycle, because it always returns true until the entity dies (it's always predicted properly). When the entity dies, you pay a 20+ cycle misprediction penalty for each entity that had a reference to it (during the next time it tries to access it). In comparison the listener based solution causes a 600 cycle cache miss (for each entity that needs to be informed during dead, and likely another cache miss for each entity that registers itself as a listener). The single cycle (of a predicted branch) can often be masked out (executed in parallel to other instructions in separate pipelines). So it is a better performing option. And it also sidesteps all the problems in programmers occasionally forgetting to register listeners, or send proper messages around... That has never happened, heh heh :)

Branchless constructs (for isel, fsel, min, max, etc) are often very good, but you have to be careful to use correct bit shuffling tricks for each platform, as all modern CPUs are superscalar, but have different amount of pipelines/ports for different instructions. But that's nothing a good inlined function that is #ifdef'd differently for different platform doesn't handle (of course it's also important to name these functions properly, and teach your programmers to use them). If some platform performs better with branches, you can of course easily change those select functions to use branches on that platform.

Branchless constructs are more important for GPU programming, as 32/64 threads in the same warp need to have identical control flow. They are also important for VMX programming (do multiple simultaneous branchless operations in vector registers, and keep the results there).
 
Status
Not open for further replies.
Back
Top