Wii U hardware discussion and investigation *rename

Status
Not open for further replies.
FWIW Wii U CPU is 3 cores at 1.7 the Wii CPU clock. So 3X1.7=5.1X Wii CPU. Then with more cache maybe you can call it 6X. I got that from a GAF post, but I guess 6X isn't a horrible generational leap, though back in the PS360 day it wasnt a great one either.


Its worth nothing ;)

Seriously though - 'multipliers' have no quantifiable value or meaning in this instance. All that's telling you is the cumulative clock speed is 5.1 times the Wii clock speed. And that somehow the larger caches equate to an additional "0.9x"....something.

Sorry. Pet peeve of mine.
 
Can we at least stop the "but it's OoO and therefore needs code optimized differently" crap.
The bulk of optimizations done on a game usually center around algorithms and more importantly data layout, they benefit every architecture.

In fact it being OoO just means it's less sensitive to instruction scheduling by the compiler.

Unless the WiiU compiler is really bad, and there is some reason to believe it's going to get really good the CPU is what it is.

Yes before anyone mentions it people do still write assembler for small often called functions, usually for vector optimizations, usually the number of lines were talking about are trivial and it's likely they'll do similar optimizations on WiiU.
 
On Wii the ARM CPU ran an OS called IOS which managed system devices for games. The hacker guys got things like USB game loaders working by reworking IOS operation. But the ARM CPU did not run the System Menu. In fact the System Menu is not running when a game is.

I'm sure we'll know exactly how WiiU works soon enough.
 
Edit:
..
Edit2:

It doesn't make sense to edit this now again as I lost the first answer.
 
Last edited by a moderator:
That's what threading is. If you have a thread that has its own parallel code stream and execution units, you've got a core. ;) Threading is about optimising use of execution units by running multiple streams of code through the processor. Depending on the number and type of execution units, threading can have no benefits at all.

For Wii U's CPU capability, we need to know firstly the peek throughput in terms of execution units, and then compare its efficiency vs. code running on Xenon.

SMT will give the processor something to do while waiting for a cache miss and will lessen the average impact of a branch misprediction if some of the instructions in flight are from the other thread. No processor will be able to schedule around misses in the LLC so unless there are no misses and no branch mispredicts you'll get some benefit, even if you're keeping the execution units fed otherwise.

There are some deficits in sharing the caches and branch prediction though (and whatever else you share, if anything - usually those two things are shared).

A core like Broadway has very minor reordering capabilities. It'll help a lot in scheduling around execution data dependencies including L1 latency, but it'll barely reduce the impact of L1 misses and do close to nothing for L2 misses. So it probably would have benefited a lot from SMT.

Probably not as much as the Xenon/Cell PPE though.
 
SMT will give the processor something to do while waiting for a cache miss and will lessen the average impact of a branch misprediction if some of the instructions in flight are from the other thread. No processor will be able to schedule around misses in the LLC so unless there are no misses and no branch mispredicts you'll get some benefit, even if you're keeping the execution units fed otherwise.
In the early days of Cell discussion, someone posted a link into cache efficiency that showed cache misses in some architecture (I think x86) were pretty rare. Adding in that devs have now gotten used to structuring their data to feed stream processors, it should only be a small subset of workloads where SMT can be any benefit. I'm thinking single digit percentages. Certainly not order-of-magnitude gains on efficient game code, such that a hypothetical Xenon with multithreading and OOE can compete with the same chip sans SMT and IOE at twice the frequency. SMT is great in a fully multitasking OS running all sorts of code, but should be of little significant value to well-written game code.
 
In the early days of Cell discussion, someone posted a link into cache efficiency that showed cache misses in some architecture (I think x86) were pretty rare. Adding in that devs have now gotten used to structuring their data to feed stream processors, it should only be a small subset of workloads where SMT can be any benefit. I'm thinking single digit percentages. Certainly not order-of-magnitude gains on efficient game code, such that a hypothetical Xenon with multithreading and OOE can compete with the same chip sans SMT and IOE at twice the frequency. SMT is great in a fully multitasking OS running all sorts of code, but should be of little significant value to well-written game code.

I would bet it's probably in the 20-40% range, When you are doing a lot of work on data streams it's usually not in the cache (in fact you don't want t to pollute the cache in many instances). Even if you read through memory perfectly linearly, the odds are you process it faster than you can fetch it, so you're going to stall every cache line.
Also IME the bulk of game code isn't particularly optimal.

There is probably a class of code where the WiiU core is as fast as the PS360 core, because cache (predominantly L1) misses dominate performance. But I would be surprised if it could keep pace across the bulk of a game codebase.

I honestly haven't looked at a lot of 360 game code recently, but I'd guess there is still a pretty heavy amount of CPU work for things like animation and this is where the WiiU core would struggle in comparison.
 
In the early days of Cell discussion, someone posted a link into cache efficiency that showed cache misses in some architecture (I think x86) were pretty rare. Adding in that devs have now gotten used to structuring their data to feed stream processors, it should only be a small subset of workloads where SMT can be any benefit. I'm thinking single digit percentages.

You mean only a single digit percentage of workloads will see any benefit or you'd only see an average of a single digit percentage gain? Because I don't agree with either, but the former is especially questionable.

Look at how much SMT helps on Atom, where > 50% gains are common by running a second thread on the same core. That has prefetch instructions and probably better automatic prefetchers than Broadway, but it's still not enough. The two processors are not that different in terms of execution resources. A tiny amount of OoO isn't enough to change the balance from helping a ton to barely helping at all. Saying that cache misses are "pretty rare" is meaningless by itself. If only one in a hundred loads miss yet the penalty is 200 cycles then it's still a huge deal. The benefits apply to L1 icache misses too, which for some loads can be as or even more common than L1 dcache misses.

Certainly not order-of-magnitude gains on efficient game code, such that a hypothetical Xenon with multithreading and OOE can compete with the same chip sans SMT and IOE at twice the frequency. SMT is great in a fully multitasking OS running all sorts of code, but should be of little significant value to well-written game code.

No one's stipulating order of magnitude gains. You don't need an OS running a bunch of heavy stuff at the same time to benefit from SMT on 3 cores. Console games can (and do) run more than 3 threads heavily and can (and do) benefit from more than 3 hardware threads. For instance all of the games being ported from the other consoles. You're claiming that Xenon's SMT was a waste of hardware and that is the one complaint I have not heard anyone make.
 
You mean only a single digit percentage of workloads will see any benefit or you'd only see an average of a single digit percentage gain?
Single digit gain in performance, making assumptions as to what game engine code looks like. ERP has other ideas based on much more relevant experience. ;)

No one's stipulating order of magnitude gains...You're claiming that Xenon's SMT was a waste of hardware and that is the one complaint I have not heard anyone make.
No, I'm saying that those believing OoOE or multithreading have the ability to make all the difference in a processor are mistaken. Multithreading is a cheap way in terms of silicon to gain extra performance, but it's a very different sort of extra performance to adding execution units, increasing clocks, etc. It's dependent on what you're doing when. SMT is only raised here as Rangers mentioned it and it was worth covering what SMT brings to the table.

Maybe Sebbbi has some relevant knowhow with his XB360 work?
 
In total throughput, Xenon will be more powerful. But in terms of executing game code, a lot depends on the code devs are using. I believe Xenon will be more powerful because I believe devs are writing optimised code that makes efficient use of the processor, but it's wrong to compare the straight numbers. GHz*threads is not at all accurate!

There's a lot of existing code developers are and will be using. It was written and tuned for in-order, deep pipeline processors so it's not like OoO CPU will run it significantly faster. If anything it's harder to optimize for OoO CPU because it's less predictable than its less "brainy" counterpart*. Unless existing code gets dropped on the floor and developers stop thinking about branching, pairing and what not, there's little reason to believe that Wii U will get advantage due to architectural differences. It's also important to keep in mind that code won't suddenly be written for Wii U alone (not by 3rd parties and not any time soon at least) so it will have to behave well on PS360. Calling GHz*threads inaccurate is really a straw man here. I haven't seen anyone doing the brain-dead math here saying that a*b=k*c*d therefore 360 is k times more powerful than Wii U. ;)

*Intel concluded some time ago that wimpy cores (deep pipelines, specialized, in-order) may give more raw power than brainy ones but very few people know how to write a decent code for those. Game developers are a fairly rare breed that does.
 
Intel's targeting a different market with PCs. You have legacy code and a massive range of developer abilities writing on a zillion different platforms. x86 has had a very strong requirement in making bad code run fast. In precision software engineering jobs with a high quality of software engineering, all those hardware extras don't achieve a great deal, in theory. However, as mentioned many times this gen, developers don't want to be writing CPU-hand-holding code. They'd much rather be able to whack code onto the console and have it run well without effort.

I'm not quite sure where Nintendo have gone with Wii U. Seems to be an awkward middle ground. Weak OoOE means, I presume, developers will still need to spend time optimising their code, especially given the lack of raw power in the CPU. If Nintendo had gone with simple cores with more grunt, I'd understand. If they went with a really easy-to-use CPU, I'd understand. But Wii U seems to offer not much of either. Almost as if their conversation with IBM went something like:

"We want it small and BC."
"Okay, we'll take the existing Broadway design and go multicore."
"But we want it better too."
"Okay, we can add some good out-of-order execution to make it easier to write for."
"Great. Only don't make it too big."
"Um, right. So we'll add...a little out-of-order support?"
"Yeah!"
 
There's a lot of existing code developers are and will be using. It was written and tuned for in-order, deep pipeline processors so it's not like OoO CPU will run it significantly faster. If anything it's harder to optimize for OoO CPU because it's less predictable than its less "brainy" counterpart*. Unless existing code gets dropped on the floor and developers stop thinking about branching, pairing and what not, there's little reason to believe that Wii U will get advantage due to architectural differences. It's also important to keep in mind that code won't suddenly be written for Wii U alone (not by 3rd parties and not any time soon at least) so it will have to behave well on PS360. Calling GHz*threads inaccurate is really a straw man here. I haven't seen anyone doing the brain-dead math here saying that a*b=k*c*d therefore 360 is k times more powerful than Wii U. ;)

*Intel concluded some time ago that wimpy cores (deep pipelines, specialized, in-order) may give more raw power than brainy ones but very few people know how to write a decent code for those. Game developers are a fairly rare breed that does.

Probably a fair amount of performance sensitive game code is still compiled C/C++/whatever and not hand written assembly, so you're at the mercy of the compiler to do a good job scheduling instructions, hinting branches, and performing prefetches (although that part at least can probably be done w/builtins and intrinsics). And while compilers have gotten fairly good I don't think they can really do as well as skilled humans at these tasks - particularly when they end up increasing register pressure more than a human would and cause more spills.

Even if we do take perfectly scheduled code - ie, no stalls that could have been scheduled around - and all branches hinted at well in advance or predicted, Broadway is still better off per clock. While they're both primarly 2-wide cores (and both have 4-instruction fetch and AFAICT both can fold branches in the frontend) Broadway has two ALUs while the PPEs only have one. Given that the PPEs are in-order and have two-cycle ALU latency you're going to see a lot of code get a lot faster per-cycle since it'll be able to run back to back ALU instructions in one cycle instead of two.

In the real world predicting or pre-hinting all code branches very well is just impossible. If you have code that branches every 5-6 instructions you don't have enough room to provide a hint for enough in advance to avoid the mispredict penalty. You'd at least need a way to cascade a bunch of hints in flight and that'd be a big added challenge. The mispredict penalty on the PPEs is absolutely massive, 24 cycles vs just a few cycles for Broadway.

I'm not quite sure where Nintendo have gone with Wii U. Seems to be an awkward middle ground. Weak OoOE means, I presume, developers will still need to spend time optimising their code, especially given the lack of raw power in the CPU. If Nintendo had gone with simple cores with more grunt, I'd understand. If they went with a really easy-to-use CPU, I'd understand. But Wii U seems to offer not much of either. Almost as if their conversation with IBM went something like:

"We want it small and BC."
"Okay, we'll take the existing Broadway design and go multicore."
"But we want it better too."
"Okay, we can add some good out-of-order execution to make it easier to write for."
"Great. Only don't make it too big."
"Um, right. So we'll add...a little out-of-order support?"
"Yeah!"

You seem to be under the mistaken presumption that Broadway (and Gekko, and the original PowerPC 750s for that matter) was in-order in the first place. The rumors that OoO was "added" to Wii U on top of Broadway are incorrect. When we talk in this thread about Wii U having only weak OoO it's based on what is publicly known about Broadway and nothing else.

Taking Broadway cores and keeping them internally identical while only changing some details of the L2 interface and coherency is a lot easier than changing the whole core to add a "little" more OoO or execution resource, and it makes BC a lot more reliable. It's not hard to see why Nintendo would want to do this. It's pretty much the bare minimum that gives them some amount of reasonable multithreading (the clock speed bump was probably free after the die shrinks)
 
Last edited by a moderator:
Intel's targeting a different market with PCs. You have legacy code and a massive range of developer abilities writing on a zillion different platforms. x86 has had a very strong requirement in making bad code run fast. In precision software engineering jobs with a high quality of software engineering, all those hardware extras don't achieve a great deal, in theory. However, as mentioned many times this gen, developers don't want to be writing CPU-hand-holding code. They'd much rather be able to whack code onto the console and have it run well without effort.
I bet people would rather write nice code w/o thinking about arch. But they do think about arch while writing the code for some time. Otherwise we'd less improvement on consoles in the past 6 years or so. If you write any high-frequency code for PS360, you'll face the need to optimize - streamline, get rid of jumps/calls, etc. A lot of engine code is being reused from title to title. There's a lot of code that works great on PS360 that won't benefit a lot from OoO CPU and it's not like developers will start writing code cowboy style - PS360 is still there as a target for their titles.

And yes, Intel is in a different market, so your statement validates my previous claim. Intel is in the "brainy" market: lots of code of varying quality. Console code is mostly driven by guys with a lot of expertise. A lot of effort is being put into optimizations. This is why game developers can and deal with "wimpy" cores that are in PS360.
 
And yes, Intel is in a different market, so your statement validates my previous claim. Intel is in the "brainy" market: lots of code of varying quality. Console code is mostly driven by guys with a lot of expertise. A lot of effort is being put into optimizations. This is why game developers can and deal with "wimpy" cores that are in PS360.

This hasn't been true for over a decade.

IMO having worked in and outside the games industry, the games industry has about the same mix of experts as most other industries doing large scale software development.
20 years ago when team sizes were small and everyone in the industry was self taught and highly motivated it was different. When you're hiring out of college and it's just a job to a lot of people, you're in no better position than any other industry.

There are exceptional teams in the games industry, but that's true of many other industries as well.
 
Status
Not open for further replies.
Back
Top