Wii U hardware discussion and investigation *rename

Status
Not open for further replies.
Well Google found this for me:

https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/A88091CAFE0F19CE852575EE0073078A/$file/To%20CL%20-%20CL%20Special%20Features%206-22-09.pdf

Section 4, "Paired-Single Precision Floating Point Operations" doesn't seem to give any indication of simd-like int capabilities ....?
 
Thanks again.

So the performance difference for integer workloads is potentially even greater than for floating point workloads. So much for Nbench based beard stroking then.

After 7+ years of vectorising as many performance critical tasks as possible, I guess it makes sense that developers looked at the Wii U with it's float-only paired-singles and groaned.
 
Thanks again.

So the performance difference for integer workloads is potentially even greater than for floating point workloads. So much for Nbench based beard stroking then.

After 7+ years of vectorising as many performance critical tasks as possible, I guess it makes sense that developers looked at the Wii U with it's float-only paired-singles and groaned.

Yeah, although people often do use "integer" synonymously with "scalar" or "branchy" code, stuff that doesn't vectorize well. Some people are still scratching their heads over what integer SIMD is useful for, although I personally use it like crazy (and rarely use float SIMD).

There was at least one dev who gave an interesting insight into Wii U optimization.. he basically said to go for L2 cache locality or go home. On aggregate Wii U has substantially more L2 cache than XBox360. Sizing out algorithms to try to fit a particular cache size is often not very high on a typical developer's check list, although in the case of XBox360 it was probably pretty vital to be as cache resident as possible since the main RAM latency was so bad. It could be that the latency isn't that great on Wii U either, what with it going through a memory controller on the GPU. So that means resize your stuff to try to get a better hit-rate on Wii U's L2 if you can. But the weird 2MB + 512KB + 512KB local L2 structure is very different from shared 1MB, and I'm sure that presents some of its own challenges (there could be a substantial penalty for sharing data structures between cores now, it could even be as bad as going out to memory or even worse, depending on how they did it)
 
One would hope the CPU die contains some mechanism to share data between CPU caches. Otherwise you'll be doing a round-trip through that not-particularly-awesome 12GB/s main RAM, which'll be hammered by GPU accesses at the same time.
 
One would hope the CPU die contains some mechanism to share data between CPU caches. Otherwise you'll be doing a round-trip through that not-particularly-awesome 12GB/s main RAM, which'll be hammered by GPU accesses at the same time.

I agree, one would definitely hope, but I would have hoped for a lot of things that Nintendo didn't do here :p
 
There was at least one dev who gave an interesting insight into Wii U optimization.. he basically said to go for L2 cache locality or go home. On aggregate Wii U has substantially more L2 cache than XBox360. Sizing out algorithms to try to fit a particular cache size is often not very high on a typical developer's check list, although in the case of XBox360 it was probably pretty vital to be as cache resident as possible since the main RAM latency was so bad. It could be that the latency isn't that great on Wii U either, what with it going through a memory controller on the GPU. So that means resize your stuff to try to get a better hit-rate on Wii U's L2 if you can. But the weird 2MB + 512KB + 512KB local L2 structure is very different from shared 1MB, and I'm sure that presents some of its own challenges (there could be a substantial penalty for sharing data structures between cores now, it could even be as bad as going out to memory or even worse, depending on how they did it)

I had originally thought that the 2MB L2 of the "master core" might have been a way of trying to engineer what was effectively a L3 cache without actually engineering the the L3 bit i.e. L2 cache misses from the "slave" cores would benefit from improved performance and save on main memory bandwidth by pulling data from the large L2 of the master core.

But it sounds like that's the last thing that you'd want. It sounds like the large L2 of the master core is actually there to assuage the fears of developers who are scared shitless of symmetrical multicore development, and who needs a core that they can fall back on that can easily handle the L2 accesses for large data sets.

Kind of seems to go against the idea of symmetrical multi core i.e. some algorithms will need to be optimised for the none-master cores rather than the master core, or they'll suffer from performance penalties. 3 x 1MB would have made a lot more sense from a lay person perspective: fulfilling the objective (as stated explicitly in the Project Cafe leaked slides) of making 360 ports easily manageable by allowing L2 datasets of the same size as the the 360's shared cache to fit in each of Espresso's CPU L2 caches.
 
Last edited by a moderator:
I'm having difficulties understanding how 3x1MB caches would be superior to 2x1MB + 1x2MB, from any perspective.

Just doesn't make sense to me, but perhaps there's some drug that's currently lacking in my system that could shed additional enlightenment...? :)
 
Espresso is 1 x 2MB and 2 x 0.5 MB for a total of 3 MB.

I was suggesting 3 x 1 MB (same total L2 and presumably same die size) so that the "weaker" 0.5 MB cores would have improved performance (and have access to same amount of L2 as each Xenon core), and so that you could run any task on an core with the same level of performance (greater flexibility in scheduling tasks).
 
Can you ever have too much cache?
Yes. The more you have, the slower it is to access (higher latency). A balance has to be struck between quantity and speed. To gain the benefits of different quantity and speed combinations, different cache layers are used - instruction and data caches, L1, L2, and sometimes L3, each smaller and faster than the next (this concept extends to RAM and storage, all increasing capacity at a reduction in speed).
 
Espresso is 1 x 2MB and 2 x 0.5 MB for a total of 3 MB.
Ah, right. I forgot; my bad.

and so that you could run any task on an core with the same level of performance (greater flexibility in scheduling tasks).
I wonder what real performance difference there is between half and one meg of cache, considering say an intel core i-series CPU has only 256k L2 (although paired with a L3) and manages quite well. Perhaps only on the order of a few percent in most cases? Making cache four times larger for one out of the three cores is odd though, but then again SO MUCH of the wuu is just fricken odd, so better not get hung up on this particular oddity, eh? :)
 
I wonder what real performance difference there is between half and one meg of cache, considering say an intel core i-series CPU has only 256k L2 (although paired with a L3) and manages quite well. Perhaps only on the order of a few percent in most cases? Making cache four times larger for one out of the three cores is odd though, but then again SO MUCH of the wuu is just fricken odd, so better not get hung up on this particular oddity, eh? :)

If the developer who swears that they only got acceptable performance by optimizing for Wii U's cache sizes is to be believed (sorry I don't remember the source) then I doubt the difference between 512KB and 2MB is negligible. It's not just that Core i has L3 cache but also most likely much lower latency (and higher bandwidth) main RAM, much more reordering capability, automatic prefetching, SMT, and more sophisticated memory controllers with more concurrency. Many more facilities for hiding the cost of a cache miss.

It makes you think, if half as much L2 cache would have been about as good then they really shouldn't have bothered going with eDRAM on this. But then again, I don't know if there was a great technical reason for this, they could have just been suckered into whatever IBM wanted to sell them. Ideally they wouldn't be getting IBM to fab this at all, and would have managed some kind of CPU on the same die as the rest of the stuff.
 
Yes. The more you have, the slower it is to access (higher latency). A balance has to be struck between quantity and speed. To gain the benefits of different quantity and speed combinations, different cache layers are used - instruction and data caches, L1, L2, and sometimes L3, each smaller and faster than the next (this concept extends to RAM and storage, all increasing capacity at a reduction in speed).
Cache size seems to affect power usage as well. (Reason why nvidia proposed additional 1KB L0 cache.)
 
If the developer who swears that they only got acceptable performance by optimizing for Wii U's cache sizes is to be believed (sorry I don't remember the source) then I doubt the difference between 512KB and 2MB is negligible.
Gamecube only had 256k cache and ~2.6GB/s main RAM bandwidth (although ridiculously low latency, at least from the GPU side since that's where the memory controller is), it did fine. 512k caches get 90+ percent hitrate typical case, so I have a hard time seeing how diff between 512k and 2MB would really be all that monstrous. Must be something else going on as well methinks if your source is correct.

It makes you think, if half as much L2 cache would have been about as good then they really shouldn't have bothered going with eDRAM on this.
L2 wouldn't have helped GPU performance; without eDRAM the system would have choked on that pathetic 12GB/s main RAM B/W.

But then again, I don't know if there was a great technical reason for this, they could have just been suckered into whatever IBM wanted to sell them.
You really think nintendo's dumb enough to be suckered into a deal like this? They've been building consoles for thirty years, you'd think they would know better than THAT at least. :)

Ideally they wouldn't be getting IBM to fab this at all
Why not? IBM has advanced eDRAM process, and they're the developer of the CPU core, so I would think they'd be ideal for ninty to work with. Especially since they probably have spare fab capacity too and might be able to offer a good price, since they only make chips for big iron enterprise market, which isn't all that big despite the moniker. :)
 
Do you think it gets 90% hits on data access too? My guess is you can hold a pretty big model in that cache and perform the operations you want to do on it sequentially.
 
Gamecube only had 256k cache and ~2.6GB/s main RAM bandwidth (although ridiculously low latency, at least from the GPU side since that's where the memory controller is), it did fine. 512k caches get 90+ percent hitrate typical case, so I have a hard time seeing how diff between 512k and 2MB would really be all that monstrous. Must be something else going on as well methinks if your source is correct.

Gamecube's competitor PS2 had no L2 cache at all.. did it not do fine? What is your criteria here exactly? We're talking about older systems with much lower clock speeds working on different datasets, things change.

Surely you don't think everyone else in the industry was also wrong for making caches bigger and bigger.

L2 wouldn't have helped GPU performance; without eDRAM the system would have choked on that pathetic 12GB/s main RAM B/W.

I'm not sure you're aware that there are two eDRAM pools on the system. The L2 cache itself is implemented with eDRAM. I'm not talking about the eDRAM on the GPU chip.

You really think nintendo's dumb enough to be suckered into a deal like this?

Yes.

Why not? IBM has advanced eDRAM process, and they're the developer of the CPU core, so I would think they'd be ideal for ninty to work with. Especially since they probably have spare fab capacity too and might be able to offer a good price, since they only make chips for big iron enterprise market, which isn't all that big despite the moniker. :)

So do they need the eDRAM for L2 cache or not? If you think 512KB per core is enough then the answer would be "no." The GPU chip has nothing to do with IBM.

You're right, they're ideal for Nintendo because Nintendo over-values backwards compatibility. Had they sacrificed it and moved on to a better CPU they could have done a unified SoC and a more powerful one at that. Everyone else is dropping BC, I don't think they needed it to survive, not at the expense of a better system.

Mind you, going with Renesas is looking pretty bad too.
 
720P 32-bit color + Z needs only 7.2MB regardless of the platform. No need to tile this rez on the 360, it fits in eDRAM just fine.
720P with FSAA is 14/28 MB for 2x/4x. I'm not sure what Megafenix is talking about, because it is the same everywhere, but you would have to tile with FSAA.

Do we know how much eDram is there?
 
Status
Not open for further replies.
Back
Top