Wii U hardware discussion and investigation *rename

Status
Not open for further replies.
That's probably fruitless speculation I'd think. It may also be that core1 has 2MB L2 only because the chip would have been pad limited otherwise - this is nothing but speculation as well, I might add, but seeing how tiny the die is, it just could be true...

Even so, there's an awful lot of die area inbetween the L2 and cores in particular, as well as along the top edge of the image that isn't labelled as anything specifically. Do those portions actually serve a purpose, or is it for all intents and purposes just dead space?
 
That begs the question if 1 main core with a large 2 MB L2 cache and 2 supporting cores with 512 kB L2 caches extracts the most concurrency and performance given their transistor budget? Looking at the die, it seems like 1 core is roughly the size of 1 MB of L2 cache, which means that Nintendo could alternatively have gone with 4 symmetric cores each with 512 kB of L2 cache without significantly increasing their die area. It's no doubt situation dependent, but overall, does the main core having 1.5 MB of extra L2 cache make up for not having a 4th core?







Well i was specifically referring to chips with a large amount of cores.....

The more capacity your cache has the less you have to resort to accessing main memory, the closer your average memory acess latency is to the super fast latency of your cache, and farther from the slow latency of main mem.

The lower your cache size, the higher the probability you will fail to access cache (cause its not available) 'missing' the cache, missing a read from an instruction cache can grind the entire thread to halt until the time it takes to access main mem. Can seriously cripple performance.

But another core, is an entire extra core that can process simutaneously, thats a big deal. I mean these are both really important to performance.

On principle, on a chip that already has a lot of cores, I would absolutely choose more cache over another core, but iirc, after you get to 1MB your miss prob is really really low, and increasing further just wont help much more.

But on a tricore, id choose another core.

Then again, i dont design custom cpu's so....
 
On principle, on a chip that already has a lot of cores, I would absolutely choose more cache over another core, but iirc, after you get to 1MB your miss prob is really really low, and increasing further just wont help much more.

While the overall reasoning is sound, I'd shy away from picking a cut off number on cache size.
First because average hit rate (averaged over a wide set of applications) increases along a smooth function, making any cut off point rather arbitrary (Why 89%, why not 92% or 94?).

Second, because cache hit rates are very application specific, depending on the code AND the data set, and the average latency of a cache hit, vs. the average latency of a cache miss, to what extent the bursted data from the main memory is useful or not.... It's really complex, and doesn't let itself get boiled down to a single ideal cache size for either L1, L2 or L3.
At some point using additional die area for CPUs vs. cache will indeed balance out fairly evenly when looked at over a great number of general applications, but since it is so application and data access dependent it is pretty damn difficult to pinpoint the exact optimum. However, I wouldn't sweat it too much, since the optimum seems to typically be fairly flat, i.e. overall the differences of going either way will be modest when you are close to the optimum. And - the optimum is typically for general code, quite low on cores. (For general x86 code it is lower than 4 and actually lower than 2 last I looked. Tons of caveats though, and a console CPU isn't running general x86 code either.).
 
Last edited by a moderator:
Even so, there's an awful lot of die area inbetween the L2 and cores in particular, as well as along the top edge of the image that isn't labelled as anything specifically. Do those portions actually serve a purpose, or is it for all intents and purposes just dead space?

Is it possible that it's area to run traces to cope for the eDRAM's latency without having to buffer it? I hope that doesn't sound totally silly.
 
While the overall reasoning is sound, I'd shy away from picking a cut off number on cache size.

First because average hit rate (averaged over a wide set of applications) increases along a smooth function, making any cut off point rather arbitrary (Why 89%, why not 92% or 94?).



Second, because cache hit rates are very application specific, depending on the code AND the data set, and the average latency of a cache hit, vs. the average latency of a cache miss, to what extent the bursted data from the main memory is useful or not.... It's really complex, and doesn't let itself get boiled down to a single ideal cache size for either L1, L2 or L3.

At some point using additional die area for CPUs vs. cache will indeed balance out fairly evenly when looked at over a great number of general applications, but since it is so application and data access dependent it is pretty damn difficult to pinpoint the exact optimum. However, I wouldn't sweat it too much, since the optimum seems to typically be fairly flat, i.e. overall the differences of going either way will be modest when you are close to the optimum. And - the optimum is typically for general code, quite low on cores. (For general x86 code it is lower than 4 and actually lower than 2 last I looked. Tons of caveats though, and a console CPU isn't running general x86 code either.).



heh, thanks for the info. and possibly a brain hemorrage :p
 
Dont forget that Wii cpu had 1Mb L2 too (IIRC), it may be there for BC and easy porting of Nintendo engines!



Plus large cache make it easier to develop for, cooler and probably quite good for real world performance (and fremerate) too? I prefer to have less and stable than seing it stutter each time it takes a few more things and isnt really up to the task because of unpredictable performance.


We spended years hearing who hard the ps360 memory (sub)sytems hit on the performance, then we have a console that is 1,5-2x better in raw numbers and have a great memory system plus four time the RAM. It should be quite interesting seeing what they can extract from it.
 
Wii's Broadway CPU only has 256KB of L2 cache. It's a shrink of the old Gamecube Gekko CPU which also had 256KB of L2. 256KB of on-die L2 cache was standard for processors of that era - Coppermine P3, Willamette P4, and Palomino AthlonXP all had 256KB of L2.
 
http://barefeats.com/g4up2.html

For interest, I haven't seen any benchmarks between the 512kB L2 PowerPC 750FX and the 1MB L2 PowerPC 750GX, but there are some benchmarks available comparing the 512kB L2 PowerPC 7447A and 1MB L2 PowerPC 7448. In the game results, doubling the L2 cache to 1MB yields a 13% fps boost in Doom 3 and a 11% boost in Halo which is pretty effective. The caveat is that the QuickSilver Power Mac these G4 processor upgrade kits were tested on had a 133MHz FSB and PC133 SDRAM so they were probably bottlenecked to system memory, which could exaggerate the benefits of more L2 cache.
 
http://barefeats.com/g4up2.html



For interest, I haven't seen any benchmarks between the 512kB L2 PowerPC 750FX and the 1MB L2 PowerPC 750GX, but there are some benchmarks available comparing the 512kB L2 PowerPC 7447A and 1MB L2 PowerPC 7448. In the game results, doubling the L2 cache to 1MB yields a 13% fps boost in Doom 3 and a 11% boost in Halo which is pretty effective. The caveat is that the QuickSilver Power Mac these G4 processor upgrade kits were tested on had a 133MHz FSB and PC133 SDRAM so they were probably bottlenecked to system memory, which could exaggerate the benefits of more L2 cache.

G4's? Dont those have altivecs? I remember paired singles used to be pretty good. Until things like the altivec. I wonder if nintendo improved the paired singles vector math any. Or perhaps they see gpgu as a solution/compromise.

I actually have some 750cx vs fx vs gx documents around. Details differences in cache size, branch prediction, and what not. Ill dig it up.
 
Wii's Broadway CPU only has 256KB of L2 cache. It's a shrink of the old Gamecube Gekko CPU which also had 256KB of L2. 256KB of on-die L2 cache was standard for processors of that era - Coppermine P3, Willamette P4, and Palomino AthlonXP all had 256KB of L2.

Thanks for the correction.

http://barefeats.com/g4up2.html

For interest, I haven't seen any benchmarks between the 512kB L2 PowerPC 750FX and the 1MB L2 PowerPC 750GX, but there are some benchmarks available comparing the 512kB L2 PowerPC 7447A and 1MB L2 PowerPC 7448. In the game results, doubling the L2 cache to 1MB yields a 13% fps boost in Doom 3 and a 11% boost in Halo which is pretty effective. The caveat is that the QuickSilver Power Mac these G4 processor upgrade kits were tested on had a 133MHz FSB and PC133 SDRAM so they were probably bottlenecked to system memory, which could exaggerate the benefits of more L2 cache.


Isnt the main problem the latency? Didnt that increased with DDR3, maybe a faster FSB helped but on the otherside DDR3 may made those improvements mute?
 
Or perhaps they see gpgu as a solution/compromise.
AMD's VLIW5 arch is very impractical for GPGPU; performance is shit most of the time. It was designed for graphics, not compute. Considering how weak wuu's gpu is to begin with, we won't be seeing much GPGPU on it I'm wagering. Since it's based on R4000-era tech it most likely lacks all GCN optimizations for co-scheduling compute and graphics tasks and so on; it probably won't mix well, kind of like running compute tasks on an older GPU in windows and trying to use the desktop for anything at the same time...
 
AMD's VLIW5 arch is very impractical for GPGPU; performance is shit most of the time. It was designed for graphics, not compute. Considering how weak wuu's gpu is to begin with, we won't be seeing much GPGPU on it I'm wagering. Since it's based on R4000-era tech it most likely lacks all GCN optimizations for co-scheduling compute and graphics tasks and so on; it probably won't mix well, kind of like running compute tasks on an older GPU in windows and trying to use the desktop for anything at the same time...

Considering the customization done to this GPU from its original r700 base, I'm not sure if we can be certain that whatever is in Latte are VLIW5s
 
If it isn't VLIW5, then why did they start off with R700? It'd be like building a custom truck by taking an existing truck, tearing it all down and throwing everything away, then re-building a new truck with all-new components. Doesn't make sense, and I'm sure this isn't what they've done.
 
If it isn't VLIW5, then why did they start off with R700? It'd be like building a custom truck by taking an existing truck, tearing it all down and throwing everything away, then re-building a new truck with all-new components. Doesn't make sense, and I'm sure this isn't what they've done.
Nintendo apparently started developing Latte in 2009 when r700 was the current generation of AMD GPU's. IIRC, development for the chip was not done until 2011, so there was time for the chip to evolve from its base. Considering the difference in the appearance of the internal works of the processor compared to the original r700, it is not something to completely ignore.
 
Nintendo apparently started developing Latte in 2009 when r700 was the current generation of AMD GPU's.
Doesn't make sense to fixate on a piece of hardware which will become completely obsoleted when you don't intend to launch for nearly half a decade. Did either Sony or MS focus on obsoleted tech in their consoles? No. Methinks nintendo picked R700 not because they've been slow-cooking wuugpu since 2009, but rather because it was cheaper than any other IP that AMD could offer, and that actual hardware development was (way) shorter than four years.

Shit, in four years nintendo could have had AMD brew up an entirely novel architecture from start to finish rather than start off with an existing one.

IIRC, development for the chip was not done until 2011, so there was time for the chip to evolve from its base.
There's no reason to assume that the base is anything but what it would look like it is, especially with nintendo. VLIW5 is part of the fundamental workings of that hardware design, you change that and you need to change pretty much everything. Even going for cayman's slightly refined VLIW4 (which also is mostly crap for GPGPU one might add) means you get DX11 as well, which we have no information whatsoever that wuu supports.

Considering the difference in the appearance of the internal works of the processor compared to the original r700, it is not something to completely ignore.
Custom layout is one thing, fundamental inner workings something else. Occam's razor speaks against that, plus nintendo cheapskatedness as well.
 
Doesn't make sense to fixate on a piece of hardware which will become completely obsoleted when you don't intend to launch for nearly half a decade. Did either Sony or MS focus on obsoleted tech in their consoles? No. Methinks nintendo picked R700 not because they've been slow-cooking wuugpu since 2009, but rather because it was cheaper than any other IP that AMD could offer, and that actual hardware development was (way) shorter than four years.







Shit, in four years nintendo could have had AMD brew up an entirely novel architecture from start to finish rather than start off with an existing one.











There's no reason to assume that the base is anything but what it would look like it is, especially with nintendo. VLIW5 is part of the fundamental workings of that hardware design, you change that and you need to change pretty much everything. Even going for cayman's slightly refined VLIW4 (which also is mostly crap for GPGPU one might add) means you get DX11 as well, which we have no information whatsoever that wuu supports.











Custom layout is one thing, fundamental inner workings something else. Occam's razor speaks against that, plus nintendo cheapskatedness as well.



I think you need to read all of the Latte thread on NeoGaf before making such blanket statements. There are folks there spending hours upon hours looking into the chip and coming up with very interesting theories

You may still be right but I don't think it's as black and white as you're making out.
 
Nintendo went all in and spent years developing a very highly customised, almost unique GPU arch in order to do something that R7xx could do on its own - deliver Xbox 360 level performance.

Nope. Really not feeling it.
 
Status
Not open for further replies.
Back
Top