Wii U hardware discussion and investigation *rename

Grall · Jun 1, 2013

That's probably fruitless speculation I'd think. It may also be that core1 has 2MB L2 only because the chip would have been pad limited otherwise - this is nothing but speculation as well, I might add, but seeing how tiny the die is, it just could be true...

Even so, there's an awful lot of die area inbetween the L2 and cores in particular, as well as along the top edge of the image that isn't labelled as anything specifically. Do those portions actually serve a purpose, or is it for all intents and purposes just dead space?

creaks · Jun 1, 2013

ltcommander.data said:
That begs the question if 1 main core with a large 2 MB L2 cache and 2 supporting cores with 512 kB L2 caches extracts the most concurrency and performance given their transistor budget? Looking at the die, it seems like 1 core is roughly the size of 1 MB of L2 cache, which means that Nintendo could alternatively have gone with 4 symmetric cores each with 512 kB of L2 cache without significantly increasing their die area. It's no doubt situation dependent, but overall, does the main core having 1.5 MB of extra L2 cache make up for not having a 4th core?

Well i was specifically referring to chips with a large amount of cores.....

The more capacity your cache has the less you have to resort to accessing main memory, the closer your average memory acess latency is to the super fast latency of your cache, and farther from the slow latency of main mem.

The lower your cache size, the higher the probability you will fail to access cache (cause its not available) 'missing' the cache, missing a read from an instruction cache can grind the entire thread to halt until the time it takes to access main mem. Can seriously cripple performance.

But another core, is an entire extra core that can process simutaneously, thats a big deal. I mean these are both really important to performance.

On principle, on a chip that already has a lot of cores, I would absolutely choose more cache over another core, but iirc, after you get to 1MB your miss prob is really really low, and increasing further just wont help much more.

But on a tricore, id choose another core.

Then again, i dont design custom cpu's so....

Entropy · Jun 2, 2013

creaks said:
On principle, on a chip that already has a lot of cores, I would absolutely choose more cache over another core, but iirc, after you get to 1MB your miss prob is really really low, and increasing further just wont help much more.

While the overall reasoning is sound, I'd shy away from picking a cut off number on cache size.
First because average hit rate (averaged over a wide set of applications) increases along a smooth function, making any cut off point rather arbitrary (Why 89%, why not 92% or 94?).

Second, because cache hit rates are very application specific, depending on the code AND the data set, and the average latency of a cache hit, vs. the average latency of a cache miss, to what extent the bursted data from the main memory is useful or not.... It's really complex, and doesn't let itself get boiled down to a single ideal cache size for either L1, L2 or L3.
At some point using additional die area for CPUs vs. cache will indeed balance out fairly evenly when looked at over a great number of general applications, but since it is so application and data access dependent it is pretty damn difficult to pinpoint the exact optimum. However, I wouldn't sweat it too much, since the optimum seems to typically be fairly flat, i.e. overall the differences of going either way will be modest when you are close to the optimum. And - the optimum is typically for general code, quite low on cores. (For general x86 code it is lower than 4 and actually lower than 2 last I looked. Tons of caveats though, and a console CPU isn't running general x86 code either.).

Exophase · Jun 2, 2013

Grall said:
Even so, there's an awful lot of die area inbetween the L2 and cores in particular, as well as along the top edge of the image that isn't labelled as anything specifically. Do those portions actually serve a purpose, or is it for all intents and purposes just dead space?

Is it possible that it's area to run traces to cope for the eDRAM's latency without having to buffer it? I hope that doesn't sound totally silly.

creaks · Jun 3, 2013

Entropy said:
While the overall reasoning is sound, I'd shy away from picking a cut off number on cache size.

First because average hit rate (averaged over a wide set of applications) increases along a smooth function, making any cut off point rather arbitrary (Why 89%, why not 92% or 94?).

Second, because cache hit rates are very application specific, depending on the code AND the data set, and the average latency of a cache hit, vs. the average latency of a cache miss, to what extent the bursted data from the main memory is useful or not.... It's really complex, and doesn't let itself get boiled down to a single ideal cache size for either L1, L2 or L3.

At some point using additional die area for CPUs vs. cache will indeed balance out fairly evenly when looked at over a great number of general applications, but since it is so application and data access dependent it is pretty damn difficult to pinpoint the exact optimum. However, I wouldn't sweat it too much, since the optimum seems to typically be fairly flat, i.e. overall the differences of going either way will be modest when you are close to the optimum. And - the optimum is typically for general code, quite low on cores. (For general x86 code it is lower than 4 and actually lower than 2 last I looked. Tons of caveats though, and a console CPU isn't running general x86 code either.).

heh, thanks for the info. and possibly a brain hemorrage

pc999 · Jun 3, 2013

Dont forget that Wii cpu had 1Mb L2 too (IIRC), it may be there for BC and easy porting of Nintendo engines!

Plus large cache make it easier to develop for, cooler and probably quite good for real world performance (and fremerate) too? I prefer to have less and stable than seing it stutter each time it takes a few more things and isnt really up to the task because of unpredictable performance.

We spended years hearing who hard the ps360 memory (sub)sytems hit on the performance, then we have a console that is 1,5-2x better in raw numbers and have a great memory system plus four time the RAM. It should be quite interesting seeing what they can extract from it.

Exophase · Jun 3, 2013

Wii's Broadway CPU only has 256KB of L2 cache. It's a shrink of the old Gamecube Gekko CPU which also had 256KB of L2. 256KB of on-die L2 cache was standard for processors of that era - Coppermine P3, Willamette P4, and Palomino AthlonXP all had 256KB of L2.

Blazkowicz · Jun 3, 2013

And Pentium Pro, too.

kalelovil · Jun 3, 2013

Blazkowicz said:
And Pentium Pro, too.

The Pentium Pro's L2 cache was on-package but not on-die.

Exophase · Jun 3, 2013

Blazkowicz said:
And Pentium Pro, too.

I was thinking more processors that were actually current in fall 2001 when Gamecube came out

Pentium Pro's L2 cache was actually on-package, 256KB would have been a lot to put on-die in 1995.

ltcommander.data · Jun 3, 2013

http://barefeats.com/g4up2.html

For interest, I haven't seen any benchmarks between the 512kB L2 PowerPC 750FX and the 1MB L2 PowerPC 750GX, but there are some benchmarks available comparing the 512kB L2 PowerPC 7447A and 1MB L2 PowerPC 7448. In the game results, doubling the L2 cache to 1MB yields a 13% fps boost in Doom 3 and a 11% boost in Halo which is pretty effective. The caveat is that the QuickSilver Power Mac these G4 processor upgrade kits were tested on had a 133MHz FSB and PC133 SDRAM so they were probably bottlenecked to system memory, which could exaggerate the benefits of more L2 cache.

creaks · Jun 3, 2013

ltcommander.data said:
http://barefeats.com/g4up2.html

For interest, I haven't seen any benchmarks between the 512kB L2 PowerPC 750FX and the 1MB L2 PowerPC 750GX, but there are some benchmarks available comparing the 512kB L2 PowerPC 7447A and 1MB L2 PowerPC 7448. In the game results, doubling the L2 cache to 1MB yields a 13% fps boost in Doom 3 and a 11% boost in Halo which is pretty effective. The caveat is that the QuickSilver Power Mac these G4 processor upgrade kits were tested on had a 133MHz FSB and PC133 SDRAM so they were probably bottlenecked to system memory, which could exaggerate the benefits of more L2 cache.

G4's? Dont those have altivecs? I remember paired singles used to be pretty good. Until things like the altivec. I wonder if nintendo improved the paired singles vector math any. Or perhaps they see gpgu as a solution/compromise.

I actually have some 750cx vs fx vs gx documents around. Details differences in cache size, branch prediction, and what not. Ill dig it up.

pc999 · Jun 3, 2013

Exophase said:
Wii's Broadway CPU only has 256KB of L2 cache. It's a shrink of the old Gamecube Gekko CPU which also had 256KB of L2. 256KB of on-die L2 cache was standard for processors of that era - Coppermine P3, Willamette P4, and Palomino AthlonXP all had 256KB of L2.

Thanks for the correction.

ltcommander.data said:
http://barefeats.com/g4up2.html

For interest, I haven't seen any benchmarks between the 512kB L2 PowerPC 750FX and the 1MB L2 PowerPC 750GX, but there are some benchmarks available comparing the 512kB L2 PowerPC 7447A and 1MB L2 PowerPC 7448. In the game results, doubling the L2 cache to 1MB yields a 13% fps boost in Doom 3 and a 11% boost in Halo which is pretty effective. The caveat is that the QuickSilver Power Mac these G4 processor upgrade kits were tested on had a 133MHz FSB and PC133 SDRAM so they were probably bottlenecked to system memory, which could exaggerate the benefits of more L2 cache.

Isnt the main problem the latency? Didnt that increased with DDR3, maybe a faster FSB helped but on the otherside DDR3 may made those improvements mute?

Grall · Jun 3, 2013

creaks said:
Or perhaps they see gpgu as a solution/compromise.

AMD's VLIW5 arch is very impractical for GPGPU; performance is shit most of the time. It was designed for graphics, not compute. Considering how weak wuu's gpu is to begin with, we won't be seeing much GPGPU on it I'm wagering. Since it's based on R4000-era tech it most likely lacks all GCN optimizations for co-scheduling compute and graphics tasks and so on; it probably won't mix well, kind of like running compute tasks on an older GPU in windows and trying to use the desktop for anything at the same time...

lwill · Jun 3, 2013

Grall said:
AMD's VLIW5 arch is very impractical for GPGPU; performance is shit most of the time. It was designed for graphics, not compute. Considering how weak wuu's gpu is to begin with, we won't be seeing much GPGPU on it I'm wagering. Since it's based on R4000-era tech it most likely lacks all GCN optimizations for co-scheduling compute and graphics tasks and so on; it probably won't mix well, kind of like running compute tasks on an older GPU in windows and trying to use the desktop for anything at the same time...

Considering the customization done to this GPU from its original r700 base, I'm not sure if we can be certain that whatever is in Latte are VLIW5s

Grall · Jun 3, 2013

If it isn't VLIW5, then why did they start off with R700? It'd be like building a custom truck by taking an existing truck, tearing it all down and throwing everything away, then re-building a new truck with all-new components. Doesn't make sense, and I'm sure this isn't what they've done.

lwill · Jun 4, 2013

Grall said:
If it isn't VLIW5, then why did they start off with R700? It'd be like building a custom truck by taking an existing truck, tearing it all down and throwing everything away, then re-building a new truck with all-new components. Doesn't make sense, and I'm sure this isn't what they've done.

Nintendo apparently started developing Latte in 2009 when r700 was the current generation of AMD GPU's. IIRC, development for the chip was not done until 2011, so there was time for the chip to evolve from its base. Considering the difference in the appearance of the internal works of the processor compared to the original r700, it is not something to completely ignore.

Grall · Jun 4, 2013

lwill said:
Nintendo apparently started developing Latte in 2009 when r700 was the current generation of AMD GPU's.

Doesn't make sense to fixate on a piece of hardware which will become completely obsoleted when you don't intend to launch for nearly half a decade. Did either Sony or MS focus on obsoleted tech in their consoles? No. Methinks nintendo picked R700 not because they've been slow-cooking wuugpu since 2009, but rather because it was cheaper than any other IP that AMD could offer, and that actual hardware development was (way) shorter than four years.

Shit, in four years nintendo could have had AMD brew up an entirely novel architecture from start to finish rather than start off with an existing one.

IIRC, development for the chip was not done until 2011, so there was time for the chip to evolve from its base.

There's no reason to assume that the base is anything but what it would look like it is, especially with nintendo. VLIW5 is part of the fundamental workings of that hardware design, you change that and you need to change pretty much everything. Even going for cayman's slightly refined VLIW4 (which also is mostly crap for GPGPU one might add) means you get DX11 as well, which we have no information whatsoever that wuu supports.

Considering the difference in the appearance of the internal works of the processor compared to the original r700, it is not something to completely ignore.

Custom layout is one thing, fundamental inner workings something else. Occam's razor speaks against that, plus nintendo cheapskatedness as well.

AzaK · Jun 4, 2013

Grall said:
Doesn't make sense to fixate on a piece of hardware which will become completely obsoleted when you don't intend to launch for nearly half a decade. Did either Sony or MS focus on obsoleted tech in their consoles? No. Methinks nintendo picked R700 not because they've been slow-cooking wuugpu since 2009, but rather because it was cheaper than any other IP that AMD could offer, and that actual hardware development was (way) shorter than four years.

Shit, in four years nintendo could have had AMD brew up an entirely novel architecture from start to finish rather than start off with an existing one.

There's no reason to assume that the base is anything but what it would look like it is, especially with nintendo. VLIW5 is part of the fundamental workings of that hardware design, you change that and you need to change pretty much everything. Even going for cayman's slightly refined VLIW4 (which also is mostly crap for GPGPU one might add) means you get DX11 as well, which we have no information whatsoever that wuu supports.

Custom layout is one thing, fundamental inner workings something else. Occam's razor speaks against that, plus nintendo cheapskatedness as well.

I think you need to read all of the Latte thread on NeoGaf before making such blanket statements. There are folks there spending hours upon hours looking into the chip and coming up with very interesting theories

You may still be right but I don't think it's as black and white as you're making out.

function · Jun 4, 2013

Nintendo went all in and spent years developing a very highly customised, almost unique GPU arch in order to do something that R7xx could do on its own - deliver Xbox 360 level performance.

Nope. Really not feeling it.

Wii U hardware discussion and investigation *rename

Grall

Invisible Member

creaks

Entropy

Exophase

creaks

pc999

Exophase

Blazkowicz

kalelovil

Exophase

ltcommander.data

creaks

pc999

Grall

Invisible Member

lwill

Grall

Invisible Member

lwill

Grall

Invisible Member

AzaK

function

None functional

Similar threads