Wii U hardware discussion and investigation *rename

Status
Not open for further replies.
The CPU has been stated to be basically the same as a 1998 model in terms of features and performance/clock, just with higher clocks. Something like Bobcat should be massively faster and you don't need three Broadways for BC.

An updated Broadway makes BC a lot easier. It has instructions that other PPC processors just don't. And Nintendo may be paranoid about timing, not wanting a processor that could be slower under some pathological condition, even under a very small timespan where it could break a game due to (usually unintended) reliance on timing.

You can combat incompatible instructions with code translation but once you do even the smallest amount of this you open up a huge can of worms software-wise. You could add hardware support for the instructions with a fairly small amount of logic in another CPU uarch but I expect IBM will open up a big new project bill for any small amount of work that breaks into a different hardware block and opens up a bunch of verification requirements.

As for the GPU, VLIW5 is several generations old now and R7xx wasn't even the most recent version of it. Time for customisation may be an issue that mean newer designs couldn't be used, but even so such an old technology base seems a little strange when the highly customised Durango is looking at GCN+.

VLIW5 is old, but its short term successor VLIW4 isn't that old in comparison - we first heard about Wii U in spring 2011 and there were supposed to be playable units in June 2011. This timeline isn't at all comparable to Durango and there's no way GCN would have been an option, and even VLIW4 was only a few months old at this point.

I do find it likely that Wii U has been in development for a long time, but I think Nintendo likes long design cycles for their consoles. 3DS uses an ancient CPU for its release date too, but that doesn't mean that Nintendo was originally planning on releasing it in 2008 when the DS was selling like crazy.

Nope, but the only fab size that seems to make sense wrt to the 32 MB of edram is 40nm.

The 32MB of eDRAM is also the only thing presented that we can even draw a reasonable density measurement for. All of the rest is based on assumptions about what the blocks contain.

There are some ideas about some SRAM cell sizes on other parts of the die, like where Marcan thinks the ARM9 is. Maybe someone should look at what that density's like. Unfortunately I'm not aware that NEC has published SRAM density numbers nor am I aware of any other die shots with known quantities that we can compare against (this is assuming NEC manufactured, but at this point I think that's more likely - XBox 360 for instance used NEC to manufacture the eDRAM daughter dies)
 
An updated Broadway makes BC a lot easier. It has instructions that other PPC processors just don't. And Nintendo may be paranoid about timing, not wanting a processor that could be slower under some pathological condition, even under a very small timespan where it could break a game due to (usually unintended) reliance on timing.

You can combat incompatible instructions with code translation but once you do even the smallest amount of this you open up a huge can of worms software-wise. You could add hardware support for the instructions with a fairly small amount of logic in another CPU uarch but I expect IBM will open up a big new project bill for any small amount of work that breaks into a different hardware block and opens up a bunch of verification requirements.

BC makes a convincing argument for including a updated Broadway, and in my (unpopular!) hypothesis about an R6xx based "Gamecube HD" in ~2007 I was thinking an updated Gekko would be useful for both that reason and that it would be capable but power efficient.

In 2012 however, I can't help wondering why you wouldn't just include 1 Broadway core and, say, 2 far more capable Bobcat cores. You might even be able to put the Bobcat cores on the GPU die (assuming 40 nm TSMC). Perhaps the cost of including both types would be too high though, and the need for BC won out.

VLIW5 is old, but its short term successor VLIW4 isn't that old in comparison - we first heard about Wii U in spring 2011 and there were supposed to be playable units in June 2011. This timeline isn't at all comparable to Durango and there's no way GCN would have been an option, and even VLIW4 was only a few months old at this point.

Well it was VLIW4 that I was thinking of. But with need to add so much Wii U specific stuff to Latte I guess maybe the schedules were a little tight. RV7xx does seem a little odd though due to its age, but maybe they were leaner and better suited to Nintendo's needs.

I do find it likely that Wii U has been in development for a long time, but I think Nintendo likes long design cycles for their consoles. 3DS uses an ancient CPU for its release date too, but that doesn't mean that Nintendo was originally planning on releasing it in 2008 when the DS was selling like crazy.

Well my thought with a "Gamecube HD" would be that it was in development because Nintendo didn't know that the Wii was going to sell like crazy, or perhaps that they weren't even sure what to do in 2005 (the Wiimote began development as a GC accessory)!

The 32MB of eDRAM is also the only thing presented that we can even draw a reasonable density measurement for. All of the rest is based on assumptions about what the blocks contain.

There are some ideas about some SRAM cell sizes on other parts of the die, like where Marcan thinks the ARM9 is. Maybe someone should look at what that density's like. Unfortunately I'm not aware that NEC has published SRAM density numbers nor am I aware of any other die shots with known quantities that we can compare against (this is assuming NEC manufactured, but at this point I think that's more likely - XBox 360 for instance used NEC to manufacture the eDRAM daughter dies)

MS switched from NEC to TSMC for their edram during the height of the RRoD crisis when the 80nm (probably) Falcon came out. Not sure who currently manufactures it. Maybe NEC again?

Comparing sram from other parts of the chip seems like a really good idea, but one which I'm not qualified to do. I think AIStrong should step up to bat, he's been consistently the most accurate with his figures so far!
 
In 2012 however, I can't help wondering why you wouldn't just include 1 Broadway core and, say, 2 far more capable Bobcat cores. You might even be able to put the Bobcat cores on the GPU die (assuming 40 nm TSMC). Perhaps the cost of including both types would be too high though, and the need for BC won out.

You know how Iwata said that they tried to integrate as much of the original Wii functionality into the new design as possible? This isn't really anything new to Nintendo. When they provide backwards compatibility they like to either augment the original hardware to be more powerful or delegate it to some other part of the design that has a useful auxiliary purpose.

There are exceptions though. GBA included the Gameboy's 8080-derived CPU but it wasn't accessible in GBA mode at all. May have had something to do with being a 5V part, and they did drop support from GB Micro entirely.

But it's questionable if two Bobcat cores would really be better than three of what they have. And they wouldn't have gotten good results out of trying to use PPC and x86 cores simultaneously.

For some reason Nintendo also wanted a large amount of eDRAM based cache and IBM was willing to give them a design that could work with this. It could also be that IBM was never willing to license a Broadway-compatible core for manufacturing by another fab (or that it involved porting work that no one, including IBM, was willing to do for a price Nintendo was willing to pay). This would have limited them to a separate IBM manufactured die for the CPU, and would have made it harder to put a non-IBM CPU on it. We'll have to see if the two dies are integrated further down the road, but consider that it never happened on Wii, where the CPU die was even smaller. It could be that IBM is interested in selling Wii U's CPU standalone like they did Broadway; they may have had some application in mind where the eDRAM makes sense (of course I'm not sure what Nintendo's alleged $1b paid for in this scenario)
 
Yeah, but I wanted to try something that might justify having large 20 shader blocks as opposed to super dense 40 shader blocks that blow past Brazos and Llano for density.
I understand your motivation, but my humble opinion is that we should first try to get a clearer idea of what we are looking at before we devise theories of how it came to be, not the other way around.

Well that's interesting to know, and means that what marcan found isn't proof of my hypothesis.
I don't think marcan was particularly sure of what he was looking at either, but he found it a compelling-enough evidence that Latte is Radeon-based, which it indeed is.

Old as opposed to new (or current). The CPU has been stated to be basically the same as a 1998 model in terms of features and performance/clock, just with higher clocks. Something like Bobcat should be massively faster and you don't need three Broadways for BC. As for the GPU, VLIW5 is several generations old now and R7xx wasn't even the most recent version of it. Time for customisation may be an issue that mean newer designs couldn't be used, but even so such an old technology base seems a little strange when the highly customised Durango is looking at GCN+.
I'm not sure I share the 'GCN = new, everything else = old' sentiment. After all, AMD's latest E-series of discrete GPUs ('E' = embedded) is nothing else but Turks. The fact nintendo most likely did not have access to GCN tech back when they started work on the WiiU is another matter.

Re the CPU - I share Exophase's opinion that 2x bobcats would not have been worth it over 3x Expressos. I might have something more substantial to add on this particular subject once I get a certain toolchain to comply (I have so much to say about GCC's policy to castrate the ppc750cl support, unless one executes certain rituals, but that's a subject for another thread).

Not exceeding the 360 in a 2012 design is quite something, when even a half crippled Llano from 2011 can do it on the CPU side and a mobile version can easily do it on the GPU side. I'm looking for a deeper reason for such an incredibly low level of performance although I'll freely admit that there doesn't actually have to be one.
As I mentioned, let's first figure out what we're looking at before we decide how we feel about it.


On a different note, what are your thoughts on the ROP and/or BW situation? I made this post yesterday, spurred on by various NeoGaf / B3D comments:

http://forum.beyond3d.com/showpost.php?p=1703968&postcount=4564

Possibly onto something, or a load of nonsense?
Well, it's an interesting theory that IMO has right to exist. I admit I may give extra weight to it due to my own inability to discern anything resembling clear-cut ROPs in there, but I'm not a seasoned die reader. My opinion is, for all we know, Latte might not have dedicated ROP blocks per se, just because it has some very close coupling between FB and the rest of the pipeline.
 
Last edited by a moderator:
R600 ISA had rather strange TMU organization -- all the texture units were placed in a separate block, much like a dedicated multiprocessor with shared L1 sampling cache, that scaled in width with the "normal" SIMDs. If the "A-group", as labeled on the die-shot, is indeed such structure, running along the two shader multiprocessors, the shared L1t cache could be the large SRAM pool just above the first A-block.
Sadly, there's not a single die-shot available to the public from a R600-class GPU, to compare.
 
R600 ISA had rather strange TMU organization -- all the texture units were placed in a separate block, much like a dedicated multiprocessor with shared L1 sampling cache, that scaled in width with the "normal" SIMDs. If the "A-group", as labeled on the die-shot, is indeed such structure, running along the two shader multiprocessors, the shared L1t cache could be the large SRAM pool just above the first A-block.
Sadly, there's not a single die-shot available to the public from a R600-class GPU, to compare.

Question for yah: Do caches typically compose their own blocks in these types of layouts? Or could R700's 8kb L1 texture cache per TMU actually lie inside the actual TMU blocks?
 
Question for yah: Do caches typically compose their own blocks in these types of layouts? Or could R700's 8kb L1 texture cache per TMU actually lie inside the actual TMU blocks?
Texture quads in post-R600 architectures are local to each SIMD multiprocessor, together with the L1T caches and their number scale with the # of multiprocessors, not with the width, as in R600.
 
Texture quads in post-R600 architectures are local to each SIMD multiprocessor, together with the L1T caches and their number scale with the # of multiprocessors, not with the width, as in R600.

I understand. However, after R700, AMD effectively halved the number of TMUs per SIMD core in their lower end cards. Might I suggest that's what we're seeing here. Not that the hardware is beyond R700 - just that Nintendo changed the ratios.

My original question more bluntly put is this: What the heck does an L1 texture cache look like and could they reside inside the presumed TMU blocks on the die?
 
The stuff we're looking at is little more than some identifiable SRAM blocks and repeated macros. The layout is custom (probably synthesized to a large degree, with large blocks placed down by hand) so you can't really recognize much by the arrangement of the SRAM blocks.

The only thing I can tell you about cache is that it'll look like a big array of SRAM for the data adjacent to a smaller array of SRAM for the tags. How large the tags will be relative to the data will depend on the cache line size. GPUs rely on texture cache more for spatial locality of reference than temporal so they don't tend to be very large, especially if they're not particularly compute oriented, and they probably use pretty big lines. GPU cache controllers will also tend to have additional features like converting texture coordinates to linear ones and stuff to help support texture decompression at some level.

The big block (1MB?) of SRAM in the upper left of the die may have tags and cache controller in its upper right-hand corner. 1MB is really big for texture cache facilitating this level of GPU capability, but if the SRAM is needed for something else anyway it might have been deemed worthwhile to have it as a cache.

The block immediately above what's being called the left-most TMU might contain cache as well (could be around 128KB). But these are all really just barely educated guesses.
 
The stuff we're looking at is little more than some identifiable SRAM blocks and repeated macros. The layout is custom (probably synthesized to a large degree, with large blocks placed down by hand) so you can't really recognize much by the arrangement of the SRAM blocks.

The only thing I can tell you about cache is that it'll look like a big array of SRAM for the data adjacent to a smaller array of SRAM for the tags. How large the tags will be relative to the data will depend on the cache line size. GPUs rely on texture cache more for spatial locality of reference than temporal so they don't tend to be very large, especially if they're not particularly compute oriented, and they probably use pretty big lines. GPU cache controllers will also tend to have additional features like converting texture coordinates to linear ones and stuff to help support texture decompression at some level.

The big block (1MB?) of SRAM in the upper left of the die may have tags and cache controller in its upper right-hand corner. 1MB is really big for texture cache facilitating this level of GPU capability, but if the SRAM is needed for something else anyway it might have been deemed worthwhile to have it as a cache.

The block immediately above what's being called the left-most TMU might contain cache as well (could be around 128KB). But these are all really just barely educated guesses.

This helps alot. Thank you for the detailed answer. The reason I was interested is that the SRAM blocks inside the proposed TMUs look like they might be about right for 8 kb of texture cache. However, if these tags are identifiable features of cache, that cannot be the case.
 
Only way is to dig into the hardware and see which GPU generation it most closely resembles. If it's R6xx it's probably true, if it's R7xx onwards it's probably not. If it's anything later it's definitely not.
You know, the scariest point according to DF, even the devs may not know! :runaway:
 
I understand your motivation, but my humble opinion is that we should first try to get a clearer idea of what we are looking at before we devise theories of how it came to be, not the other way around.

Well that's certainly the logical way to do it. But is it the fun way? :???:

Re the CPU - I share Exophase's opinion that 2x bobcats would not have been worth it over 3x Expressos. I might have something more substantial to add on this particular subject once I get a certain toolchain to comply (I have so much to say about GCC's policy to castrate the ppc750cl support, unless one executes certain rituals, but that's a subject for another thread).

You're thinking of doing some benches comparing the 750 and Bobcat? That would be really interesting, would be interesting to see them both compared to the PPE in Cell too.

Well, it's an interesting theory that IMO has right to exist. I admit I may give extra weight to it due to my own inability to discern anything resembling clear-cut ROPs in there, but I'm not a seasoned die reader. My opinion is, for all we know, Latte might not have dedicated ROP blocks per se, just because it has some very close coupling between FB and the rest of the pipeline.

Well I missed one of the TMU blocks when I first looked at the die shot so I'm probably not qualified to look for ROPs (that are possibly in two separate units, or even not there at all). There must be some positive things that greatly increased flexibility could bring to the Wii U if it wasn't simply trying to run Xbox ports though - higher quality sub-sample based AA applied intelligently to key edges (a la cell shading) perhaps?
 
The GPR file organization on R600 was taken from R520 and even had its own small L1 cache -- quite different from what AMD did later in R700. I don't see such structure in "Latte" here, where the SIMD are clearly integrated with the GPRs.
 
You know, the scariest point according to DF, even the devs may not know! :runaway:

Well it's one way to prevent hardware documentation leaking!

I really don't understand what Nintendo have to gain by keeping so many basic hardware secrets. Any negative PR from having low specs would surely be dwarfed by any impact on the quality of the platform's software.
 
GPUs rely on texture cache more for spatial locality of reference than temporal so they don't tend to be very large, especially if they're not particularly compute oriented, and they probably use pretty big lines.
Radeons use a 64 Byte cache line size, not larger than on CPUs.
The big block (1MB?) of SRAM in the upper left of the die may have tags and cache controller in its upper right-hand corner. 1MB is really big for texture cache facilitating this level of GPU capability, but if the SRAM is needed for something else anyway it might have been deemed worthwhile to have it as a cache.
I thought the current idea is that it's there for backwards compatibility. It's supposed to be the 1MB texture memory that together with the 2 MB eDRAM array forms the 3 MB dedicated video memory in the Wii mode.
The GPR file organization on R600 was taken from R520 and even had its own small L1 cache -- quite different from what AMD did later in R700. I don't see such structure in "Latte" here, where the SIMD are clearly integrated with the GPRs.
Register file cache in R600?!? First time I hear it. Register file organization was kept pretty constant all the way from R600 to R900. They kept the same four x, y, z, w banks and have the same read port and cycle restrictions. Or do you have details?
The major difference between R600 und R700 is basically the much more scalable TMU setup integrated to the SIMDs, not the registers.

Edit:
Or do you mean the small (8kB for the whole GPU, shared by all SIMDs) Cache in the stream out/in datapath? Wasn't that just a glorified write combining buffer (later incarnation had it too, In Cypress it has grown to 128kB [and even got atomic units])?
 
Last edited by a moderator:
Register file cache in R600?!? First time I hear it. Register file organization was kept pretty constant all the way from R600 to R900. They kept the same four x, y, z, w banks and have the same read port and cycle restrictions. Or do you have details?
The major difference between R600 und R700 is basically the much more scalable TMU setup integrated to the SIMDs, not the registers.

Edit:
Or do you mean the small (8kB for the whole GPU, shared by all SIMDs) Cache in the stream out/in datapath? Wasn't that just a glorified write combining buffer (later incarnation had it too, In Cypress it has grown to 128kB [and even got atomic units])?
Here's a quote from B3D's own R600 analysis on the matter:
For local memory access, the shader core can load/store from a huge register file that takes up more area on the die than the ALUs for the shader core that uses it. Accesses can happen in 'scalar' fashion, one 32-bit word at a time from the application writer's point of view, which along with the capability of co-issuing 5 completely random instructions (we tested using truly terrifying auto-generated shaders) makes ATI's claims of a superscalar architecture perfectly legit. Shading performance with more registers is also very good, indeed we've been able to measure that explicitly with shaders using variable numbers of registers, where there's no speed penalty for increasing them or using odd numbers. It's arguably one of the highlights of the design so far, and likely a significant contributor to R600's potential GPGPU performance as well.

Access to the register file is also cached, read and write, by an 8KiB multi-port cache. The cache lets the hardware virtualise the register file, effectively presenting any entry in the cache as any entry in the larger register file. It's unclear which miss/evict scheme they use, or if there's prefetching, but they'll want to maximise hits for the running threads of course.
And since there isn't a die-shot of any R600 SKU, R520 can give some clues to the general layout, considering the architectural similarities with R600 -- note the big dark rectangle in the middle:

die-big.jpg
 
intersting post from fourth storm on neogaf.

Here's some more fuel to add to the fire apropos ROPs. Check out this photo of llano. It's a VLIW5 APU design that actually seems to share more similarities to Latte than the old RV770 die:

http://images.anandtech.com/reviews/cpu/amd/llano/review/desktop/49142A_LlanoDie_StraightBlack.jpg

Any guesses as to where the ROPs are there? Besides the obvious structures (LDS, ALU, TMU,Texture L1), I am hard pressed to find too blocks that are exactly the same in layout/SRAM banks.

Perhaps, as they did with the ALUs in both Llano and Latte, Renesas/AMD/whoever were able to fit what were formerly two blocks into one.

In other words, 8 ROPs in one block.

...Maybe?
 
Status
Not open for further replies.
Back
Top