Wii U hardware discussion and investigation *rename

Status
Not open for further replies.
intersting post from fourth storm on neogaf.

Here's some more fuel to add to the fire apropos ROPs. Check out this photo of llano. It's a VLIW5 APU design that actually seems to share more similarities to Latte than the old RV770 die:

http://images.anandtech.com/reviews/cpu/amd/llano/review/desktop/49142A_LlanoDie_StraightBlack.jpg

Any guesses as to where the ROPs are there? Besides the obvious structures (LDS, ALU, TMU,Texture L1), I am hard pressed to find too blocks that are exactly the same in layout/SRAM banks.

Perhaps, as they did with the ALUs in both Llano and Latte, Renesas/AMD/whoever were able to fit what were formerly two blocks into one.

In other words, 8 ROPs in one block.

...Maybe?
AMD does like 2 texture per ROP ratio. 16 texture units should mean 8 ROPs. Only in their APU do they differ. Llano has 20 texture units and 8 ROPs. The proposed location on Neogaf does look pretty good (just above the DDR interface) from my point of view but I have no idea what they look like so what do I know.

The die of the Wii U does seem to share a lot of similar blocks to both Llano and Bobcat. I think some of the blocks on the Wii U would be dram controllers (which would be pretty large I would imagine) which would not be in Llano.

Llano's rops are probably next to the DDR interface I would imagine.
 
AMD does like 2 texture per ROP ratio. 16 texture units should mean 8 ROPs. Only in their APU do they differ. Llano has 20 texture units and 8 ROPs. The proposed location on Neogaf does look pretty good (just above the DDR interface) from my point of view but I have no idea what they look like so what do I know.

The die of the Wii U does seem to share a lot of similar blocks to both Llano and Bobcat. I think some of the blocks on the Wii U would be dram controllers (which would be pretty large I would imagine) which would not be in Llano.

Llano's rops are probably next to the DDR interface I would imagine.

so does this change your opinion on the 20 shaders per block theory?

Edit: it seems fourth storm was mistaken and wrote this
Ah, the chip is face down on the motherboard. Of course! I concede defeat on this one.
 
Last edited by a moderator:
Radeons use a 64 Byte cache line size, not larger than on CPUs.

The more I post the more obvious it is I don't know anything about the fine details of these GPUs, isn't it.. ;) I really slept through GPU uarch stuff for a long time.. At any rate, thanks for your corrections and insights, it's been very informative.

64b is pretty standard on CPUs these days but it wasn't that long ago when that was pretty large. Do any other GPUs use larger lines? Did they in the past?

I thought the current idea is that it's there for backwards compatibility. It's supposed to be the 1MB texture memory that together with the 2 MB eDRAM array forms the 3 MB dedicated video memory in the Wii mode.

That's what Marcan's saying anyway. I don't have a reason to doubt this, but I'm taking things with some salt since he seems like he speaks really authoritatively on some items he's just speculating on..

But he has gotten names for these places and memory offsets, which would strongly support that they're sized in the manner he suggests. 2MB + 1MB does fit Hollywood well. I am going to guess that they wanted 1MB SRAM for Wii U's sake though, since I don't think Wii BC called for it.
 
But he has gotten names for these places and memory offsets, which would strongly support that they're sized in the manner he suggests. 2MB + 1MB does fit Hollywood well. I am going to guess that they wanted 1MB SRAM for Wii U's sake though, since I don't think Wii BC called for it.

The 1 MB SRAM actually makes alot of sense for Wii BC, as I understand it. The texture cache on Flipper/Hollywood was quite a bit more efficient than even the 2MB frame buffer. MoSys broke the 1MB up into 32 macros, each of which could be accessed at the same time. As darkblu will tell you, this resulted in some extremely hard to emulate low latency texture fetches. It seems as if even Renesas' eDRAM was not up to the task in matching this freakish decade old design.
 
Having a bunch of parallel accesses doesn't do anything for latency, it just means it's a really wide bus. No reason why it needs to be SRAM, which 1T-SRAM wasn't in the first place. I don't see a reason why someone else's eDRAM couldn't match the performance when it can clock something like twice as high.
 
Having a bunch of parallel accesses doesn't do anything for latency, it just means it's a really wide bus. No reason why it needs to be SRAM, which 1T-SRAM wasn't in the first place. I don't see a reason why someone else's eDRAM couldn't match the performance when it can clock something like twice as high.

Well yes, it is a 512-bit bus with a higher bandwidth than the fb. But if used correctly, the 32 addressable macros could also amount to less overall page switches in the cache. Additionally, we really don't know what kind of latency Renesas' eDRAM boasts. I think 1t-SRAM is 1 clock cycle, and Wii U likely downclocks for Wii BC mode.
 
Well yes, it is a 512-bit bus with a higher bandwidth than the fb. But if used correctly, the 32 addressable macros could also amount to less overall page switches in the cache. Additionally, we really don't know what kind of latency Renesas' eDRAM boasts. I think 1t-SRAM is 1 clock cycle, and Wii U likely downclocks for Wii BC mode.

What does page switches mean here..?

Supposedly the texture buffer (most sources call it a cache but I'm not aware of it operating like a real cache) had a sustainable latency of 5ns, which would be about a clock cycle. 1T-SRAM is really nothing more than eDRAM with an embedded controller to make it look like SRAM externally - if you look at embedded 1T-SRAM vs eDRAM both solutions will of course have an embedded controller, meaning that the only difference is in the exact nature of the eDRAM layout and controller logic.

Incidentally, the Gamecube information I'm looking at says Flipped actually was manufactured on an NEC process and using NEC's eDRAM (NEC and Renesas merged so they're basically the same company now).

http://www.angelfire.com/mt/psprecinct/news/gcspecs.html

This was continued to the Wii, of course:

http://www.ign.com/articles/2006/06/19/nec-and-mosys-announce-wii-hardware-partnerships

If Wii U's GPU was manufactured by Renesas there's no reason why its eDRAM design would be anything but an evolution of what was in Flipper and Hollywood.
 
As darkblu will tell you, this resulted in some extremely hard to emulate low latency texture fetches.
...Why on earth would you need to emulate (alledgedly) low latency texture fetches? The hardware draws a polygon. How many cycles of latency at whatever stage of the process has no bearing on the visual appearance of the polygon when it's finished being drawn. This is all irrelevant.

This religious worship of supposed secret Nintendo sauce really needs to stop. Be rational for once, for chrissakes. There's nothing particularly special about that old gamecube rasterizer. It's just a really, really, really old piece of hardware that wasn't even particularly good in the first place. Ask ERP about his experiences with it for example.
 
What does page switches mean here..?

Supposedly the texture buffer (most sources call it a cache but I'm not aware of it operating like a real cache) had a sustainable latency of 5ns, which would be about a clock cycle. 1T-SRAM is really nothing more than eDRAM with an embedded controller to make it look like SRAM externally - if you look at embedded 1T-SRAM vs eDRAM both solutions will of course have an embedded controller, meaning that the only difference is in the exact nature of the eDRAM layout and controller logic.

Incidentally, the Gamecube information I'm looking at says Flipped actually was manufactured on an NEC process and using NEC's eDRAM (NEC and Renesas merged so they're basically the same company now).

http://www.angelfire.com/mt/psprecinct/news/gcspecs.html

This was continued to the Wii, of course:

http://www.ign.com/articles/2006/06/19/nec-and-mosys-announce-wii-hardware-partnerships

If Wii U's GPU was manufactured by Renesas there's no reason why its eDRAM design would be anything but an evolution of what was in Flipper and Hollywood.

No doubt that 1t-SRAM and Renesas eDRAM are quite similar. I am not a developer by any means, and stand to be corrected, but I have done quite a bit of reading on the subject, and all sources seem to agree: more page switches in any cache read equate to that much more latency per switch. Renesas' eDRAM may be comparable, but splitting it into 32 pools, each of which with a 16 bit interface, seems to be beyond even the high speed 2 MB eDRAM pool on Wii U. That seems to have a 256 bit interface over 2 MB.

...Why on earth would you need to emulate (alledgedly) low latency texture fetches? The hardware draws a polygon. How many cycles of latency at whatever stage of the process has no bearing on the visual appearance of the polygon when it's finished being drawn. This is all irrelevant.

This religious worship of supposed secret Nintendo sauce really needs to stop. Be rational for once, for chrissakes. There's nothing particularly special about that old gamecube rasterizer. It's just a really, really, really old piece of hardware that wasn't even particularly good in the first place. Ask ERP about his experiences with it for example.

Take a step back, Grall. Nobody is worshiping anything here. We are merely attempting to make sense of what we are seeing. There is a 1 MB SRAM cache on Latte that is locked off to devs and conspicuously adjacent to a 2MB eDRAM pool that is also locked off. Nintendo plainly put it there for some functional reason. BC is the obvious answer.

Likewise, nobody is claiming that Flipper is beyond any other graphics hardware - modern or ancient. The point is that the GPU had an abnormally large texture cache broken into an array of independent macros and coupled with a decently wide bus. Emulating this performance perfectly in all software, as Nintendo apparently require, is not trivial without matching those qualities in hardware. Sure, we see Dolphin pulling off great results without any eDRAM or 1t-SRAM, but that is not the frame for frame perfect emulation Nintendo require in home consoles.
 
No doubt that 1t-SRAM and Renesas eDRAM are quite similar. I am not a developer by any means, and stand to be corrected, but I have done quite a bit of reading on the subject, and all sources seem to agree: more page switches in any cache read equate to that much more latency per switch. Renesas' eDRAM may be comparable, but splitting it into 32 pools, each of which with a 16 bit interface, seems to be beyond even the high speed 2 MB eDRAM pool on Wii U. That seems to have a 256 bit interface over 2 MB.

I get how changing DRAM pages is a latency problem, but I don't think it's something 1T-SRAM had a special advantage for outside of offering buffering that you could have done some other way. My guess is Renesas will still offer a 1T-SRAM equivalent even if it's not called that; they may have fully absorbed the IP or otherwise superseded it - the 2MB and 32MB could be derived or superior (but this is all still assuming Renesas did the manufacturing in the first place)

But it looks like what you're saying is that the fab just isn't offering small enough macros to get the needed access width out of 1MB. That makes sense and I agree could explain the choice for SRAM.

Likewise, nobody is claiming that Flipper is beyond any other graphics hardware - modern or ancient. The point is that the GPU has an abnormally large texture cache broken into an array of independent macros and coupled with a decently wide bus. Emulating this performance perfectly in all software, as Nintendo aim to do, is not trivial without matching those qualities. Sure, we see Dolphin pulling off great results without any eDRAM or 1t-SRAM, but that is not the frame for frame perfect emulation Nintendo require in home consoles.

First and foremost, if Nintendo used the rest of Hollywood's GPU logic then it'll need a comparable RAM for textures, there's no getting around that.. regardless of what latencies the software needs the hardware is going to need what Hollywood had. It just doesn't have the ability to hide latency like more modern GPUs do.

I agree with what you're saying about Nintendo's preferences here.. We could debate how much Nintendo could have emulated the whole GPU in software and if there was any risk of complication at all (Dolphin isn't perfect and of course typically has access to much better GPU hardware), but quality software emulation for BC just isn't Nintendo's style (not counting Virtual Console).

Now I know Iwata said that they used a novel hardware approach that was more intelligent than putting Hollywood on the chip but we don't know to what extent this is the case. Just by having that 32MB of eDRAM and sharing the FB and TB RAMs with the Wii U hardware they've already done enough to make that claim.
 
...Why on earth would you need to emulate (alledgedly) low latency texture fetches? The hardware draws a polygon. How many cycles of latency at whatever stage of the process has no bearing on the visual appearance of the polygon when it's finished being drawn. This is all irrelevant.
You're not making sense in this part. Relax and re-read it. Try rephrasing it, if you will.

This religious worship of supposed secret Nintendo sauce really needs to stop. Be rational for once, for chrissakes. There's nothing particularly special about that old gamecube rasterizer. It's just a really, really, really old piece of hardware that wasn't even particularly good in the first place. Ask ERP about his experiences with it for example.
The cube also happens to have a top-performing dependent-reads rasterizer for its time (I doubt ERP had much issue with that), if you're going to emulate that you'd have to be as good or better. As per the religious worship part - I think you're getting too emotionally involved.
 
Anybody catch the sonic all star racer, face off from digitial foundry, wii u had the worst resolution of all 3 versions, i'm starting to think wiiu has 160 SP like Function and Esrever suggested or it has some nasty bottle necks
 
Last edited by a moderator:
Here's a quote from B3D's own R600 analysis on the matter:
So you meant the small 8kB buffer, which sits in the stream out data path (i.e. outside of the shader array!). I know AMD mentioned this virtualized register file in some presentation, but let's think about it. The R600 ISA (this point is identical to all VLIW architectures) allows each element access to 128 registers (16 bytes per reg, split in the 4 xyzw banks). That equals 2 kB per element in a wavefront or 128 kB for a single wavefront. The VLIW architectures run 2 wavefronts in an interleaved fashion, so the size of the register file was basically designed to be able to run 2 wavefronts with the maximum register allocation allowed by the architecture. The UTDP would schedule usually only so many wavefronts per SIMD, that the SIMD would not run out of registers. The whole register eviction because there wasn't enough space never kicked in (if it is avoidable, which it mostly is) to begin with. And even if it would have, the 8kB size forbids any real help from this cache.
8192 Bytes/64 = 128 Bytes per Element = 8 Registers for single wavefront. But you have several wavefronts per SIMD on several SIMDs running in the chip (and the granularity for register allocation is 4 registers). The useful window of register eviction to this cache is so tiny, that it never gets used for this purpose. If register eviction is necessary, it is basically done to global memory and gets streamed through or simply bypasses this cache.
And as I said already before, this cache is not a one off specialty of R600. It was kept and even grown to 128kB in Cypress. The global atomics were handled by it.

To sum it up, the "virtualized" register file meant in practice not much more than that each wavefront could address its architectural registers from r0 to r127 (restrictions apply because of tempories and so on) or a smaller range and the UTDP allocated as many as necessary and gives the SIMD an offset into the physical register file added to each access with the "virtual" register index from that wavefront to get the physical one. The result is that several wavefronts can share the reg file without collisions or even knowing from each other.
R520 can give some clues to the general layout, considering the architectural similarities with R600
R520 has significant architectural differences to R600. The shaders and register files of R600 are much closer to R700 than to R520. Claiming R600 is close to R520 in this respect is comparable to saying GCN resembles VLIW4 (which is not completely without merit in some aspects, but is also quite misleading).
 
Last edited by a moderator:
Somebody will be able to figure it out once they can run code on it. Developers can already run code on it, so one of them may figure it out and make a statement about it.
 
If I see it right, the quarter size SIMDs (it has 2 SIMD engines with just 20 SPs/4 VLIW5 groups each) are roughly the same size as 20 SPs in RV770. But the resolution could be better and the register banks look a bit small in comparison (and I'm somehow missing a few in one of the SIMDs). So that's not a definitive assessment.
 
Is it possible you guys never figure out the SP count on this GPU?
If it is indeed produced in TSMC's 40G (if the SRAM array in the upper left corner is 1MB and not just 512kB), the amount of SRAM in the SIMDs indicates strongly 1 MB register file size in total (or they used two different SRAM cells with a factor of two different density, highly unlikely). Also the comparison with Brazos shows that each individual SRAM bank is significantly larger, comparing just the area where the SRAM cells are (the dark area is control, the brighter area are the actual cells, that is the blue part for the Brazos shot), it's quite close to a factor of two. That indeed hints to a reorganization of the SRAM banks and to a modification of how the register file access works, which in turn could be a sign of larger modifications to the base VLIW architecture. Or AMD did this reorganization just to squeeze out the last few tenths of a mm² needed per SIMD. Anyway, in either case the large register file wouldn't make sense without an appropriate amount of computing resources. Even if it would be a more heavily modified architecture, it's capabilities should end up close to a 320 SP version.
160 SPs is just a possibility, if Nintendo opted for some really strange decisions. I think it is quite unlikely. I would be more interested in the actual changes they did to the SIMDs. But I guess this won't become public for quite some time.
 
And as I said already before, this cache is not a one off specialty of R600. It was kept and even grown to 128kB in Cypress. The global atomics were handled by it.
R600's was a one off in that it was a read/write cache. In the following generations it was write only. Your size analysis explains why the read caching was dropped.
 
Status
Not open for further replies.
Back
Top