Wii U hardware discussion and investigation *rename

Gipsel · Feb 6, 2013

Exophase said:
I've tried thinking about this for a while.. I'm pretty tired so I might make some obvious mistake, please bear with me :/

The virtue of SIMD is that you can the entire vector like one big register, and can access it via one very wide single register file port. So the amount of individual addressing you need shouldn't scale with the number of lanes (VLIW groups) in the SIMD. Of course this will be limited if the SIMD is broken up over multiple blocks like.

I know how SIMD works logically. Im talking about the physical implementation. And there you have a few boundary conditions. The important one in this context is that you don't want to build four huge 64kB register banks which are far away from the individual ALUs. Instead you split them up and place them very close to the ALUs. This pays back twofold: it reduces the access latency and saves power. As no SIMD lane can see the registers of other lanes anyway, there is no point to have register files shared. It's simpler, faster and even more power efficient.
And it actually enables some trick a single register file couldn't do easily. Each SIMD lane can add an offset to the index into the register file so each lane can in fact address a different register with the single instruction shared by all lanes (look it up in the ISA manual under "relative adressing" with the offset given in the AR register set by the MOVA_* instruction). Afaik, GCN got rid of this messy stuff.

Exophase said:
If using R700 series as a base this document may be helpful: http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf

I know all the ISA manuals from R600, R700, Evergreen, HD6900, and Southern Islands (that's how AMD called the GPUs in the manuals).

Exophase said:
Section 4.7.4 is helpful. One of GPR.X, GPR.Y, GPR.W, and GPR.Z can be read each cycle. GPR.Trans takes a read port from one of these. There's also a constant file which I think can supply one read per cycle (over 4 cycles?) I'm not sure where writes fit in though, if they can fit in with this too (and if so, if the SRAMs need to be read + write ported). So a VLIW SIMD block may only need something like 5-6 SRAM banks dedicated to it.

The register adressing is really awful for the VLIW architectures. But anyway:

Let's forget a moment that each lane has to work on a 4 new elements every 4 cycles. Then there is the task of reading up to 12 operands over 3 cycles (some bank swizzling encoded in the instruction takes care of the distribution to the VLIW slots, we don't talk about that here, it's another mess; we neither consider the constants or literals, they don't come from the reg file, nor about the PS and PV "pipeline registers" which serve as some compiler directed data forwarding because the results of the previous instruction are not yet written back to the reg file). For that you clearly need 4 banks. Those banks are not only physically separate, but already logically in the ISA. The remaining 4th cycle is used to write back the four results of the previous instruction by the way. So everything could be used up.

But now we come back to the fact, that each SIMD lane actually executes 4 elements of the wavefront over four cycles in a pipelined fashion. That would mean during the first cycle one reads four 32bit values from the 4 banks for the first element. In the next cycle one has to read both, the next four operands for the first element as well as the first four operands for the second element. And so on. In the end this results one has to constantly read 12 values per cycle and also write 4 values per cycle. That needs basically 4 banks with 4 ports (3 read, one write; fetching data is done in seperate clauses and may steal the necessary cycles when changing clauses).

But one could have the idea, that one simply fetches and writes 128bit values (4x32bit) for all 4 elements processed in that lane together and everything would be fine. The relative adressing of GPRs would destroy this as the 4 elements can access different GPRs. But I strongly suspect that it uses the same mechanism as when one uses a GPR indexed access of constants called waterfalling (it basically serializes the 4 elements in each lane; not as bad as serializing all 64 elements in a wavefront for a GPR indexed constant file access). I really doubt they implemented 4 times the register ports/banks just for this basically never used feature. As I mentioned, GCN dropped support for this (and I guess nobody will notice).

Inuhanyou · Feb 6, 2013

shinobi said:
he also said it will be in the same situation as the wii, that's kinda of hard to believe.

He is completely correct in the sense that it is comparable to the current generation while the other two are on a whole nother level. DF has already said as much themselves.

It is absolutely a Wii-like situation although the gap is smaller this time and Wii U's GPU not lacking in the most basic elements such as shaders.

Exophase · Feb 6, 2013

Gipsel said:
But one could have the idea, that one simply fetches and writes 128bit values (4x32bit) for all 4 elements processed in that lane together and everything would be fine. The relative adressing of GPRs would destroy this as the 4 elements can access different GPRs. But I strongly suspect that it uses the same mechanism as when one uses a GPR indexed access of constants called waterfalling (it basically serializes the 4 elements in each lane; not as bad as serializing all 64 elements in a wavefront for a GPR indexed constant file access). I really doubt they implemented 4 times the register ports/banks just for this basically never used feature. As I mentioned, GCN dropped support for this (and I guess nobody will notice).

That's what I'm saying (well, what I wanted to).. not to have only four really wide register banks, but enough width to cover the 4 SIMD lanes in a wavefront. Yes, relative addressing would require serializing, although you could offer optimization for it the same as any scatter/gather.

I didn't really think through the constants.. I take it they come from one big shared register file that's accessed along with the instruction fetch?

Gipsel · Feb 6, 2013

fellix said:
Those are in scale, as precise as I could get it. Wii U GPU doesn't seem to be holding the same amount of SIMD logic per partition.
Take into account the difference in manufacturing process.

I would halve the number of banks and double their size, but this is nitpicking.

However, I guess it's better to compare the Wii U to Brazos, as both are made in the same process:

As I said, the Wii U SIMDs look like I would expect from a 55nm version.

.

fellix said:
I still think it's possible there's some SIMD redundancy:

http://i.imgur.com/5oArkj4.jpg

The highlighted block, right next to the multiprocessors, contains similar GPR memory banks (in red), in 16:1 size ratio -- the same redundancy ratio found in RV770.

Similar SRAM banks are in other locations as well (for instance below the SIMDs on the right side). And I don't see enough space for the logic of the SP. It somewhat interferes with the other structures there, isn't it?

Gipsel · Feb 6, 2013

Exophase said:
That's what I'm saying (well, what I wanted to).. not to have only four really wide register banks, but enough width to cover the 4 SIMD lanes in a wavefront. Yes, relative addressing would require serializing, although you could offer optimization for it the same as any scatter/gather.

What I said covers already the 4 wavefront elements processed on the same VLIW group (it doesn't make much of a difference if you use the 4 elements processed on the same SIMD lane aka VLIW group or if you take 4 SIMD lanes or the 4 elements processed on one lane to do this). You want to make it even wider (i.e. 8 or 16 work items/"threads" worked on that way). But at some point you run into the problem that there was a reason to split them in the first place (close proximity to the ALUs).

Exophase said:
I didn't really think through the constants.. I take it they come from one big shared register file that's accessed along with the instruction fetch?

It's a bit more complicated. One can of course use normal fetches to access the constants, but this is slow. Then there is a set of 256 (128bit, so 4 KB in total) constant registers/constant file (only used for DX9). The today usual constant buffers are handled by a cache in each SIMD. Each clause (a group of VLIW instructions) can specify up to 4 lines of 16 constants of the buffers which then get "locked" to the constant cache for this clause (these lines get loaded to the cache when scheduling the clause for execution, I would suspect the same memory as for the constant registers is used for this purpose).
But all this is handled by the swizzle network in front of the ALUs. It comes on top of everything the register file delivers. It shouldn't play a role here (other than there should be some additional smaller SRAM structures for each SIMD holding these values).

Azgoodazdead · Feb 6, 2013

Gerry said:
No offence, but what do you think people have been discussing for the last 5 pages?

Posting up info that was uncovered. This is a further scan of what is under the hood, so none taken. Just attributing.

Exophase · Feb 6, 2013

Gipsel said:
As I said, the Wii U SIMDs look like I would expect from a 55nm version. .

Hrrm.. well, ignoring the technical justification, which you make a good point with, what's working in favor for and against Renesas 55nm, instead of TSMC 40nm (or Renesas 40nm?)?

Against:
- eDRAM looks too dense to be 55nm
- Chipworks guy says it's TSMC 40nm, and while some of the other stuff he says might be a little questionable this is something you'd really expect them to know
- Wii Mini is alleged to be on 40nm (although I'm not 100% convinced of that either); if that's the case it'd be really weird for Wii U not to be
- It's 2013 and 55nm is downright ancient

For:
- Shader blocks seem quite dense for 40SP, although not outside the realm of possibility
- Too few SRAM banks for Brazos-like layout at least
- Renesas is credited, but may only played a role in design instead of manufacturing, or a joint role in conjunction with TSMC
- Is anyone else using eDRAM w/TSMC? AFAIK they say they can do it..

The eDRAM density seems like a fairly big one.

fellix · Feb 6, 2013

Gipsel said:
Similar SRAM banks are in other locations as well (for instance below the SIMDs on the right side). And I don't see enough space for the logic of the SP. It somewhat interferes with the other structures there, isn't it?

The block area below the SIMDs is most likely the command processor (and thread dispatcher) units. I've noticed from other die shots, that those contain similar SRAM structures used for the GPR file.

Entropy · Feb 6, 2013

Exophase said:
Hrrm.. well, ignoring the technical justification, which you make a good point with, what's working in favor for and against Renesas 55nm, instead of TSMC 40nm (or Renesas 40nm?)?

Against:
- eDRAM looks too dense to be 55nm
- Chipworks guy says it's TSMC 40nm, and while some of the other stuff he says might be a little questionable this is something you'd really expect them to know
- Wii Mini is alleged to be on 40nm (although I'm not 100% convinced of that either); if that's the case it'd be really weird for Wii U not to be
- It's 2013 and 55nm is downright ancient

For:
- Shader blocks seem quite dense for 40SP, although not outside the realm of possibility
- Too few SRAM banks for Brazos-like layout at least
- Renesas is credited, but may only played a role in design instead of manufacturing, or a joint role in conjunction with TSMC
- Is anyone else using eDRAM w/TSMC? AFAIK they say they can do it..

The eDRAM density seems like a fairly big one.

IBMs 45nm SOI and TSMCs 40nm are generational siblings.
A 1Mbit EDRAM macro used on Power7 uses 0.24mm2, so naively multiplying 0.24 by 8 (bits/byte) and 32 (EDRAM size) makes 61.44mm2. That constitutes a lower bound, since there are aspects omitted. So given the size of the EDRAM of the WiiU GPU alone, I think 55nm is a very questionable proposition.

Indeed, to me it seems as if they've done a good job on density at 40nm. But then, it doesn't have to maintain signal integrity at as high clocks as IBMs EDRAM design is made for.

Personally, I think the fact that a Chipworks guy flat out states that we are looking at TSMC 40nm carries a lot of weight. This is their area of expertise, what they do for a living in the industry. Short of Nintendo coming out and telling us, I'd take their word over anyone elses.

function · Feb 6, 2013

Is it in any way possible to mix nodes, either by switching lines for a half-done dip (sounds impossible) or stacking 40 nm edram macros on top of a 55nm chip?

The small, fast 2MB block of edram has a density of approx 2 mm^2 / MB but that includes quite a bit of overhead for gaps and pretty gold stripes and stuff, or about 1.1 mm^2 / MB if you just measure one of the light orange brown blocks and multiply its area by 8.

In other words: if the edram really looks like it's 40nm, and the shader blocks really look like they're 55nm, is there any way that both can be true? Something something, "stacking", PSVita, mumble.

function · Feb 6, 2013

I found this Googling around:

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6176968

A 3D system prototype of an eDRAM cache stacked over processor-like logic using through-silicon vias said:
This work describes the design and operation of a prototype of a 3D system, constructed by stacking a memory layer, built with eDRAM [3] and logic blocks from the IBM Power7™ processor L3 cache, and a “processor proxy” layer in 45nm CMOS technology [4] enhanced to include through-silicon vias (TSVs)

I think I've actually read about this before here at B3D. Probably too late to be relevant to the Wii U, but the idea I had in the post above was instead of using an interposer you build your processor on a 55nm process and then stick the edram on the chip but away from the logic to avoid thermal issues (everything can transmit heat to the heatsink directly).

This way you could avoid the issues of using an edram process for logic, avoid one stage of connecting to an interposer and minimise wastage caused by difficult to manufacture edram.

But I guess if anything like this had been used Chipworks would have spotted it a mile away. So I guess this is kind of academic at this point.

Esrever · Feb 6, 2013

fellix said:
Those are in scale, as precise as I could get it. Wii U GPU doesn't seem to be holding the same amount of SIMD logic per partition.
Take into account the difference in manufacturing process.

dang you beat me to it.

http://i.imgur.com/9EEVWDw.jpg
Wii U die to scale superimposed onto llano's

The SIMD blocks that makes up the CUs. LLano has 40 SPs per block. Wii U is unknown. At 40nm, the blocks in the Wii U are smaller than the Llano ones at 32nm. It is highly unlikely that the Wii U has 40 SPs per block.

Esrever · Feb 7, 2013

The 3 die shots I used to get my data, they are scaled to be relative size. Wii U is 11.88mm x 12.33mm actual, 4870 is ~16mm x 16mm and Llano is 13.78 mm x 16.54mm. I could not find a useful die shot of bobcat.

Here is the scaled picture of the unit sizes. I use photoshop to calculate the sizes and areas are probably within a small margin of error but if my numbers for the die sizes are correct( I can't really be sure size I just used the internet to find them), there should not be much error.

If we assume perfect scaling from 55nm to 40nm. Then R770 on 40nm should be smaller than what is inside the Wii U. If we again assume perfect scaling, Llano would be about 2x the area of the Wii U's block at 40nm and 2.3x the size of the R770 at 55nm.

MystWalker · Feb 7, 2013

Esrever said:
dang you beat me to it.

Would you mind re-posting that as an attachment? Unable to see imgur hosted images.

AlNom · Feb 7, 2013

hm... I'm not sure you can easily compare it to Llano because it's a design ported to 32nm SOI, plus there are changes/additions between R700 and Evergreen ("DX11" capabilities etc). I mean if you double up on rv770 shader block to get 40 units, those should be a heck of a lot smaller at 32nm than what's in Llano, so again, I'm a bit hesitant to compare that in particular.

Will have to think about it some more.

There may be more to the Brazos comparison or at least one less factor removed (different fab/node), though again it's based on Evergreen. Back to square 1.

Gipsel said:
However, I guess it's better to compare the Wii U to Brazos, as both are made in the same process:

Do you have a link to the high res die shot of Brazos? The one I found on google is about 550x488 res. The pixel counting seems to point to ~0.90mm^2 per block of 20 shaders.

Vs 1.46mm^2 per block on WiiU

edit: thanks fellix.

fellix · Feb 7, 2013

Esrever said:
I could not find a useful die shot of bobcat.

http://www.ixbt.com/cpu/images/amd-bobcat/bobcat-die-hq.jpg

AlNom · Feb 7, 2013

+66% area per shader block is rather bizarre. :s The # of register banks seems like a strong indicator though (hadn't thought to count it before, Gipsel - derp, just catching up.).

shinobi · Feb 7, 2013

AlStrong did you just delete your post, it was very interesting but i guess you wanna be sure or you were just mistaken?

AlNom · Feb 7, 2013

shinobi said:
AlStrong did you just delete your post, it was very interesting but i guess you wanna be sure or you were just mistaken?

There was one post with nothing in it.

What I meant to post is above. I'm in a bit of a rush at the moment, sorry for the confusion.

shinobi · Feb 7, 2013

AlStrong said:
There was one post with nothing in it. What I meant to post is above. I'm in a bit of a rush at the moment, sorry for the confusion.

well i understood the first post you erased, but have no idea what the new post means, thanks anyway.

Wii U hardware discussion and investigation *rename

Gipsel

Inuhanyou

Exophase

Gipsel

Gipsel

Azgoodazdead

Exophase

fellix

Entropy

function

None functional

function

None functional

Esrever

Esrever

MystWalker

AlNom

Moderator

fellix

AlNom

Moderator

shinobi

AlNom

Moderator

shinobi

Similar threads