Wii U hardware discussion and investigation *rename

TheLump · Jun 24, 2013

Were there ever any die photos of an e6760 available? I'm not implying anything - I just remember the discussion at the time and would be interesting to compare the differences between the two chips.

AlNom · Jun 24, 2013

TheLump said:
Were there ever any die photos of an e6760 available? I'm not implying anything - I just remember the discussion at the time and would be interesting to compare the differences between the two chips.

Don't think so, no. The comparison was about TDP & general perf-level more than anything IIRC.

(((interference))) · Jun 25, 2013

Also it might be less FLOPS than Xenos on paper but the Wii U arch is probably a good deal more efficient than the R500 whatever architecture Xenos uses.

Like how GCN is at 100% efficiency vs Xenos at 60%.

3dilettante · Jun 25, 2013

There's one aspect where the VLIW architecture could not avoid falling below peak: shaders that did not have sufficient ILP to fill up all issue slots.
GCN does avoid that case, but there are no real world metrics by which it would be considered 100% efficient .

(((interference))) · Jun 25, 2013

Well, I really wouldn't know, but that's what's in the Durango dev docs: ~60% efficiency for Xenos vs near 100% for GCN.
http://www.neogaf.com/forum/showpost.php?p=48282244&postcount=109

TheLump · Jun 25, 2013

AlStrong said:
Don't think so, no. The comparison was about TDP & general perf-level more than anything IIRC.

Oh ok, thanks. There was some (idle) speculation about Latte being based off of the same thing (ie a downclocked, modified e6760 chip without the GDDR5 on board) wasn't there? It's obviously not now that we have more info of course...

That was all due to the AMD rep email saga (which was obviously not a reliable source, as concluded at the time). I do wonder why the cs rep was so happy to confirm something like that though. I was one of the ones who got the same response from AMD when testing the email claim, so at the time it seemed plausible (until I came back down to earth and realised it's unlikely a rep would actually know that info for certain). Regardless, it would be interesting to see a die shot of an e6760 just to see how far off those claims were. Maybe it would give us an insight into what AMD were doing with MCM packages around that time, which might help with some of the lingering uncertanties regarding Latte's layout? Unlikely but worth a thought.

PSA: I am not for a second implying or trying to spark an implication that the two chips are in any way closely related, just curious how the two would compare given all the debate in the past!

Entropy · Jun 25, 2013

BobbleHead said:
Except his flat out written statement is only partially correct. It is not fabbed at TSMC, It is fabbed at Renesas. He can't tell TSMC vs Renesas from the work they did, but he can tell 40nm vs 55nm.

Here is a source for it being fabbed at Renesas:
http://www.eetasia.com/ART_8800678216_499489_NT_f90242a2.HTM

If you don't have a login you can get to that same article from google without a login via:
http://www.google.com/url?sa=t&rct=...3DfFwTTWt5pAKSA&bvm=bv.48293060,d.cGE&cad=rja

That article was posted in november 2012, and she initially declares: "Although no teardown specialists have worked on the Wii U yet,"

Well, by now they have, and they say the die is manufactured by TSMC.

She then goes on to assert based on Renesas referring to a gaming oriented SoC, and Nintendos video, that Renesas are indeed involved with the WiiU. Those are her sources. The point of her article is to assert that Renesas collects revenue from the part, which doesn't actually contradict TSMC doing the fabbing of the GPU. (Besides, Renesas has been using TSMC for manufacture on the 40nm node, and will deepen that relationship moving forward to finer lithography as Renesas is moving towards a fab-lite model, focussing on IP, design and synthesis. Which is good, since Nintendo will likely want to shrink that GPU at some point in the future.)

Again, since this was about SPU count, if we have physical evidence and a credible source for process node, (and we do), AND the most easily recognizable feature on on the die are the SPU blocks, AND measurements versus other SPU blocks by AMD at both 55 and 40nm say that we are looking at 320 SPUs, then Occams Razor, "Don't cross the bridge for water", et cetera all mandate that we should accept that we are dealing with 320 SPUs, and look elsewhere to explain less than stellar performance with (some of) the initial batch of 360 ports. It's not as if there aren't other very obvious areas of concern for such ports, so using those to instead question the best data point available to us is, in my view, very strange.

function · Jun 25, 2013

You are using "Occam's Razor" incorrectly. It would not allow for you to assume re-engineering of the sram register banks into a form that have never been seen anywhere else. It works on competing hypotheses that both correctly explain what can be observed.

You offer no explanation for there only being enough register banks for 160 shaders. To do so actually adds complexity and complication, it does not make things simpler. Ignoring what can be observed does not qualify as a correct use of Occams Razor.

Hold whatever opinion you want to, but know that you are using Occam's Razor incorrectly, and not in a small way.

pc999 · Jun 25, 2013

(((interference))) said:
Well, I really wouldn't know, but that's what's in the Durango dev docs: ~60% efficiency for Xenos vs near 100% for GCN.
http://www.neogaf.com/forum/showpost.php?p=48282244&postcount=109

Anyone knows the efficiency numbers for VLIW5 pipelines architectures?

Rangers · Jun 25, 2013

(((interference))) said:
Well, I really wouldn't know, but that's what's in the Durango dev docs: ~60% efficiency for Xenos vs near 100% for GCN.
http://www.neogaf.com/forum/showpost.php?p=48282244&postcount=109

that was on one example shader, surely chosen for just that purpose. A best case scenario for GCN.

i'm sure gcn is hugely more efficient, of course.

ltcommander.data · Jun 26, 2013

pc999 said:
Anyone knows the efficiency numbers for VLIW5 pipelines architectures?

Those 60% Xenos efficiency numbers are about right. AMD reported an average 3.4 slot utilization for VLIW5 as the reason why they dropped an ALU from their SPU and went VLIW4 in Cayman.

Which does bring up another angle to the 160, 240, 320 SP Latte debate. What if Nintendo took a 320 SP RV730 and made it VLIW4 by cutting every 5th SP resulting in 256 SPs? With better ability to code close to metal on consoles, developers could probably get better slot utilization than the 3.4 on PC, but the 5th SP would still be underutilized. By getting rid of the 5th SP, the Wii U loses a little bit of peak performance, which may contribute to why it may not be performing as well as some people think a 320 SP GPU should, but they save die space and power. It's still VLIW based so it wouldn't require the design effort that other custom work would and some of the changes made in Cayman could be used as a reference.

Fourth Storm · Jun 29, 2013

ltcommander.data said:
Those 60% Xenos efficiency numbers are about right. AMD reported an average 3.4 slot utilization for VLIW5 as the reason why they dropped an ALU from their SPU and went VLIW4 in Cayman.

Which does bring up another angle to the 160, 240, 320 SP Latte debate. What if Nintendo took a 320 SP RV730 and made it VLIW4 by cutting every 5th SP resulting in 256 SPs? With better ability to code close to metal on consoles, developers could probably get better slot utilization than the 3.4 on PC, but the 5th SP would still be underutilized. By getting rid of the 5th SP, the Wii U loses a little bit of peak performance, which may contribute to why it may not be performing as well as some people think a 320 SP GPU should, but they save die space and power. It's still VLIW based so it wouldn't require the design effort that other custom work would and some of the changes made in Cayman could be used as a reference.

That would be nice, but the shader blocks only have enough register pools for exactly 160 shaders. Also, I'm quite confident that there are 8 TMUs, which make perfect sense of two SIMD cores. Here's a post I made on GAF explaining how I was able to identify the TMUs/L1 cache: http://www.neogaf.com/forum/showpost.php?p=59514681&postcount=5604

Entropy · Jul 9, 2013

Fourth Storm said:
That would be nice, but the shader blocks only have enough register pools for exactly 160 shaders. Also, I'm quite confident that there are 8 TMUs, which make perfect sense of two SIMD cores. Here's a post I made on GAF explaining how I was able to identify the TMUs/L1 cache: http://www.neogaf.com/forum/showpost.php?p=59514681&postcount=5604

Sorry that I haven't been able to come by in a while.
Lets look at the 160 SPU shader hypothesis, and why there are good reasons to question it.

It is based on two underlying assumptions:
1. The SRAM blocks of Latte has to be arranged exactly as on the RV770 (visually as well as logically)
2. The analysis of the constituents of these particular blocks on the die shot is correct.

Which would lead to these conclusions:
3. The total number of SPUs are 160 arranged in 8 groups of 20.

But since the density of an SPU ALU block is equivalent to what AMD provided in 2007/8 on 55nm lithography, it also follows that either:
4a. Latte is actually produced in 55nm lithography.
or
4b. The Latte SPU blocks are roughly half the density of what AMD produced on their 40nm Brazos platform, two years earlier, with DX11 capable ALUs. For whatever reason.

4a is contradicted by both Chipworks, and the fact that the eDRAM density is quite close to what IBM is achieving on their 32nm Power7+! It is actually better than any product I've managed to find on 40nm. (55nm eDRAM cell sizes are a factor of two larger again when compared to 40/45nm cell sizes from TSMC/Renesas/IBM.) To achieve such high eDRAM density on 40nm, we have to assume that process maturity and relatively low clock targets has contributed to the good result. Which is not inconceivable, after all. But to assume that it could produce yet another factor of two in density... Suffice to say that there isn't a single example even in the ballpark of such density on 55nm anywhere.

4b is simply bizarre. Latte is introduced two years after AMDs Ontario, has less demanding clock targets, and is assumed by some to be less complex. Assuming that Lattes SPU blocks under these circumstances would be roughly half the density of Ontarios is very, very strange. How could that be?

So there it is - if you assume that point 1 and 2 is true, then you paint yourself into a very difficult corner where you have to justify how either point 4a or 4b could be correct.
Personally I prefer to question point 1. AMD has been modifying their VLIW GPU architecture for the better part of a decade by now, of course they can change (the appearance of) the SRAM blocks!

In which case 320 SPUs is a good match for all data we have.

liolio · Jul 9, 2013

Entropy said:
Sorry that I haven't been able to come by in a while.
Lets look at the 160 SPU shader hypothesis, and why there are good reasons to question it.

It is based on two underlying assumptions:
1. The SRAM blocks of Latte has to be arranged exactly as on the RV770 (visually as well as logically)
2. The analysis of the constituents of these particular blocks on the die shot is correct.

Which would lead to these conclusions:
3. The total number of SPUs are 160 arranged in 8 groups of 20.

But since the density of an SPU ALU block is equivalent to what AMD provided in 2007/8 on 55nm lithography, it also follows that either:
4a. Latte is actually produced in 55nm lithography.
or
4b. The Latte SPU blocks are roughly half the density of what AMD produced on their 40nm Brazos platform, two years earlier, with DX11 capable ALUs. For whatever reason.

4a is contradicted by both Chipworks, and the fact that the eDRAM density is quite close to what IBM is achieving on their 32nm Power7+! It is actually better than any product I've managed to find on 40nm. (55nm eDRAM cell sizes are a factor of two larger again when compared to 40/45nm cell sizes from TSMC/Renesas/IBM.) To achieve such high eDRAM density on 40nm, we have to assume that process maturity and relatively low clock targets has contributed to the good result. Which is not inconceivable, after all. But to assume that it could produce yet another factor of two in density... Suffice to say that there isn't a single example even in the ballpark of such density on 55nm anywhere.

4b is simply bizarre. Latte is introduced two years after AMDs Ontario, has less demanding clock targets, and is assumed by some to be less complex. Assuming that Lattes SPU blocks under these circumstances would be roughly half the density of Ontarios is very, very strange. How could that be?

So there it is - if you assume that point 1 and 2 is true, then you paint yourself into a very difficult corner where you have to justify how either point 4a or 4b could be correct.
Personally I prefer to question point 1. AMD has been modifying their VLIW GPU architecture for the better part of a decade by now, of course they can change (the appearance of) the SRAM blocks!

In which case 320 SPUs is a good match for all data we have.

I think you are right IF there is indeed 32MB of eDRAM on the GPU die, I don't remember Nintendo confirming that information. Though as there is nobody to question that amount, rumors or leaks, I think that you are right we might deal with 320SP.

Fourth Storm · Jul 9, 2013

Entropy said:
Sorry that I haven't been able to come by in a while.
Lets look at the 160 SPU shader hypothesis, and why there are good reasons to question it.

It is based on two underlying assumptions:
1. The SRAM blocks of Latte has to be arranged exactly as on the RV770 (visually as well as logically)
2. The analysis of the constituents of these particular blocks on the die shot is correct.

Which would lead to these conclusions:
3. The total number of SPUs are 160 arranged in 8 groups of 20.

But since the density of an SPU ALU block is equivalent to what AMD provided in 2007/8 on 55nm lithography, it also follows that either:
4a. Latte is actually produced in 55nm lithography.
or
4b. The Latte SPU blocks are roughly half the density of what AMD produced on their 40nm Brazos platform, two years earlier, with DX11 capable ALUs. For whatever reason.

4a is contradicted by both Chipworks, and the fact that the eDRAM density is quite close to what IBM is achieving on their 32nm Power7+! It is actually better than any product I've managed to find on 40nm. (55nm eDRAM cell sizes are a factor of two larger again when compared to 40/45nm cell sizes from TSMC/Renesas/IBM.) To achieve such high eDRAM density on 40nm, we have to assume that process maturity and relatively low clock targets has contributed to the good result. Which is not inconceivable, after all. But to assume that it could produce yet another factor of two in density... Suffice to say that there isn't a single example even in the ballpark of such density on 55nm anywhere.

4b is simply bizarre. Latte is introduced two years after AMDs Ontario, has less demanding clock targets, and is assumed by some to be less complex. Assuming that Lattes SPU blocks under these circumstances would be roughly half the density of Ontarios is very, very strange. How could that be?

So there it is - if you assume that point 1 and 2 is true, then you paint yourself into a very difficult corner where you have to justify how either point 4a or 4b could be correct.
Personally I prefer to question point 1. AMD has been modifying their VLIW GPU architecture for the better part of a decade by now, of course they can change (the appearance of) the SRAM blocks!

In which case 320 SPUs is a good match for all data we have.

320 SPUs would be a good match for all the data only if we knew for sure that the chip was manufactured on TSMC 40nm, there were 32 register pools in each of the SPU blocks (as in Llano, Brazos, and Trinity), and there existed more than two blocks of TMUs.

At the risk of sounding brash, I can say with 99% certainty that I've correctly identified the TMU blocks in that link I posted. As much as we can say for sure which blocks are the shaders, I believe we can now do this with the TMUs. The more die shots I have compared, the more obvious it seems.

The size of the SPU blocks is the only outlier in our data, really, but I believe function already mentioned that it can be explained when you take into account the different fab house (Renesas vs TSMC). I have attempted to clarify this in the past (although it seems not to have taken hold, unfortunately) that the 40nm TSMC was Jim's guess or a guess from one of his colleagues after taking an initial glance at the die. I followed up with him shortly after on the subject and he said that they had not performed any precise gate measuring, and that 40nm and 55nm were actually pretty hard to tell apart without getting those figures.

I have actually heard through the grapevine that Latte is manufactured on Renesas' 45nm process node. This makes a great deal of sense to me as it seems apparent from the heat spreader labels (which do not mention TSMC) and articles like this one here that the chip is being manufactured by Renesas in-house and is not outsourced. 45nm fits right in with Renesas' current production lines.

To my knowledge, Latte is the first Radeon GPU that Renesas have worked on, and this lack of prior knowledge may have contributed to a less than optimal size for some of the components, which are practically hand-crafted for TSMC's fab lines anyway. Of course, Renesas' engineers have done a bang up job on integrating the eDRAM, something that I doubt TSMC would have been able to pull off. But once you combine this lack of previous experience, larger process node, the possibility of extra transistors here and there to run the BC shim layer, and possibly even just a less dense design target to avoid a RROD situation, I believe we start painting a decent picture as to why those SPU blocks are so large.

function · Jul 9, 2013

For what it's worth I think Fourth Storm's reasoning is sound and I agree with him. He does a better job of explaining it than I have too. The work on identifying the TMUs shows real commitment.

I think the below is a particularly important point (that I wasn't aware of / had forgotten / drunk?) to bear in mind given where the discussion is currently:

Fourth Storm said:
... (Renesas vs TSMC). I have attempted to clarify this in the past (although it seems not to have taken hold, unfortunately) that the 40nm TSMC was Jim's guess or a guess from one of his colleagues after taking an initial glance at the die. I followed up with him shortly after on the subject and he said that they had not performed any precise gate measuring, and that 40nm and 55nm were actually pretty hard to tell apart without getting those figures.

Based on the evidence for 8 TMUs (I'd originally assumed 16) it makes me think again that maybe textured fill rate rather than ROP BW might be involved in some of high overdraw alpha-texture performance hits we saw in multiplatform Wii U games (CoD etc).

Grall · Jul 9, 2013

liolio said:
I think you are right IF there is indeed 32MB of eDRAM on the GPU die

It can't be 16MB (half the RAM to fit 55nm rather than 40), because alledgedly, wuu uses eDRAM to emulate wii 1T SRAM main memory, and wii/GC has 24MB main memory 1T SRAM. Also, it probably won't be 24MB eDRAM on the die if the number of banks visible don't seem to match; IE, even power of 2.

So it's probably 32MB after all.

Exophase · Jul 9, 2013

ltcommander.data said:
Those 60% Xenos efficiency numbers are about right. AMD reported an average 3.4 slot utilization for VLIW5 as the reason why they dropped an ALU from their SPU and went VLIW4 in Cayman.

Which does bring up another angle to the 160, 240, 320 SP Latte debate. What if Nintendo took a 320 SP RV730 and made it VLIW4 by cutting every 5th SP resulting in 256 SPs? With better ability to code close to metal on consoles, developers could probably get better slot utilization than the 3.4 on PC, but the 5th SP would still be underutilized. By getting rid of the 5th SP, the Wii U loses a little bit of peak performance, which may contribute to why it may not be performing as well as some people think a 320 SP GPU should, but they save die space and power. It's still VLIW based so it wouldn't require the design effort that other custom work would and some of the changes made in Cayman could be used as a reference.

The problem is that 5th execution unit in AMD's VLIW5 actually does unique operations so you can't just drop it. Cayman redstributed the operations to the other four units.

Grall said:
It can't be 16MB (half the RAM to fit 55nm rather than 40), because alledgedly, wuu uses eDRAM to emulate wii 1T SRAM main memory, and wii/GC has 24MB main memory 1T SRAM. Also, it probably won't be 24MB eDRAM on the die if the number of banks visible don't seem to match; IE, even power of 2.

So it's probably 32MB after all. :smile:

Marcan says that 32MB of GPU RAM is directly addressable at some location. He's pretty convincingly demonstrated he can run code on the thing so I'd consider his word on this to be very reliable.

ltcommander.data · Jul 11, 2013

Exophase said:
The problem is that 5th execution unit in AMD's VLIW5 actually does unique operations so you can't just drop it. Cayman redstributed the operations to the other four units.

Yeah, I meant drop one out of every five ALUs, specifically one of the simple ALUs, rather than the actual 5th complex ALU. Keeping the dedicated t-unit should involve less hardware and driver changes than implementing 4 beefed up ALUs.

For a more general question regarding the possibility of cross-platform games between the Wii U and XBox One and PS4, is it possible that a game targeting 1080p60 on the XBox One (assuming no cloud compute offloading making it the lower graphical target compared to the PS4) could run without drastic graphical cutbacks at 720p30 on the Wii U?

Grall · Jul 11, 2013

How would you perform 4D transforms and whatnot with just 3 simple units? The T-unit doesn't execute those instructions AFAIK (that's why AMDs earlier designs were VLIW5 to begin with...)

Wii U hardware discussion and investigation *rename

TheLump

AlNom

Moderator

(((interference)))

3dilettante

(((interference)))

TheLump

Entropy

function

None functional

pc999

Rangers

ltcommander.data

Fourth Storm

Entropy

liolio

Aquoiboniste

Fourth Storm

function

None functional

Grall

Invisible Member

Exophase

ltcommander.data

Grall

Invisible Member

Similar threads