Make educated guess of DurangOrbis die sizes, tdps, and costs based on VGLeaks

Gubbi · Jan 27, 2013

It's probably a DRAM array with a SRAM interface of some sort.

It doesn't have to be Mosys' 1T-SRAM, which is very short latency, it could have longer latencies (andhigher density).

Cheers

Gubbi · Jan 27, 2013

Bagel seed said:
Some forumers over on Semiaccurate think the chip or SoC might be a stack on SRAM. I think that kinda explains how they can integrate the huge die with everything else... Charlie says its an active interposer. In other words SRAM = Interposer. Also I don't think it needs to be 28nm in that case.

The primary reason to use SRAM over EDRAM for large arrays is the processing steps required for DRAM compromise logic performance. If it is a different die alltogether I would expect them to use DRAM since you have at least four times as much capacity then.

Cheers

Bagel seed · Jan 27, 2013

Gubbi said:
The primary reason to use SRAM over EDRAM for large arrays is the processing steps required for DRAM compromise logic performance. If it is a different die alltogether I would expect them to use DRAM since you have at least four times as much capacity then.

Maybe they valued latency over capacity then? And any yield or sourcing issues too.

3dilettante · Jan 28, 2013

Bagel seed said:
Maybe they valued latency over capacity then? And any yield or sourcing issues too.

There's a limit to what SRAM can do, or needs to do for latency if it's a separate die or if it's on the far side of the same chip. If it's not linked as tightly as an L1 or L2 cache, the latency of the memory is added on top of the miss latency of the nearer caches, signal transit time, and arbitration overhead.

IBM has shown that having EDRAM as a last-level cache can be mostly equivalent to an SRAM (just several times bigger) for the sorts of chips it makes, save some corner cases where the slower DRAM arrays and banking conflicts can cause latency to flare up to the point that IBM's design can't hide that they aren't SRAM.

In some circumstances, EDRAM for large memory pools can provide a benefit because its density is such that the physical distances signals need to travel are smaller.
If it's across an interface as well, then SRAM's benefits are swamped even further.

Bagel seed · Jan 28, 2013

Right that's why if it's all in the same package as the interposer there wouldn't be any latency problems. I think that's the only thing that can explain all the whys and hows of going with SRAM.

dobwal · Feb 2, 2013

toshiba is selling 32mb of what it calls pseudo sram for $26 dollars a pop.

Here is a place that sells low power SRAM in many different configurations.
http://www.renesas.eu/products/memory/low_power_sram/product_selector.jsp

Yes sram is huge, IBM went with edram for its L3 cache because it would of cost them an additional 800 million in transitors to use SRAM.

Intel had 30 Mb of SRAM on its 65 nm Tuwilka (?) and its cost 191 mm2 worth of silicon.

liolio · Feb 3, 2013

Is somebody like Aaron Pink in the surrounding?
TSMC claims a peak density of 3900 Kgates/mm^2, I don't know how it translates in MB and I guess it depends on the structure of that pool of memory.
I guess it could give us quiet some reliable estimates based on that raw data.

My take is that it is plain SRAM on the same die as the GPU.

For the ref INtell 22nm value:

a high-density 0.092- µm2 cell, a low-voltage 0.108- µm2 cell, and a high-performance 0.130-µm2 cell. The SRAM operated at 4.6 GHz at 1 V.”

TSMC is High perf @28nm is 0.127µm2

DieH@rd · Feb 3, 2013

AlphaWolf said:
6 transistors/bit * 8 * 32000000 = 1.5Billion transistors for the actual memory. That would probably be less than 120mm2 @28nm but there would be other logic on there.

Wow that is really big, I never expected it to be so large! With ~70mm2 for CPU and ~140mm2 GPU..... they will almost reach the chip size of Radeon 7950/70.

Ruskie · Feb 3, 2013

Durango actually may have bigger SOC then Orbis. eSRAM will "make up" for Orbis CU advantage in space size, and then there will also be DMEs too but we don't know their transistor count or die space they take.

MrFox · Feb 3, 2013

http://www.chipworks.com/blog/technologyblog/2012/12/11/a-review-of-tsmc-28-nm-process-technology/

Earlier this year, we completed a limited analysis of the high density SRAM on the AMD RadeonTM HD 7970 215-0821060 graphics processor, which was fabricated with TSMC’s HP process. Our TEM analysis confirmed the 215-0821060 transistor structure was identical to that seen in the Altera Stratix V device, as would be expected since both are based on the TSMC 28 nm HP process. The 215-0821060 features a 0.16 µm2 6T-SRAM with the transistors arranged in a uniaxial layout. By contrast the 90 nm ATI 215PADAKA12FG graphics processor extracted from ATI Radeon X1950 Pro Graphics Card had a SRAM cell that is over five times bigger, at 0.86 µm2.

32MB would be only 40mm2?

Ninjaprime · Feb 3, 2013

MrFox said:
http://www.chipworks.com/blog/technologyblog/2012/12/11/a-review-of-tsmc-28-nm-process-technology/
32MB would be only 40mm2?

Hilariously wrong.

MrFox · Feb 3, 2013

Why is there such a big difference between the sram test chip (0.127) and an array on a real GPU (0.16)?

liolio · Feb 4, 2013

Ninjaprime said:
Hilariously wrong.

How that post is supposed to be helpful? I mean it seems that you doesn't know how much of an over head there is to put those memory cell together.
I don't know, MrFox doesn't know either, it is not an issue by self but I don't see why you answered in such a mocking tone... At least MrFox dig into an article (which I actually read too while I was my self searching for information before posting) and brang some extra information in the conversation.
On the other hand you post is hilariously useless to follow your wording, and whereas that is not my point to sound harsh you can see for your self that it is an unpleasant way to comment on one's post.

40mm^2 would indeed be without overhead (an useless bunch of memory cells) but how big is the overhead? But even that is unclear, as memory size in TSMC own test chip and the size measured in GPU is different as MrFox is pointing out.

For cache it has to be really big. A 50% overhead would be ~60mm^2. 100% get to 80mm^2.

My guess is that Durango is made of 2 chips:
1) the GPU which include 32MB of ESRAM.
Looking at cap verde and what could be the size of the scratchpad (lets go with 80mm^2), I could see the chip being in the same ballpark as pitcairn. That would be with 12 CUs and something based on GCN.
There would be no trouble to fit the IO (a 256 bit bus to the RAM and a reasonably fast link to the CPU ~30GB/s).
2) the CPU, 8 jaguar core and the L2 would not take much room, even with extra units having specific purpose (security, sound, what not). If there is only cpu cores and cache and the IO, definitely 80mm^2 is enough. Adding stuff would not make the chip "big" anyway I look at it.

Overall if I follow that line of thinking I could see the whole set-up being pretty affordable. The silicon budget would not be much higher than what we have in the last xbox revision.
Down the line the plan would be to shrink the system only once, putting the GPU and GPU on the same SoC.

Still the GPU(/north bridge) would be above the 185mm^2 and as such be a tad more costly than it could be. Overall the system design seems really biased toward "low" production costs to the point where it gets me to wonder if MSFT could have pushed further.
The tiniest GPU that sold with a 256 bit bus was barely above 190mm^2. I wonder if it would be doable to get all the IO in a chip that is 185mm^2 or just below.
I'm do not know how DDR3 vs GDDR5 wrt to the number of pins you need, I guess some people here know.

So going further down my line of thinking, I could see MSFT using something that is neither GCN or previous AMD GPU architectures.
I think that a significant amount of the transistors AMD spent on GCN may be vouched as useless for MSFT primary use which is graphics.
GCN architecture has to keep track of more threads (vs vliw4/5 architectures), the amount of "memory" on chip has increased to meet Dx requirements, etc. You have ACEs.
To how much space that amount is unknown, though looking at the transistor count there was a beefy increase between from say juniper to cap verde, turned well in both perfs and transistors density though.
When it all said and done, I could see MSFT cut corner and take parts of different GPU architectures to try to get the GPU as tiny as they can get it to be.

If I look at the SIMD alone the difference in efficiency is not much between VLIW4 architectures and GCN. GCN dealing with graphics should keep 100% of its ALUS busy (in a ALUS bound scenario obviously) whereas according to AMD own data, on average (in the same ALUs bound scenario) the VLIW keeps 3.8 alu busy out of 4 => 95%.

I could see them use lesser sized "global data share" and "local data share" disregarding the requirements they set in the PC realm.
I could see them pass on the improvements GCN GPUs brang wrt to tesselation, in a close environment I could see Cayman level of performance being enough. MSFT has enough grip to enforce on editors a proper level (the level matching the perfs of their hardware) of adaptive tessellation.

I could see them pass on what seems to be a reworked block in GCN gpus which might include the command processor and the ACEs.

The one thing they should definitely take out of GCN (not sure of my wordinghere ... I mean use GCN ROPs) is the ROPs that seems way more efficient than the ones in previous architecture.

Overall I could see something in many regard closer to HD69xx (or what is in Trinity) than to the discrete GPU AMD ships nowadays.

If my goal were to be really cheap I would make quiet some trade off to get the chip at 185mm^2 (or just below). Those trade off save money and that could be the part engineers could not compromise on whereas the amount of performance lost would be minimal in percent vs say the ~220mm^2 GPU/chip (GCN based) I were discussing at the beginning of my post.

Actually I would go as far as cutting the amount of SIMD to 10 if needed. Which looking at the rumors we have could mean that 2 SIMDs are include on the CPU die. Again it would possibly less optimal but looking at costs as the primary driver for the system... Edit actually I could see them using GCN as Compute performance could be more relevant.

In which case, MSFT could end with something like that (I'm not pretending that it would have not impact on performance, I'm arguing it would have a reasonable impact):
1 ~185mm^2 chip including the GPU and the scratchpad memory.
2 80/90mm^2 chips including the 8 jaguar cores and the L2 and a 2 SIMD GPU.

Now along with DDR3 I could see MSFT pretty replace the existing 360 SKUs (may be not the arcade if they want the HDD to be standard which I would wish) with the new system keeping the pricing structure and subsiding the difference in BOM between those products.
To me the kind of specs were hear about Durango hints at a pretty aggressive pricing strategy.

Edit If IO are an issue and if it doable to connect 8GB to a 192 bit bus (looking at some pc shipping with bobcat APU so 64 bit bus along with 4 GB of RAM it should be doable) I would make that trade off too. Definitely looking at the GPU (# of ROPS and ALUs and the amount of bandwidth the scratchpad is supposed to provide) it is unclear to which extend a lesser amount of bandwidth to the main RAM would alter performances /+60GB/s sounds a bit over kill

Edit 2 None of that efforts/arbitrations may have been needed depending on the actual size of the scratchpad + move engines still it could impact IO.

Ninjaprime · Feb 4, 2013

liolio said:
How that post is supposed to be helpful?

It wasn't. It was posted to let him know he was wrong. If I was paid to teach people things I might bother to make a long post why, but since I am not being paid, I'll let people do their own research and figure it out.

Shifty Geezer · Feb 4, 2013

Ninjaprime said:
It wasn't. It was posted to let him know he was wrong. If I was paid to teach people things I might bother to make a long post why, but since I am not being paid, I'll let people do their own research and figure it out.

The spirit of discussion does promote an open sharing of knowledge and ideas without material recompense. You're under no obligation of course, but a lack of at least a line of broad correction moves your contribution outside of the scope of discussion. I can understand not doing that for trolls, but MrFox is partaking in a legitimate line of conversation.

function · Feb 4, 2013

Ninjaprime said:
It wasn't. It was posted to let him know he was wrong. If I was paid to teach people things I might bother to make a long post why, but since I am not being paid, I'll let people do their own research and figure it out.

According to the Chipworks die shot the big edram block on the Wii U is ~40 mm^2 (or just under). Are you saying that it's not, or that it's not got 32 MB of edram? Or am I misunderstanding completely?

AlphaWolf · Feb 4, 2013

function said:
According to the Chipworks die shot the big edram block on the Wii U is ~40 mm^2 (or just under). Are you saying that it's not, or that it's not got 32 MB of edram? Or am I misunderstanding completely?

Esram vs edram

Shifty Geezer · Feb 4, 2013

AlphaWolf said:
Esram vs edram

Yes, we need ot be clear on that. Earlier there was some confusion over the RAM being 1T DRAM or SRAM or whatever. We're now told it's low-latency SRAM, so 6 transistor fast SRAM.

MrFox · Feb 4, 2013

Ninjaprime said:
It wasn't. It was posted to let him know he was wrong. If I was paid to teach people things I might bother to make a long post why, but since I am not being paid, I'll let people do their own research and figure it out.

Ninjaprime, I'm laughing at the superior intellect.

Previously I brought up that Mosys is describing their 1T cell size with an included overhead, making it easy to calculate memory density. Afterwards, I wanted to give the 0.16 figure as a better starting point than 0.127 because it's based on a real implementation (not a test chip). I wasn't sure what the norm was for calculating the overhead, what needed to be included when talking about density (because I assumed it's not linear, and depends on the chosen granularity, width, probably other things), I was expecting others to add that information, or give a rule of thumb similar to what Mosys are giving for their memory... but I don't have money to give you, because that information wants to be free, you are holding it hostage.

Anyway, for the difference between 0.127 and 0.16:
http://www.realworldtech.com/iedm-2010/6/

The 0.127um2 cells are tuned for maximum density, but require 1.1V for operation. Trading off density for lower operating voltage (e.g. in SRAMs used with logic), TSMC also provides a 0.155um2 cell that requires ~0.7V.

So even if we had the "test chip" density, it's probably not a useful figure. It also explains why the GPU had a 0.16 cell size.

Ninjaprime said:
It would be a gain, its just not possible. These chips are mostly likely 28nm, which is pretty much the same scale-wise as Intel's 32nm. The largest server chip Intel made on 32nm was Westmere-EX, and it was 10 cores and had only 30MB of L3 cache on it. The L3 array takes up, ~40% of the chip, and its a giant 513mm^2 chip. It would be ridiculous for a console to use up 200mm^2 for a 32MB chunk of cache.

EDIT: I maintain my educated guess of 40mm2 with an overhead of 50%, so 60mm2. It's less than a third of your 200mm2 guess, I think I'm closer.

Ninjaprime · Feb 5, 2013

function said:
According to the Chipworks die shot the big edram block on the Wii U is ~40 mm^2 (or just under). Are you saying that it's not, or that it's not got 32 MB of edram? Or am I misunderstanding completely?

eDRAM vs SRAM, and probably dense low perf eDRAM on the nintendo part, if I had to guess.

Make educated guess of DurangOrbis die sizes, tdps, and costs based on VGLeaks

Gubbi

Gubbi

Bagel seed

3dilettante

Bagel seed

dobwal

liolio

Aquoiboniste

DieH@rd

Ruskie

MrFox

Deludedly Fantastic

Ninjaprime

MrFox

Deludedly Fantastic

liolio

Aquoiboniste

Ninjaprime

Shifty Geezer

uber-Troll!

function

None functional

AlphaWolf

Specious Misanthrope

Shifty Geezer

uber-Troll!

MrFox

Deludedly Fantastic

Ninjaprime