The Xbox 1's ESRAM

liolio · Jun 26, 2013

3dilettante said:
Both sell for much more than what would be budgeted for a console component, so the extra cash absorbs some of the premium.

Well indeed Nvidia makes money, vendors too. Though I remember how adverse people were to the idea of any manufacturer be it Sony or MSFT using a bus wider than in prior design.

Making an off-chip memory pool would mean Durango would have a 256 DDR3 bus, plus something equivalent to the 1024-bit on-die bus the eSRAM uses.

The bus could be wide and modestly clocked, or narrow and fast, which Crystalwell seems to be doing.
The wide method means the chip is going to have its perimeter dominated by the DDR3 and daughter die interface. The irony would be that the lack of an on-die eSRAM bloating the chip might mean more work is needed to provide the necessary pad space to the daughter die.
The narrower and fast interface would work if you have expertise in fast and custom interfaces and want to expend the effort for a custom high-speed interface, but thats several ifs that don't seem to fit here.

I will answer that later (in that post).

What were they going to shrink it to?

I'm not sure I get what you mean, Alstrong was speaking of the overhead in cost of the smart edram in the 360 and why it was not integrated in either Xenos or the SoC (for Valhalla the last 360 revision). I answer that actually they might be happy with it (after it is just a pretty tiny, ost likely high yield chip). Say they were to do a last revision and shrinking Valhalla to 32nm, they could consider using a process that would allow eDRAM to be integrated to the main chip, I would not be too surprise if they actually don't and go with a tinier chip+ the already tiny smart eDRAM chip.

I think there was a limited pool of options they could have drawn from.

I agree pretty much you have a tiny pool of fast memory (be it on chip or off chip), wider bus and faster memory aka GDDR5. I will 'elaborate' below.

Copying Crystalwell means having a variant of Intel's high-performance 22nm process for the eDRAM, the resources of Intel for designing the scheme, the expertise to make it somewhat affordable (if too expensive for a console), and the price flexibility to charge enough for novel product.

AMD (anybody not Intel?) has none of these.

Well I remember that we already have that discussion. Actually you made your points pretty well.
I'm not saying that MSFT or AMD had the option but if they had it I would think that they would have chose that over giving up a lot of die space on the main die.

As I said I remember that we already went there but I came to think a bit more about it.
back to the part I did not answered and the last part of your post.

To put it shortly I really wonder if only Intel has the "how to", after all not that long ago "off chip cache" usually L3 were not that uncommon (in server parts).

Wrt to the type of connection between the APU and an "off chip" memory pool, cache or not, actually I think that AMD would have vouched for the same approach as Intel. I think they would because that is what they did in Xenos (though it was 3 times slower).

The last part of the equation is the size of the cache, and I wonder if this could be the trickiest part (for me). Intel stated that they could have gone with 32MB because their cache hit rate was alredy most of the time above 95% (or 92%? /details). I think it would be doable for AMD to have a 32MB L3 of chip manufactured on a 40 nm process, the chip should not be too big (WiiU GPU is made on such a process and is a whooping 150mm^2). The issue is that if AMD can have such a high cache hit rate 32MB would not be "good enough".

End of my rant but I really wonder if actually AMD really considered the option (be it for MSFT or them-selves).
Off-chip L3 are nothing new (though used to be way tinier), AMD design Xenos+smart eDRAM using a pretty fast and narrow link (for its time), so it could be doable (though I agree that it would prove difficult to match the size of Intel implementation).

Actually I though some more and I came with another explanation about why AMD could have never considered such an approach. Not that long ago, AMD was owning its foundry, then was bound to GF. Now they've more option, when you think about it a CrystalWell type of chip would cost them money but they could sell it at a profits instead of having "vendors" (sorry I don't remember how we name them... Gigabyte, sapphire, msi etc. ) only buying CPU, APU or GPU for them and getting expensive memory from another vendors (pretty the same as Intel but for Intel the CW doesn't costs much /higher margins).
I could see how AMD, in trouble times, with pretty constant changes at the top managements, could not have considered all the options they have now that they are no longer bound to a specific foundry.

Cyan · Jun 26, 2013

XpiderMX said:
eDRAM easier and less costly to implement than eSRAM. Just asking.

I wonder what GPU whizzes have to say in that sense, but it seems difficult to predict what the performance would be like using EDRAM instead of ESRAM..

I realise this may be not easy due to the unusual memory architecture. But I believe that the eSRAM has been chosen 'cos you can shrink it with the rest of the die AND the fact that it is within the die, means the latency is better.

Excuse me if I'm wrong.

liolio · Jun 26, 2013

Cyan said:
But I believe that the eSRAM has been chosen 'cos you can shrink it with the rest of the die AND the fact that it is within the die, means the latency is better.

Excuse me if I'm wrong.

Most likely we will never know

It is not like MSFT/AMD had much choices, Jaguar is implemented on TMSC 28nm process, so is GCN.
Implementing everything on a 32nm process allowing for eDRAM (if there are any outside of IBM and Intel) would have take a lot of time and cost a lot of money. Then they would have been bound to a pretty "exotic" process for a chip they want to shrink fast.

Sadly it seems that AMD did not dig further in the approach they develop with Xenos+smart edram.
Pc market makes it that you want to make things flexible, scratchpad /fixed amount of memory is not.

3dilettante · Jun 26, 2013

liolio said:
I'm not sure I get what you mean, Alstrong was speaking of the overhead in cost of the smart edram in the 360 and why it was not integrated in either Xenos or the SoC (for Valhalla the last 360 revision). I answer that actually they might be happy with it (after it is just a pretty tiny, ost likely high yield chip). Say they were to do a last revision and shrinking Valhalla to 32nm, they could consider using a process that would allow eDRAM to be integrated to the main chip, I would not be too surprise if they actually don't and go with a tinier chip+ the already tiny smart eDRAM chip.

They could have been happy, or they had few good options to integrate it until recently.

To put it shortly I really wonder if only Intel has the "how to", after all not that long ago "off chip cache" usually L3 were not that uncommon (in server parts).

Intel does very well in managing the costs for consumer-level parts. It had dual-die MCMs for standard desktops with Conroe, and at various times had multichip modules for non-server parts. It's not something easily replicated without a vertically integrated manufacturing and design pipeline.

End of my rant but I really wonder if actually AMD really considered the option (be it for MSFT or them-selves).
Off-chip L3 are nothing new (though used to be way tinier), AMD design Xenos+smart eDRAM using a pretty fast and narrow link (for its time), so it could be doable (though I agree that it would prove difficult to match the size of Intel implementation).

Off-chip L3 isn't new, but has almost never been cheap. The last off-die cache modules at the desktop level were the slot Athlons and Pentium IIIs, and those routinely started at over $500. Additionally, L3 caches are not AMD's strong suit, and don't fit very well with the architectures chosen.

Actually I though some more and I came with another explanation about why AMD could have never considered such an approach. Not that long ago, AMD was owning its foundry, then was bound to GF. Now they've more option, when you think about it a CrystalWell type of chip would cost them money but they could sell it at a profits instead of having "vendors" (sorry I don't remember how we name them... Gigabyte, sapphire, msi etc. ) only buying CPU, APU or GPU for them and getting expensive memory from another vendors (pretty the same as Intel but for Intel the CW doesn't costs much /higher margins).

They have fewer options and even less control. Intel used a specialized process to build their eDRAM. AMD has no processes and limited influence, and creating a Crystalwell faces the danger of not having a smaller node to shrink it to. The saga of the Xbox 360's eDRAM shows that many years can go by without removing an obvious cost adder.

A lot of SRAM alternatives fell by the wayside with the current processes, aside from IBM's SOI implementation that is apparently not desired if your design didn't start out there.
There's no broad foundry interest in giving that a path forward.
AMD has also stated it is only going to provide products on standard foundry processes to save money.

Cyan · Jun 26, 2013

liolio said:
Most likely we will never know

It is not like MSFT/AMD had much choices, Jaguar is implemented on TMSC 28nm process, so is GCN.
Implementing everything on a 32nm process allowing for eDRAM (if there are any outside of IBM and Intel) would have take a lot of time and cost a lot of money. Then they would have been bound to a pretty "exotic" process for a chip they want to shrink fast.

Sadly it seems that AMD did not dig further in the approach they develop with Xenos+smart edram.
Pc market makes it that you want to make things flexible, scratchpad /fixed amount of memory is not.

I don't see how they could solve that either. Perhaps going for pure GDDR5 RAM, but this has been discussed so many times... The question stands though. Would be better for them to use EDRAM?

ESRAM seems to have additional benefits and a console is fixed hardware, so flexibility doesn't apply, plus it serves their purpose right, right? :smile: On die scratchpad memory, shrinkable, etc.

At first I thought the 102GB/s bandwidth was slightly disappointing compared to the 256GB/s , but after ERP explained to me that the framebuffer of the EDRAM on the X360 was very different compared to the Xbox One's ESRAM, and that X360 EDRAM only wrote to the ROPs and Z operations and that the framebuffer in modern GPUs is compressed so 102GB/s means much more than that compared to uncompressed framebuffers, it was like comparing coarse wool to fine wool.

cal_guy · Jun 26, 2013

Kb-Smoker said:
Has anyone really look at the latency different on amd products that used DDR3 and GDDR5?

The amd memory controller is the weak link. Really there is minor difference in latency when comparing the RAM in the only test i could find.

http://www.sisoftware.net/?d=qa&f=gpu_mem_latency

The reason MS went with DDR3 was because DDR4 wasn't ready.

Memory latency on a stock 7970 Ghz edition is 319ns on the compute shader global full random benchmark, and 473ns on the OpenCl version.

liolio · Jun 26, 2013

3dilettante said:
They could have been happy, or they had few good options to integrate it until recently.
-----------------------------
Intel does very well in managing the costs for consumer-level parts. It had dual-die MCMs for standard desktops with Conroe, and at various times had multichip modules for non-server parts. It's not something easily replicated without a vertically integrated manufacturing and design pipeline.
----------------------------
Off-chip L3 isn't new, but has almost never been cheap. The last off-die cache modules at the desktop level were the slot Athlons and Pentium IIIs, and those routinely started at over $500. Additionally, L3 caches are not AMD's strong suit, and don't fit very well with the architectures chosen.
----------------------------
They have fewer options and even less control. Intel used a specialized process to build their eDRAM. AMD has no processes and limited influence, and creating a Crystalwell faces the danger of not having a smaller node to shrink it to. The saga of the Xbox 360's eDRAM shows that many years can go by without removing an obvious cost adder.
----------------------------
A lot of SRAM alternatives fell by the wayside with the current processes, aside from IBM's SOI implementation that is apparently not desired if your design didn't start out there.
There's no broad foundry interest in giving that a path forward.
AMD has also stated it is only going to provide products on standard foundry processes to save money.

Well another round of pretty sound points...

I find it extremely depressing, I could name a couples of members (in the core forum) that might think that I'm a Intel fanboy, or something like that though that is not true I'm pretty much brand agnostic, and actually having low budget AMD has always provided me with lot of value. If I had to buy something now (or in the upcoming months) I would quite possibly end up with an AMD parts not because I like the brand but because the value they provide within my usually low budget.

I really wonder about how AMD can compete with Intel if they have no way of easing bandwidth constrains

Gipsel · Jun 26, 2013

Kb-Smoker said:
The amd memory controller is the weak link. Really there is minor difference in latency when comparing the RAM in the only test i could find.

http://www.sisoftware.net/?d=qa&f=gpu_mem_latency

Large parts of that test are just wrong. A lot of the numbers don't make any sense. They didn't test what they thought they would test because they probably didn't understand the details of the GPU architecture. It's actually not that easy to correctly measure the cache latencies on the VLIW architectures as memory accesses are put into separate clauses from arithmetic instructions. That usually leads to pausing of that wavefront and the scheduling of other wavefronts until at some point it gets rescheduled. So one may measure the clause scheduling latency (which is ~40 cycles iirc) multiplied by the number of wavefronts they run per CU, but not the actual cache latency.
And how they arrived at their numbers for the LDS is really beyond me. It is virtually impossible to get such high latencies if they didn't provoke a huge number of bank conflicts (serializing the accesses for the wavefront) or some other sorry try to botching it up. My guess would be that they took some existing CUDA code and ported it without taking the peculiarities of th diffeent architectures into account.

Gunhead · Jun 26, 2013

XpiderMX said:
eDRAM easier and less costly to implement than eSRAM. Just asking.

AFAICS, ESRAM doesn't mean "embedded SRAM"; it's a specific type of SRAM designed by Enhanced Memory. Microsoft aparently licenced the IP from them?

"Embedded SRAM" wouldn't make much sense really, when is the last time you saw non-embedded SRAM? (OK, I remember those separate cache chips from Pentium Pro and early Athlon; also IBM's big iron stuff.)

"Embedded DRAM" (eDRAM) in contrast is a useful term because the lithography process for the trench capacitors is necessarily very different from regular DRAM. And indeed it's much more dense than SRAM which usually easily offsets the slightly higher cost per wafer of the extra process steps; so it's less costly while (process wise) not easier to implement as such. It's also limited to a lower frequency compared to SRAM.

LoStranger · Jun 27, 2013

Great responses guys but still what does the ESRAM actually do as far as game development? Can the ESRAM ram do Anti Aliasing Super Sampling can it help load textures anything like that?

french toast · Jun 27, 2013

LoStranger said:
Great responses guys but still what does the ESRAM actually do as far as game development? Can the ESRAM ram do Anti Aliasing Super Sampling can it help load textures anything like that?

My thinking entirely, great technical insights but what difference in a game would having the esram have over not having it..

Lets assume both scenarios use same ddr3 ram..one with esram one without...aside from the bandwidth benefits...

32mb is not very big, but im assuming (im unsure) that alot of game code is thrown in and out of memory in small chunks- very frequently, these I presume would be very latency sensitive...

So what kind of things would that be? How would that effect the game world?

smerfy · Jun 27, 2013

LoStranger said:
Great responses guys but still what does the ESRAM actually do as far as game development? Can the ESRAM ram do Anti Aliasing Super Sampling can it help load textures anything like that?

I'd imagine it can do both. I don't think that it's a big enough cache to render out a full frame at high resolutions, but portions of it could definitely utilize it. Loading textures, AA, possibly resource tiling, these are things it can definitely help with.

I'm still waiting on the official specs for it before I can really start to think in theoreticals about what it can accomplish, so once we get that (if we do) it'll be easier to surmise what the eSRAM can take care of.

Cyan posted at the top of the page about "tiled resources", in case you didn't see it. There are a lot of unknowns and I'd love to hear what others with more knowledge than me would have to say about the eSRAM's usability.

smerfy · Jun 27, 2013

smerfy said:
I'd imagine it can do both. I don't think that it's a big enough cache to render out a full frame at high resolutions, but portions of it could definitely utilize it. Loading textures, AA, possibly resource tiling, these are things it can definitely help with.

I'm still waiting on the official specs for it before I can really start to think in theoreticals about what it can accomplish, so once we get that (if we do) it'll be easier to surmise what the eSRAM can take care of.

Cyan posted at the top of the page about "tiled resources", in case you didn't see it. There are a lot of unknowns and I'd love to hear what others with more knowledge than me would have to say about the eSRAM's usability.

My apologies, Cyan posted in a different thread. I can't seem to find the "edit" button on this forum (new member), so I'll correct myself this way. If I'm missing something on the page that would allow me to edit, please feel free to PM me, I'd be much appreciated.

Anyways, here is the post:

Cyan said:
Great news! Apparently Microsoft are trying to transform the Xbox One into a TBDR machine and are introducing a new technology called Tiled Resources to create fine levels of detail on DirectX 11.2 for Windows 8.1 and Xbox One.

(thanks to catxoperro for the news)

Read more at http://venturebeat.com/2013/06/26/m...xbox-one-and-windows-8-1/#TwCuik1WHHrEf81O.99

Brad Grenz · Jun 27, 2013

He was wrong, though. That was just about exposing the Partial Resident Texture hardware capabilities of AMD's current GPUs in DirectX.

Rather than trying to shuffle all your data back and forth between the DDR3 and ESRAM, most Xbox One developers are just going to put most of their high bandwidth writes in the ESRAM to avoid saturating the DDR3 bus.

Shifty Geezer · Jun 27, 2013

LoStranger said:
Great responses guys but still what does the ESRAM actually do as far as game development?

It's memory. It stores data for the CPU and GPU to work on before finally outputting to the screen. It enables nothing more or less than what the RAM in a console allows - it doesn't have any special features.

LoStranger · Jun 27, 2013

Shifty Geezer said:
It's memory. It stores data for the CPU and GPU to work on before finally outputting to the screen. It enables nothing more or less than what the RAM in a console allows - it doesn't have any special features.

Then why even have the ESRAM??? why not just the 8 Gb DDR3 Ram and leave it at that since it does nothing special

Shifty Geezer · Jun 27, 2013

8GB's DDR3 wouldn't be fast enough. MS have used DDR3 and ESRAM for a combined BW of ~170 GB/s (or 200 GB/s, depending on whether you go by the leaked docs or the post E3 tech panel), instead of GDDR5 that would be needed for the same BW. The DDR3 is only good for about 60 GB/s which would be a severely bandwidth starved console.

Rangers · Jun 27, 2013

Shifty Geezer said:
It's memory. It stores data for the CPU and GPU to work on before finally outputting to the screen. It enables nothing more or less than what the RAM in a console allows - it doesn't have any special features.

Low latency could provide some performance enhancement, though we've discussed that to death.

Also to be pedantic, the DDR3 provides 68 GB/s, closer to 70 if you were rounding, rather than 60 :smile:

Shifty Geezer · Jun 27, 2013

Rangers said:
Low latency could provide some performance enhancement, though we've discussed that to death.

That's performance, not feature, unless the efficiency gains enable fabulous new methods like AA Ubertech Pro ++.

Also to be pedantic, the DDR3 provides 68 GB/s, closer to 70 if you were rounding, rather than 60 :smile:

Fair enough, I'll try and remember that for future reference.

liolio · Jun 27, 2013

Shifty Geezer said:
8GB's DDR3 wouldn't be fast enough. MS have used DDR3 and ESRAM for a combined BW of ~170 GB/s (or 200 GB/s, depending on whether you go by the leaked docs or the post E3 tech panel), instead of GDDR5 that would be needed for the same BW. The DDR3 is only good for about 60 GB/s which would be a severely bandwidth starved console.

Well Trinity manages to render with less that half of that figure with less ROPs.
The system is to behave in a different manner as the PS4 anyway so I'm not sure "DDR3 would not be fast enough" is a good way to approach things.

Durango could definitely render in the main RAM, it has as (~) much bandwidth to it as a HD77xx.
What will be the best use of the scratchpad, I guess nobody knows at this point or I think that MSFT may have profiled a lot of things before making its decision so I would think that they have an idea and by extension some developers may have gotten the memo, things is I don't think they can speak.

Anandtech made a statement while reviewing Haswell+CrystalWell that was a bit mysterious to me. Iirc he stated that Intel did not keep the frame buffer in CW. We don't know what that cache policies are but I would think that at some points it could be (or part of it) in the L4. Anyway if true I think that whatever happens later some render targets are written first straight to the main RAM (out of the ROPs and their caches).

MSFT may do the same as more game move to G-buffer they may think that primarily rendering in the main RAM is the way to go.
They could use the scratchpad as a "buffer", for textures, tiles of various data structures,etc that would be more readily available to the GPU than in the main RAM (high bandwidth, lower latency). They could also favor data structure that used multiple times by the GPU.

Now how that compares to the PS4 (I guess that what fast enoght refers to, right?), it is unclear, it is unclear if Durango is going to be as fast as the PS4 (or more surprisingly faster which nobody seems to expected). Sebbbi had this interesting post about optimizing for the size of the ROPs cache (in the core section), his optimization was dependant on the amounts of cache which is depend to the amount of RBE partition in the GPU, the PS4 has twice the ROPs and matching cache structures of Durango.

The Xbox 1's ESRAM

liolio

Aquoiboniste

Cyan

orange

liolio

Aquoiboniste

3dilettante

Cyan

orange

cal_guy

liolio

Aquoiboniste

Gipsel

Gunhead

LoStranger

french toast

smerfy

smerfy

Brad Grenz

Philosopher & Poet

Shifty Geezer

uber-Troll!

LoStranger

Shifty Geezer

uber-Troll!

Rangers

Shifty Geezer

uber-Troll!

liolio

Aquoiboniste

Similar threads