The pros and cons of eDRAM/ESRAM in next-gen

According to both AMD and Nvidia, future GPUs will be using HBM for their main memory, with 512GB/s with 4 stacks, the memory is sampling right now. By the time we can put 256MB on die, HBM will be on it's second gen for a while (1TB/s and 64GB total ram in 4 stacks in 2016 or 2017). How does 256MB on die makes any sense for next gen?

If every vendor select HBM as their next main RAM, a large internal memory pool is no longer an intelligent design, the GPU/APU technologies will have reached a point of evolution where stacked memory have naturally solved the problem that the XB1 ESRAM was trying to solve. Bandwidth, pad area, cost, and power.
 
I thought it was because they decided early on that their setup required 8GB of RAM, and at that time it seems improbable that this could be achieved with anything other than DDR3. This was not unrealistic, given that Sony could only basically confirm for certain that 8GB of GDDR5 was feasible in early 2013.

So given those parameters, improving on DDR3's bandwidth had to come from something like DDR3, or they'd have had to split memory pools, which is not a popular choice these days (unless on PC perhaps ;) ).

It's a nice story, but from bkilian's info the change from 4GB to 8GB came late. Late 2011.

It seems to me like Microsoft is just hung up on ESRAM/EDRAM. They start all their designs from that point and dont consider anything else. This is more or less what bkilian said, that the design was more or less always featuring ESRAM, before any other major decisions, such as amount of RAM, had been made.

And they still would have gotten away with it (scooby doo voice) if Sony hadn't made the late bold move to 8GB GDDR5. Then we'd be comparing 4GB to 8GB and saying "MS picked the ESRAM so they could have 8GB RAM".

The "lack" of compute is to some extent a separate issue. They probably could have pretty easily glommed on 2-6 more CU's (They considered enabling the redundant 2 anyway, which would have put them at 14 with no change to die size). They just didn't for reasons of being cheap. I suppose if they also had wanted 32 ROPs, things might start getting unwieldy, but I think more CU's could have been pretty easy.
 
for the most part all related to 1 CON and that is the fact that it's not large enough.

Whenever you discuss ES/DRAM, you always get a lot of people wanting more. Why not 64, 128, 256? Then all the problems are solved!

Of course it's just too large. Which is the problem with it in the first place.

32MB is crazy small amount of memory in this day and age, though.



1024 wide bus, read/write same cycle, max throughput of 2048 bits per clock cycle.
can perform latency sensitive tasks
built into the SOC, cooling can be kept central to the rest of the console, no need to passively or actively cool external RAM.
guaranteed bandwidth and your 7/8 r/w cycles 1 bubble cycle vs CPU/GPU bandwidth contention || unknown RAM available due to OS


We dont really know about latency, it could go through enough latency of memory crossbars that advantage may not exist too much. Even if the ESRAM itself is physically low latency, going through the memory pipe to the GPU could nerf that. 3dilletante has written about that, much of which is above my comprehension level :p.

The power advantage is likely real and little mentioned. However in the end you end up with the X1 at 120 watts and the PS4 at 140 watts. Even if that entire difference was down to the ESRAM (it's surely not, as GDDR5 and 18 CU's both draw significantly more than DDR3/12, granted that X1 does have higher clocks), it's still not even close to worth a significantly lower performance.

The lack of CPU contention for the ESRAM is probably a real advantage and nice. Can even be why X1 games have often show better AF maybe, as that's alleged to be a bandwidth issue.

Wouldn't another advantage be that in theory, X1 has 272 GB/s of peak bandwidth, vs less for a singular pool of GDDR5? Not much has been talked about this and I'm not sure how real world viable it would be to burst from both DDR3 and ESRAM at once, but certainly on paper you can achieve higher bandwidth that way.
 
Embedded memory isn't a Con. It was a cost effective solution to improve graphics performace due to MS's choice of unified memory, just like the 360.

The discussion of Pro/Con revolves around what the competition is using and IMO the entire discussion is moot.


And this point is exactly what causes the discussion to end prematurely.

All I'm saying is if MS came to b3d and said listen guys we are going to commission b3d 5 million dollars to make us an all new engine that fully exploits the architecture what aspects would you exploit? I'm not being stubborn by not acknowledging the obvious but the discussion keeps walking down the same path. I'd interested in reading about what isn't obvious.
 
iroboto said:
All I'm saying is if MS came to b3d and said listen guys we are going to commission b3d 5 million dollars to make us an all new engine that fully exploits the architecture what aspects would you exploit? I'm not being stubborn by not acknowledging the obvious but the discussion keeps walking down the same path. I'd interested in reading about what isn't obvious.

Hindsight is always 20/20.
 
Last edited by a moderator:
Hindsight is always 20/20.


Agreed. Without all the hard work you guys put in we never would have gotten to this stage. But we're plateauing at business implications of why the esram+ DDR is in place and no longer discussing the technical merits of the memory setup. We need to move onto how developers will engineer with this setup because discovery is learning and learning is fun!
 
Okay...wasn't this already covered a while ago?

eqwaeegqd6icqtk723ve.jpg
 
We dont really know about latency, it could go through enough latency of memory crossbars that advantage may not exist too much. Even if the ESRAM itself is physically low latency, going through the memory pipe to the GPU could nerf that. 3dilletante has written about that, much of which is above my comprehension level :p.
I'll be honest here, I just reiterated what was written, I actually need a primer on what latency sensitive operations would be. Whenever I think latency I'm only thinking how many clock cycles close you can cut over an operation before the next operation can continue.

The power advantage is likely real and little mentioned. However in the end you end up with the X1 at 120 watts and the PS4 at 140 watts. Even if that entire difference was down to the ESRAM (it's surely not, as GDDR5 and 18 CU's both draw significantly more than DDR3/12, granted that X1 does have higher clocks), it's still not even close to worth a significantly lower performance.
One of the things I wanted to learn more about was VRAM, as it's actively cooled on my kepler, but I've noticed that in some cards I've seen it's not cooled. I've never known what happens if VRAM overheats and what would cause it except for upping the voltage so any discussion points here would be great.

The lack of CPU contention for the ESRAM is probably a real advantage and nice. Can even be why X1 games have often show better AF maybe, as that's alleged to be a bandwidth issue.
This might be worth a spin off for PS4 based thread, but how often would the CPU be reading/writing to RAM while the render code occurs? Is this actually an issue?
Wouldn't another advantage be that in theory, X1 has 272 GB/s of peak bandwidth, vs less for a singular pool of GDDR5? Not much has been talked about this and I'm not sure how real world viable it would be to burst from both DDR3 and ESRAM at once, but certainly on paper you can achieve higher bandwidth that way.

I'm not sure if it's an advantage or just part of the design. If the CPU is constantly moving resources from HDD into DDR3, it's taking up the DDR3 side of the bandwidth. The DMAs are assisting in moving the information from the DDR3 into the ESRAM for processing. You are peaking, but you aren't necessarily doing more work, unless the CPU is doing work that would be done on the GPU and working with it directly on the DDR3 in isolation from the work being done on the GPU/sram combo.
^^^ is this even worthwhile?
 
Okay...wasn't this already covered a while ago?

It was, I've referenced the slide a couple times myself. It's very high level and that we should expect better utilization of the esram in 3rd+ wave of titles.

Would be interesting to know how many cycles the DMEs are saving the GPU from having to fetch information itself.
 
Last edited by a moderator:
This might be worth a spin off for PS4 based thread, but how often would the CPU be reading/writing to RAM while the render code occurs? Is this actually an issue?

We know there is an issue on PS4 where the bandwidth drops when CPU and GPU are using memory at the same time. I suspect same thing happens for X1 when using DDR3. It seems nice that ESRAM does not have to worry about the CPU.

I'm not sure if it's an advantage or just part of the design. If the CPU is constantly moving resources from HDD into DDR3, it's taking up the DDR3 side of the bandwidth. The DMAs are assisting in moving the information from the DDR3 into the ESRAM for processing. You are peaking, but you aren't necessarily doing more work, unless the CPU is doing work that would be done on the GPU and working with it directly on the DDR3 in isolation from the work being done on the GPU/sram combo.
^^^ is this even worthwhile?

Well, there is only 32MB to fill up. It's just not much. What I'm wondering is a scenario where you put 32MB (or a partial amount of that) of data in ESRAM, leave it there and use it over and over for a certain time, and also pull out of DDR3 simultaneously. Can you achieve a lot of bandwidth? And how relevant is this usage model? And of course there would be many "mix n match" scenarios, where one could presumably see a benefit (some transferring in/out of ESRAM, but also using some repeat data out of ESRAM).

Of course it's all wayyyyy over my head to talk about that kind of programming.
 
Console redesign lowers the cost by going to a smaller process and simplify the design. Changing the design which adds complexity is not how it works.

Why should changing the memory system in the SOC during a redesign be a significant complexity factor? The redesign costs should be negligible compared to the component costs. I'm quite sure they'll switch to DDR4/128 when there's a cost benefit. The same could have happened with GDDR5 to HBM.
 
Why should changing the memory system in the SOC during a redesign be a significant complexity factor?

Because they are on the die?

The redesign costs should be negligible compared to the component costs.

R&D cost is more expensive.

I'm quite sure they'll switch to DDR4/128 when there's a cost benefit. The same could have happened with GDDR5 to HBM.

I can't respond to sentences formed in such fashion. Basically I'm saying that your when, ifs, and could haves are highly improbable in the "real world"
 
Last edited by a moderator:
except one is far more likely to happen then the other. R&D that lead to vastly lower product costs will be invested in. its silly to think otherwise.
 
Cost benefits would be weighed against the consistency of the platform.
I think we'd need a better idea on how the new standards behave on the same workloads and access patterns before fixating on cost savings.

There are penalties based on the physical properties of the DRAM and interface that don't scale with speed, and doing something like halving the device count leaves fewer chips and channels to handle the same number of fixed-length penalties. There are some undesirable drop-offs in efficiency that come with that, and those can have a larger impact than fiddling with a few latency parameters while keeping the interface the same.

If there's more confidence that a new standard can meet or exceed the old in all those corner cases, it would be safer, but even then hobbling faster hardware so that it can't outrace the old revisions is something that has been done before.

Even if it someday saves money, care should be taken that it doesn't stumble over already existing software implementations.
 
We know there is an issue on PS4 where the bandwidth drops when CPU and GPU are using memory at the same time. I suspect same thing happens for X1 when using DDR3. It seems nice that ESRAM does not have to worry about the CPU.



Well, there is only 32MB to fill up. It's just not much. What I'm wondering is a scenario where you put 32MB (or a partial amount of that) of data in ESRAM, leave it there and use it over and over for a certain time, and also pull out of DDR3 simultaneously. Can you achieve a lot of bandwidth?


I guess the real question we should be asking is whether having an over abundance of bandwidth have a positive effect on the system. If it does then bandwidth would be a bottleneck for the x1. The reason I brought active/central cooling for embedded ram vs the passive cooled setup for ps4 is that at the end of the day one is likely to withstand being pusher with higher clocks and voltages than the other.

If MS was desperate enough would they clock the GPU even higher? Or like kinect reserve (didn't need to be there) perhaps the clocks were targeted to be higher (looking at cooling solution) but the software just wasn't stable enough yet?

Every ~50MHz is 8-9GB/s more bandwidth for esram when calculating with the 7/8 bubble. It would increase the performance of the whole pipeline as well. 1200 MHz would be 270GB/s with esram alone.
 
On second thought, probably scratch the device count difference, my mistake for posting tired. If two chips hang off the same channel, they would receive the same commands. The doubled transfer rate with non-scaling penalty challenge to scaling would more likely remain.
 
If MS was desperate enough would they clock the GPU even higher? Or like kinect reserve (didn't need to be there) perhaps the clocks were targeted to be higher (looking at cooling solution) but the software just wasn't stable enough yet?

Viable stable operation of the chip at higher frequencies aside, going faster puts stress on the power and cooling systems. Microsoft engineered the Xbox One to last for 10 years because of the whole TV thing, but if they have abandoned this, then it may be acceptable compromise to run faster knowing the power and cooling system will clock out earlier. However I can't see them eking out much more clock speed, I'd be surprised at anything above 1.8Ghz - they've bumped it once already. Of course, if they are willing to live with a certain failure rate, replacing those boxes with more capable ones... :eek:
 
Because they are on the die?

R&D cost is more expensive.

I can't respond to sentences formed in such fashion. Basically I'm saying that your when, ifs, and could haves are highly improbable in the "real world"

If a redesign helps them save cost or adds important features it will happen. Otherwise we would still use the same XBox in 90nm from 2005 without HDMI.
 
If a redesign helps them save cost or adds important features it will happen. Otherwise we would still use the same XBox in 90nm from 2005 without HDMI.

Right, so did the 360 change the memory system dramatically by adapting to new tech....?
I think you are changing the subject and drawing a bad analogy with adding the HDMI.
 
Agreed. Without all the hard work you guys put in we never would have gotten to this stage. But we're plateauing at business implications of why the esram+ DDR is in place and no longer discussing the technical merits of the memory setup. We need to move onto how developers will engineer with this setup because discovery is learning and learning is fun!
A noble sentiment, but we literally can't do that without more, low-level info. Of greatest importance are the mysterious latencies. For everything else, we basically have a couple of BW figures and caveats along with the limited size, so software design considerations are all about fitting the workloads into the limited pool rather than maximising its advantages. There are literally no known advantages to XB1's ESRAM over alternative memory setups with the same BW - If XB1 had 200 GB/s peak RAM (or whatever one wants to peg the BW figure at), it'd be functionally identical AFAWK. Unless we get a leak showing the latency are truly minimal, that won't come into it.

I suppose there's a case that ESRAM is providing at peak higher BW than otherwise possible, as long as engines are designed to maximise it. One could try discussing engine designs that enable peak BW with carefully managed memory ops. Without an actual engine to look at an analyse though, I doubt that'd be at all productive.

As far as I'm concerned, this thread is spent. ES/DRAM provides a BW advantage at a given cost with the compromise of software complexity and difficulty using that BW. Given the way RAM is progressing, it looks like ES/DRAM are dead-ends for future hardware, meaning the subject ends with this generation where it's not really doing much. Plotting the importance of ES/DRAM over time, it has steadily declined from great back in the PS2 and GC to provide high BW otherwise unobtainable, to irrelevant now and going forwards.
 
Back
Top