The pros and cons of eDRAM/ESRAM in next-gen

function · Jul 16, 2014

SlimJim said:
What I still don't understand is that if even armchair architects see the major faults/cons in going with ESRAM, then how could MS have proceeded with it?

Because the armchair architects are ignorant of most of the considerations and cost projections. They are also, seemingly ignorant of their ignorance, and hence vocal and opinionated.

To me the only logical explanation is that they needed the TV stuff, requiring 8GB so badly, that they were willing to sacrifice:
yields, ease of programmability, graphical power.

esram was decided on before 8GB. sram is also pretty defect tolerant.

I know that it would have been a much better choice to go with a separate ARM SOC to handle ALL of the TV stuff, including having it's own RAM. HDMI passthrough mode with the "xbox turned off" would cost 5-10 watts at the most, instead of 70+ watts. For Kinect stuff the system could be kicked out of standby, handling the voice requests.

A separate SOC would have added more to the cost and added complexity to the board. A separate pool of ram would have not added to game-accessible BW. The Xbone isn't just doing HDMI pass-through. You cannot use voice to kick the Kinect into voice recognition mode - that is a contradiction.

Anyway, this way Xbox could have launched with 4-6GB of GDDR5 main ram, and the (in that case) useless ESRAM transistors could be left out; allowing for a GPU competitive with PS4.

4 GB of ram would have left the Xbox one with either a 128-bit bus (and hence still required esram) or committing it to using 8 x 512MB chips, likely ensuring high cost per MB of RAM over the platform lifetime.

There are some synthetic, hypothetical situations in which ESRAM could prove useful, but they are far outweighed, and outnumbered by all the associated cons. That's my conclusion at least.

The esram is useful in every Xbox One game. There are even real, in game situations where it offers performance advantages over a 256-bit GDDR5 configuration too.

mosen · Jul 16, 2014

A low latency on-chip memory should be good for GPU context switch, right?

Shifty Geezer · Jul 16, 2014

mosen said:
A low latency on-chip memory should be good for GPU context switch, right?

It's been raised before in this thread. Low latency has potential benefits, but we've no idea what the latency is in XB1's ESRAM to make guesses as to its impact. So in real terms, the benefits of low latency in XB1 is somewhere between 'none' and 'lots for some workloads'.

function · Jul 16, 2014

Laa-Yosh said:
I think TV and Kinect are blamed more about their overall effect on the bill of materials; if MS engineers had a larger budget, they might have decided to go with a faster unified memory instead.

I think even a less conservative design might have used esram. Just taking the current design as-is so as to minimise armchecting, if MS had allowed themselves a larger power budget they could have simply run the chip faster. Jaguar can pass 2 gHz and GCN can pass 1 gHz. And not just in the hands of overclockers - this is on cheap and cheerful are mainstream parts.

For more power and heat - and possibly still no more than the PS4 - they could have been right on the PS4's tail, while still having cheaper main memory and not growing the chip even a single mm.

Add 10% onto the chip ($10 at current TSMC wafer start rates?) and you could have 4 x 12MB for 48 MB of that fast esram. Or cut the esram entirely and add a 128 MB edram module to the package.

Or maybe do none of these, and just do something else.

Point is, even if MS had decided to aim higher, they might have taken a different approach to just having a huge great wodge of GDDR5 on a phat bus...

Dave Baumann · Jul 16, 2014

SlimJim said:
To me the only logical explanation is that they needed the TV stuff, requiring 8GB so badly, that they were willing to sacrifice:
yields...

From a defectivity perspective SRAMs are usually highly defect resistant because they have easy and cheap redundancy (dependant on the design); this is moreso than logic and significantly compared to PHY's. So a chip with lots of on-die RAM is more defect resistant than a similarly sized chip that does not.

Conversely a DDR3 PHY can probably be smaller than a GDDR5 PHY.

dobwal · Jul 16, 2014

SlimJim said:
To me the only logical explanation is that they needed the TV stuff, requiring 8GB so badly, that they were willing to sacrifice:
yields, ease of programmability, graphical power.

Explain that logic especially considering that none of devices targeting TVs outside of consoles/PCs come close to having 8 GBs of RAM.

taisui · Jul 16, 2014

Logic? To where this armchair is going, we don't need logic. Heck, the non-technical ones can even just claim self-contradicting "facts" in the same sentence on things that they know nothing about.

3dilettante · Jul 16, 2014

mosen said:
A low latency on-chip memory should be good for GPU context switch, right?

If it's a GPU compute context switch, having the ready bandwidth, potentially better latency, and bidirectional interface could be of help.
If it's a question of a context switch where the different OS partitions are both trying to use the ESRAM, it could add complexity to the process and potentially up to 32MB of context to switch out.

function · Jul 16, 2014

Dave Baumann said:
From a defectivity perspective SRAMs are usually highly defect resistant because they have easy and cheap redundancy (dependant on the design); this is moreso than logic and significantly compared to PHY's. So a chip with lots of on-die RAM is more defect resistant than a similarly sized chip that does not.

Conversely a DDR3 PHY can probably be smaller than a GDDR5 PHY.

It's also my understanding that sram is the best candidate for scaling. This and the smaller PHYs probably mean that MS's design will benefit more from the transition to 20nm than the PS4 will.

The cost implications of MS's design decisions will not be fully clear until we see a teardown of a 20nm, slim console two or three years from now. Especially if MS move to offering an optical-drive-less model, having markedly lower power consumption will give them form factor options that may prove useful to them.

Similarly, the performance enabled by the 1Bone and its esram won't be fully apparent until "+11%", DX12, and games fully factored around the real-world characteristics of the system have gone from conception through to delivery.

3dilettante's thoughts on giving it three years seem like a good idea from a number of perspectives.

JaggedSac · Jul 16, 2014

SlimJim said:
What I still don't understand is that if even armchair architects see the major faults/cons in going with ESRAM, then how could MS have proceeded with it?

To me the only logical explanation is that they needed the TV stuff, requiring 8GB so badly, that they were willing to sacrifice:
yields, ease of programmability, graphical power.

Why is 8gb of RAM necessary for TV? Did I miss something?

Laa-Yosh · Jul 16, 2014

The system currently reserves about 3GBs of memory for the extra features; it makes sense to assume that a LOT of the originally planned 4GB would have been taken (even if not 75% of it). It's not just TV, but the seamless app switching, Kinect, and whatever else, that requires so much.
"TV stuff" is probably just a short term to describe all of the above.

Then it's all just theory from this point on, as it seems that ESRAM was added to the design well before the decision to go with 8GB RAM.

However, the cost of the extra hardware for the Kinect stuff does seem like a good reason to cut costs on the other parts of the system. So even the original 4GB version was probably designed with a slow main memory and a fast but small on-chip buffer to compensate for the lack of memory bandwidth. Thus the argument still makes sense - MS has probably compromised in the design because of the additional non-gaming features.

3dilettante · Jul 16, 2014

JaggedSac said:
Why is 8gb of RAM necessary for TV? Did I miss something?

Expansive memory helps provide the breathing room for dedicated memory for a game and separate but simultaneously accessible application and functions.

The memory allocation is about the most restrictive portion of the setup.
Everything else has the potential to be more fluidly allocated, as games can be made to cede cores, system bandwidth, and GPU power depending on what mode they are in. I'm not sure if it was clearly outlined that those could then be used by an application, but I don't think there's a technical reason they couldn't.

That might be beyond the scope of this thread's topic, but it is a solution that would be more flexible than having a separate dedicated processor that doesn't get that pool of resources to draw from, or it's a larger engineering project that creates sizeable resources that then cannot be used for games.

JaggedSac · Jul 16, 2014

Ah, I see, thanks. That poster was just using "TV" as a strange catch all.

Globalisateur · Jul 16, 2014

3dilettante said:
If it's a GPU compute context switch, having the ready bandwidth, potentially better latency, and bidirectional interface could be of help.
If it's a question of a context switch where the different OS partitions are both trying to use the ESRAM, it could add complexity to the process and potentially up to 32MB of context to switch out.

I don't understand this argument, applied on GPGPU tasks. Because many of the GPU computed tasks necessary data (like for physics or animation of the 3D models) will be stored on the DDR3 ram, most particularly on a 1080p framebuffer + G-buffer which already eats a good deal of the 32MB video ram.

You will either have to use the slower DDR3 bandwith or first move the data to the esram with some memory tiling, which is not free and cost both DDR3 (max ~60GB/s) and Esram bandwidth (max ~140GB/s in the best case real scenario where you always use both "directions").

But this ~140GB/s maximum number given my developers is not even an average but only during one specific post processing operation, where you can use both "directions" of the esram because of the alpha blending tasks.

I am asking here to experts, if you ideally constantly feed the esram with GPGPU targeted data (with memory tiling) with a constant averaged transfer of 30GB/s of data between DDR3 and Esram, this bandwidth will constanly be lost for both parties? So in average the DDR3 and esram will only have respectively ~30GB/s and ~110GB/s really peak bandwidth for processed data in the best case scenario where both esram "directions" are ideally used?

mosen · Jul 16, 2014

3dilettante said:
If it's a GPU compute context switch, having the ready bandwidth, potentially better latency, and bidirectional interface could be of help.
If it's a question of a context switch where the different OS partitions are both trying to use the ESRAM, it could add complexity to the process and potentially up to 32MB of context to switch out.

I meant GPU context switch for compute. 30GB/s coherent bandwidth between CPU-GPU and 30GB/s bandwidth of the CPU to DDR3 should help for sizes that are not suitable for eSRAM. But there should be some opportunities to use eSRAM for better results during GPU context switch. Also I think a part of eSRAM is already reserved for OS.

In Hot Chips 25 John Sell said that they highly customized 4 command processors (2 graphics, 2 compute) to significantly reduce the amount of time/work that CPU has to spend on assembling graphic commands and they support shared memory with CPU. This changes should help to speed up GPU context switch, right?

Sorry if it's off topic.

taisui · Jul 16, 2014

Globalisateur said:
I don't understand this argument, applied on GPGPU tasks. Because many of the GPU computed tasks necessary data (like for physics or animation of the 3D models) will be stored on the DDR3 ram, most particularly on a 1080p framebuffer + G-buffer which already eats a good deal of the 32MB video ram.

You don't need the whole mesh to do your physics though, just the skeleton/bvs, which is just a tiny fraction of data. But then again since one does not need to fat bw of the esram for compute, it won't be logical to actually do it on the esram, so it's probably not a likely usage scenario, and the architects had not mention any latency advantages (and rendering is generally latency tolerant anyways).

Also given that there is a pretty nice 30GBps coherent bw between the CPU-GPU, more of a reason to not use esram for compute.

3dilettante · Jul 16, 2014

Globalisateur said:
I don't understand this argument, applied on GPGPU tasks. Because many of the GPU computed tasks necessary data (like for physics or animation of the 3D models) will be stored on the DDR3 ram, most particularly on a 1080p framebuffer + G-buffer which already eats a good deal of the 32MB video ram.

A context switch for a compute kernel, which some revisions of the more recent GCN IP pool have some facility for, would be writing out the contents of an in-process wavefront or workgroup running on a CU, and at some point reading another context back in.
It's not clear if Durango has this function, if it's exposed, or if it would be set up to use the ESRAM. A large on-die memory with shorter latency and bidirectional bandwidth would benefit that use case, however.

MrFox · Jul 17, 2014

function said:
I think even a less conservative design might have used esram. Just taking the current design as-is so as to minimise armchecting, if MS had allowed themselves a larger power budget they could have simply run the chip faster. Jaguar can pass 2 gHz and GCN can pass 1 gHz. And not just in the hands of overclockers - this is on cheap and cheerful are mainstream parts.

For more power and heat - and possibly still no more than the PS4 - they could have been right on the PS4's tail, while still having cheaper main memory and not growing the chip even a single mm.

Add 10% onto the chip ($10 at current TSMC wafer start rates?) and you could have 4 x 12MB for 48 MB of that fast esram. Or cut the esram entirely and add a 128 MB edram module to the package.

Or maybe do none of these, and just do something else.

Point is, even if MS had decided to aim higher, they might have taken a different approach to just having a huge great wodge of GDDR5 on a phat bus...

Is the difference between 800 and 1GHz only a power budget issue or is the yield affected? Whatever doesn't pass at 1GHz have to be thrown away, there's no low cost GPU card at 860MHz that can recycle those chips. There was a hardware analyst who said they were at 66% yield at launch, leading to a cost of $100 and $110 per working chip. So maybe 1GHz would have driven the yield down, and the price up. I don't know.

I agree with what has been said before, regardless of the amount of memory, the accepted story is that MS wanted ESRAM+DDR3/4, and Sony wanted GDDR5 with or without ESRAM. But the reasons are not clear to me at all, why either wouldn't have considered the alternative?

A lot of things we now take for granted almost didn't happen. Mark Cerny said they were initially deciding between 2GB and 4GB, and having an HDD or not (let's take a moment of silence to imagine the PS4 with 2GB, and no HDD). When they announced 8GB he hinted that the most difficult aspect was procurement. Months later, Andrew House kept praising Sony's amazing procurement team, basically giving them the credit for the 8GB. Without this procurement deal, the XB1 would have been praised as the better design, either thanks to ESRAM which allowed more memory, or thanks to the lower price and a close enough performance, so a better price/performance.

I think the most interesting aspect of ESRAM is the DDR3 cost which surprisingly didn't provide an immediate benefit, but long term it's looking good. I have no idea what deal microsoft have got with their RAM, we can't know these things and I just follow analysts estimates, but DDR3-2133 is unnaturally expensive right now, and it has the potential of dropping very soon.

Looking at the backside of the motherboard, I guess I'm the only one puzzled by this, it seems that the XB1 had issues with power delivery to the chip. I wonder if that could be related to the ESRAM because it's one of the very few things that are any different between XB1 and PS4. Maybe someone have a better and more informed theory.

function · Jul 17, 2014

MrFox said:
Is the difference between 800 and 1GHz only a power budget issue or is the yield affected?

I'm going off comments by a couple of chip developer folks, and also from looking at cheaper GCN PC parts (and also watching the overclocking scene), but it seems that almost any functional GCN parts can hit 980+ gHz if you're happy to accept the power draw. There may have been some impact on yield, but I'm inclined to think that up to the common frequency limits of even cheap parts (around 1 gHz) that it'd be small or insignificant.

The impact on power and heat would be significant, however.

I agree with what has been said before, regardless of the amount of memory, the accepted story is that MS wanted ESRAM+DDR3/4, and Sony wanted GDDR5 with or without ESRAM. But the reasons are not clear to me at all, why either wouldn't have considered the alternative?

I think both of them considered both (plus more), tbh. But different goals and perhaps different engineering timescales resulted in different decisions, I think.

A lot of things we now take for granted almost didn't happen. Mark Cerny said they were initially deciding between 2GB and 4GB, and having an HDD or not (let's take a moment of silence to imagine the PS4 with 2GB, and no HDD). When they announced 8GB he hinted that the most difficult aspect was procurement. Months later, Andrew House kept praising Sony's amazing procurement team, basically giving them the credit for the 8GB. Without this procurement deal, the XB1 would have been praised as the better design, either thanks to ESRAM which allowed more memory, or thanks to the lower price and a close enough performance, so a better price/performance.

Yeah, Sony seem to have done a good job of delivering a very compelling product, but I think MS sought to engineer in more flexibility later on in design and reduce cost related risks from a very early stage, given their design decisions. Sony's initial and clear advantage seems to be causing people to play the blame game, and esram is unfairly taking the brunt IMO. Assessment of market advantages of particular level of performance, and BOM are far more likely to be responsible for relative performance levels IMO.

I think the most interesting aspect of ESRAM is the DDR3 cost which surprisingly didn't provide an immediate benefit, but long term it's looking good. I have no idea what deal microsoft have got with their RAM, we can't know these things and I just follow analysts estimates, but DDR3-2133 is unnaturally expensive right now, and it has the potential of dropping very soon.

Yeah, the price differential on RAM according to iSupply makes the combined APU + RAM price for Xbox One almost as high as for PS4. I think this situation will correct itself over time, particularly after RAM prices correct and after a shrink to 20nm, but I could be very wrong and it will be interesting to see how things pan out!

Looking at the backside of the motherboard, I guess I'm the only one puzzled by this, it seems that the XB1 had issues with power delivery to the chip. I wonder if that could be related to the ESRAM because it's one of the very few things that are any different between XB1 and PS4. Maybe someone have a better and more informed theory.

I think power delivery may just be one of the things Sony is better at. Looking at the situation with regard to power supply cost (iSupply) and the number of components littering MS boards like burnt trees in a post-apocalyptic landscape, I think MS just did whatever met their requirements, while Sony were more elegant about ... everything.

Based on nothing but the feeling in my nuts, of course. Which are probably turning numb from the effect of this 'ergonomic' sit/stand stool that I'm trying out.

Rangers · Jul 17, 2014

I'm going off comments by a couple of chip developer folks, and also from looking at cheaper GCN PC parts (and also watching the overclocking scene), but it seems that almost any functional GCN parts can hit 980+ gHz if you're happy to accept the power draw. There may have been some impact on yield, but I'm inclined to think that up to the common frequency limits of even cheap parts (around 1 gHz) that it'd be small or insignificant.

Perhaps, and I agree mostly, but it's different clocking a APU up vs a GPU or CPU alone. That is why the steamroller based APU's actually saw a CPU clock drop vs Richland. They claimed to move to a process that was a better middle ground between CPU and GPU, rather than optimized for either, which actually resulted in a lower CPU clock.

Basically I dont think APU is necessarily as easy to clock up as individual CPU or GPU.

BUT, given MS hit 853/1.75 with a part that is generally attributed as "dead silent" in working Xbox One's, it does seem like at least 900 mhz GPU would have been reasonable, if they allowed a bit more noise.

Anyways just enable the two redundant CU's (which they considered) and clock to 900, you're already at ~1.6 teraflops, functionally very close to PS4, if that's your aim, with the current ESRAM architecture. And these are minor tweaks that require no redesign and likely only nominally more expense. You could throw in another 8GB of (cheap) DDR3 (total of 16GB) to imo further even the performance cards against PS4. You are still limited by ESRAMsetup/DDR3/16 ROPS, but overall the machine is likely equal to PS4 (has advantages faster GPU clock/setup, more RAM etc)So it was probably easily possible, MS just chose not too, because they are cheap/dumb/thought Kinect was awesome/wrongly thought performance doesnt matter.

900mhz X 896 SP X 2=1.61 TERAFLOP

The pros and cons of eDRAM/ESRAM in next-gen

function

None functional

mosen

Shifty Geezer

uber-Troll!

function

None functional

Dave Baumann

Gamerscore Wh...

dobwal

taisui

3dilettante

function

None functional

JaggedSac

Laa-Yosh

I can has custom title?

3dilettante

JaggedSac

Globalisateur

Globby

mosen

taisui

3dilettante

MrFox

Deludedly Fantastic

function

None functional

Rangers

Similar threads