The pros and cons of eDRAM/ESRAM in next-gen

Betanumerical · Sep 5, 2014

Scott_Arm said:
The DMEs are customized versions of the DMA available on AMD GCN. I can't remember the exact features that they added. THey were obviously customized with a particular intent. Don't know the details.

We know the details, they are normal DMA + LZ encode / decode + jpeg decode.

Scott_Arm · Sep 5, 2014

Betanumerical said:
We know the details, they are normal DMA + LZ encode / decode + jpeg decode.

I was just too lazy to look it up

Shifty Geezer · Sep 5, 2014

Scott_Arm said:
The DMEs are customized versions of the DMA available on AMD GCN. I can't remember the exact features that they added.

I think the principle customisation was to change the A to an E...

Scott_Arm · Sep 5, 2014

Shifty Geezer said:
I think the principle customisation was to change the A to an E...

Isn't the swizzle and jpeg decode new? Are they extra DMA units? You're right though. It's not like a huge piece of new silicon.

iroboto · Sep 5, 2014

So can I assert the following:

In general this setup has higher theoretical bandwidth, harder to program for, harder to master, slow to maximize but with the pro of a higher ceiling than a simpler architecture. If you were to pinpoint a true weakness it would involve not having monumentally more bandwidth over the competition, instead esram has a 25% more bandwidth (over a simpler competing external architecture). They likely could have gone with higher bandwidth (edram) and more CUs but it would be less than 32mb of working space) which may have been much harder to program for.

DMEs are there to help saturate the bus much like it would over PCIE.

Scott_Arm · Sep 5, 2014

The DMEs are there to keep ESRAM filled with useful data as much as possible. Because you can read/write from the ESRAM concurrently, you can DMA data into into from DDR3 while the GPU is reading from ESRAM at the same time. You could also do the opposite. While the GPU writes to ESRAM, the DMEs can copy data over to DDR3.

Betanumerical · Sep 5, 2014

iroboto said:
So can I assert the following:

In general this setup has higher theoretical bandwidth, harder to program for, harder to master, slow to maximize but with the pro of a higher ceiling than a simpler architecture. If you were to pinpoint a true weakness it would involve not having monumentally more bandwidth over the competition, instead esram has a 25% more bandwidth (over a simpler competing external architecture). They likely could have gone with higher bandwidth (edram) and more CUs but it would be less than 32mb of working space) which may have been much harder to program for.

DMEs are there to help saturate the bus much like it would over PCIE.

edram is denser then esram, why would it be smaller? if anything it should be bigger.

iroboto · Sep 5, 2014

Betanumerical said:
edram is denser then esram, why would it be smaller? if anything it should be bigger.

I am assuming that the idea is to use a much smaller amount.

Betanumerical · Sep 5, 2014

iroboto said:
I am assuming that the idea is to use a much smaller amount.

But if the density numbers I am looking are correct (roughly 3x) then even using the same amount would reduce it by a drastic size, also the edram doesn't have to be on the same die you should be able to do a MCM with it and have more of it.

steveOrino · Sep 5, 2014

Betanumerical said:
But if the density numbers I am looking are correct (roughly 3x) then even using the same amount would reduce it by a drastic size, also the edram doesn't have to be on the same die you should be able to do a MCM with it and have more of it.

Thats what Intel did in the end. Large pools of sram just didn't make economic sense. Apparently scaling 6T sram is really difficult.

3dilettante · Sep 5, 2014

Betanumerical said:
But if the density numbers I am looking are correct (roughly 3x) then even using the same amount would reduce it by a drastic size, also the edram doesn't have to be on the same die you should be able to do a MCM with it and have more of it.

SRAM doesn't have to be on the same die either, but a second custom die and an MCM adds complexity and cost to the whole project.
The DF interview indciated that Microsoft wanted a single-chip solution anyway.

I'm not up on how many foundries are offering new eDRAM products, so I'm also not sure about the supply situation for even discrete components. A few of the known sources of not bleeding edge eDRAM did not do so well in the last year or so.

steveOrino said:
Thats what Intel did in the end. Large pools of sram just didn't make economic sense. Apparently scaling 6T sram is really difficult.

Intel modified a version of its leading-edge process and put its best-in-class manufacturing resources behind it. Intel was basically its own answer to all the questions going with eDRAM would have posed to a foundry customer, but even then it didn't target systems with cost targets as lean as the consoles.

Lalaland · Sep 5, 2014

3dilettante said:
Intel modified a version of it's leading-edge process and put its best-in-class manufacturing resources behind it. Intel was basically its own answer to all the questions going with eDRAM would have posed to a foundry customer, but even then it didn't target systems with cost targets as lean as the consoles.

And even with their considerable process advantages the EDRAM pool is only for the most expensive SKUs Intel ships, I would hazard a guess that just the MCM + EDRAM is a significant proportion of the entire die cost for XB1

Pixel · Sep 5, 2014

They like to talk about energy consumption/efficency being a huge factor in all areas of hardware design including esram/edram.
If energy consumption was a factor, It was a very very wise decision that power savings of esram over the superior bandwidth and superior size of edram was a significant factor in the choice, and the overall development of the hardware as I dont think gamers anywhere who spend $400-500 on a console, $60 a year for online, and a hundred or more on games a year would tolerate a $2 year jump in their energy bill from a console with a 10% higher power consumption during gameplay gameplay.

http://energyusecalculator.com/electricity_gameconsole.htm

oldschoolnerd · Sep 5, 2014

iroboto said:
So can I assert the following:

In general this setup has higher theoretical bandwidth, harder to program for, harder to master, slow to maximize but with the pro of a higher ceiling than a simpler architecture. If you were to pinpoint a true weakness it would involve not having monumentally more bandwidth over the competition, instead esram has a 25% more bandwidth (over a simpler competing external architecture). They likely could have gone with higher bandwidth (edram) and more CUs but it would be less than 32mb of working space) which may have been much harder to program for.

DMEs are there to help saturate the bus much like it would over PCIE.

One of the major benefits over and above the higher bandwidth is that there will be a significant reduction in contention for the system RAM. Removing contention makes everything better.

function · Sep 5, 2014

Lalaland said:
And even with their considerable process advantages the EDRAM pool is only for the most expensive SKUs Intel ships, I would hazard a guess that just the MCM + EDRAM is a significant proportion of the entire die cost for XB1

And even with their 1.6 gHz edram, Intel are topping out at less BW from their off-die memory than MS's on-die esram.

I think the pie-in-the-sky "1000 GB/s" PS4 slide has been pretty successful at convincing people that MS's esram sucks. It's easier to make a powerpoint slide than engineer a processor.

Pixel said:
They like to talk about energy consumption/efficency being a huge factor in all areas of hardware design including esram/edram.
If energy consumption was a factor, It was a very very wise decision that power savings of esram over the superior bandwidth and superior size of edram was a significant factor in the choice, and the overall development of the hardware as I dont think gamers anywhere who spend $400-500 on a console, $60 a year for online, and a hundred or more on games a year would tolerate a $2 year jump in their energy bill from a console with a 10% higher power consumption during gameplay gameplay.

Electricity bill is only one factor, as AMD and Intel's continued focus on processor power draw shows. MS chose a power envelope and engineered a fast solution to fit within it, to be manufactured within their budget constraints.

For better or worse, they wanted a silent console, and they could only spend so much on cooling. The heatsink in the Xbox One is already more expensive than the both of the heatsinks in the original 360 combined. I'd wager it's more expensive than the one in the more power-hungry PS4, too.

iroboto said:
So can I assert the following:

In general this setup has higher theoretical bandwidth, harder to program for, harder to master, slow to maximize but with the pro of a higher ceiling than a simpler architecture. If you were to pinpoint a true weakness it would involve not having monumentally more bandwidth over the competition, instead esram has a 25% more bandwidth (over a simpler competing external architecture). They likely could have gone with higher bandwidth (edram) and more CUs but it would be less than 32mb of working space) which may have been much harder to program for.

How much would more performance would MS be able to extract from "monumentally more" esram BW? From within the esram, they have much more than +25% peak BW per CU, as they have fewer CUs. And there will be many situations where even this doesn't add much.

More bandwidth from an off-die edram pool would have required a very wide off-chip path - much wider than the 360 used (and MS were specifically trying to get away from this design) and wider than even Intel use on their 22nm Iris enbled Uber processors.

And if you're talking on-chip, then who's going to make that for them ....? Intel? Nope. Renesas on their 45 nm node? Nope. IBM on their 32nm node (at probably a larger die size and goodness knows what engineering cost)?

I would assert that edram was not a realistic option within their constraints, and that off-die edram would have probably netted them less BW but possibly higher power draw, and that on-die would have been difficult to source from the possibly no-one that could have manufactured it for them.

The esram's one real weakness is apparently its small size. Even another 16 MB (~40 mm^2) would significantly alter the proposition of using large g-buffers or texturing from it. And at an extra ~ $10 that still seems more attractive than current on or off die edram.

DMEs are there to help saturate the bus much like it would over PCIE.

Using DMEs to saturate the esram would likely also saturate the main memory bus and kill CPU performance through contention.

DME's are there to allow "processor free" transfer of data between memory pools, and copies within the same pool (most likely main ram).

iroboto · Sep 5, 2014

function said:
And even with their 1.6 gHz edram, Intel are topping out at less BW from their off-die memory than MS's on-die esram.

I think the pie-in-the-sky "1000 GB/s" PS4 slide has been pretty successful at convincing people that MS's esram sucks. It's easier to make a powerpoint slide than engineer a processor.

ROFL. Agreed, and I'm guilty of feeling that way too. I often assume that two companies will likely hit the same options, but ultimately select one over the other. There is a reason why both companies did not go with EDRAM, it's likely because it just wasn't the best solution at the time.

Electricity bill is only one factor, as AMD and Intel's continued focus on processor power draw shows. MS chose a power envelope and engineered a fast solution to fit within it, to be manufactured within their budget constraints.

For better or worse, they wanted a silent console, and they could only spend so much on cooling. The heatsink in the Xbox One is already more expensive than the both of the heatsinks in the original 360 combined. I'd wager it's more expensive than the one in the more power-hungry PS4, too.

Agreed, it's quite luxurious.

How much would more performance would MS be able to extract from "monumentally more" esram BW? From within the esram, they have much more than +25% peak BW per CU, as they have fewer CUs. And there will be many situations where even this doesn't add much.

More bandwidth from an off-die edram pool would have required a very wide off-chip path - much wider than the 360 used (and MS were specifically trying to get away from this design) and wider than even Intel use on their 22nm Iris enbled Uber processors.

And if you're talking on-chip, then who's going to make that for them ....? Intel? Nope. Renesas on their 45 nm node? Nope. IBM on their 32nm node (at probably a larger die size and goodness knows what engineering cost)?

Agreed, for the number of CUs you are correct, I believe I read a paper from AMD indicating 32 CUs require ~700 GB/s to be fully saturated; (if linear calculations apply) that 12 CUs could be fully saturated by approximately 250 GB/s (or very close to the entire system theoretical bandwidth of the system).
You've more or less summarized everything that has been brought up quite well here.

I would assert that edram was not a realistic option within their constraints, and that off-die edram would have probably netted them less BW but possibly higher power draw, and that on-die would have been difficult to source from the possibly no-one that could have manufactured it for them.

The esram's one real weakness is apparently its small size. Even another 16 MB (~40 mm^2) would significantly alter the proposition of using large g-buffers or texturing from it. And at an extra ~ $10 that still seems more attractive than current on or off die edram.

Yep, this definitely makes sense to me here. I'm feeling guilty about that PS4 slide now haha.

Using DMEs to saturate the esram would likely also saturate the main memory bus and kill CPU performance through contention.

DME's are there to allow "processor free" transfer of data between memory pools, and copies within the same pool (most likely main ram).

This is the part that gets me, this is very much acting as a solution to emulate what one would do with huma correct? Either you have a fully shared address space so that you don't waste additional CPU or GPU cycles in copying data to two separate locations - or you have these DMAs whose job it is, to move data, without taking up those cycles.

I'm not understanding exactly why having the DMAs go full tilt is necessarily a bad thing? The 4 DMAs are running peak ~25GB/s which is close to 1/2 the bandwidth on DDR and ~12% of ESRAM. According to this, it doesn't always need to contention? RAM -> RAM and ESRAM -> ESRAM (copy) should never contend with each other right?

From VG Leaks.
Copy Operation Peak throughput using move engine(s) Peak throughput using shader
RAM ->RAM 25.6 GB/s 34 GB/s
RAM ->ESRAM 25.6 GB/s 68 GB/s
ESRAM -> RAM 25.6 GB/s 68 GB/s
ESRAM -> ESRAM 25.6 GB/s 51.2 GB/s

Read more at: http://www.vgleaks.com/world-exclusive-durangos-move-engines

HTupolev · Sep 5, 2014

iroboto said:
According to this, it doesn't always need to contention? RAM -> RAM and ESRAM -> ESRAM (copy) should never contend with each other right?

I'm not sure what you mean, but there's no reason that DMAs wouldn't contend with other things on a bus.

Jay · Sep 6, 2014

Why would you do either of those in the first place?
The only reason would be to copy it from gpu to cpu memory space and visa versa.
But then you may as well have just flagged it as shared and use the shared coherency.

iroboto · Sep 6, 2014

HTupolev said:
I'm not sure what you mean, but there's no reason that DMAs wouldn't contend with other things on a bus.

Right. Wait yes I'm stupid.

Why would you do either of those in the first place?
The only reason would be to copy it from gpu to cpu memory space and visa versa.
But then you may as well have just flagged it as shared and use the shared coherency.

As for copying back to the same pool of memory, yes, I imagine these scenarios aren't often, but certainly going to have someone tell me I'm wrong. As for your second point, makes sense to me but I don't think this is applicable for every scenario right

Rangers · Sep 6, 2014

3dilettante said:
SRAM doesn't have to be on the same die either, but a second custom die and an MCM adds complexity and cost to the whole project.
The DF interview indciated that Microsoft wanted a single-chip solution anyway.

I'm not up on how many foundries are offering new eDRAM products, so I'm also not sure about the supply situation for even discrete components. A few of the known sources of not bleeding edge eDRAM did not do so well in the last year or so.

Intel modified a version of its leading-edge process and put its best-in-class manufacturing resources behind it. Intel was basically its own answer to all the questions going with eDRAM would have posed to a foundry customer, but even then it didn't target systems with cost targets as lean as the consoles.

Yeah, Anandtech talked about this. Basically the reason Intel does a lot of the design decisions they do is because they have a lot of foundries that they own sitting there and need to use them for something rather than let them go to waste. This is different than a lot of other players and all the cost complexities can be very different. Basically, Intel needs lots of things to fab. They may use EDRAM simply partly because of this. If they didn't use EDRAM, some capacity may have gone to waste.

Intel even got into graphics originally for extraneous reasons. They had extra die space because they were perimeter IO limited from shrinking their CPU's below a certain size, so they started throwing GPU's on them to do something rather than wasting space. Of course now it's very important but originally it was more of an oddball decision.

The pros and cons of eDRAM/ESRAM in next-gen

Betanumerical

Scott_Arm

Shifty Geezer

uber-Troll!

Scott_Arm

iroboto

Daft Funk

Scott_Arm

Betanumerical

iroboto

Daft Funk

Betanumerical

steveOrino

3dilettante

Lalaland

Pixel

oldschoolnerd

function

None functional

iroboto

Daft Funk

HTupolev

Jay

iroboto

Daft Funk

Rangers

Similar threads