The pros and cons of eDRAM/ESRAM in next-gen

Had edram been ready at the 28nm node at gloflo or tsmc during xo development timeline, and given that you can fit approx 3x as much edram in the same space as 8t sram, do you think MS would have gone with a full +90MB (same real estate allocation) and maintain the same chip size or allocate more towards CU? Or shrunken the size and save on chip costs? Or something in the middle.

Maybe 2-4 more CU, and a 50-60MB edram cache, giving developers a much heftier scratch pad for buffers and other high badwith assets.

What ifs are so much fun.

The decision was supposedly down to ease of fabbing. ESRAM has more options, supposedly. Also I'm not sure the situation of how easy on die EDRAM is.

For 2 more CU's they could have just enabled the two redundant ones, which word is they strongly considered, but rejected the idea, perhaps in favor of an upclock though I see no reason for them to be exclusive other than perhaps penny pinching.
 
Regardless of EDRAM vs ESRAM or 14 vs 16 CU's, it was their choice to do what they did. I mean it would have only been nominally bigger anyway.

Here is 64MB ESRAM @ 408GB/s, 32ROPS, 16CU's at the same process for comparison. Only slightly larger so you get a few less chips per wafer so at the end of the day it would have cost them a couple bucks more per chip. Hindsight being what it is they probably could have afforded it and should have done it. Could have been done even more efficiently than what I've shown here by spreading the mem controllers out along the edge to keep the redundant ESRAM in place to improve yields. Just trying to provide a sense for it.

XB1SOC-2.jpg
 
They'd probably just shrink it. Same die area under different nodes don't cost the same, and adding another 50MB of ESRAM isn't something that you can do without a redesign+a lot more development time.
Its the same node. The apu are on the 28nm node. I'm not talking about increasing the size of esram scratchpad. I'm talking about switching to edram in the hypotehtical situation it were a mature process at the foundries. Edram has roughly a 3x smaller realestate and transistor count than 8t sram. BTW its not a mistake and its not 8:1 difference.

The decision was supposedly down to ease of fabbing. ESRAM has more options, supposedly. Also I'm not sure the situation of how easy on die EDRAM is.

For 2 more CU's they could have just enabled the two redundant ones, which word is they strongly considered, but rejected the idea, perhaps in favor of an upclock though I see no reason for them to be exclusive other than perhaps penny pinching.
According to the interview with digital foundries the distiguished engineer said it wasn't an available option at the foundries at that node, and it sounds like they might have preferred edram. Also there are articles backing up that foundries were transitioning away from maturing their edram processes.
Digital Foundry: And there wasn't really any actual guarantee of availability of four-gigabit GDDR5 modules in time for launch. That's the gamble that Sony made which seems to have paid off. Even up until very recently, the PS4 SDK docs still refer to 4GB of RAM. I guess Intel's Haswell with eDRAM is the closest equivalent to what you're doing. Why go for ESRAM rather than eDRAM? You had a lot of success with this on Xbox 360.

Nick Baker: It's just a matter of who has the technology available to do eDRAM on a single die.

Digital Foundry: So you didn't want to go for a daughter die as you did with Xbox 360?

Nick Baker: No, we wanted a single processor, like I said. If there'd been a different time frame or technology options we could maybe have had a different technology there but for the product in the timeframe, ESRAM was the best choice.

Regardless of EDRAM vs ESRAM or 14 vs 16 CU's, it was their choice to do what they did. I mean it would have only been nominally bigger anyway.

Here is 64MB ESRAM @ 408GB/s, 32ROPS, 16CU's at the same process for comparison. Only slightly larger so you get a few less chips per wafer so at the end of the day it would have cost them a couple bucks more per chip. Hindsight being what it is they probably could have afforded it and should have done it. Could have been done even more efficiently than what I've shown here by spreading the mem controllers out along the edge to keep the redundant ESRAM in place to improve yields. Just trying to provide a sense for it.

XB1SOC-2.jpg

Thanks.
 
Last edited by a moderator:
To me the disappointing part was that there was no developer around this time pushing them to go the extra mile. Around the 360 launch EPIC screamed for more RAM and demo'd what Gears would look like w/ 256M vs 512M. Sony was forced to take stock of itself after this meeting and went all in for 8GB of DDR5.

MS was simply too comfortable with their current position as the best dev environment and shot for good enough rather than best they could possibly muster. Slight altercation to their silicon budget would have made all the difference in the world and they wouldn't be in the situation they are in now. If PS4 games didn't look and run better, and devs had more wriggle room for render targets in ESRAM, then Kinect, multitasking, etc. looks much better everything else being equal.

64MB ESRAM @ 408GB/s, 32ROPS and 16CU's. Think about it.
 
It wouldn't change anything ... from the various pdf's about the future of xbox it is quite obvious that they knew they'd have the weaker console (based on the projected prices of both consoles) and they didn't care .
 
Things would be different if they were cheaper as per that preso. Trust me. If they knew then what they know now, the Xbox One would not be the same as it is.
 
To me the disappointing part was that there was no developer around this time pushing them to go the extra mile.

Slight altercation to their silicon budget would have made all the difference in the world and they wouldn't be in the situation they are in now. If PS4 games didn't look and run better, and devs had more wriggle room for render targets in ESRAM, then Kinect, multitasking, etc. looks much better everything else being equal.


Changing RAM capacity goes as far as changing the chips put into an otherwise identical final console.
Mess with the silicon and you mess with the end product of a process that is far longer and far more expensive.

RAM capacity is a number that can be readily given, and it actually has direct bearing on things developers work with and can measure.
Arbitrary silicon parameters on a chip that was designed years before devs can see it with a billion unknown variables provides no concrete or rational thing for developers to push on.

64MB ESRAM @ 408GB/s, 32ROPS and 16CU's. Think about it.

How is that simple?
 
Yup, that's major surgery. They were already at 5 billion transistors, that would have pushed them well over 7. Does even Titan/GK110 have 7?
 
Yup, that's major surgery. They were already at 5 billion transistors, that would have pushed them well over 7. Does even Titan/GK110 have 7?
I dunno, but my dream machine was (still is) a sick Xbox One called Xbox One 3D or just called Talisman. -like the old Project Talisman, a TBDR GPU created by Microsoft-

It would feature the exact same GPU that it has now, BUT a GPU for each eye (I miss the days when you could name a GPU, like say... Xenos).

So it would be dual GPU, exactly the same SoC, with each GPU featuring 32 MB of eSRAM.
 
64MB ESRAM @ 408GB/s, 32ROPS and 16CU's. Think about it.

Actually a much cheaper/smarter modification would've been to turn the ESRAM into full on L3 cache similar to what Intel did with the Iris Pro. It is pretty shocking they didn't think of this, given that (allegedly) the ESRAM is full blown 6-transistor on-chip memory. It's just missing the cache controller!!!
 
An L3 cache would be CPU coherent, however.
Coherent bandwidth for the chip is 30 GB/s, and that is the high water mark for every other design using AMD's current architecture.
 
An L3 cache would be CPU coherent, however.
Coherent bandwidth for the chip is 30 GB/s, and that is the high water mark for every other design using AMD's current architecture.

The GPU already has an L2 cache that I believe is non-coherent with the CPU. Can't the ESRAM just be an L3 cache to that L2 cache?
I admit I don't fully understand the hardware implications so perhaps what I'm suggesting is not that feasible or cheap to accomplish but it does seem having 1 billion sram transistors sit on a chip and be manually addressable somewhat of a waste.
 
hm... where does the ROP cache fit into the hierarchy :?:

The L2 on the GPU is for the shader/tex.
 
Actually a much cheaper/smarter modification would've been to turn the ESRAM into full on L3 cache similar to what Intel did with the Iris Pro. It is pretty shocking they didn't think of this, given that (allegedly) the ESRAM is full blown 6-transistor on-chip memory. It's just missing the cache controller!!!

Would this have been on balance a good or bad thing?

You would lose the ability to program/manually tune.

I dont see Iris Pro necessarily tearing up the performance benchmarks.
 
Iris Pro is in a different performance bracket altogether, and the L4 eDRAM bandwidth is ~100GB/s aggregate.

Indeed the Iris Pro is meant as a laptop level graphics chip. It has about 50% of the ALU and bandwidth of Xbox One I believe. But the embedded ram is better design all around - 128Mb of it, it's EDRAM (so way more compact) and more importantly it's part of the cache hierarchy.
In general most cache memories have some ability to lock part of it so it's manually addressable if needed, but being a true cache makes it just-work on everything out of the box.
 
Wll Iris Pro is 852 GFLOPS, which is ~2/3 as XOne GPU. Seems as much in XOne class as XOne is in PS4's then.

I just mean I dont see the 128MB of L4 cache in Iris Pro making it somehow perform like a 1.5GFLOP GPU. Or even any above it's weight really, other than bandwidth.

I guess if the goal is just ultimate ease of use maybe cache is the way to go. But would I be wrong in thinking manual control is the best way to chase performance, which is what you really want in a console that's under fixed specs for 6 years?

But I'm out of my knowledge league here. Just my ill informed ideas from 10,000 feet.
 
The GPU already has an L2 cache that I believe is non-coherent with the CPU. Can't the ESRAM just be an L3 cache to that L2 cache?
The GPU's memory hierarchy appears to be too primitive to extend in that way.
The L2 is already physically sliced to match memory controllers, and it serves as the common coherence client for the CUs so that the GPU can be at least weakly coherent within itself.

The GPU's idea of coherence works because the CU L1s are write-through and the physically partitioned L2 means data can only spill to one place. No coherence checking is needed because there is only one place data can be cached.
An L3 being pasted on creates another place data could be cached, and that would break the GPU as it is.

The eSRAM basically stands on the other side of a crossbar as if it's a sort-of memory controller, so by manually addressing it and treating it like a spot of main memory, it's a unique non-cached piece of memory that basically means most of the GPU can operate without a redesign.

hm... where does the ROP cache fit into the hierarchy :?:

The ROP caches are separate from the vector/texture cache path, and they get their data over an export bus from the CUs instead of the load/store units. They're similarly aligned with memory channels like the L2 slices, but the two cache types don't really operate together.
 
And now the obvious questions :p, why not increase the size of both the L2s and the ROP caches? Or is it that any practical size increases wouldn't be anywhere as useful as just having the scratchpad?
 
Wll Iris Pro is 852 GFLOPS, which is ~2/3 as XOne GPU. Seems as much in XOne class as XOne is in PS4's then.

I just mean I dont see the 128MB of L4 cache in Iris Pro making it somehow perform like a 1.5GFLOP GPU. Or even any above it's weight really, other than bandwidth.

I guess if the goal is just ultimate ease of use maybe cache is the way to go. But would I be wrong in thinking manual control is the best way to chase performance, which is what you really want in a console that's under fixed specs for 6 years?

But I'm out of my knowledge league here. Just my ill informed ideas from 10,000 feet.
I would avoid based completely the difference on FLOPS counts, last Nvidia GPU shows that FLOPs tell a very limited part of the story.
Not that I expect iris pro to punch above its weight but I'm not sure that its perform as it does because of FLOPS count or bandwidth.
Usually Intel GPU do terrible with AA, I suspect other issues holding performances like sucky ROPS or fixed function hardware.

If the question is about esram vs edram I think MSFT's answer is pretty clear they had no choice because of costs and available tech.
It costs Crystalwell cost INtel "peanuts" and they sell it with huge margins. Now for anybody else I could see those 80mm2 of silicon on a pretty advanced lithography cost a lot of money.
Then could have AMD done something worthy out of it leaving price alone for a second? I wonder, they have yet to fix their L3 on their main line CPU, the "unification" of the memory subsystem in their APU should come with their next round of products, etc.

My pov is that the core i7 4770r is a better chip than durango or Liverpool overall, the CPU perfs are not in the same ballpark, power consumption is better, it is still not a decent gaming rig and the price sucks wrt to the gaming perfs it provides.

Intel solution is great but for others actors I wonder if GDDR5 is a better bet, as the cost of EDRAM and R&D associated with CW might very well cover the extra pennies.

Microsoft solution, aka using esram, could be a good one though I wonder about the implementation. Say Devs gets their head wrapped around the limitation of the esram wrt to size we are still looking at
something that is some regards performs as 16 ROPs GPU stuck to GDDR5 through a 128 bit bus though with lot more RAM. I think it will be a while before the 2GB of RAM of cards like the R7 260x and GTx 750i turns into a severe limitation.

I guess it's going to work yet the whole thing looks costly: lots of silicon dedicated to esram, 256bit bus to the main ram, fast DDR3.
------------------------------------

Overall I wonder if actually the issue is not esram vs edram but UMA vs NUMA design.
The inner of AMD's APU still seems a bit messy too me, even Sony stated that the bandwidth available to the GPU dropped significantly when the CPU accessed the RAM (iirc and I don't know to which extend it affect perfs in real world usage /over my head).
Was UMA ready for prime,especially for MSFT for which an all GDDR5 system was out of the picture?

Looking at the perfs of low mid range GPUs fare against this generation of consoles, I wonder if
NUMA would have turned into such an issue.

I think of something like this:
6/8 GB of DDR3 on a 128 bit bus, cheap and standard/cheap one 1600.
1/2 GB of fast GDDR5 on a 128 bit bus.
a single chip but the CPU and GPU are connected through a fast on chip PCI express type of link, a "discrete GPU on chip" type of set-up more than a "not that ready for prime heterogeneous processor wannabee".

The chip would have been a lot tinier, and cheaper. Depending on the memory set-up selected they may also have saved on the memory price. DDR3 2133 still come at a nice premium over its vanilla 1600 ancestor.
If they were willing to cut corners, 1GB could have done at the cost of enforcing the use of virtual texturing /tile resources (may be not a great idea).
It may also have save quite some R&D expenses.
 
Last edited by a moderator:
Back
Top