The pros and cons of eDRAM/ESRAM in next-gen

Rangers · Aug 30, 2014

Not sure if this is the place for it, but it doesn't seem to fit in business discussion either...anyway Intel has released the Haswell-E's today that AFAIK are the first processors that support DDR4. So the much talked about DDR4 is finally a real product.

The fastest grade so far is DDR4-3200. This is ~102 GB/s on a 256 bus if my math is right.

I wondered if this could have been an interesting X1 design, do away with the ESRAM, cleans up the design a lot.

My guess is no, this leaves you with well less BW than PS4, PS4 would have had 70% more BW and only 40% more flops. At least with the ESRAM, there is some theoretical area where X1 can meet or exceed the competitor bandwidth. Of course you are limited to 32MB which is the big problem. Without being a programmer working on it it's difficult to say whether 102 GB straight up or the ESRAM limited but higher peak would be better.

But the other (killer) issue is according to Anand 16GB DDR4 kits are $285. This price makes the whole thing a non starter really. Sure prices will come down but it's almost a year after X1 release and that price is very high.

GravityX · Aug 30, 2014

Esram makes everything better.

Developers didn't mind using edram last gen for 360. So why all the talk about it being a negative?

Pixel · Aug 30, 2014

Esram makes everything better. Without esram you'd be left with a system so horribly pathetically and tragically crippled by its anemic bandwidth/memory system that you'd be left with a system that only moderately outperforms last gen consoles.

Strange · Aug 30, 2014

GravityX said:
Esram makes everything better.

Developers didn't mind using edram last gen for 360. So why all the talk about it being a negative?

To be clear nobody's discouting eSRAM or eDRAM as useful. They're simply outshined by other configurations that you can say have much more potential and ease of use.

mc6809e · Aug 30, 2014

GravityX said:
Esram makes everything better.

Developers didn't mind using edram last gen for 360. So why all the talk about it being a negative?

It's not a negative so much as it is less of a positive than is GDDR5.

shredenvain · Aug 30, 2014

Even if Ms went with Gddr5 for the main system memory there would have still been a chance of them including the 32mbs of esram. After all the 360 had gddr3 which was highspeed graphics memory for the time and still included edram. Oh well.

GravityX · Aug 30, 2014

mc6809e said:
It's not a negative so much as it is less of a positive than is GDDR5.

Is it true?

Bandwidth:

Xbox One - Esram + DDR3 = 140/150mb

PS4 - GDDR5 = 140/120mb

Xbox One has better bandwidth?

Are these consoles closer than we are led to believe?

I mean.

Curiously:

Within a year X1 reaching parity with a big budget game like Destiny @1080p/60fps, and this is using half the ROP's and with 50% less CU's compared to the PS4. And prior to any advantage DX12 may bring as well.

So this begs the question.

Is it possible we will see 1440p games on the PS4?

Thanks all, and sorry I sound like Erich von Däniken.

Rangers · Aug 30, 2014

GravityX said:
Developers didn't mind using edram last gen for 360. So why all the talk about it being a negative?

Part of it is the competition. Last gen the competition setup was worse and more complicated. This gen it's (arguably) better and (certainly) less complicated.

So any solution is judged in part by the competition of the time.

Even if MS went with Gddr5 for the main system memory there would have still been a chance of them including the 32mbs of esram. After all the 360 had gddr3 which was highspeed graphics memory for the time and still included edram. Oh well.

Solution like that would have only made (most) sense with 128 bit GDDR5 bus, as 128 bit GDDR3 bus on 360. Which for starters would limit them to 4GB RAM.

128 bits of GDDR5=probably ~88GB/s. Not that much more than 256 bits of DDR3. (This plus EDRAM was one of the setups considered for PS4)

8GB 256 GDDR5 PLUS ESRAM (with same specs as One ESRAM, lets say). Might have been interesting. At that point you're kind of trading CU's for more bandwidth. Probably a bad trade but at least there's a performance positive. But the CU's probably get you more performance.

Pixel · Aug 30, 2014

What would have been best disregarding cost and how that would effect system sticker price and shrink the install base would have been to go with an off die but on chip edram solution like intel, however full powered edram that facilitates extremely high bandwidth. Gddr5 + 64-128MB high power edram offering 500-1TB bandwidth. Sure it would also complicate shrinking to smaller nodes, however render targets could be nice and fat, and at the same time facilitate extremely high bandwidth demanding textures and rendering techniques and effects.

Nisaaru · Aug 30, 2014

What's the point of the high bandwidth if the low clocked and CU/ROP limited GPU can't effectively use it?

3dilettante · Aug 30, 2014

mc6809e said:
Do you have any insight into the idleness of the eight cores of the PS4/XBox Jaguar processors as they wait for main memory?

I don't have any benchmarks for those. For either console, it should be closer to 6 cores since the other two appear to be reserved. There aren't exact numbers but Durango was listed as having ~190 cycles of main memory latency and Orbis 220+.
That's well beyond Jaguar's ability to reorder around a stall. At 2 instructions per cycle, its 64 op window can only last 32 cycles.
Splitting the difference at ~200 cycles, that's ~168 cycles of nothing.
If the scenario is one miss to main memory every 200 cycles, it yields ~.32 IPC or 16% of peak IPC assuming everything else is perfect (it isn't).

The point of caches is to keep that worst-case scenario from happening so much.
A 32KB Intel cache was profiled at 5-10% L1 miss rate in SPEC 2k, as an iffy proxy for Jaguar's L1.
The per-core amount of the Jaguar L2 is 16x bigger than the L1, which with a square root relationship to capacity means the number of misses to main memory should be 1.25-2.5% (L2 misses come out of L1 misses).
I'm unfortunately hoping SPEC versions, but benchmarks in a more recent version had 14-40% of their instruction composed of memory loads.

I'm mangling math by combining the lowest of both ranges and the highest of both--which I haven't really verified is correct, but I will go with it just for the theory.
.0125*.14=.0017=.17% of instructions hitting main memory for a low end and .025*.4=.01= 1%.

In a contrived scenario of pure stall or perfect work where I hopefully don't screw up massively:
.17% of 1000 instructions is 1.7 misses to memory.
That is 1000 instructions/2 instructions per clock = 500 cycles of work and 289 cycles of stall. This is 500 cycles of work out of a total of 789 clocks elapsed or roughly 63% of peak.
1% of 1000 instructions is 10 misses. That's 1000/2=500 of work and 1700 cycles of stall for 500/2200 = ~23% of peak.

Unfortunately the reality is that things are way more complex than this. We're not approaching that level of pure work or pure stall without a very specific instruction mix and a lot of luck, if it really can be done that way. On top of that, the CPUs will frequently overlap misses, so two stalls to main memory don't lead to 2x the stall cycles if they are launched close together. Jaguar can have up to 8 misses in flight. I haven't really figured in writes or instruction cache traffic, and haven't covered hazards or core contention, TLB fills, fused ops, varying cache latencies, branch mispredicts, and so on and on.

While eight cores sounds impressive, I just can't imagine keeping them all fed while the GPU is accessing memory, especially on the XBone. It seems to me the cores would spend much of their time running up against the "memory wall".

It's not particularly clear that this is any better for Orbis, since the northbridge link for the CPUs is a third slower and its memory latency may be measurably worse.
The memory wall is a problem for everyone.

I'm inclined to think it would have been better to toss out four cores and replace the area they took up with an L3 cache or maybe a larger L2.

The missing peak performance would be noticeable.
With something like the system reservation and OS services, it would be 1-2 cores being at least partially taken away from developers, leaving two weak cores.

pjbliverpool said:
Would it be possible under such a setup to still utlise a the power of a big CPU for certain CPU intensive aspects of a game that don't neccesarily require close intergration with the GPU? Or is that basically what you're talking about here?:

That and there are workloads that have a lot of intermediate work that can be swamped by bus transfers, but then have a final result writeback, like various image processing routines. In that case, even if the cores are weaker, it strips out all the intermediate copies that just eat up time. Even if the final copy is still a cost, it's a more limited and occasional expense that can be amortized more readily.

The fact that we tend to see very little (if any) speed up from upping PCI-E bandwidth was I'd assumed evidence that the CPU and GPU don't actually need a huge amount of bandwidth between them to oprate at full capacity (assuming you have enough memory local to the GPU). I'd be interested to better understand why that's not the case?

The transfers can be such an obstacle that various algorithms are simply not used.
One possibly extreme but not entirely unrepresentative example of how much transfers make many GPGPU workloads a nonstarter even for high-end GPUs:

http://www.extremetech.com/wp-content/uploads/2012/06/memcached_useful-calculations_AFDS.jpg

For these workloads, it's worse than a waste of time to try it, so it isn't done.
It's not certain at this point how much the consoles will leverage compute with the shared die, memory space, and higher bandwidth, but it's one of the first times that the idea wasn't shot down outright.

SumoSaki · Aug 30, 2014

GravityX said:
Within a year X1 reaching parity with a big budget game like Destiny @1080p/60fps, and this is using half the ROP's and with 50% less CU's compared to the PS4. And prior to any advantage DX12 may bring as well.

Destiny on Xbone is currently running at 1080p (as confirmed by IGN) but at 30fps. PS4 is also running at 30fps.

Shifty Geezer · Aug 30, 2014

GravityX said:
Is it true?

Bandwidth:

Xbox One - Esram + DDR3 = 140/150mb
PS4 - GDDR5 = 140/120mb

Read the thread. This is/has already being discussed. You aren't raising any new questions or info.

Are these consoles closer than we are led to believe?

Wrong thread. This thread is only about discussing the pros and cons of large pools of fast, local RAM in a system architecture. It's not particularly about current-gen consoles; these are just being used as reference examples.

Rurouni · Aug 30, 2014

ESRAM or EDRAM is always a win if you ignore the cost. Meaning that if PS4 add even just a 32MB of EDRAM, it would be a win. But there is no such thing as a free lunch, so to add fast local RAM you must sacrifice something. Based on the X1, obviously the trade off isn't very good (vs PS4). Personally, I would like to know about that 1TB/s bandwidth local RAM (mentioned by Cerny?). How much you can have that local RAM and how much (how low) the main RAM size you can get with that configuration.

Shifty Geezer · Aug 30, 2014

I think the argument/investigation should be explored from the other direction. Fast local store is good for bandwidth. Therefore, determine what are the bandwidth limiting factors in modern engines and going forwards, versus how much is compute based. If there's obviously a significant disadvantage in 150 GB/s versus 300+ GB/s, fast local store has something to contribute. If bandwidth requirements aren't so important, fast local store loses its value. IIRC that was the beginnings of this thread before current-gen showed up. With more and more bandwidth saving approaches like tiled resources, the requirement for bandwidth is different to how it used to be in previous generations. We won't need loads of fast RAM to render particle effects if those particle effects can be rendered with a clever per-pixel volumetric shader, for example. You then end up with a simple trade - RAM speed and simpler solutions, or more computational power and computed solutions?

sebbbi · Aug 30, 2014

Big on-chip memories reduce the off-chip (main) memory traffic. As the off-chip (main) memory traffic is one of the biggest sources of power usage, we will see even bigger on-chip memories in the future (performance/watt gets more and more important all the time).

Interesting questions are: How much does ESRAM save power? How badly power/TDP/cooling limited the current consoles are? How bad is the Jaguar / GCN scaling beyond the sweep spot (could they have overclocked the chips more)? Xbox One is dead silent. It seems that the power consumption is quite low right now. PS4 emits much more noise.

Shifty Geezer · Aug 30, 2014

sebbbi said:
Xbox One is dead silent. It seems that the power consumption is quite low right now. PS4 emits much more noise.

Power consumption is only part of that equation. The rest is cooling solution. eg. PS4 has internal PSU generating heat the internal fan needs to cope with, whereas Xb1 puts that heat outside. Measured wattage. XB1 ~115. PS4 ~140.

Also, doesn't (won't) stacked RAM etc. solve the power problem more conveniently than a small cache?

mosen · Aug 30, 2014

Shifty Geezer said:
Power consumption is only part of that equation. The rest is cooling solution. eg. PS4 has internal PSU generating heat the internal fan needs to cope with, whereas Xb1 puts that heat outside. Measured wattage. XB1 ~115. PS4 ~140.

Also, doesn't (won't) stacked RAM etc. solve the power problem more conveniently than a small cache?

XB1 launched with DX11, it should use less power with DX12.

http://blogs.msdn.com/b/directx/arc...-high-performance-and-high-power-savings.aspx

Shifty Geezer · Aug 30, 2014

mosen said:
XB1 launched with DX11, it should use less power with DX12.

http://blogs.msdn.com/b/directx/arc...-high-performance-and-high-power-savings.aspx

Not really. You can only reduce power consumption by doing less work. On existing titles maxing out the CPU, DX12 reduces the total workload and so reduces the power requirements. In future, and on XB1, DX12 games will use the freed overhead to do more useful work with the CPU, and it'll still be maxxed and burn as many watts as it can.

Globalisateur · Aug 30, 2014

GravityX said:
Is it true?

Bandwidth:

Xbox One - Esram + DDR3 = 140/150mb

PS4 - GDDR5 = 120/140mb

It is not true because you can't compare those numbers (apples and oranges): from the Metro Redux interview of Oles Shishkovstov:

Yes it is true, that the maximum theoretical bandwidth - which is somewhat comparable to PS4 - can be rarely achieved (usually with simultaneous read and write, like FP16-blending)

The key word is rarely. Sony's 120/140GB/s number are genuine average number when those 140-150GB/s esram number are achieved rarely. What is the definition or "rare"?

We could do a rough approximation though. If "rarely" means 10% of the time then we could average a number we could compare with Sony's numbers (120-140GB/s):

- 70%-80% of 109GB/s = 76/87GB/s 90% of the time but we'll take the higher number because no contention with CPU so 87GB/s.
- 145GB/s (highest esram BW under rare conditions) 10% of the time (rarely)

Average estimated esram really usable BW = 93GB/s that you can now compare with 120/140GB/s. Of course you have to add main ram BW (which has contention here) but then we'll have to add also the esram + ddr3 BW lost by the data transfers during memory tiling etc.

Of course it is supposing "rarely" means at least 10% but it may even means less than 10%, and I really doubt rare means more than 1/10.

But it's the first time a developer admits in an interview that those BW microsoft numbers, even if true, are a bit dishonest (because it's apples and oranges stuff) and can only be used rarely, under rare conditions.

The pros and cons of eDRAM/ESRAM in next-gen

Rangers

GravityX

Pixel

Strange

mc6809e

shredenvain

GravityX

Rangers

Pixel

Nisaaru

3dilettante

SumoSaki

Shifty Geezer

uber-Troll!

Rurouni

Shifty Geezer

uber-Troll!

sebbbi

Shifty Geezer

uber-Troll!

mosen

Shifty Geezer

uber-Troll!

Globalisateur

Globby

Similar threads