The ESRAM in Durango as a possible performance aid

Thanks, so GDDR5 and DDR3 are not that different.

What about 6T SRAM and 1T SRAM?

1T SRAM = eDRAM in smaller nodes. But... e.g. TSMC also has dense 6T SRAM.

eDRAM is not so nice for leakage and power (i.e. chargepumps to make a higher V for your bit line). And
don't forget costs (4-6 extra mask steps, so 10-15% extra mask costs and higher wafer costs) and yield.

If I were MS, I would go for the SiP option (stack a DRAM die on top of your main die, like Intel seems to be doing for Haswell...
and what they of course in the original Xbox 360).
 
At the small nodes, I believe that eDRAM is actually less leaky than SRAM. You have 1 capacitor and 1 transistor vs 6 transistors leaking. Leakage is getting worse and worse and really starting to dominate at the smaller nodes, it you used to no be the case.

https://www.power.org/wp-content/uploads/2010/11/Wire_Speed_Presentation_5.5_-_Final4.pdf

IBM's A2 presentation on page 13 describes eDRAM cells vs SRAM cells with regards to power and performance. They replaced the L2 SRAM with one based on eDRAM cells. Similar performance, but half the area and 20% of the power. eDRAM latency looks to be about 2.5x worse.
 
If I were MS, I would go for the SiP option (stack a DRAM die on top of your main die, like Intel seems to be doing for Haswell...
and what they of course in the original Xbox 360).
You can't stack dies with the thermal profile of an APU-type chip, it'd nuke itself with the heat trapped inside of it. Haswell and the various versions of the 360 all use dies set side-by-side in a traditional MCM manner.
 
Rather than anything else, it just seems like a way of having a lot of cheap memory while maintaining some degree of performance.

So it's a performance aid over just having DDR3, not special sauce.

Though it'd be interesting to see the latency figures for ESRAM vs DDR3 and GDDR5.
I was asking in the Orbis technical thread and the answer was the difference in DDR3 vs GDDR5 latency was not significant, but I never got actual figures.

I can imagine that latency figures (temporal) maybe a bit better for eSRAM but the major advantage of SRAM would be its closeness (on chip). Latency introduced by the controller and ram itself between esram and off chip ram maybe small but latency introduced by distance is probably not.
 
So how big would the difference be between say the PS4's GDDR5 setup and the on die ESRAM in Durango?

Its probably more than just looking at eSRAM versus GDDR5. Onion+ isn't there for nothing so even Sony realizes the need for lower latency access to memory. The problem is there is no info on how much latency is removed by avoiding the L1 and L2 caches of the gpu. Or if eSRAM allows a similar bypass.
 
so, if they're tossing around a really high 5 billion transistor count, and it seems a huge chunk of that must be dedicated to esram (I'm seeing posts saying 2 billion transistors, that's a 7850 GPU size!)...shouldn't we really expect this to increase performance?

If not, why didn't they just go with dumb EDRAM? If the only purpose was bandwidth aid, wouldn't that have been better?
 
EDRAM locks you into certain foundries since it requires specialized manufacturing techniques while SRAM is the same as the rest of the logic on the die and can be made anywhere.
 
EDRAM locks you into certain foundries since it requires specialized manufacturing techniques while SRAM is the same as the rest of the logic on the die and can be made anywhere.


i mean, it just really seems odd.l i saw one post thumbnailing well, 5b trans, must be broken down 2b gpu, 2b esram, 1b cpu/everything else...

i mean, he just dedicated 4b to the gpu, thats a 7970. why not use an actual 7970 then.

something just doesnt add up to me. options.

-microsoft screwed up big time, making a weak console that's also expensive (certainly this will be a popular opinion)

-the esram is somehow much less weighty than it's trans count indicates. ie, can be packed in much smaller space etc. so the trans count is not close to a true indication of it's cost which is much less.

-the esram is really useful/give the gpu a major power efficiency help? (what this thread was about, with inconclusive well, conclusion that i can tell, but overall not seeming to be greatly positive)

also, i'm not to say "locking you into certain foundries", while intriguing info, matters so much?

What of the EDRAM in 360? Seemed to work out ok.

There's much talk about the advantage of being able to fab anywhere, but it always just ends up being TSMC, or at best one or two other big options, doesn't it? Always found that a bit odd lol.
 
eDRAM on XBox 360 was ultimately never integrated into any of the other dies so it always carried that additional cost. There have been some cases of eDRAM integrated with big logic but I expect it increases the cost of the entire die by a fair amount.

If the eSRAM is 32MB of 6T SRAM with 9-bits per byte for ECC that'd be 1.8b transistors. Would be less w/o ECC, more if it's 8T (doubt it). So 2b sounds like the right ballpark. Big SRAM blocks may well be the densest part of the die. TSMC says a 6T SRAM cell at 28nm is 0.127um^2 which would make 32Mx9-bit around 40mm^2 which isn't that bad (not sure what controllers and other overhead would add) Chances are good that in a few years manufacturing cost will go down more than it would have for eDRAM.
 
the ESRAM can be a coherent cache for both CPU and GPU and useful in HSA. It has low latency which will be good for GPU compute and is probably low power. I think the xbox one will probably be somewhat less power hungry than the PS4 but I have nothing to really back that up, it will be even harder to tell when kinect and move and what not is working at the same time.

The die is gigantic for a SOC, probably bigger than the PS4 die. I think it will be more expensive in the short run to do this but probably will get cheaper with shrinks more so than using GDDR5.

I doubt they would need ECC in a home console. 6t sram as 6 transistor per bit is about 1.6b transistors. This is pretty massive chunk of transistor budget. Can't really say if it was worth it without knowing what MS is planning and what their line of thought is.
 
the ESRAM can be a coherent cache for both CPU and GPU and useful in HSA.

According to who? Cache needs a lot of extra die area for tags and a controller, and adds the headache of putting it in between the bus instead of on a separate bus.
 
Not to minimize the importance of bandwidth at all, but what other benefit does esram provide the console? Wouldn't that budget have been better used on GPU CUs? Sony has faster memory and more compute for the same cost I take it?
 
It's only useful to have more CU's if you can feed them, and that requires bandwidth and the ability to hide memory latency.
Caches help with the latter, but CU's are idle a lot of the time waiting on memory, or other parts of the render pipeline.
Doubling the CU count does not make a part twice as fast except in artificial tests.
 
According to who? Cache needs a lot of extra die area for tags and a controller, and adds the headache of putting it in between the bus instead of on a separate bus.
If PRT can leverage eSRAM, perhaps there's a much coarser level of coherence at a page or texture chunk level?
The consistency model AMD is using for its upcoming unified memory is very weak, so it doesn't need to be as responsive or well-ordered as a CPU peer would be.

Not to minimize the importance of bandwidth at all, but what other benefit does esram provide the console? Wouldn't that budget have been better used on GPU CUs? Sony has faster memory and more compute for the same cost I take it?

I don't think anyone can say the solutions have the same cost. In an ideal world with decent yields and process node transitions, the eSRAM solution could do better.
Whether it's significantly better would depend on how much of the idealized cost savings are realized, and how well Sony can wrangle cheap memory and improve its PCB and chip packaging. The latter is something Sony has an interest at being good at.
 
It's only useful to have more CU's if you can feed them, and that requires bandwidth and the ability to hide memory latency.
Caches help with the latter, but CU's are idle a lot of the time waiting on memory, or other parts of the render pipeline.
Doubling the CU count does not make a part twice as fast except in artificial tests.

So functionally esram helps keep the cus fed with as many wavefronts as they can handle if managed properly. Interesting. I'd really like to get a better understanding of how the move engines and esram work together and why MS went this route over a more robust and future proof graphics solution.

We also heard nothing about SHAPE today...
 
I hope sebbbi doesn't mind me re-pasting his reply on the technical thread here....



On Xbox 360, the EDRAM helps a lot with backbuffer bandwidth. For example in our last Xbox 360 game we had a 2 MRT g-buffer (deferred rendering, depth + 2x8888 buffers, same bit depth as in CryEngine 3). The g-buffer writes require 12 bytes of bandwidth per pixel, and all that bandwidth is fully provided by EDRAM. For each rendered pixel we sample three textures. Textures are block compressed (2xDXT5+1xDXN), so they take a total 3 bytes per sampled texel. Assuming a coherent access pattern and trilinear filtering, we multiply that cost by 1.25 (25% extra memory touched by trilinear), and we get a texture bandwidth requirement of 3.75 bytes per rendered pixel. Without EDRAM the external memory bandwidth requirement is 12+3.75 bytes = 15.75 bytes per pixel. With EDRAM it is only 3.75 bytes. That is a 76% saving (over 4x external memory bandwidth cost without EDRAM). Deferred rendering is a widely used technique in high end AAA games. It is often criticized to be bandwidth inefficient, but developers still love to use it because it has lots of benefits. On Xbox 360, the EDRAM enables efficient usage of deferred rendering.

Also a fast read/write on chip memory scratchpad (or a big cache) would help a lot with image post processing. Most of the image post process algorithms need no (or just a little) extra memory in addition to the processed backbuffer. With large enough on chip memory (or cache), most post processing algorithms become completely free of external memory bandwidth. Examples: HDR bloom, lens flares/streaks, bokeh/DOF, motion blur (per pixel motion vectors), SSAO/SSDO, post AA filters, color correction, etc, etc. The screen space local reflection (SSLR) algorithm (in Killzone Shadow Fall) would benefit the most from fast on chip local memory, since tracing those secondary rays from the min/max quadtree acceleration structure has quite an incoherent memory access pattern. Incoherent accesses are latency sensitive (lots of cache misses) and the on chip memories tend to have smaller latencies (of course it's implementation specific, but that is usually true, since the memory is closer to the execution units, for example Haswell's 128 MB L4 should be lower latency than the external memory). I would expect to see a lot more post process effects in the future as developers are targeting cinematic rendering with their new engines. Fast on chip memory scratchpad (or a big cache) would reduce bandwidth requirement a lot.
 
i mean, it just really seems odd.l i saw one post thumbnailing well, 5b trans, must be broken down 2b gpu, 2b esram, 1b cpu/everything else...

i mean, he just dedicated 4b to the gpu, thats a 7970. why not use an actual 7970 then.

something just doesnt add up to me. options.

-microsoft screwed up big time, making a weak console that's also expensive (certainly this will be a popular opinion)

-the esram is somehow much less weighty than it's trans count indicates. ie, can be packed in much smaller space etc. so the trans count is not close to a true indication of it's cost which is much less.

-the esram is really useful/give the gpu a major power efficiency help? (what this thread was about, with inconclusive well, conclusion that i can tell, but overall not seeming to be greatly positive)

also, i'm not to say "locking you into certain foundries", while intriguing info, matters so much?

What of the EDRAM in 360? Seemed to work out ok.

There's much talk about the advantage of being able to fab anywhere, but it always just ends up being TSMC, or at best one or two other big options, doesn't it? Always found that a bit odd lol.

MS does have that patent for dramatically improved methodology for processing tiled assets efficiently, which relies extensively on display planes and the low latency mem pool of the eSRAM iirc. The low latency mem pool allows the payoff for depth tiling to offer significant boosts to processing efficiency I think. Don't have the link to the patent anymore. :/
 
Back
Top