The pros and cons of eDRAM/ESRAM in next-gen

Well, it is a fact that there are structures on a chip that no not scale as well as others. Those typically tend to by the physical structures as well. So its not that the ESRAM will scale "better" but having a smaller memory PHY will penalise you less.
 
Skynix's website is interesting ...

"SK hynix will enhance product portfolios with HBM technology to diversify into various applications such as Graphic card, Network/HPC and PC/Game console."

proGraphic3.gif


proGraphic4.gif



REF: https://www.skhynix.com/gl/products/graphics/graphics_info.jsp
 
eSRAM is very fast memory. In general the biggest challenge that game developers are facing are memory access patterns, so while we have a lot of computation power, the memory access cost is increasing substantially over the last ten years, compared to the cost of arithmetic instructions. “As long as you are in registers you are fine but as soon as you need to access memory, it becomes slower. So the challenge is to access memory in the most efficient way.
“Therefore memory access patterns are the most important optimization strategies. So it’s not about counting cycles but it’s about thinking how can we re-factor an algorithm so that we can access memory in a more efficient way. eSRAM is part of that.
“For example with a compute shader you can access cache memory (thread group shared memory) so you can re-factor your algorithm so that it uses this memory better, resulting in a huge and substantial speed ups. With the Xbox One, the introduction of eSRAM has a similar idea.
“The memory expensive draw calls can be rendered into eSRAM. When you don’t need so much memory bandwidth, you use the regular system memory. You have to plan ahead, you have to think how you are going to use the memory, in the most optimal way. So eSRAM gives you an advantage if you do this. For one of our games, we used eSRAM by creating an excel sheet first, that shows how we are going to use eSRAM through the stages of the rendering pipeline. This helped us utilize the speed improvements that were coming from the eSRAM

http://gamingbolt.com/xbox-ones-esr...-expensive-draw-calls-can-be-rendered-into-it
 
PS4 and PC developers will say the same thing about memory access patterns. This is probably the most important aspect of achieving good performance with GPUs. As more developers use compute and complicated shaders they will discover this.
 
PS4 and PC developers will say the same thing about memory access patterns. This is probably the most important aspect of achieving good performance with GPUs. As more developers use compute and complicated shaders they will discover this.

But are they at an disadvantage since PS4/PC do not have ESRAM like the X1? Or does ESRAM bring the X1 on equal footing as the PC/PS4?
 
But are they at an disadvantage since PS4/PC do not have ESRAM like the X1?

Devs don't have to worry about fitting into a limited memory space with their framebuffers, but anything they do that reduces overall bandwidth consumption helps everyone. You might look at sebbbi's posts on framebuffer management or tiled particles & ROP caches.

For compute shaders, they'd benefit from targeting L1/LDS/sharedmem/GDS/L2 closer to the shader units themselves.

As Shifty mentioned above, it echoes sentiments last gen where developing for Cell first (tasks that fit need to fit into relatively small LS) may see benefits on other platforms (e.g. X360's 1MB L2 shared across 6 HW threads).
 
But are they at an disadvantage since PS4/PC do not have ESRAM like the X1? Or does ESRAM bring the X1 on equal footing as the PC/PS4?
Pretty much every strategy that favours XB1's ESRAM brings the same benefits to every other platform. The only possible exception would be some technique that benefits from lower latency, and as we've no idea what the latency figures are for ESRAM, speculation in that field is going to be extremely flaky. To date, I think every dev talking about the ESRAM has been talking about bandwidth and reducing that. Tile based resources or careful memory management are equally beneficial to other architectures.
 
Pretty much every strategy that favours XB1's ESRAM brings the same benefits to every other platform. The only possible exception would be some technique that benefits from lower latency, and as we've no idea what the latency figures are for ESRAM, speculation in that field is going to be extremely flaky. To date, I think every dev talking about the ESRAM has been talking about bandwidth and reducing that. Tile based resources or careful memory management are equally beneficial to other architectures.
Again, the esram is exclusive to the GPU. And you get a really high bandwith for that. You don't need to really save bandwith you need to use it correct. Also, currently you must use techniqes that don't depent on latency. With esram you can use stuff that depents on it.
 
And you get a really high bandwith for that. You don't need to really save bandwith you need to use it correct.
Everything needs to save bandwidth. That's why we have compressed textures and optimised framebuffer packing methods. BW is a valuable, easily overexploited commodity.
Also, currently you must use techniqes that don't depent on latency. With esram you can use stuff that depents on it.
Which I already mentioned.
 
Seems that other people are using the same "super high tech" Excel sheet based ESRAM optimization technique as we are :D
PS4 and PC developers will say the same thing about memory access patterns. This is probably the most important aspect of achieving good performance with GPUs. As more developers use compute and complicated shaders they will discover this.
Memory access patterns have been the most important thing in achieving good performance on CPUs for long time now (and it took some people awfully long time to understand this, and many Universities still teach it completely wrong to new students). It's the same now for GPUs, since all modern GPUs have proper cache hierarchies and the algorithms are now more complex (true gather & scatter based algorithms, instead of just gather in pixel shaders or scatter in vertex shaders).
 
Memory access patterns have been the most important thing in achieving good performance on CPUs for long time now (and it took some people awfully long time to understand this, and many Universities still teach it completely wrong to new students). It's the same now for GPUs, since all modern GPUs have proper cache hierarchies and the algorithms are now more complex (true gather & scatter based algorithms, instead of just gather in pixel shaders or scatter in vertex shaders).

A specific example?
 
A specific example?
Most universities don't even teach performance critical programming, unless you take some special courses (and there aren't many available). Nowadays C/C++ isn't even that commonly used language in schools/universities anymore. Writing cache efficient code in Java (or many other currently popular languages) is very hard, since you can't control the placement of the objects in memory (and long pointer chains are very frequent). The common thinking is that low level optimization is the responsibility of the compiler. Unfortunately compilers aren't good in optimizing memory access patterns.
 
In my studies (graduated 2 years ago), I didn't even have any mandatory C/C++ classes. All was optional. I've chosen CG classes, which meant I had to learn it (or rather, I've begun doing it myself even earlier).

Since then, they've changed the curriculum to include C in the third semester, at least.
 
Skynix's website is interesting ...

"SK hynix will enhance product portfolios with HBM technology to diversify into various applications such as Graphic card, Network/HPC and PC/Game console."

proGraphic3.gif


proGraphic4.gif


REF: https://www.skhynix.com/gl/products/graphics/graphics_info.jsp

Yep, this is awesome, depending on how long it takes to become affordable. Last year there was a nice theory that the GPU industry could move toward HBM in two step. First would be XB1-style with a small HBM pool and DDR4 (making GDDR5 obsolete, so PS4 would get stuck in a niche). It now looks like both AMD and Nvidia are going all out 100% HBM. I'm guessing that could explain why GDDR5M disappeared unceremoniously, HBM becomes the best of both worlds and PS4 could be contributing a huge stable demand, bringing the cost down.

Microsoft is in a strange place, there are no HBM configurations that make sense, other than downclocking 512GB/s down to 68GB/s, or waiting for second gen to have less stacks. I'm thinking maybe they are going to move to WideIO2 instead. Lower power, lower cost, etc... It can easily do 68GB/s with four stacks, they'd keep an edge in cost and power requirement, WideIO2 will be used in phones/tablet, so there's an insane volume already guaranteed. That would be a very cool "slim" revision.

... or I'm overly optimistic and stacked memory is only going to happen next gen :???:
 
Last edited by a moderator:
Most universities don't even teach performance critical programming, unless you take some special courses (and there aren't many available). Nowadays C/C++ isn't even that commonly used language in schools/universities anymore. Writing cache efficient code in Java (or many other currently popular languages) is very hard, since you can't control the placement of the objects in memory (and long pointer chains are very frequent). The common thinking is that low level optimization is the responsibility of the compiler. Unfortunately compilers aren't good in optimizing memory access patterns.

That's fair. Do you still do assembly optimization and/or inlining?
 
That's fair. Do you still do assembly optimization and/or inlining?

M$ compiler killed inlining with 2k10+ compilers, unfortunately. Instrinsics are not powerful enough, since you do not have real control of what is happening, when you do weird things...

But I was not aware that C/C++ inst mandatory any more... ugh.
 
Microsoft is in a strange place, there are no HBM configurations that make sense, other than downclocking 512GB/s down to 68GB/s, or waiting for second gen to have less stacks.

Why do you think they would need to adjust the memory bandwidth to the previous console in a new revision? IMHO the system(HW and OS) is far too complex that games could depend on exact bandwidth/latency timings. As long as a new memory system has >= bandwidth/latency that should work.
 
Why do you think they would need to adjust the memory bandwidth to the previous console in a new revision? IMHO the system(HW and OS) is far too complex that games could depend on exact bandwidth/latency timings. As long as a new memory system has >= bandwidth/latency that should work.
I was thinking it has to behave exactly the same, so they would simulated the exact DDR3 timings by adding delays. They can't allow new slim model to run games faster, the game would have to be tested for two hardware, studios would be angry, early adopters would be angry.
 
That's not really necessary. First off, if the target development platform is the still the original configuration then the baseline performance is guaranteed. Secondly, changing one aspect is not likely to drastically change the overall performance.
 
Back
Top