The pros and cons of eDRAM/ESRAM in next-gen

Camille Mirey, CEO of Persistant Studios:

Actually, the small memory isn’t problematic, it only means you can use 32Mb at a time, but you can map/unmap virtual memory at will on those very fast 32Mb,” Camille explains. We found this statement interesting as the eSRAM has been criticized before for its small size

And this is usually used for render targets and blending-intensive operations. So it will indirectly help particles [effects], by making high overdraw be faste

http://gamingbolt.com/xbox-ones-esr...an-help-in-better-particle-effects-popcorn-fx
 
Camille Mirey, CEO of Persistant Studios....

So the eSRAM is pretty good for particles. Maybe overdraw too? I remember joker454 and others commenting why the X360 version of Bayonetta was so much better framerate wise, as the console has insane amounts of bandwidth, which is good for overdraw.

More detailed quotation:

The Xbox One’s eSRAM although limited in size has a extremely high bandwidth, with a peak speed of 204 Gb/s. In our lengthy interview with Camille, we asked whether there is potential for better particle effects using eSRAM or is it more of a bottleneck given it’s extremely small size.

Actually, the small memory isn’t problematic, it only means you can use 32Mb at a time, but you can map/unmap virtual memory at will on those very fast 32Mb,” Camille explains.

We found this statement interesting as the eSRAM has been criticized before for its small size.

Camille believes that this could actually result into better particle effects as it makes overdraw faster. “And this is usually used for render targets and blending-intensive operations. So it will indirectly help particles [effects], by making high overdraw be faster,” she further explains
You can also read it on http://www.cinemablend.com/games/Aztez-Confirmed-1080p-60fps-Xbox-One-PS4-63835.html
 
The 'better' remark isn't qualified, so I'm reading it as 'better than if there was no ESRAM'. The 204 GB/s BW is a best-case scenario that I don't think will be achieved with particle rendering (particles fit in the caches so you're only using write speed). It's a different, incomparable system to XB360's with blend in the eDRAM. Ultimately the ESRAM isn't providing any advantage over standard RAM with the same BW.
 
pMax also makes a very valid point, until the last minute availability of 8Gb GDDR5 chips the 8GB DDR3 + ESRAM design was looking pretty smart. Alas for MS swapping RAM chips is a fairly easy step compared to doing a single revision of a CPU design.

This brings up what is probably a very silly question. Would X1 have seen any meaningful benefit if MS had decided to boost to 16GB DDR3? I don't mean in terms of graphics performance.. more in terms of disk caching, room for pre-calculated baked data, having a spec on their data sheet that Sony wouldn't have been able to equal.. :p
 
The 'better' remark isn't qualified, so I'm reading it as 'better than if there was no ESRAM'. The 204 GB/s BW is a best-case scenario that I don't think will be achieved with particle rendering (particles fit in the caches so you're only using write speed). It's a different, incomparable system to XB360's with blend in the eDRAM. Ultimately the ESRAM isn't providing any advantage over standard RAM with the same BW.

Yes, absolutely, this often cited "peak" 204GB/s BW is different from regular peaks BW of regular GDDR5 and DDR3 ram.

Xbox architects themselves said that the best bandwidth they ever reached tested in a best case scenario with the esram was 140-150GB/s. That's a ~72% maximum possible utilization of the peak BW in a simple test. This 72% number is explained by the specific constraints occuring during the simultaneous read/write of the esram.

That's lower than the standard 90% utilization rate reached in similar best case scenario done with GDDR5 ram. Also this 72% can't be done for all tasks. I suppose some tasks don't benefit/need at all from simultaneous read and write or have only an unbalanced read and write load.

Like you said if you do only particles rendering and only need to write then your peak BW is only 109BG/s but the standard maximum real BW will then probably be around 90% so you'll only reach 98GB/s usable BW for only read or only write tasks.
 
Yes, absolutely, this often cited "peak" 204GB/s BW is different from regular peaks BW of regular GDDR5 and DDR3 ram.

Xbox architects themselves said that the best bandwidth they ever reached tested in a best case scenario with the esram was 140-150GB/s. That's a ~72% maximum possible utilization of the peak BW in a simple test. This 72% number is explained by the specific constraints occuring during the simultaneous read/write of the esram.

Semantics but that's not what the architects said. They said:
Of course if you're hitting the same area over and over and over again, you don't get to spread out your bandwidth and so that's one of the reasons why in real testing you get 140-150GB/s rather than the peak 204GB/s...

I don't read that as a rare, edge-case, scenario; i read that as a more realistic average in real-world applications rather than the peak of 204 GB/s. (Which is 218GB/s now after the GPU speed bump?)
 
Last edited by a moderator:
(Which is 218GB/s now after the GPU speed bump?)

You can't sustain writes for absolutely every single cycle. The writes is known to insert a bubble [a dead cycle] occasionally... one out of every eight cycles is a bubble so that's how you get the combined 204GB/s as the raw peak that we can really achieve over the ESRAM.

It's just 204, from the same link.
 
"There are four 8MB lanes, but it's not a contiguous 8MB chunk of memory within each of those lanes. Each lane, that 8MB is broken down into eight modules. This should address whether you can really have read and write bandwidth in memory simultaneously,"

I have a doubt, I don't know if it was discussed before.

There are 4 lanes, each lane have 8MB, my doubt is... the 4 lanes shares the bandwidth?
 
Yes, absolutely, this often cited "peak" 204GB/s BW is different from regular peaks BW of regular GDDR5 and DDR3 ram.

Xbox architects themselves said that the best bandwidth they ever reached tested in a best case scenario with the esram was 140-150GB/s. That's a ~72% maximum possible utilization of the peak BW in a simple test. This 72% number is explained by the specific constraints occuring during the simultaneous read/write of the esram.

That's lower than the standard 90% utilization rate reached in similar best case scenario done with GDDR5 ram. Also this 72% can't be done for all tasks. I suppose some tasks don't benefit/need at all from simultaneous read and write or have only an unbalanced read and write load.

Like you said if you do only particles rendering and only need to write then your peak BW is only 109BG/s but the standard maximum real BW will then probably be around 90% so you'll only reach 98GB/s usable BW for only read or only write tasks.
the architects didn't talk about best case with the 140-150 GB/s they talk about real life code (still not highly optimized). best case would still be 204GB/s.
than you don't count in the CPU, that hurts the effective bandwidth of the GDDR5 memory pool (according to sony twice the bandwidth that is needed is lost). also you're operations are reads and writes, which must change over and over (which results in lost cycles, too). so you won't reach the 90% you're writing about, you're far away from this.
also we are not talking about theoretical benchmarks, we are talking about games. so many reads and many writes are mixed from CPU and GPU.
what people also forget. the esram is only and only for the GPU. it is not shared in the whole system. you also have the DDR3 memory you can use at the same time.
 
particles fit in the caches so you're only using write speed
Not true at all. Alpha blending needs equal amount of read and write bandwidth. ROP caches are very small (hundeds of kilobytes) compared to 1080p RGBA16F footprint (~16MB) so it doesn't help that much as particles are randomly scattered around the screen. For soft particles you need to read 32 bits extra per pixel (depth value). This read uses L2 cache (but again L2 is only hundreds of kilobytes in size, and you'd need several megabytes to make a big difference).

---

However you can render particles in small (for example 128x128 pixel) screen space tiles. This way the ROP cache and the L2 cache can hold the whole tile. GPU reads the RGBA16F color buffer and the 32 bit depth buffer once per pixel (from memory). All blending/overdraw and repeated depth sampling (for soft particles) occur in caches. And the result (RGBA16F per pixel) is written once to the memory. Total read thus becomes 8+4=12 bytes per pixel and total write becomes 8 bytes per pixel. I don't know if any games use this techique yet. Nobody has mentioned it yet in their next gen rendering articles.

My prototype of this technique doubled the particle rendering performance on Radeon 7970. 7970 fill rate is 29.6 GP/s. HDR soft particle rendering needs 8*2+4=20 bytes per pixel of bandwidth (if we disregard the read cost of DXT compressed particle textures). Total thus becomes 529 GB/s at full fill rate. 7970 has "only" 264 GB/s bandwidth, so it is highly bandwidth starved in this case. Tiling moves all the bandwidth heavy processing to caches and thus the BW requirement is massively reduced allowing 7970 to reach it's full fill rate. Downside for tiled particle rendering is that you need to bin the particles to tiles and you have some extra geomery cost (because some particles belong to multiple tiles and thus need to be processed twice). Binning is done most efficiently on compute shaders (limiting this technique mostly for DX11+ engines, and also explaining why most developers haven't adapted it yet).
 
Last edited by a moderator:
Okay, thanks for the correction. So are particles a good candidate to reach higher BW on XB1's ESRAM?
Particle rendering (RGBA16F HDR soft particles) is the pass where we use the most total BW. The numbers are really good. I am not going to post our numbers here, because I am not willing to participate in the console wars :)
 
Cool. So ESRAM is a win in this case. It's good to know as that's really where scratchpad should shine. It does make one think though that if the caches were larger, could particles be rendered as almost all writes? Then again, animated particles would immediately break that so I suppose there's no point designing for it. My idea of a particle is a, say, 64x64 32 bit RGBA image which is certainly small enough for a low level cache.
 
It does make one think though that if the caches were larger, could particles be rendered as almost all writes?
No, if caches were enough to remove all overdraw, you still had to load every pixel from the backbuffer once (8 bytes read per pixel) and store them once (8 bytes write per pixel). Also each pixel reads the depth buffer once (4 bytes per pixel) for soft particle blending. So in total that's 8 bytes of writes per pixel and 12 bytes of read. Obviously you also need to read the particle texture as well, but that's likely DXT compressed, so it's only 1 bytes per pixel (likely 2 as you also need a normal map). All particle textures used by a single frame do not likely fit into the GPU L2 cache, so particle textures will be read multiple times. It's hard to estimate how much this costs, but you could roughly estimate that the total BW usage is around 2x more reads than writes (16 bytes read per pixel + 8 bytes write per pixel).

At 1080p that is 32 MB reads + 16 MB writes = super cheap. Even a integrated netbook GPU (or a phone/tablet) could manage that BW cost easily. Caches are nice if you hit them :)

Then again, animated particles would immediately break that so I suppose there's no point designing for it.
Particles do animate, and each frame might have separate texture (usually all textures are put into one big atlas). Thus the particle textures themselves don't fit L2 easily, but the biggest BW cost comes from the backbuffer read & writes and depth buffer read, and tiling solves those issues elegantly.
 
Last edited by a moderator:
@sebbi,

I remember that the Halo Reach engine utilized cheap particles so that they can do lots of them, which looks more impressive. Is this a similar technique to what you are doing?
 
Surely I'm not the only one who finds this statement confusing at best.
The too. My understanding of alpha blending is that's is basically compositing two or more things so surely you need more reads than writes!?! :???:
 
No, if caches were enough to remove all overdraw, you still had to load every pixel from the backbuffer once (8 bytes read per pixel) and store them once (8 bytes write per pixel). Also each pixel reads the depth buffer once (4 bytes per pixel) for soft particle blending. So in total that's 8 bytes of writes per pixel and 12 bytes of read. Obviously you also need to read the particle texture as well, but that's likely DXT compressed, so it's only 1 bytes per pixel (likely 2 as you also need a normal map). All particle textures used by a single frame do not likely fit into the GPU L2 cache, so particle textures will be read multiple times. It's hard to estimate how much this costs, but you could roughly estimate that the total BW usage is around 2x more reads than writes (16 bytes read per pixel + 8 bytes write per pixel).

At 1080p that is 32 MB reads + 16 MB writes = super cheap. Even a integrated netbook GPU (or a phone/tablet) could manage that BW cost easily. Caches are nice if you hit them :)


Particles do animate, and each frame might have separate texture (usually all textures are put into one big atlas). Thus the particle textures themselves don't fit L2 easily, but the biggest BW cost comes from the backbuffer read & writes and depth buffer read, and tiling solves those issues elegantly.
Now that you mention the term "animate" and animation, I always wondered this about the new consoles, and after some things you talked about in regards to new graphics techniques that are going to become a standard, the question is... :smile2:

Could the new consoles animate the Final Fantasy movie in real time?
 
Back
Top