The pros and cons of eDRAM/ESRAM in next-gen

Seems obvious now you mention it, but I'd not thought about this
Fast graphics or scratchpad memory is nothing new for us long time console developers and has been the standard on Sony, Microsoft and Nintendo platforms for long long time. So we are always thinking about the memory layout and how to optimize it further. Good thing with the new DX11 GPUs is the native integer processing support. This makes it more efficient to bit pack data tightly and more efficient to reconstruct float data from packed presentation (reinterpret cast is free, so you can abuse the ieee floating point standard bit presentation freely).

Nothing in my post was specific for a single platform, except using the number 32 as the target for size optimization. I would do similar optimizations for other platforms as well. I am sure MJP would too (this layout was from a prototype).
 
If you mean MSAA, then I would say not likely at 1080p. MSAA doesn't really work with deferred rendering. I'd personally prefer advances in post-AA, like what Ryse is doing.

Correct me if I'm wrong, but 2x MSAA doubles the size of the depth buffer and 4x MSAA quadruples it.

It seems to me that you are right. However a proper PP-AA would be good, too.

Thanks again.
 
It seems to me that you are right. However a proper PP-AA would be good, too.
Post process AA is quite good at 1080p. The pixels are so small that the missing sub pixel information doesn't hurt as much as it does on 720p. Most last gen AAA games were sub HD (1152x720, 608p, etc) and had cheap post AA. This combination caused problems such as edge crawling and blurriness. 1080p has 2.5x smaller pixels, making these issues much less problematic, and the new algorithms (SMAA) are also slightly better than the old ones. Post AA also handles transparencies properly. MSAA, not so well.

MSAA is possible with deferred shading, but it requires quite a bit of effort for the whole rendering pipeline to get working both efficiently and without any visual issues. It is still considerably slower than the best PP methods.
 
MJPs g-buffer layout is actually only two RTs in the g-buffer rendering stage and one RT in the lighting stage. And a depth buffer of course. Quite normal stuff.

On GCN you want to pack your data to 64 bpp (4 x 16 bit integer) render targets because that doubles your fill rate compared to using more traditional 32 bpp RTs (GCN can do 64 bit filling at same ROP rate as 32 bit filling).

I assume that the packing is like this:
Gbuffer1 = normals + tangents (64 bit)
Gbuffer2 = diffuse + brdf + specular + roughness (64 bits)
Depth buffer (32 bits)

Without any modifications this takes 40 megabytes of memory (1080p).

The lighting step doesn't need extra 8 MB for the 4x16f RT, because compute shader can simultaneously read and write to the same resource, allowing you to to lighting "in-place", writing the output over the existing g-buffer. This is also very cache friendly since the read pulls the cache lines to L1 and the write thus never misses L1 (GCN has fully featured read & write caches).

It's also trivial to get this layout down to 32 MB from the 40 MB. Replace gbuffer1 with a 32 bit RT (32 MB target reached at 1080p). Store normal as 11+11 bit using lambert azimuth equal area projection. You can't see any quality difference. 5+5 bits for tangents is enough (4 bits for exponent = mip level + 1 bit mantissa). 11+11+5+5=32. Also if you only use the tangents for shadow mapping / other planar projections, you don't need them at all, since you can analytically calculate the derivatives from the stored normal vector.

This layout is highly efficient for both g-buffer rendering and lighting. And of course also for post processing since all your heavy data fits in the fast memory. Shadow maps obviously need to be sampled from main memory during the lighting, but this is actually a great idea since the lighting pass woudn't otherwise use any main memory BW at all (it would be completely unused = wasted).

Great info, thanks for posting.

I take this means you guys had no trouble getting Trials Fusion to run @ 1080p60pfs on the bone ;)
 
I looked at Chipworks die shot of Xbox One chip and 32MB eSRAM is actually two 16MB eSRAM macros/blocks if I stand correct and Xbox One CPU has an L3 Cache/eSRAM between both dual module CPU's thus I wonder if that L3 Cache/eSRAM accessible directly to GPU and size of it...

In case it is available could it be possible to off-load some information there?

well it should be around 2+MB of Memory. maybe (just an assumption) this is a buffer for the 4 move engines.
the main esram blocks can be accessed in 512KB blocks. so 4 move engines may have also 512KB to free up fast the main esram so it can be used for other stuff while moving its content to main memory. this way the esram must not wait until the data has been moved to main memory and can be written into the faster buffer.
the move engines can also do some stuff, so they need their own fast memory to do so, else they would need to access the slow main memory which i cannot believe.
 
MJPs g-buffer layout is actually only two RTs in the g-buffer rendering stage and one RT in the lighting stage. And a depth buffer of course. Quite normal stuff.

On GCN you want to pack your data to 64 bpp (4 x 16 bit integer) render targets because that doubles your fill rate compared to using more traditional 32 bpp RTs (GCN can do 64 bit filling at same ROP rate as 32 bit filling).

I assume that the packing is like this:
Gbuffer1 = normals + tangents (64 bit)
Gbuffer2 = diffuse + brdf + specular + roughness (64 bits)
Depth buffer (32 bits)

Without any modifications this takes 40 megabytes of memory (1080p).

The lighting step doesn't need extra 8 MB for the 4x16f RT, because compute shader can simultaneously read and write to the same resource, allowing you to to lighting "in-place", writing the output over the existing g-buffer. This is also very cache friendly since the read pulls the cache lines to L1 and the write thus never misses L1 (GCN has fully featured read & write caches).

It's also trivial to get this layout down to 32 MB from the 40 MB. Replace gbuffer1 with a 32 bit RT (32 MB target reached at 1080p). Store normal as 11+11 bit using lambert azimuth equal area projection. You can't see any quality difference. 5+5 bits for tangents is enough (4 bits for exponent = mip level + 1 bit mantissa). 11+11+5+5=32. Also if you only use the tangents for shadow mapping / other planar projections, you don't need them at all, since you can analytically calculate the derivatives from the stored normal vector.

This layout is highly efficient for both g-buffer rendering and lighting. And of course also for post processing since all your heavy data fits in the fast memory. Shadow maps obviously need to be sampled from main memory during the lighting, but this is actually a great idea since the lighting pass woudn't otherwise use any main memory BW at all (it would be completely unused = wasted).


Food for thought ....


Andrew Goossen (Xbox One architect) :

Of course with Xbox One we're going with a design where ESRAM has the same natural extension that we had with eDRAM on Xbox 360, to have both going concurrently. It's a nice evolution of the Xbox 360 in that we could clean up a lot of the limitations that we had with the eDRAM. The Xbox 360 was the easiest console platform to develop for, it wasn't that hard for our developers to adapt to eDRAM, but there were a number of places where we said, "Gosh, it would sure be nice if an entire render target didn't have to live in eDRAM," and so we fixed that on Xbox One where we have the ability to overflow from ESRAM into DDR3 so the ESRAM is fully integrated into our page tables and so you can kind of mix and match the ESRAM and the DDR memory as you go.

Sometimes you want to get the GPU texture out of memory and on Xbox 360 that required what's called a "resolve pass" where you had to do a copy into DDR to get the texture out - that was another limitation we removed in ESRAM, as you can now texture out of ESRAM if you want to. From my perspective it's very much an evolution and improvement - a big improvement - over the design we had with the Xbox 360. I'm kind of surprised by all this, quite frankly.
 
But why are developers finding a hard time with the ESRAM when they arent dependent on it as much as they were with the EDRAM on the 360?
 
But why are developers finding a hard time with the ESRAM when they arent dependent on it as much as they were with the EDRAM on the 360?
I do not believe that devs have a hard time with it, I would think it is just new, MSFT environment going by the report is still evolving /not mature.
I would not discard business considerations either, users base is still low, they did not have much time and on a shifting environment, fine tuning engine at this point in time might not be the priority and so on.
The situation on the PS4 might not be perfect either, but its simple design and software environment might cover up for that, partly at least.

I think the issue is overstated, things should get better "fast" (by that I mean it should take a lot less time to adapt to the system pro/con than it took on the 360).
I've read no "real" complaint so far, nothing that compares to what one could read in the ps360 era.
 
Post process AA is quite good at 1080p. The pixels are so small that the missing sub pixel information doesn't hurt as much as it does on 720p. Most last gen AAA games were sub HD (1152x720, 608p, etc) and had cheap post AA. This combination caused problems such as edge crawling and blurriness. 1080p has 2.5x smaller pixels, making these issues much less problematic, and the new algorithms (SMAA) are also slightly better than the old ones. Post AA also handles transparencies properly. MSAA, not so well.

MSAA is possible with deferred shading, but it requires quite a bit of effort for the whole rendering pipeline to get working both efficiently and without any visual issues. It is still considerably slower than the best PP methods.

Do you have triple buffering at 1080p too?
 
Why would a CPU limitation result in lower rendering resolutions? The games are running at 60fps, the question is why is the Xbox One struggling to get above 720p at that framerate? This includes MGSV, CoD:Ghosts, Battlefield 4 and Titanfall.

IMO, the reason is quite simply, the PS4. If you have two platforms, one with more logic while on a simpler architecture, it's a simple choice for developers to target that system as their ideal machine. Then you take the second platform, the Xbox One, with less logic and a more complicated architecture and something's gotta give.

If there was no PS4, I have no doubts that games would be pushing 1080/30 and 1080/60 consistently on the One, though of course with slightly less demanding visuals.

The simple reason I believe, is that it's simply easier to focus on getting the max out of the prefered platform and limit the secondary one either by framerate or resolution (depending on where your main bottlenecks are). It's a lot easier than to say downgrade assets and complexity to reach the same framerate and resolution on paper.

This may improve over time due to better libraries and resources or, it may not. Guess that entirely depends on the success of the Xbox One. If it falls back further in sales and marketshare, developers might continue on this path (or eventually drop support all together for some games). I think the difference is to some degree exagerated a bit due to these reasons.
 
I do not believe that devs have a hard time with it.
It depends what people mean by 'hard time'. It's probably not too taxing to work within the design, but if they are also targeting two other platforms without any memory structure concerns, the added faf of managing buffers in XB1 could constitute a 'hard time'. Let's say they have a renderer up and running nicely on PS4 and PC with all the features they want, and then they port that to XB1 and it doesn't fit nicely. They either have to re-engineer the FB structure, shaders, etc., for everyone, or come up with an XB1 build, or just leave it to wrestle with a less-than-ideal FB structure and substandard performance. That's probably where the 'hard time' regards launch title performance comes in.
 
On PS4, the max read peak is still the full BW ~150. On XB1 is ~100, right?

It would be interesting to see on many read/few writes scenarios which one gets the better benefits -GDDR5 or eSRAM- in the end.

I do not think those situations are so infrequent, after all.
 
On PS4, the max read peak is still the full BW ~150. On XB1 is ~100, right?

It would be interesting to see on many read/few writes scenarios which one gets the better benefits -GDDR5 or eSRAM- in the end.

I do not think those situations are so infrequent, after all.

no, max read performance on xbox one is 109GB/s + DDR3
Max write Performance is 109GB/s + DDR3

Max mixed performance is 204GB/s + DDR3

the esram and DDR3 can be used at the same time. also is also possible to read and write into esram at the same time. another difference is that you have the really high bandwidth for just 32MB Memory, while for the big memory you don't have that high bandwidth.
if you are only writing/reading 32MB in the whole GDDR5 Memory, you have a really low bandwidth for this chunk for the big memory, if the 32MB is not spread to all memory modules (you cannot access one module with all the bandwidth). you can only reach almost the same Performance (but not the low latency) for the render-target (if this is inside the 32MB) if you spread it through all memory modules. if you really use all the bandwidth of the esram has, you would have a problem achieving the same with gddr5. but normally you wouldn't really use all the bandwidth.

but at least this means, you can really really often read and write inside the 32MB, something you couldn't do with dram.
at the same time, if your render target is bigger than 32mb, you loose some speed, but you are still faster than having your whole rendertarget in GDDR5 memory (in theory).

the bandwidth shouldn't be a problem at all in the xbone, in current games it is more the drivers and the gpu, that are just to slow to use it.

but it would be interessting to know how much actual bandwidth the render target consumes in current games.
 
no, max read performance on xbox one is 109GB/s + DDR3
Max write Performance is 109GB/s + DDR3

Max mixed performance is 204GB/s + DDR3

the esram and DDR3 can be used at the same time. also is also possible to read and write into esram at the same time. another difference is that you have the really high bandwidth for just 32MB Memory, while for the big memory you don't have that high bandwidth.
if you are only writing/reading 32MB in the whole GDDR5 Memory, you have a really low bandwidth for this chunk for the big memory, if the 32MB is not spread to all memory modules (you cannot access one module with all the bandwidth). you can only reach almost the same Performance (but not the low latency) for the render-target (if this is inside the 32MB) if you spread it through all memory modules. if you really use all the bandwidth of the esram has, you would have a problem achieving the same with gddr5. but normally you wouldn't really use all the bandwidth.

but at least this means, you can really really often read and write inside the 32MB, something you couldn't do with dram.
at the same time, if your render target is bigger than 32mb, you loose some speed, but you are still faster than having your whole rendertarget in GDDR5 memory (in theory).

the bandwidth shouldn't be a problem at all in the xbone, in current games it is more the drivers and the gpu, that are just to slow to use it.

but it would be interessting to know how much actual bandwidth the render target consumes in current games.


You're forgetting that GDDR5 operates at a MUCH higher clock rate, so in the same cycle that you run a read and write for the eSRAM, GDDR5 could have completed more than 6 read OR write cycles.

Remember

GDDR5 - 176GB/s (256 pins * 5500Mhz / 8 bits)
1 cycle 256 bits, done in 182 ns

DDR3 - 68GB/s (256 pins * 2133Mhz / 8 bits)
1 cycle 256 bits, done in 469 ns

eSRAM - 109GB/s+95GB/s =204GB/s (128 bytes * 853Mhz) (this is max theoretical, which is, quite frankly, undoable)
1 cycle 1024+896 bits done in 1,172 ns

in one cycle on the DDR3, the GDDR5 runs ~2.5 cycles
in one cycle on the eSRAM, the GDDR5 runs ~6.4 cycles

Thus, if we look at it from a "benefit of reading and writing" standpoint, there is pretty much no benefit.
So what if you can do them both together?
I'll just break the load up into two different cycles of read and write and there will be no issues!

And your example is a really odd example that aims to break the PS4 system by forcing it to do what the XB1 does, which probably no one will even try to do.
Why would you want to write to a small 32MB block so many times if you can just distribute it across the full 8GB. and write to lots of different 32MB blocks at once/ read from lots of 32MB blocks?
You're applying a 32MB limit to a system where it has no such limit, thus this mode of operation won't even be attempted.




This also brings up an industrial engineering POV, in comparing the types of memory.
From a bandwidth point of view, GDDR5 is actually favored over eSRAM, if other things (like size and bandwidth) is held constant.

Small batches with high frequency generally allows much better throughput than large batches with lower frequency.

Thus, if you have two hypothetical memory systems both with bandwidth of 100GB/s and running under similar conditions,
The one operating at 4000Mhz with 200 pins will achieve better throughput over another one operating at 1000Mhz and 800pins.
 
Last edited by a moderator:
if you are only writing/reading 32MB in the whole GDDR5 Memory, you have a really low bandwidth for this chunk for the big memory, if the 32MB is not spread to all memory modules (you cannot access one module with all the bandwidth)

...you believe that memory is spread sequentially on every module and you achieve 256 bit bus speed only if you access 8 byte at this address and 8 bytes at the address that is 512Mb later? yeah, of course. I am *sure* AMD designed their MCT+DCT to work that way :rolleyes:

On a more realistic approach... you missed my question.

Given L2+*NB+MCT etc., what if my shaders read alot of data and write relatively less (like in a buffer). Could/would the gddr5 (not considering precharge...) achieve much better results?

I am just thinking to most possible data-feed scenarios that may come to my mind. Any idea?
 
You're forgetting that GDDR5 operates at a MUCH higher clock rate, so in the same cycle that you run a read and write for the eSRAM, GDDR5 could have completed more than 6 read OR write cycles.
you need more cycles for one read on dram. GDDR5 is even worse than DDR3. there are cycles that are not effective at all. the higher clock rate only helps to reduce the latency, but not the needed cycles. higher clock rate means for the GDDR5/DDR3 ram that more cycles get lost. only if you read or write large chunks you can use the cycles a bit more efficient on dram. and yes the latencies are much lower on the esram, which means the bandwidth can be used more effective (small reads/writes). the higher clock-rate makes the latency of GDDR5 and DDR3 almost equal (latency not bandwidth), but the esram has still much better latency than DDR3. but latency is only needed, where many small operations are done. GDDR5 is good for big things (like textures) that are not accessed frequently. And if you often change between read and write you even loose more cycles, which again means loss of bandwidth.
And because MS has used DDR3 with lower bandwidth than the GDDR5 on the PS4, they can't afford loosing any bandwidth on those ineffective cycles. so they use esram to compensate so many of the small operations can be done in the small esram where they don't harm the bandwidth.

Thus, if we look at it from a "benefit of reading and writing" standpoint, there is pretty much no benefit.
So what if you can do them both together?
I'll just break the load up into two different cycles of read and write and there will be no issues!

And your example is a really odd example that aims to break the PS4 system by forcing it to do what the XB1 does, which probably no one will even try to do.
Why would you want to write to a small 32MB block so many times if you can just distribute it across the full 8GB. and write to lots of different 32MB blocks at once/ read from lots of 32MB blocks?
You're applying a 32MB limit to a system where it has no such limit, thus this mode of operation won't even be attempted.
please think of it a moment. You have small buffers on the gpu, that must be filled with data. that means small reads/writes. the 32Mb was only the render target. on ps4 it might be a little bit bigger, because you "only" have one memory pool but the situation is still the same. if the render target is not spread all over the memory you can never reach the esram bandwith. worst case would be your render target is just in one physical memory-module (512 MB of memory is one module). So you would be limited to 11 GB/s max. only way to get it faster is to spread into all memory-modules so you can reach theoretically the max bandwidth. but now you have even smaller chunks which reduce the effectiveness of dram memory tricks and you are actually loosing more bandwidth.

I don't say it is the holy grail, but the esram is really, really fast for it's size.
and all I'd said in my last post, was, that the bandwidth is not the limiting factor on xbone development so far.

...you believe that memory is spread sequentially on every module and you achieve 256 bit bus speed only if you access 8 byte at this address and 8 bytes at the address that is 512Mb later? yeah, of course. I am *sure* AMD designed their MCT+DCT to work that way :rolleyes:
that would be the absolute worst case. but at the same time you don't want to block the 256 bit memory interface for just one read/write. there are parallel requests that also want to read/write something. but every module can only do one task at a time and than switch (dram is loosing many cycles on that). so the max bandwidth strongly depends on how good the data is shared between the modules and which other tasks want to read/write in those modules. problem here is, you don't want to read/write one big render target, you only want small parts of it. that cost the dram many cycles.

On a more realistic approach... you missed my question.

Given L2+*NB+MCT etc., what if my shaders read alot of data and write relatively less (like in a buffer). Could/would the gddr5 (not considering precharge...) achieve much better results?

I am just thinking to most possible data-feed scenarios that may come to my mind. Any idea?
No, still the esram would be much faster, but only if it fits into esram. it would even be faster if it only would be bit bigger than esram, because you can also use the DDR3 at the same time to read/write.
the big plus the esram has, it can almost read/write simultaneously. together with bandwidth and lower latencies, for parallel tasks (like almost everything a GPU does) the esram is the fastest way to read/write anything. but if you are forced to use the DDR3 (because most data is still in there) the GDDR5 would be better ;)
 
if you are only writing/reading 32MB in the whole GDDR5 Memory, you have a really low bandwidth for this chunk for the big memory, if the 32MB is not spread to all memory modules (you cannot access one module with all the bandwidth).
AMD stripes its graphics memory across all channels. For proper utilization, the eSRAM either stripes addresses across its arrays or the software should as well.
It has banking conflicts as well.


Given L2+*NB+MCT etc., what if my shaders read alot of data and write relatively less (like in a buffer). Could/would the gddr5 (not considering precharge...) achieve much better results?

I am just thinking to most possible data-feed scenarios that may come to my mind. Any idea?

External DRAM works best with long periods of the same kind of access, due to the amount of time it takes to physically turn the bus and device around. Failure to do so will massively gut its bandwidth. eSRAM should be relatively immune to this, although some turnaround may the reason why it cannot acheive a perfect doubling of its bandwidth in read+write scenarios.

The two memory pools emphasize different scenarios. DRAM likes heavily lopsided read/write mixes, or at least mixes where it can go for as long as possible without switching modes.
eSRAM doesn't get near its peak with mixes that don't get near 50:50, but it can handle the mixed access case without requiring reads and writes occur in exclusive chunks.
 
no, max read performance on xbox one is 109GB/s + DDR3
Max write Performance is 109GB/s + DDR3

Max mixed performance is 204GB/s + DDR3

No and it's not even the peak bandwith because of the mix of write/read. XB1 architects said in a Eurogamer interview they reached maximum 140-150GB/s (let's say 145GB/s) with some rare special case.

You can't sustain writes for absolutely every single cycle. The writes is known to insert a bubble [a dead cycle] occasionally... one out of every eight cycles is a bubble so that's how you get the combined 204GB/s as the raw peak that we can really achieve over the ESRAM. And then if you say what can you achieve out of an application - we've measured about 140-150GB/s for ESRAM.
If XB1 architects, on a simple limited scenario, reached 145GB/s bandwidth maximum with the esram, probably nobody will ever do better than them.

On a similar simple scenario we know GDDR5 can reach 90-91% of the peak BW so roughly 160GB/s from 176GB/s in real ideal case.

Without DDR3 in a similar ideal case 160GB/s > 145GB/s. You can then add 50/55GB/s BW from DDR3 then so reaching 200GB/s maximum ideal cumulated BW.

You can add that to the external memory and say that that probably achieves in similar conditions 50-55GB/s and add those two together you're getting in the order of 200GB/s across the main memory and internally
In conclusion:
XB1 - 200GB/s cumulated BW with PS2 architecture slow ram / fast Vram
PS4 - 160GB/s unified memory.
 
No and it's not even the peak bandwith because of the mix of write/read. XB1 architects said in a Eurogamer interview they reached maximum 140-150GB/s (let's say 145GB/s) with some rare special case.

If XB1 architects, on a simple limited scenario, reached 145GB/s bandwidth maximum with the esram, probably nobody will ever do better than them.

On a similar simple scenario we know GDDR5 can reach 90-91% of the peak BW so roughly 160GB/s from 176GB/s in real ideal case.

Without DDR3 in a similar ideal case 160GB/s > 145GB/s. You can then add 50/55GB/s BW from DDR3 then so reaching 200GB/s maximum ideal cumulated BW.

In conclusion:
XB1 - 200GB/s cumulated BW with PS2 architecture slow ram / fast Vram
PS4 - 160GB/s unified memory.

We've been over this. The architects said with 145GB/s scenario with the esram by measuring real games. This is not some benchmark test to max out the bandwidth by using some bogus code, so if you are making a comparison on game code on the X1 vs synthetic code on the PS4 I don't see how that's a good comparison.
 
Back
Top