The Xbox 1's ESRAM

sebbbi · Jun 27, 2013

liolio said:
Anandtech made a statement while reviewing Haswell+CrystalWell that was a bit mysterious to me. Iirc he stated that Intel did not keep the frame buffer in CW. We don't know what that cache policies are but I would think that at some points it could be (or part of it) in the L4. Anyway if true I think that whatever happens later some render targets are written first straight to the main RAM (out of the ROPs and their caches).

nAo said in the Haswell thread that Intel's ROP caches are backed up by the whole cache hierarchy (http://beyond3d.com/showpost.php?p=1756702&postcount=550). He is working at Intel so that information is likely correct. We should assume that both L3 and L4 (in GT3e) cache ROP results. This doesn't mean that the whole frame buffer would (always) be in the L4 cache, but the least recently accessed cache lines of the frame buffer are likely there.

It is good that Intel doesn't permanently reserve some part of the L4 cache as GPU frame buffer, as all the CPU memory requests also go though it. Someone might want to use the L4 cache to boost pure CPU based processing (such as sparse voxel octree rendering)

.

Gipsel · Jun 27, 2013

liolio said:
Well Trinity manages to render with less that half of that figure with less ROPs.

I guess the performance target for XB1 as well as PS4 is well above Trinity.

liolio said:
Sebbbi had this interesting post about optimizing for the size of the ROPs cache (in the core section), his optimization was dependant on the amounts of cache which is depend to the amount of RBE partition in the GPU, the PS4 has twice the ROPs and matching cache structures of Durango.

Even with that it is much too small. The PS4 likely has 128kB color cache in the ROPs (if it is the same size as in AMD GPUs). That is way too small if you don't resort to rendering quite small tiles as sebbi tested (which is not very convenient or even impossible to do efficiently in the general case). To get a significant and consistent effect, one would need to increase the size by two orders of magnitudes or so (at least several megabytes), so that it can hold major parts or better the complete render target, which would reduce the needed bandwidth for ROP operations. Currently, these ROP caches merely use the spatial locality of the fragments in a wavefront to use the memory bandwidth more efficiently (loading and evicting complete tiles of the render target at once). It coalesces reads and combine writes.

AlNom · Jun 27, 2013

Gipsel said:
That is way too small if you don't resort to rendering quite small tiles as sebbi tested (which is not very convenient or even impossible to do efficiently in the general case).

TBDS ought to be a decent fit or was that what was tested :?:

Gipsel · Jun 27, 2013

AlNets said:
what was tested

The discussion starts here. It was basically some particle rendering.

sebbbi said:
(Slightly OT, continued from my above post)

I did some extra testing with the GCN ROP caches. Assuming the mobile versions have equally sized ROP caches than my Radeon 7970 (128 KB), it looks like rendering in 128x128 tiles might help these new mobile APUs very much, as APUs are very much BW limited.

Particle rendering has a huge backbuffer BW cost, especially when rendering HDR particles to 4x16f backbuffer. Our system renders all particles using a single draw call (particle index buffer is depth sorted, and we use premultiplied alpha to achieve both additive and alpha channel blended particles simultaneously). It is actually over 2x faster to brute force render a draw call containing 10k particles 60 times to 128x128 tiles (move scissor rectangle across a 1280x720 backbuffer) compared to rendering it once (single draw call, full screen). And you can achieve this kinds of gains by spending 15 minutes (just a brute force hack). With a little bit of extra code, you can skip particle quads (using geometry shader) that do not land on the active 128x128 scissor area (and save most of the extra geometry cost). This is a good way to reduce particle overdraw BW cost to zero. A 128x128 tile is rendered (alpha blended) completely inside the GPU ROP cache. This is especially a good technique for low BW APUs, but it helps even the Radeon 7970 GE (with massive 288 GB/s BW).

With this technique, soft particles gain even more, since the full screen depth texture reads (128x128 area) fits the GCN 512/768 KB L2 cache (and become BW free as well). Of course Kepler based chips should have similar gains (but I don't have one for testing).

If techniques like this become popular in future, and developers start to spend lots of time in optimizing for the modern GPU L2/ROP caches, it might make larger GPU memory pools (such as the Haswell 128 MB L4 cache) less important. It's going to be interesting too see how things pan out.

liolio · Jun 27, 2013

sebbbi said:
nAo said in the Haswell thread that Intel's ROP caches are backed up by the whole cache hierarchy (http://beyond3d.com/showpost.php?p=1756702&postcount=550). He is working at Intel so that information is likely correct. We should assume that both L3 and L4 (in GT3e) cache ROP results. This doesn't mean that the whole frame buffer would (always) be in the L4 cache, but the least recently accessed cache lines of the frame buffer are likely there.

It is good that Intel doesn't permanently reserve some part of the L4 cache as GPU frame buffer, as all the CPU memory requests also go though it. Someone might want to use the L4 cache to boost pure CPU based processing (such as sparse voxel octree rendering) .

Ok that makes a lot more sense than that elusive sentence I read at Anandtech more in line with what a cache should be /do.

Gipsel said:
I guess the performance target for XB1 as well as PS4 is well above Trinity.

Well that is not fair to what I wrote :

my-self said:
Well Trinity manages to render with less that half of that figure with less ROPs.
The system is to behave in a different manner as the PS4 anyway so I'm not sure "DDR3 would not be fast enough" is a good way to approach things.

Durango could definitely render in the main RAM, it has as (~) much bandwidth to it as a HD77xx.

And I did not say either that the scratchpad would be left untouched.

Even with that it is much too small. The PS4 likely has 128kB color cache in the ROPs (if it is the same size as in AMD GPUs). That is way too small if you don't resort to rendering quite small tiles as sebbi tested (which is not very convenient or even impossible to do efficiently in the general case). To get a significant and consistent effect, one would need to increase the size by two orders of magnitudes or so (at least several megabytes), so that it can hold major parts or better the complete render target, which would reduce the needed bandwidth for ROP operations. Currently, these ROP caches merely use the spatial locality of the fragments in a wavefront to use the memory bandwidth more efficiently (loading and evicting complete tiles of the render target at once). It coalesces reads and combine writes.

Interesting details, I asked something on the matter to sebbbi in the afforementioned thread, actually reading your comment sort of answer my question /why optimize for the size of the whole caches vs local share of the color cache. I won't go into the detail of explaining what I was not getting right.

Still are you sure it would not be possible to extend that approach. The main overhead seems to be the number of drawcalls. On console they are supposedly extremely cheap and actually even on PC Andrew L made clear that overhead of drawcalls is greatly over-estimated. I mean sebbbi got a x2 speed-up, it is neat as it doesn't seem to rely on external bandwidth.

Gipsel · Jun 27, 2013

liolio said:
Still are you sure it would not be possible to extend that approach. The main overhead seems to be the number of drawcalls. On console they are supposedly extremely cheap and actually even on PC Andrew L made clear that overhead of drawcalls is greatly over-estimated. I mean sebbbi got a x2 speed-up, it is neat as it doesn't seem to rely on external bandwidth.

I'm far from being an expert on this matter, but I would assume that it doesn't look too rosy if one has some heavy geometry work where one can't determine beforehand in which tile the result will end up. That means it has to be duplicated for each tile. If one isn't absolutely bandwidth limited and the shaders are almost completely idle, small tiles (sebbi used 60 tiles for it's 1280x720 render target, 1080p would be more like 130 tiles) will get prohibitive. That means one would like to have significantly larger caches to use fewer but larger tiles. Otherwise this tiling approach will only help a certain subset of problems suited to it.

sebbbi · Jun 27, 2013

Gipsel said:
doesn't look too rosy if one has some heavy geometry work where one can't determine beforehand in which tile the result will end up. That means it has to be duplicated for each tile.

That's true if you have a traditional CPU driven renderer that renders each (whole) object with a separate draw call (or uses traditional geometry instancing). However if you have a GPU (compute shader) driven renderer, you can split your geometry to much smaller patches, and the tiling geometry overhead will hurt much less. For this technique to be efficient in general case (triangle mesh rendering) it needs the whole renderer to be designed around it.

However for particle effect rendering (one vertex per particle + alpha blending with huge overdraw), simple geometry shader culling (not create quad for particles not in viewport) should be enough to make it quite efficient. Of course compute shader could be used to bin particles to tiles to further improve the efficiency.

I must point out that my GPU cache optimization trick (that was taken from another thread) was about Tahiti, Kepler and Haswell ROP caches (and L3 in case of Haswell). I will leave the Xbox One speculation for you guys

BeyondTed · Jul 2, 2013

Gunhead said:
AFAICS, ESRAM doesn't mean "embedded SRAM"; it's a specific type of SRAM designed by Enhanced Memory. Microsoft aparently licenced the IP from them?

Do you have any interesting links for Enhanced Memory Systems, Inc?

I tried to find some but not much yet. Some association with Ramtron in Colorado Springs. Also some link that would not load about them combining SRAM with eDRAM.

Would be interested in any good technical links for them and their products.

dobwal · Jul 4, 2013

BeyondTed said:
Do you have any interesting links for Enhanced Memory Systems, Inc?

I tried to find some but not much yet. Some association with Ramtron in Colorado Springs. Also some link that would not load about them combining SRAM with eDRAM.

Would be interested in any good technical links for them and their products.

I looked around and could find anything other than it was a company based on a partnership between Ramtron and Cypress that seems defunct.

However, MS may have rolled their own SRAM solution. The BSC (Barcelona Super Computing) Microsoft Research Centre has been researching what they call RDC/dvSRAM/eSRAM ("eSRAM" is not readily used but appeared in a scientific journal figure in place of "RDC") which is a hardware based Transactional Memory cache. Its reconfigurable SRAM that can act as general purpose SRAM or reconfigure itself into a HTM solution where half of the memory serves shadow copies.

It could explain why we have rumors MS is having yield or heat problems with eSRAM. It also could be an explanation of the "5 billion transistors " figure as its two 6 T cell connected together with 2 inverters which are paired to 4 extra transistors. Its also able to perform dual writes per cycle. It is designed to write the same data to two different cells.

blakjedi · Jul 4, 2013

dobwal said:
IIt could explain why we have rumors MS is having yield or heat problems with eSRAM. It also could be an explanation of the "5 billion transistors " figure as its two 6 T cell connected together with 2 inverters which are paired to 4 extra transistors. Its also able to perform dual writes per cycle. It is designed to write the same data to two different cells.

Ah that explains why they were surprised at its capability to actually meet exceed their projections. Excellent find again Mr. Dobwal. This is getting interesting. Onto Gamescom.

Esrever · Jul 4, 2013

liolio said:
When I look at the size of durango SoC I really wonder if MSFT went with eSRAM because they though it was better or simply because they had no choice.
I mean it is unclear if AMD could have done something akin to CrystalWell or even considered doing it (I hope they will react to Intel move and "copy" it). Intel acknowledged that they did no really need more than 32MB (though I guess they will find use for the others 96MB I would be surprised if they don't), looking at the WiiU producing 32MB of external cache should not result in a crazy big chip with the matching costs.
Now Intel cache hit rates are awesome may be CW would not serve AMD chip as successfully.

Either way on chip scratchpad provides significant performances improvements (outside of bandwidth). So far we can only guess.

My personal bet is that MSFT engineers might be looking CW with envy (and so might be Sony or I would guess lot of people in the industry).

Microsoft has said they went with ESRAM for easier developer access. Edram is inflexible and requires a controller and logic. Not something MS wanted to deal with again. They wanted fast memory that the gpu and cpu can access with low latency. Crystalwell doesn't actually do much except caches for the GPU and has a huge bus connected to it. I doubt MS wanted something like that in the first place.That and the $1000 price tag of it makes it impossible for a console.

LoStranger said:
Hi there,

I'm aware the X1 employs 32MB ESRAM @ 102 GB/S I wanted to ask a few questions regarding this

1) Why the decision to go with ESRAM instead of GDDR5 particularly considering the fact the Xbox 360 used GDDR3 Unified memory

2) What benefits does having 8GB DDR3 + 32 MB of ESRAM have over the PS4'S 8GB GDDR5

Latency and efficiency are some advantages. GDDR5 is great but it doesn't offer what MS is going for. It was uncertain if it would be ready in time. MS would have probably gone with a GDDR5 + ESRAM system if they thought they could push for it early in the life cycle.

The latency for the Esram will allow for better performance of HSA because the intercommunication between CPU and GPU if MS chooses to implement it in such a way. The whole system if built for efficiency. They are banking on shrinking of the ESRAM with die shrinks to keep cost down in the future.

The tiled memory MS showed off at BUILD are some things they can do with that memory that MS has been planning for a while I would guess. If they do it right, the memory subsystem in the xbox will be more robust than in the playstation, with dedicate engine for compression and memory movement, the bandwidth advantage shouldn't be much of a difference. The latency advantages will be helpful if there is enough ESRAM for the CPU and GPU. The Jaguar cores are put into 4 core modules while have very hard time communicating between modules, they also lack an L3. Meaning MS can leverage this as a decent CPU advantage if they choose to. It all depends on how much ESRAM there is to go around when they are trying to use the GPU and CPU.

If MS reserves 4-8MB for cpu+GPU L3 16MB for tiled memory access and 8 MB for a partial frame buffer, it would allow for better cpu performance, a good gpu memory implementation and generally better efficiency with anything that runs one both the CPU and GPU. I don't know what MS is actually doing with the memory so thats all just my speculation. MS didn't just was 1.6B transistor so they could say they have a 5B transistor chip. There is uses for it and they know it. They had their vision. They know what they are doing in terms of SoC design.

TheWretched · Jul 4, 2013

So, by saying "it writes the same data to two cells simultaneously", is it akin to say Raid 1? I mean, what "good" would such a duplication of data be, for RAM?

Tchock · Jul 4, 2013

If anything Crystalwell seems over-engineered to serve two generations of Iris graphics; since it's built on a n-1 process and the end product is high margin, cost/size is much less of an issue. I think any Broadwell/Haswell v2 chip might still use the same die considering Intel would be thinking of hi-DPI displays beyond HD resolutions too.

liolio · Jul 4, 2013

Tchock said:
If anything Crystalwell seems over-engineered to serve two generations of Iris graphics; since it's built on a n-1 process and the end product is high margin, cost/size is much less of an issue. I think any Broadwell/Haswell v2 chip might still use the same die considering Intel would be thinking of hi-DPI displays beyond HD resolutions too.

Well it is not, they will deploy it on any product they deem worthy. It is coming to Xeon Phi for example.
I wonder if it could help in the mobile realm to (though nothing is announced yet). I think they chose something big enough so it is "stable" for a while.

Esrever, I disagree MSFT did not have the choice to use CrystalWell, plain and simple and that is not correct from my POV:

Crystalwell doesn't actually do much except caches for the GPU and has a huge bus connected to it.

Crystal web is connected by a pretty narrow, high speed link. A cache does a lot, a lot more than a scratchpad. Though latency to off chips memory is always higher than on chip memory be it cache or scratchpad, I agree. Though as we speak of Intel the latency figures are quite low, but I guess it is the same with the main memory be it for their CPU or GPU the latency for uncached access to the RAM are impressively low, way faster than the competition.
And that I'm not sure about how it makes sense:

Microsoft has said they went with ESRAM for easier developer access. Edram is inflexible and requires a controller and logic. Not something MS wanted to deal with again. They wanted fast memory that the gpu and cpu can access with low latency.

Cache is as straight forward as it gets, you can optimize for it but it "works" by itself even if you don't. Scratchpad doesn't work by it-self it requires too controllers and logic.

dobwal · Jul 4, 2013

TheWretched said:
So, by saying "it writes the same data to two cells simultaneously", is it akin to say Raid 1? I mean, what "good" would such a duplication of data be, for RAM?

http://www.ece.ubc.ca/~aamodt/papers/wwlfung.micro2011.pdf

3dilettante · Jul 4, 2013

Tchock said:
If anything Crystalwell seems over-engineered to serve two generations of Iris graphics; since it's built on a n-1 process and the end product is high margin, cost/size is much less of an issue. I think any Broadwell/Haswell v2 chip might still use the same die considering Intel would be thinking of hi-DPI displays beyond HD resolutions too.

The last I saw for Crystalwell, it was made on a variant of Intel's 22nm process.

Gubbi · Jul 5, 2013

Speculation: The size is down to two things:
1. Pad limitations.
2. Multiple purposes. I'm confident we'll see Xeon products utilizing the 128MB edram cache for use with DDR3 memory subsystem, it'll do wonders for a lot of workloads.

Cheers

TheWretched · Jul 5, 2013

dobwal said:
http://www.ece.ubc.ca/~aamodt/papers/wwlfung.micro2011.pdf

I am sorry, but I have little time to read and understand a technical paper at the moment, thus I asked this question here to people who have read it (and I "quoted" a person, too).

So again... why would it make sense to write the same data to two different cells? Unless you can read said data from both cells at twice the speed, it doesn't make a lot of sense to me.

liolio · Jul 5, 2013

Gubbi said:
Speculation: The size is down to two things:
1. Pad limitations.
2. Multiple purposes. I'm confident we'll see Xeon products utilizing the 128MB edram cache for use with DDR3 memory subsystem, it'll do wonders for a lot of workloads.

Cheers

I remember reading some posts on RWT stating that CW is an OTS now at Intel, they indeed aimed at more than graphics with it.
Some were wondering about the mobile realm to, 128MB may allow to fit the basic kernel+ a few light weigtht services and thus enabling the RAM to be completely turned off. I guess it may require a couple of optimizations but that is an interesting idea.

Esrever · Jul 6, 2013

how much performance would a 32MB cache gain for the CPU? This would be the easiest and dumbest way to use ESRAM, just use the whole thing as a cache.

http://www.xbitlabs.com/articles/cpu/display/core2duo-e7200_4.html
this shows a C2D with 3mb vs 6mb of cache and its about a 10% improvement on that. I can only assume it gets better with 32MBs and more CPU resources to use the cache.

The Xbox 1's ESRAM

sebbbi

Gipsel

AlNom

Moderator

Gipsel

liolio

Aquoiboniste

Gipsel

sebbbi

BeyondTed

dobwal

blakjedi

Esrever

TheWretched

Tchock

liolio

Aquoiboniste

dobwal

3dilettante

Gubbi

TheWretched

liolio

Aquoiboniste

Esrever

Similar threads