Questions about Xbox One's ESRAM & Compute

Pixel · Oct 6, 2013

Well unlike the 360, the X1's main memory offers mediocre bandwidth for the gpu.

Another benefit of Esram is it has higher bandwidth partially due to dualport nature of this type of 6T sram. If they had gone with dram not only would there be higher latency but there wouldn't be dual porting effectively cutting down on the theoretical peak bandwidth.

taisui · Oct 6, 2013

Betanumerical said:
You don't but you can certainly read it into the eSRAM then copy back to the DDR3.

didn't you just say that it was moved from DRAM a moment ago?

Betanumerical · Oct 6, 2013

taisui said:
didn't you just say that it was moved from DRAM a moment ago?

I was saying that moving data from one pool to another if it needs to read quickly requires the use of the DDR3 bandwidth yes.

Allandor · Oct 6, 2013

Betanumerical said:
What if your data needs more then ~68GB/s read, you have to have it in the eSRAM.

you must count in that the 68gb/s ist read and write and is shared with the other components. the esram should be used exclusively for the gpu. then there is the thing with those dram tricks that could have a negative impact if you switch between, reads and writes and that from different components (cpu, gpu, audio...).
so in reallity you don't even have the 68gb/s for the gpu (a little bit worse would it be with gddr5 on same bandwith, but because it usually has a much higher bandwith it is not that bad) . and here the move engines can do their job by only load tiles of textures into the esram. this not only saves memory, but a lot of bandwidth and also cpu-cycles (and the move-engines should do it much faster than the cpu could) and even some gpu-cycles.

most of the writes and reads for the gpu are done in the esram, so the bandwidth of the ddr3 is not that important (textures are not everything) for the gpu as it would be with only one memory pool.
the only thing is, the textures must be loaded into the ram once or with intelligent streaming, but that shouldn't consume that much bandwith.

one small benefit that comes with the esram is, that it has the same clock-speed as the gpu, so you save some sync cyles here (once you have your data there). again, this would be worse with just the external memory pool. this should be a small boost in efficiency for the gpu.

Strange · Oct 6, 2013

Allandor said:
most of the writes and reads for the gpu are done in the esram, so the bandwidth of the ddr3 is not that important (textures are not everything) for the gpu as it would be with only one memory pool.

Saying that the GPU can perform most of the read/writes on a 32MB eSRAM is a very unrealistic statement.

The move engine also doesn't bypass the 68GB/s the DDR3 has either, unless it obtained the data elsewhere.

Allandor · Oct 6, 2013

Strange said:
Saying that the GPU can perform most of the read/writes on a 32MB eSRAM is a very unrealistic statement.

not all reads/writes but most of them, yes.
many reads/writes cripples dram bandwidth (68GB/s is only the peak bandwidth if you only read or only write once), and if you've got multiple compontents that want to read/write you won't have the slightest chance to reach this peak. If you can read/write most of the values in the esram, this will not only save bandwidth, it is even faster, because no other compontent will disturb the gpu. And if you've written a while and want to do something else, you move the result to the dram in "one" write Operation (thx to the move engines, without bothering the cpu). This way you saved the dram many read/writes. this way, the esram would be just used as a buffer, but you are free to use it any other way.
e.g. in xbox 360 the edram was used to save bandwidth from main memory. why should it be different this time, if you've even got more of it, lower latency, more bandwidth and more capabilities (e.g. move engines).

The move engine also doesn't bypass the 68GB/s the DDR3 has either, unless it obtained the data elsewhere.

you're right, it doesn't bypass the 68gb/s (I never wrote that) but, because only tiles of a texture are loaded (really small fractions), the reads are much smaller. and because you must not write every tiny result back to the dram. this is what helps to save bandwidth.

every read/write has a overhead. you will not only loose the bandwidth for the bytes that you read/write you will lose many cycles. if these reads/writes the overhead is not that alarming, but the smaller the data is the more inefficient the dram gets. that's why you won't get even close to the 68GB/s if you're just using the dram.

the good thing, every xbox is a dev-kit. so we can test this at home

Betanumerical · Oct 6, 2013

Allandor said:
you're right, it doesn't bypass the 68gb/s (I never wrote that) but, because only tiles of a texture are loaded (really small fractions), the reads are much smaller. and because you must not write every tiny result back to the dram. this is what helps to save bandwidth.

every read/write has a overhead. you will not only loose the bandwidth for the bytes that you read/write you will lose many cycles. if these reads/writes the overhead is not that alarming, but the smaller the data is the more inefficient the dram gets. that's why you won't get even close to the 68GB/s if you're just using the dram.

the good thing, every xbox is a dev-kit. so we can test this at home

wrt to the Move Engines, texture tiling means swizzling. Not the PRT texture tiling that people think of.

adev · Oct 6, 2013

Betanumerical said:
wrt to the Move Engines, texture tiling means swizzling. Not the PRT texture tiling that people think of.

It's converting linear data into 8*8 (usually) square tiles.

These tiles are then spread across the ESRAM to help give the ideal access pattern for simultaneous read/write.

LightHeaven · Oct 6, 2013

Pixel said:
Well unlike the 360, the X1's main memory offers mediocre bandwidth for the gpu.

Another benefit of Esram is it has higher bandwidth partially due to dualport nature of this type of 6T sram. If they had gone with dram not only would there be higher latency but there wouldn't be dual porting effectively cutting down on the theoretical peak bandwidth.

Actually, it's a bit worst than 360 (compared to Ps3) when compared to Ps4, but not that much.

360's GPU had 22.4 GB/s for the main memory. Ps3 GPU had the same 22.4GB/s for GDDR3, but also could access the XDR so an extra 25.6GB/s. That brings the total system bandwidth, for the gpu to 48GB/s. The ratio between them is 2.14.

Ps4 has a single 176GB/s memory pool, and xbone a 68GB/s, the ratio between them is 2.59.

And this time the system was designed with hardware compression in mind and per vgleaks data stays compressed almost the entire time, so the effective bandwidth should also be even bigger.

Gipsel · Oct 6, 2013

adev said:
It's converting linear data into 8*8 (usually) square tiles.

These tiles are then spread across the ESRAM to help give the ideal access pattern for simultaneous read/write.

Actually, there are a whole lot of different tiling modes (with GCN there are apparently 32 different ones, optimal one depends on the format, and AMD renamed it now to swizzle as to not confuse people with the PRT tiles). And this is not just done for the eSRAM. When copying texture data from the system RAM to VRAM on a PC, the data also get swizzled/tiled. It helps the DRAM access basically in the same way as it helps the eSRAM access. And it is also good for optimizing the spatial locality for the caches (you need to fetch less cachelines to cover the area the texture filters sample). That's why it is a capability which is also part of the usual DMA engines.

adev · Oct 6, 2013

Gipsel said:
Actually, there are a whole lot of different tiling modes (with GCN there are apparently 32 different ones, optimal one depends on the format, and AMD renamed it now to swizzle as to not confuse people with the PRT tiles). And this is not just done for the eSRAM. When copying texture data from the system RAM to VRAM on a PC, the data also get swizzled/tiled. It helps the DRAM access basically in the same way as it helps the eSRAM access. And it is also good for optimizing the spatial locality for the caches (you need to fetch less cachelines to cover the area the texture filters sample). That's why it is a capability which is also part of the usual DMA engines.

Absolutely, I was trying to help clarify Beta's comment because I've also seen tiling/swizzling confused with swizzling operators.

Bigus Dickus · Oct 6, 2013

Why would you copy data from ddr3 to esram before beginning to work on it? That copy operation is limited to 68 GB/s just as working directly from the ddr3 would be. Data should go from ddr3 directly to gpu, with intermediate results possibly written back to esram. Subsequent steps may read from and write to DDR or esram pools as needed depending on where the data resides and whether it would most benefit from high bandwidth or not. This talk of an extra step of copying to esram first, then beginning work, just doesn't make any sense to me.

Allandor · Oct 6, 2013

Bigus Dickus said:
Why would you copy data from ddr3 to esram before beginning to work on it? That copy operation is limited to 68 GB/s just as working directly from the ddr3 would be. Data should go from ddr3 directly to gpu, with intermediate results possibly written back to esram. Subsequent steps may read from and write to DDR or esram pools as needed depending on where the data resides and whether it would most benefit from high bandwidth or not. This talk of an extra step of copying to esram first, then beginning work, just doesn't make any sense to me.

of course you can skip this step, but the move-engines work async, so you can give them something to do while working on something else.

Bigus Dickus · Oct 6, 2013

Doing that, essentially prefetching data to the esram, only makes sense in situations when the DDR bandwidth isn't being heavily utilized already as the move engines still operate over the common busses. With 68 GB/s to DDR, I'm nit sure how often that will be the case.

MJP · Oct 6, 2013

If you have static data that you think would strongly benefit from being in ESRAM, then (I would assume that) the only way to get it in there is to copy it. For instance if you pre-voxelized your static data and wanted to cone trace or ray cast into it you'd want the bandwidth, but since the GPU isn't generating it every frame you'd need to copy it into ESRAM before the part of the frame that uses it. The same would probably go for any GPU-generated data that needs to be used in the frame long after it was generated.

taisui · Oct 6, 2013

You only use the Move Engines when it makes sense to do, and the spin is suggesting that you would ALWAYS use the the Move Engine and/or GPU which consumes the BW, then you've "waste" the BW because you have to copy, see how a fallacy is formed right there.

Shifty Geezer · Oct 6, 2013

MJP said:
If you have static data that you think would strongly benefit from being in ESRAM, then (I would assume that) the only way to get it in there is to copy it. For instance if you pre-voxelized your static data and wanted to cone trace or ray cast into it you'd want the bandwidth, but since the GPU isn't generating it every frame you'd need to copy it into ESRAM before the part of the frame that uses it. The same would probably go for any GPU-generated data that needs to be used in the frame long after it was generated.

that's very true, but don't forget the ESRAM is only 32 MBs. The very maximum that you'll be copying across from DDR3 to ESRAM is 32 MBs. Worst case is about a gig a second, and you'll be making net gains on that BW by having it in ESRAM, otherwise you wouldn't bother moving it there.

imaxx · Oct 6, 2013

Allandor said:
of course you can skip this step, but the move-engines work async, so you can give them something to do while working on something else.

This really sounds like Cell's SPE DMA management... no thanks

Gubbi · Oct 6, 2013

MJP said:
If you have static data that you think would strongly benefit from being in ESRAM, then (I would assume that) the only way to get it in there is to copy it. For instance if you pre-voxelized your static data and wanted to cone trace or ray cast into it you'd want the bandwidth, but since the GPU isn't generating it every frame you'd need to copy it into ESRAM before the part of the frame that uses it. The same would probably go for any GPU-generated data that needs to be used in the frame long after it was generated.

Most tree based or tree derived search structure all have great temporal locality for the top nodes and virtually zero for leafs (ie. most leaf nodes are untouched). You only want to store the top n levels in ESRAM and let the rest remain in main memory, - and only if you're thrashing the GPU caches. Let's say you can store the top 10 levels of your search tree in a few MBs, but the next two levels would require an order of magnitude more. You'd probably be better off with the few megs, and let other tasks use the remaining ESRAM. Transferring a few MB into ESRAM every frame is virtually free (ie. < 0.25% of per frame bandwidth @60 Hz).

IMO, the real strength of the ESRAM, together with memory mapping, is the flexibility it offers to do really sophisticated placement of structures in memory. It will also keep developers busy for a few years before they get the maximum from the system.

Cheers

MJP · Oct 6, 2013

I wasn't trying to make any comment on performance or drawbacks, I was just answering the question "Why would you copy data from ddr3 to esram before beginning to work on it?"

Questions about Xbox One's ESRAM & Compute

Pixel

taisui

Betanumerical

Allandor

Strange

Allandor

Betanumerical

adev

LightHeaven

Gipsel

adev

Bigus Dickus

Allandor

Bigus Dickus

MJP

taisui

Shifty Geezer

uber-Troll!

imaxx

Gubbi

MJP

Similar threads