Questions about Xbox One's ESRAM & Compute

onQ

Veteran
I've been wondering about the Xbox One's 32MB of ESRAM & how it could work with compute.

Could it be possible to just use the DDR3 for the normal graphical tasks while using the 32MB of ESRAM for the compute shaders or have a whole lighting engine that's ran in compute using the fast embedded ram?
 
Yes. The XB1 memory architecture appears to be completely flexible, so devs can do whatever they want with it. Whether you'd want to or not is a different question. If you only want 68 GB/s for your entire graphics and processing workloads, go for it!
 
It's worth mentioning that it is isn't all or nothing. You could reserve a few MB for compute and have the GPU use the rest for graphics.

Latency sensitive algorithms benefit the most from ESRAM, so you might taylor your memory layout to get the maximum benefit; Like having the top nodes of a KD, Quad-tree or whatever search structure you use, and have the bottom nodes and leafs in DDR3 memory.

Cheers
 
Like having the top nodes of a KD, Quad-tree or whatever search structure you use, and have the bottom nodes and leafs in DDR3 memory.

In most games, where the 3D view is relative to the players position, I could see potential for these optimizations. I still haven't been able to figure out the impact of PRT on ESRAM, but it seems like a potential candidate for performance. Maybe someone with experience in this field can elaborate/explain whether or not PRT would be a good candidate for the ESRAM usage?
 
Last edited by a moderator:
Didnt MS demonstrate tiled resources using 16 MB of memory to display a 3 GB texture? Is it possible that just utilizing half of eSRAM will allow Durango to relatively house 3+ GBs of texture data on chip?

They talked about how tile resources works by treating portions of gpu memory and DRR3 as a sort of large unified texture buffer. I guess the DME could manage things like moving texture data between the esram and DDR3 while the gpu is busy performing internal operations. As well using the swizzle logic to manipulate texture data residing in esram and ddr3 before it hits the gpu.
 
Didnt MS demonstrate tiled resources using 16 MB of memory to display a 3 GB texture? Is it possible that just utilizing half of eSRAM will allow Durango to relatively house 3+ GBs of texture data on chip?
No. 16 MBs of ESRAM allows for 16 MBs of texture. If that 16 MBs is exactly what you need to texture every pixel on screen, it's enough. If you need more than 16 MBs of tiles to render the screen, you'll need to copy tiles into the ESRAM, which is something of a waste of BW if you aren't going to reuse the tile. If the scene changes, you need to load in more tiles also.

Allocating ESRAM to textures depends on reutilisation. Frame to frame changes could be low enough that it's plausible, although then you have less RAM for everything else and a key point of PRT is reducing BW requirements. So you'd be using up ESRAM space to save a tiddly amount of BW and reduce latency on texturing which is one area GPUs are specifically designed to accommodate high latency. Doesn't make sense to me.

Using PRT for non-texture data (especially stuff you write to as well as read) could make some sense as others suggest, but storing your level textures in there seems pointless to me. Maybe for destruction, like writing bullet holes to textures, keeping the immediate tiles in and drawing on them, and then swapping them out as the player moves, would be worth doing?
 
No. 16 MBs of ESRAM allows for 16 MBs of texture. If that 16 MBs is exactly what you need to texture every pixel on screen, it's enough. If you need more than 16 MBs of tiles to render the screen, you'll need to copy tiles into the ESRAM, which is something of a waste of BW if you aren't going to reuse the tile. If the scene changes, you need to load in more tiles also.

Allocating ESRAM to textures depends on reutilisation. Frame to frame changes could be low enough that it's plausible, although then you have less RAM for everything else and a key point of PRT is reducing BW requirements. So you'd be using up ESRAM space to save a tiddly amount of BW and reduce latency on texturing which is one area GPUs are specifically designed to accommodate high latency. Doesn't make sense to me.

Using PRT for non-texture data (especially stuff you write to as well as read) could make some sense as others suggest, but storing your level textures in there seems pointless to me. Maybe for destruction, like writing bullet holes to textures, keeping the immediate tiles in and drawing on them, and then swapping them out as the player moves, would be worth doing?

Thanks shifty.

Look around and found the most detailed description of PRT I could come across. It contains more information than provided by more general sources and it was written by AMD for a SISGRAPH 2012 virtual texture course.

http://www.jurajobert.com/data/Virtual_Texturing_in_Software_and_Hardware_course_notes.pdf
 
Last edited by a moderator:
I've been wondering about the Xbox One's 32MB of ESRAM & how it could work with compute.

Could it be possible to just use the DDR3 for the normal graphical tasks while using the 32MB of ESRAM for the compute shaders or have a whole lighting engine that's ran in compute using the fast embedded ram?

It's possible that the embedded RAM makes read-write buffers a "fast path". Normally when you do multiple passes with a compute shader you have the choice between
a) double-buffering the buffer you work on, then input and output are two different resources, one is read-only, the other write-only
b) using a single read-write buffer
This has a few implication on performance. The read-write buffer can not be a texture, it has to be a rw UAV, and rw UAVs need to support atomics. In effect memory i/o via rw UAVs is slow in comparison to ro/wo UAVs and textures. On the other hand, depending on the architecture, rw can go over the cache and bandwidth is saved. There are also some texture scenarios which are faster and slower. If read as unfiltered 4byte blobs, textures - in the disguise of a ro UAV - can be read at maximum memory-controller throughput if you get the access pattern right. If filtered though, it has to go through the [sometimes] slowest path of the memory system.

So, embedded RAM might make rw UAVs just about as fast as regular RAM ro UAVs, with the positive side-effect that you have less resources allocated, less bandwidth spend, and if done smart no or only a small fraction of the original writes off-chip (that's good for the CPU cache).
 
The read-write buffer can not be a texture, it has to be a rw UAV, and rw UAVs need to support atomics. In effect memory i/o via rw UAVs is slow in comparison to ro/wo UAVs and textures. On the other hand, depending on the architecture, rw can go over the cache and bandwidth is saved. There are also some texture scenarios which are faster and slower.
I thought UAVs can also have a texture format? And GCN can also write to texture formats through the TMUs (also supports atomic operations in textures, but one doesn't have to use them of course if one wants higher speed accesses) not only through ROPs/memexport as with earlier architectures. And generally, UAVs, all buffers and textures are using exactly the same caches (L1 and L2 are basically general read write caches used for everything despite ROP exports, just the instruction and scalar data caches are read only). It may be just a matter of (un-)necessary synchronisation, when something appears to be faster then the other.
 
UAVs must be mapped on datatypes divisible by four. That is int and float, and only one, two or four of them can be fecthed simultaniously. doubles are emulated with floats, and can only be two. Structs are longer chains of fetches of those.
UAVs are not shader resource views aka SRVs, which are what in earlier APIs would be a "texture".
There are three paths through the memory system in Cypress at least, and I'm sure it's still like that in GCN. Filtered textures need to be fetched especially ofc. Unfiltered textures might sometimes be fetched especially, fe. BTC compressed ones, or bit-fields, what's necessary is type-conversion in that case. Others like RGBAf can be mapped directly on the fast read-only UAV path and no type-conversion takes place just a simple static_cast. That one's almost able to get peak bandwidth (see the OpenCL documentation). The read-write UAV path needs to consider atomics and is comparatively slow.
 
I'm relatively sure that it changed a bit with GCN and former restrictions are less severe or simply nonexistent. And what the APIs on the PC side allow wouldn't apply in the XB1 case anyway. What the hardware can do matters. Of course it may be not so easy to write block compressed textures, but it may be rare, that one wants to do it from within a compute shader.
And what do you mean with that atomics need to be considered and it gets slower? If one doesn't use atomics, you don't use them. It only slows you down if you use them.

edit:
Just had a look, that is what data formats GCN can access, either as buffer, image (texture; can be everything: 1D, 1D array, 2D, 2D array, 2D MSAA, 2D MSAA array, 3D, or cube, i.e. one can also write individual MSAA samples in a multisampled texture for instance) or through the ROPs (CB):
gcn_data_formatsalsx5.png
gcn_data_formats200s3g.png
As you see, it's a bit more than just 4byte aligned formats. And as said, images and buffers use both the same caches and both support atomics. Btw., I think MS let AMD add support for a few more formats.
 
Last edited by a moderator:
It's a bit incomplete list as it misses "buffer rw". The "rw" case is the one case that supports atomics, and it's quite different to streaming reads or writes.
SRV can only be "r"ead or only be "w"rite, but not "rw". You can not bind two different SRVs "r" and "w" pointing to the same memory region, which means texture access in the traditional sense is always unidirectional streaming.
In the DX direct-compute model all structured buffers, all UAVs, need to have element-sizes aligned to ints, the demultiplexing of the individual members appear to be additional shader-instructions. If GCN can do more, for example fetch a single byte from a memory region, actually we may just have found something where a direct compute shader in "Mantle HLSL" could certainly have much more/better features than a DX11.2 one.
I don't see the table saying that you can fe. write 16bit values (down-mapped from 32bit ints or floats) at full speed. It's just full of booleans. :)

Anyway. A bandwidth tester could reveal if all the different UAV/SRV configs fall into one or into three speed classes.
 
It's a bit incomplete list as it misses "buffer rw". The "rw" case is the one case that supports atomics, and it's quite different to streaming reads or writes.
No, the list is complete (save for some special formats MS may have let AMD added). You just missed the point. The GCN hardware has simply no concept of a write only buffer. There are data formats it can write and a few more format it can read (different for simple buffers and the image formats supporting also texture sampling). And that's an inclusive list in the sense than it can also read all formats it can write to. It is simply the case that every format writable by the hardware can be bound as a read/write buffer (or texture) to the shader. It is as simple as that. That the hardware has further restrictions if it can apply atomics (only to 32 and 64bit per pixel [not component!] formats) or not doesn't change that. As I said, one should forget about the API restrictions from the PC space here. On a console, you can simply present it as it is: use everything you want (and the hardware supports) as a r/w buffer/texture, but be aware, that atomics work only on a restricted set of data formats.
And before someone comes with the argument that r/w buffers without atomics make no sense, that is simply not true.

I don't see the table saying that you can fe. write 16bit values (down-mapped from 32bit ints or floats) at full speed. It's just full of booleans. :)

Anyway. A bandwidth tester could reveal if all the different UAV/SRV configs fall into one or into three speed classes.
This data conversion should be "full speed" for filtering purposes. At least it doesn't incur an additional penalty for texture filtering. It means at least 4 values can be converted per clock cycle (equalling the filtering output speed). If the conversion could be faster than that for non-filtered accesses (buffer reads/writes or image reads/writes without applied sampler) would be an interesting test. Maybe one sees a speed difference depending if one applies data conversion or not (GCN is able to do both, one just has to use the right instructions, up to 16 accesses/clock without sampling and conversion can potentially be served from the vector L1).
Otherwise, the hardware basically cares just if it is a linear buffer or some tiled/swizzled texture format which may influence the cache hit rate depending on the access pattern. It's of course going to be faster, if the data layout fits the access pattern. And with just 8 Bit sized reads or writes, one isn't going to hit any bandwidth limit (as the number of accesses limits probably way before the raw bandwidth). But aside from this, there should be no inherent bandwidth difference.
 
Last edited by a moderator:
This thread will be much more useful after this coming weekend as DF is prepping articles to cover the subject then.
 
Let me just add a definitive source of information about the formats and "r", "w" and "rw" distinctions:
http://msdn.microsoft.com/en-us/library/windows/desktop/ff728749(v=vs.85).aspx

Direct3D 11's Unordered Access View (UAV) of a Texture1D, Texture2D, or Texture3D resource supports random access reads and writes to memory from a compute shader or pixel shader. However, Direct3D 11 supports simultaneously both reading from and writing to only the DXGI_FORMAT_R32_UINT texture format. For example, Direct3D 11 does not support simultaneously both reading from and writing to other, more useful formats, such as DXGI_FORMAT_R8G8B8A8_UNORM. You can use only a UAV to random access write to such other formats, or you can use only a Shader Resource View (SRV) to random access read from such other formats. Format conversion hardware is not available to simultaneously both read from and write to such other formats.

GCN1 and GCN2 can of course offer what DX11 doesn't allow. Which, especially in this case, would make Mantle a pot of gold.
 
Back
Top