Questions about Xbox One's ESRAM & Compute

Discussion in 'Console Technology' started by onQ, Sep 29, 2013.

  1. onQ

    onQ
    Veteran

    Joined:
    Mar 4, 2010
    Messages:
    1,540
    Likes Received:
    56
    I've been wondering about the Xbox One's 32MB of ESRAM & how it could work with compute.

    Could it be possible to just use the DDR3 for the normal graphical tasks while using the 32MB of ESRAM for the compute shaders or have a whole lighting engine that's ran in compute using the fast embedded ram?
     
  2. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    43,576
    Likes Received:
    16,034
    Location:
    Under my bridge
    Yes. The XB1 memory architecture appears to be completely flexible, so devs can do whatever they want with it. Whether you'd want to or not is a different question. If you only want 68 GB/s for your entire graphics and processing workloads, go for it!
     
  3. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,640
    Likes Received:
    1,082
    It's worth mentioning that it is isn't all or nothing. You could reserve a few MB for compute and have the GPU use the rest for graphics.

    Latency sensitive algorithms benefit the most from ESRAM, so you might taylor your memory layout to get the maximum benefit; Like having the top nodes of a KD, Quad-tree or whatever search structure you use, and have the bottom nodes and leafs in DDR3 memory.

    Cheers
     
  4. Aenima

    Newcomer

    Joined:
    Jul 15, 2010
    Messages:
    65
    Likes Received:
    19
    In most games, where the 3D view is relative to the players position, I could see potential for these optimizations. I still haven't been able to figure out the impact of PRT on ESRAM, but it seems like a potential candidate for performance. Maybe someone with experience in this field can elaborate/explain whether or not PRT would be a good candidate for the ESRAM usage?
     
    #4 Aenima, Sep 29, 2013
    Last edited by a moderator: Sep 29, 2013
  5. dobwal

    Legend Veteran

    Joined:
    Oct 26, 2005
    Messages:
    5,763
    Likes Received:
    2,035
    Didnt MS demonstrate tiled resources using 16 MB of memory to display a 3 GB texture? Is it possible that just utilizing half of eSRAM will allow Durango to relatively house 3+ GBs of texture data on chip?

    They talked about how tile resources works by treating portions of gpu memory and DRR3 as a sort of large unified texture buffer. I guess the DME could manage things like moving texture data between the esram and DDR3 while the gpu is busy performing internal operations. As well using the swizzle logic to manipulate texture data residing in esram and ddr3 before it hits the gpu.
     
  6. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    43,576
    Likes Received:
    16,034
    Location:
    Under my bridge
    No. 16 MBs of ESRAM allows for 16 MBs of texture. If that 16 MBs is exactly what you need to texture every pixel on screen, it's enough. If you need more than 16 MBs of tiles to render the screen, you'll need to copy tiles into the ESRAM, which is something of a waste of BW if you aren't going to reuse the tile. If the scene changes, you need to load in more tiles also.

    Allocating ESRAM to textures depends on reutilisation. Frame to frame changes could be low enough that it's plausible, although then you have less RAM for everything else and a key point of PRT is reducing BW requirements. So you'd be using up ESRAM space to save a tiddly amount of BW and reduce latency on texturing which is one area GPUs are specifically designed to accommodate high latency. Doesn't make sense to me.

    Using PRT for non-texture data (especially stuff you write to as well as read) could make some sense as others suggest, but storing your level textures in there seems pointless to me. Maybe for destruction, like writing bullet holes to textures, keeping the immediate tiles in and drawing on them, and then swapping them out as the player moves, would be worth doing?
     
  7. dobwal

    Legend Veteran

    Joined:
    Oct 26, 2005
    Messages:
    5,763
    Likes Received:
    2,035
    Thanks shifty.

    Look around and found the most detailed description of PRT I could come across. It contains more information than provided by more general sources and it was written by AMD for a SISGRAPH 2012 virtual texture course.

    http://www.jurajobert.com/data/Virtual_Texturing_in_Software_and_Hardware_course_notes.pdf
     
    #7 dobwal, Sep 30, 2013
    Last edited by a moderator: Sep 30, 2013
  8. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    942
    Likes Received:
    402
    It's possible that the embedded RAM makes read-write buffers a "fast path". Normally when you do multiple passes with a compute shader you have the choice between
    a) double-buffering the buffer you work on, then input and output are two different resources, one is read-only, the other write-only
    b) using a single read-write buffer
    This has a few implication on performance. The read-write buffer can not be a texture, it has to be a rw UAV, and rw UAVs need to support atomics. In effect memory i/o via rw UAVs is slow in comparison to ro/wo UAVs and textures. On the other hand, depending on the architecture, rw can go over the cache and bandwidth is saved. There are also some texture scenarios which are faster and slower. If read as unfiltered 4byte blobs, textures - in the disguise of a ro UAV - can be read at maximum memory-controller throughput if you get the access pattern right. If filtered though, it has to go through the [sometimes] slowest path of the memory system.

    So, embedded RAM might make rw UAVs just about as fast as regular RAM ro UAVs, with the positive side-effect that you have less resources allocated, less bandwidth spend, and if done smart no or only a small fraction of the original writes off-chip (that's good for the CPU cache).
     
  9. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    I thought UAVs can also have a texture format? And GCN can also write to texture formats through the TMUs (also supports atomic operations in textures, but one doesn't have to use them of course if one wants higher speed accesses) not only through ROPs/memexport as with earlier architectures. And generally, UAVs, all buffers and textures are using exactly the same caches (L1 and L2 are basically general read write caches used for everything despite ROP exports, just the instruction and scalar data caches are read only). It may be just a matter of (un-)necessary synchronisation, when something appears to be faster then the other.
     
  10. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    942
    Likes Received:
    402
    UAVs must be mapped on datatypes divisible by four. That is int and float, and only one, two or four of them can be fecthed simultaniously. doubles are emulated with floats, and can only be two. Structs are longer chains of fetches of those.
    UAVs are not shader resource views aka SRVs, which are what in earlier APIs would be a "texture".
    There are three paths through the memory system in Cypress at least, and I'm sure it's still like that in GCN. Filtered textures need to be fetched especially ofc. Unfiltered textures might sometimes be fetched especially, fe. BTC compressed ones, or bit-fields, what's necessary is type-conversion in that case. Others like RGBAf can be mapped directly on the fast read-only UAV path and no type-conversion takes place just a simple static_cast. That one's almost able to get peak bandwidth (see the OpenCL documentation). The read-write UAV path needs to consider atomics and is comparatively slow.
     
  11. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    I'm relatively sure that it changed a bit with GCN and former restrictions are less severe or simply nonexistent. And what the APIs on the PC side allow wouldn't apply in the XB1 case anyway. What the hardware can do matters. Of course it may be not so easy to write block compressed textures, but it may be rare, that one wants to do it from within a compute shader.
    And what do you mean with that atomics need to be considered and it gets slower? If one doesn't use atomics, you don't use them. It only slows you down if you use them.

    edit:
    Just had a look, that is what data formats GCN can access, either as buffer, image (texture; can be everything: 1D, 1D array, 2D, 2D array, 2D MSAA, 2D MSAA array, 3D, or cube, i.e. one can also write individual MSAA samples in a multisampled texture for instance) or through the ROPs (CB):
    [​IMG] [​IMG]
    As you see, it's a bit more than just 4byte aligned formats. And as said, images and buffers use both the same caches and both support atomics. Btw., I think MS let AMD add support for a few more formats.
     
    #11 Gipsel, Sep 30, 2013
    Last edited by a moderator: Oct 1, 2013
  12. Rockster

    Regular

    Joined:
    Nov 5, 2003
    Messages:
    973
    Likes Received:
    129
    Location:
    On my rock
    Is that 10.10.10.2 format only 7e3 again, or is there hw alpha blending for 6e4 this time around?
     
  13. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    According to this article the XB1 fully supports 7e3 as well as 6e4 rendert target formats.
     
  14. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    942
    Likes Received:
    402
    It's a bit incomplete list as it misses "buffer rw". The "rw" case is the one case that supports atomics, and it's quite different to streaming reads or writes.
    SRV can only be "r"ead or only be "w"rite, but not "rw". You can not bind two different SRVs "r" and "w" pointing to the same memory region, which means texture access in the traditional sense is always unidirectional streaming.
    In the DX direct-compute model all structured buffers, all UAVs, need to have element-sizes aligned to ints, the demultiplexing of the individual members appear to be additional shader-instructions. If GCN can do more, for example fetch a single byte from a memory region, actually we may just have found something where a direct compute shader in "Mantle HLSL" could certainly have much more/better features than a DX11.2 one.
    I don't see the table saying that you can fe. write 16bit values (down-mapped from 32bit ints or floats) at full speed. It's just full of booleans. :)

    Anyway. A bandwidth tester could reveal if all the different UAV/SRV configs fall into one or into three speed classes.
     
  15. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    No, the list is complete (save for some special formats MS may have let AMD added). You just missed the point. The GCN hardware has simply no concept of a write only buffer. There are data formats it can write and a few more format it can read (different for simple buffers and the image formats supporting also texture sampling). And that's an inclusive list in the sense than it can also read all formats it can write to. It is simply the case that every format writable by the hardware can be bound as a read/write buffer (or texture) to the shader. It is as simple as that. That the hardware has further restrictions if it can apply atomics (only to 32 and 64bit per pixel [not component!] formats) or not doesn't change that. As I said, one should forget about the API restrictions from the PC space here. On a console, you can simply present it as it is: use everything you want (and the hardware supports) as a r/w buffer/texture, but be aware, that atomics work only on a restricted set of data formats.
    And before someone comes with the argument that r/w buffers without atomics make no sense, that is simply not true.

    This data conversion should be "full speed" for filtering purposes. At least it doesn't incur an additional penalty for texture filtering. It means at least 4 values can be converted per clock cycle (equalling the filtering output speed). If the conversion could be faster than that for non-filtered accesses (buffer reads/writes or image reads/writes without applied sampler) would be an interesting test. Maybe one sees a speed difference depending if one applies data conversion or not (GCN is able to do both, one just has to use the right instructions, up to 16 accesses/clock without sampling and conversion can potentially be served from the vector L1).
    Otherwise, the hardware basically cares just if it is a linear buffer or some tiled/swizzled texture format which may influence the cache hit rate depending on the access pattern. It's of course going to be faster, if the data layout fits the access pattern. And with just 8 Bit sized reads or writes, one isn't going to hit any bandwidth limit (as the number of accesses limits probably way before the raw bandwidth). But aside from this, there should be no inherent bandwidth difference.
     
    #15 Gipsel, Oct 2, 2013
    Last edited by a moderator: Oct 2, 2013
  16. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    942
    Likes Received:
    402
    Well, lets see what GCN2 docs tell us in hopefully not so short time.
     
  17. astrograd

    Regular

    Joined:
    Feb 10, 2013
    Messages:
    418
    Likes Received:
    0
    This thread will be much more useful after this coming weekend as DF is prepping articles to cover the subject then.
     
  18. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    942
    Likes Received:
    402
    Let me just add a definitive source of information about the formats and "r", "w" and "rw" distinctions:
    http://msdn.microsoft.com/en-us/library/windows/desktop/ff728749(v=vs.85).aspx

    GCN1 and GCN2 can of course offer what DX11 doesn't allow. Which, especially in this case, would make Mantle a pot of gold.
     
  19. onQ

    onQ
    Veteran

    Joined:
    Mar 4, 2010
    Messages:
    1,540
    Likes Received:
    56

    Could this be the reason why some devs said that the ESRAM was a pain to use?
     
  20. Ceger

    Newcomer

    Joined:
    Aug 21, 2013
    Messages:
    59
    Likes Received:
    1
    Which devs would that be? Honest question.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...