The pros and cons of eDRAM/ESRAM in next-gen

Discussion in 'Console Technology' started by Shifty Geezer, Jan 8, 2012.

  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    It's risky to extrapolate from a platform whose TDP is close to 1/10 that of a console. The load performance scenario is likely only a few watts away from what the console APU idles at.

    The software test is designed to show how the system operates in a tightly power-constrained environment on a synthetic workload designed to make the trade-off obvious.
     
  2. function

    function None functional
    Legend

    Joined:
    Mar 27, 2003
    Messages:
    5,854
    Likes Received:
    4,411
    Location:
    Wrong thread
    The problem with directly adding the values is that while there are situations where this is applicable, there are situations where the esram is twiddling its super fast thumbs while the CPU and GPU choke on the limited DDR3 BW.

    A little more DDR3 BW might have allowed higher utilisation of the esram BW.

    DDR3 2400 mHz was available last year but the additional cost of this over 2133 would likely have been prohibitive, and that's assuming that AMD's DDR3 controller would have made effective use of the faster memory.
     
  3. function

    function None functional
    Legend

    Joined:
    Mar 27, 2003
    Messages:
    5,854
    Likes Received:
    4,411
    Location:
    Wrong thread
    Just a thought, but could tiling (the whole screen) to reduce buffer footprints, and then reading textures and geometry from esram allow developers to effectively manually manage almost all GPU access from main memory?

    Manually managing access by transferring correctly sized chunks of data using DMA might allow contention issues affecting the CPU to be minimised, right ...?

    The figures for esram BW make it look like when the esram isn't doing FP16 blending it has 'spare' BW on its hands. It copying into esram used more esram BW but was a net win then it would seem to be a good idea.
     
  4. Laa-Yosh

    Laa-Yosh I can has custom title?
    Legend Subscriber

    Joined:
    Feb 12, 2002
    Messages:
    9,568
    Likes Received:
    1,455
    Location:
    Budapest, Hungary
    So far it doesn't look like bandwidth would be X1's main weakness. The small size of the ESRAM and the lower amount of GPU power seem to be more serious issues.
     
  5. function

    function None functional
    Legend

    Joined:
    Mar 27, 2003
    Messages:
    5,854
    Likes Received:
    4,411
    Location:
    Wrong thread
    Main memory contention issues are significant, according to the Metro dev at least. It's affecting both CPU and GPU performance on X1.

    Nothing can be done about the processing power unfortunately, but it may be possible to work around contention issues.
     
  6. liquidboy

    Regular

    Joined:
    Jan 16, 2013
    Messages:
    416
    Likes Received:
    77
    question ive had since hearing about esram a year ago was how does one code for it?!

    If its a logical extension of XBox360, we had XNA Studio, that automatically did some allocation for us..

    eg. predicated tiling in xbo360

    Does XB1 have something similar with Dx 11.x extensions... (seeing as XNA is not an option this generation)..


    OR does MS have WinRT API's and or C++ AMP to allow us target internal shared memory ... Much like what C++ AMP tiling

    As the architects said above, we have 32 MB, 4 lanes times 8 modules per lane

    = 32 total modules (1 mb each) that can have avg read/write across them in parallel of 140-150GB/s..

    So are these 32 tiles in a C++ amp sense ?!

    I wish the devs tied to NDA were allowed to share some insights...

    note: Microsoft have gone on record stating that the Xbox team invested heavily in c++ amp for Xbox One, im betting that is the approach to best utilize HW (accelerators & shared memory etc in XB1) ...
     
    #786 liquidboy, Sep 2, 2014
    Last edited by a moderator: Sep 2, 2014
  7. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    15,134
    Likes Received:
    7,680
    Switching from DX11 to DX12 could help, if it reduces the CPU use and touches memory less often. Keeping the ESRAM utilized by DMAing data in/out would be the best method of avoiding contention on the DDR3.
     
  8. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    15,134
    Likes Received:
    7,680
    The GPU can see both DDR3 and the ESRAM. There should be some API functions that would DMA data between the two pools of memory. Should not be complicated, but it would be "manual," as far as I know. Coming up with good algorithms to keep ESRAM full of useful data at the right times is the tricky part.
     
  9. oldschoolnerd

    Newcomer

    Joined:
    Sep 13, 2013
    Messages:
    65
    Likes Received:
    8
    Whilst these algorithms to keep the ESRAM full of data needed by the GPU are tricky, it is far from insurmountable. Surely tiling much larger data structures in and out of DDR3 ram using the move engines mitigate the comparatively small size of the ESRAM to the point where its ...big enough. This leaves the x1 in the enviable position of having a dedicated, high bandwidth, low latency memory pool for the GPU that removes contention against the DDR3. The CPU gets uncontended access to DDR3,excepting the asynchronous move engines which due to being very latency sensitive are exactly the sort of contending memory clients you want if you must have any contention at all.

    It really does seem to add up. The DDR3 has a max achievable bandwidth of 50 - 55GB/s. The CPU can have up to 30, and if I remember correctly the move engines are allowed up to 25...almost like it was fully thought out to be like this, with the design goal of minimising the contention that you are otherwise stuck with when using a single pool of shared memory.

    Then of course you have the GPU with a contention free 100 - 150GB/s of bandwidth to ESRAM depending on the concurrency of read and write activity.

    Back to the original thread topic of the pros and cons, there seem to be plenty of pros with the only real cons being the complexity of software required to extract maximum performance, meaning it will be a while before we see titles using it to its max.
     
  10. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    15,134
    Likes Received:
    7,680
    It would be interesting to hear a bit more about ESRAM and how it's being used by titles right now to know if they're just sticking render targets in there and leaving it, or if anyone is actually making use of it by copying different render targets or other data in and out of the ESRAM. Without DMAing the data you need in/out, it seems like ESRAM wouldn't give too much of a benefit because you'd be hitting DDR3 a lot.

    One of the inherent drawbacks of ESRAM is the size of it. Right now it just doesn't seem to be practical/cost-effective for a console to have a larger ESRAM. Your whole renderer would have to be designed around the small size, to make sure you can receive one of its bigger benefits, which is avoiding contention with the CPU.
     
  11. mosen

    Regular

    Joined:
    Mar 30, 2013
    Messages:
    452
    Likes Received:
    152
    Is there any specific advantage for eSRAM in read/modify/write operations compered to DRAMs? Like fine granularity read/modify/write? Or latency? Is it possible for XB1's GPU to perform exclusive read/modify/write sequences on the same buffer on eSRAM?
     
  12. steveOrino

    Regular

    Joined:
    Feb 11, 2010
    Messages:
    549
    Likes Received:
    242

    Maybe someone could help me understand this. I know MS wanted to go with Sram because of procurement issues but in the end it it really save them anything? I read the papers from Intel and Samsung about how sram isn't meeting their requirements (Takes too much die space, difficult to shrink, power hungry) and why embedded dram is the way forward for their future products.
     
  13. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,834
    Likes Received:
    18,634
    Location:
    The North
    Digital Foundry: Perhaps the most misunderstood area of the processor is the ESRAM and what it means for game developers. Its inclusion sort of suggests that you ruled out GDDR5 pretty early on in favour of ESRAM in combination with DDR3. Is that a fair assumption?

    Nick Baker: Yeah, I think that's right. In terms of getting the best possible combination of performance, memory size, power, the GDDR5 takes you into a little bit of an uncomfortable place. Having ESRAM costs very little power and has the opportunity to give you very high bandwidth. You can reduce the bandwidth on external memory - that saves a lot of power consumption as well and the commodity memory is cheaper as well so you can afford more. That's really a driving force behind that. You're right, if you want a high memory capacity, relatively low power and a lot of bandwidth there are not too many ways of solving that.

    Andrew Goossen: Right. By fixing the clock, not only do we increase our ALU performance, we also increase our vertex rate, we increase our pixel rate and ironically increase our ESRAM bandwidth. But we also increase the performance in areas surrounding bottlenecks like the drawcalls flowing through the pipeline, the performance of reading GPRs out of the GPR pool, etc. GPUs are giantly complex. There's gazillions of areas in the pipeline that can be your bottleneck in addition to just ALU and fetch performance.

    If you go to VGleaks, they had some internal docs from our competition. Sony was actually agreeing with us. They said that their system was balanced for 14 CUs. They used that term: balance. Balance is so important in terms of your actual efficient design. Their additional four CUs are very beneficial for their additional GPGPU work. We've actually taken a very different tack on that. The experiments we did showed that we had headroom on CUs as well. In terms of balance, we did index more in terms of CUs than needed so we have CU overhead. There is room for our titles to grow over time in terms of CU utilisation, but getting back to us versus them, they're betting that the additional CUs are going to be very beneficial for GPGPU workloads. Whereas we've said that we find it very important to have bandwidth for the GPGPU workload and so this is one of the reasons why we've made the big bet on very high coherent read bandwidth that we have on our system.

    I actually don't know how it's going to play out of our competition having more CUs than us for these workloads versus us having the better performing coherent memory. I will say that we do have quite a lot of experience in terms of GPGPU - the Xbox 360 Kinect, we're doing all the Exemplar processing on the GPU so GPGPU is very much a key part of our design for Xbox One. Building on that and knowing what titles want to do in the future. Something like Exemplar... Exemplar ironically doesn't need much ALU. It's much more about the latency you have in terms of memory fetch [latency hiding of the GPU], so this is kind of a natural evolution for us. It's like, OK, it's the memory system which is more important for some particular GPGPU workloads.
     
  14. MetalSpirit

    Newcomer

    Joined:
    Jan 29, 2014
    Messages:
    48
    Likes Received:
    1
    Is this really so?

    As far as I know, The CPU cannot access the ESRAM, so adding the two figures is not valid unless you only consider the GPU.

    But then we have this:

    [​IMG]

    So... I highly doubt those numbers are ever available in reality.
     
  15. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,834
    Likes Received:
    18,634
    Location:
    The North
    It is in reference to the GPU having total bandwidth of that size. As for CPU access:

    Digital Foundry: And you have CPU read access to the ESRAM, right? This wasn't available on Xbox 360 eDRAM.

    Nick Baker: We do but it's very slow.
     
  16. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,834
    Likes Received:
    18,634
    Location:
    The North
    I'm fairly certain that the scope of this topic needs to be expanded. Looking at esram in isolation is what is causing this discussion loop and loop. The whole memory system needs to be looked at as a whole: DMEs, DDR and ESRAM.

    MS developed a more complex memory system and the choices that were made would likely align with the goals. Hindsight is 20/20 I agree but a long time ago in feb of 2013 you guys were onto the engineers working on x1.

    http://beyond3d.com/showpost.php?p=1713230&postcount=840

    Dobwal makes an interesting hypothesis that x1 leverages fpga to interface the memory system posted here:

    http://beyond3d.com/showpost.php?p=1713503&postcount=849

    We never followed up on that though. Regardless of whether fpga is used or not DME is a common recurring theme from a long time ago. Why DME from DDR to embedded RAM? What's the point?
     
  17. Betanumerical

    Veteran

    Joined:
    Aug 20, 2007
    Messages:
    1,763
    Likes Received:
    280
    Location:
    In the land of the drop bears
    It makes no sense to use a FPGA in this situation, you gain nothing and lose everything.

    A FGPA is hotter, larger, slower and more expensive then a comparable ASIC on the kind of scale that the XB1 is produced at.
     
  18. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    15,134
    Likes Received:
    7,680
    They don't use an FPGA. It would be on one of the teardown BOMs. It doesn't really make sense as a solution.
     
  19. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,834
    Likes Received:
    18,634
    Location:
    The North
    let's ignore the FPGA comment, clearly that derailed - why DMEs?
     
  20. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    15,134
    Likes Received:
    7,680
    The DMEs are customized versions of the DMA available on AMD GCN. I can't remember the exact features that they added. THey were obviously customized with a particular intent. Don't know the details.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...