The pros and cons of eDRAM/ESRAM in next-gen

3dilettante · Sep 1, 2014

mosen said:
XB1 launched with DX11, it should use less power with DX12.

http://blogs.msdn.com/b/directx/arc...-high-performance-and-high-power-savings.aspx

It's risky to extrapolate from a platform whose TDP is close to 1/10 that of a console. The load performance scenario is likely only a few watts away from what the console APU idles at.

The software test is designed to show how the system operates in a tightly power-constrained environment on a synthetic workload designed to make the trade-off obvious.

function · Sep 2, 2014

liquidboy said:
as the architects have stated numerous times

Realistic

140-150GB/s for ESRAM (internal)
50-55GB/s for DDR (external)

Total =190-205 GB/s

The problem with directly adding the values is that while there are situations where this is applicable, there are situations where the esram is twiddling its super fast thumbs while the CPU and GPU choke on the limited DDR3 BW.

A little more DDR3 BW might have allowed higher utilisation of the esram BW.

DDR3 2400 mHz was available last year but the additional cost of this over 2133 would likely have been prohibitive, and that's assuming that AMD's DDR3 controller would have made effective use of the faster memory.

function · Sep 2, 2014

Just a thought, but could tiling (the whole screen) to reduce buffer footprints, and then reading textures and geometry from esram allow developers to effectively manually manage almost all GPU access from main memory?

Manually managing access by transferring correctly sized chunks of data using DMA might allow contention issues affecting the CPU to be minimised, right ...?

The figures for esram BW make it look like when the esram isn't doing FP16 blending it has 'spare' BW on its hands. It copying into esram used more esram BW but was a net win then it would seem to be a good idea.

Laa-Yosh · Sep 2, 2014

So far it doesn't look like bandwidth would be X1's main weakness. The small size of the ESRAM and the lower amount of GPU power seem to be more serious issues.

function · Sep 2, 2014

Main memory contention issues are significant, according to the Metro dev at least. It's affecting both CPU and GPU performance on X1.

Nothing can be done about the processing power unfortunately, but it may be possible to work around contention issues.

liquidboy · Sep 2, 2014

question ive had since hearing about esram a year ago was how does one code for it?!

If its a logical extension of XBox360, we had XNA Studio, that automatically did some allocation for us..

eg. predicated tiling in xbo360

Does XB1 have something similar with Dx 11.x extensions... (seeing as XNA is not an option this generation)..

OR does MS have WinRT API's and or C++ AMP to allow us target internal shared memory ... Much like what C++ AMP tiling

As the architects said above, we have 32 MB, 4 lanes times 8 modules per lane

= 32 total modules (1 mb each) that can have avg read/write across them in parallel of 140-150GB/s..

So are these 32 tiles in a C++ amp sense ?!

I wish the devs tied to NDA were allowed to share some insights...

note: Microsoft have gone on record stating that the Xbox team invested heavily in c++ amp for Xbox One, im betting that is the approach to best utilize HW (accelerators & shared memory etc in XB1) ...

Scott_Arm · Sep 2, 2014

function said:
Main memory contention issues are significant, according to the Metro dev at least. It's affecting both CPU and GPU performance on X1.

Nothing can be done about the processing power unfortunately, but it may be possible to work around contention issues.

Switching from DX11 to DX12 could help, if it reduces the CPU use and touches memory less often. Keeping the ESRAM utilized by DMAing data in/out would be the best method of avoiding contention on the DDR3.

Scott_Arm · Sep 2, 2014

liquidboy said:
question ive had since hearing about esram a year ago was how does one code for it?!

If its a logical extension of XBox360, we had XNA Studio, that automatically did some allocation for us..

eg. predicated tiling in xbo360

Does XB1 have something similar with Dx 11.x extensions... (seeing as XNA is not an option this generation)..

OR does MS have WinRT API's and or C++ AMP to allow us target internal shared memory ... Much like what C++ AMP tiling

As the architects said above, we have 32 MB, 4 lanes times 8 modules per lane

= 32 total modules (1 mb each) that can have avg read/write across them in parallel of 140-150GB/s..

So are these 32 tiles in a C++ amp sense ?!

I wish the devs tied to NDA were allowed to share some insights...

note: Microsoft have gone on record stating that the Xbox team invested heavily in c++ amp for Xbox One, im betting that is the approach to best utilize HW (accelerators & shared memory etc in XB1) ...

The GPU can see both DDR3 and the ESRAM. There should be some API functions that would DMA data between the two pools of memory. Should not be complicated, but it would be "manual," as far as I know. Coming up with good algorithms to keep ESRAM full of useful data at the right times is the tricky part.

oldschoolnerd · Sep 2, 2014

Scott_Arm said:
The GPU can see both DDR3 and the ESRAM. There should be some API functions that would DMA data between the two pools of memory. Should not be complicated, but it would be "manual," as far as I know. Coming up with good algorithms to keep ESRAM full of useful data at the right times is the tricky part.

Whilst these algorithms to keep the ESRAM full of data needed by the GPU are tricky, it is far from insurmountable. Surely tiling much larger data structures in and out of DDR3 ram using the move engines mitigate the comparatively small size of the ESRAM to the point where its ...big enough. This leaves the x1 in the enviable position of having a dedicated, high bandwidth, low latency memory pool for the GPU that removes contention against the DDR3. The CPU gets uncontended access to DDR3,excepting the asynchronous move engines which due to being very latency sensitive are exactly the sort of contending memory clients you want if you must have any contention at all.

It really does seem to add up. The DDR3 has a max achievable bandwidth of 50 - 55GB/s. The CPU can have up to 30, and if I remember correctly the move engines are allowed up to 25...almost like it was fully thought out to be like this, with the design goal of minimising the contention that you are otherwise stuck with when using a single pool of shared memory.

Then of course you have the GPU with a contention free 100 - 150GB/s of bandwidth to ESRAM depending on the concurrency of read and write activity.

Back to the original thread topic of the pros and cons, there seem to be plenty of pros with the only real cons being the complexity of software required to extract maximum performance, meaning it will be a while before we see titles using it to its max.

Scott_Arm · Sep 2, 2014

It would be interesting to hear a bit more about ESRAM and how it's being used by titles right now to know if they're just sticking render targets in there and leaving it, or if anyone is actually making use of it by copying different render targets or other data in and out of the ESRAM. Without DMAing the data you need in/out, it seems like ESRAM wouldn't give too much of a benefit because you'd be hitting DDR3 a lot.

One of the inherent drawbacks of ESRAM is the size of it. Right now it just doesn't seem to be practical/cost-effective for a console to have a larger ESRAM. Your whole renderer would have to be designed around the small size, to make sure you can receive one of its bigger benefits, which is avoiding contention with the CPU.

mosen · Sep 5, 2014

Is there any specific advantage for eSRAM in read/modify/write operations compered to DRAMs? Like fine granularity read/modify/write? Or latency? Is it possible for XB1's GPU to perform exclusive read/modify/write sequences on the same buffer on eSRAM?

steveOrino · Sep 5, 2014

Scott_Arm said:
Right now it just doesn't seem to be practical/cost-effective for a console to have a larger ESRAM.

Maybe someone could help me understand this. I know MS wanted to go with Sram because of procurement issues but in the end it it really save them anything? I read the papers from Intel and Samsung about how sram isn't meeting their requirements (Takes too much die space, difficult to shrink, power hungry) and why embedded dram is the way forward for their future products.

iroboto · Sep 5, 2014

Digital Foundry: Perhaps the most misunderstood area of the processor is the ESRAM and what it means for game developers. Its inclusion sort of suggests that you ruled out GDDR5 pretty early on in favour of ESRAM in combination with DDR3. Is that a fair assumption?

Nick Baker: Yeah, I think that's right. In terms of getting the best possible combination of performance, memory size, power, the GDDR5 takes you into a little bit of an uncomfortable place. Having ESRAM costs very little power and has the opportunity to give you very high bandwidth. You can reduce the bandwidth on external memory - that saves a lot of power consumption as well and the commodity memory is cheaper as well so you can afford more. That's really a driving force behind that. You're right, if you want a high memory capacity, relatively low power and a lot of bandwidth there are not too many ways of solving that.

Andrew Goossen: Right. By fixing the clock, not only do we increase our ALU performance, we also increase our vertex rate, we increase our pixel rate and ironically increase our ESRAM bandwidth. But we also increase the performance in areas surrounding bottlenecks like the drawcalls flowing through the pipeline, the performance of reading GPRs out of the GPR pool, etc. GPUs are giantly complex. There's gazillions of areas in the pipeline that can be your bottleneck in addition to just ALU and fetch performance.

If you go to VGleaks, they had some internal docs from our competition. Sony was actually agreeing with us. They said that their system was balanced for 14 CUs. They used that term: balance. Balance is so important in terms of your actual efficient design. Their additional four CUs are very beneficial for their additional GPGPU work. We've actually taken a very different tack on that. The experiments we did showed that we had headroom on CUs as well. In terms of balance, we did index more in terms of CUs than needed so we have CU overhead. There is room for our titles to grow over time in terms of CU utilisation, but getting back to us versus them, they're betting that the additional CUs are going to be very beneficial for GPGPU workloads. Whereas we've said that we find it very important to have bandwidth for the GPGPU workload and so this is one of the reasons why we've made the big bet on very high coherent read bandwidth that we have on our system.

I actually don't know how it's going to play out of our competition having more CUs than us for these workloads versus us having the better performing coherent memory. I will say that we do have quite a lot of experience in terms of GPGPU - the Xbox 360 Kinect, we're doing all the Exemplar processing on the GPU so GPGPU is very much a key part of our design for Xbox One. Building on that and knowing what titles want to do in the future. Something like Exemplar... Exemplar ironically doesn't need much ALU. It's much more about the latency you have in terms of memory fetch [latency hiding of the GPU], so this is kind of a natural evolution for us. It's like, OK, it's the memory system which is more important for some particular GPGPU workloads.

MetalSpirit · Sep 5, 2014

liquidboy said:
as the architects have stated numerous times

Realistic

140-150GB/s for ESRAM (internal)
50-55GB/s for DDR (external)

Total =190-205 GB/s

Is this really so?

As far as I know, The CPU cannot access the ESRAM, so adding the two figures is not valid unless you only consider the GPU.

But then we have this:

So... I highly doubt those numbers are ever available in reality.

iroboto · Sep 5, 2014

It is in reference to the GPU having total bandwidth of that size. As for CPU access:

Digital Foundry: And you have CPU read access to the ESRAM, right? This wasn't available on Xbox 360 eDRAM.

Nick Baker: We do but it's very slow.

iroboto · Sep 5, 2014

I'm fairly certain that the scope of this topic needs to be expanded. Looking at esram in isolation is what is causing this discussion loop and loop. The whole memory system needs to be looked at as a whole: DMEs, DDR and ESRAM.

MS developed a more complex memory system and the choices that were made would likely align with the goals. Hindsight is 20/20 I agree but a long time ago in feb of 2013 you guys were onto the engineers working on x1.

http://beyond3d.com/showpost.php?p=1713230&postcount=840

Dobwal makes an interesting hypothesis that x1 leverages fpga to interface the memory system posted here:

http://beyond3d.com/showpost.php?p=1713503&postcount=849

We never followed up on that though. Regardless of whether fpga is used or not DME is a common recurring theme from a long time ago. Why DME from DDR to embedded RAM? What's the point?

Betanumerical · Sep 5, 2014

iroboto said:
I'm fairly certain that the scope of this topic needs to be expanded. Looking at esram in isolation is what is causing this discussion loop and loop. The whole memory system needs to be looked at as a whole: DMEs, DDR and ESRAM.

MS developed a more complex memory system and the choices that were made would likely align with the goals. Hindsight is 20/20 I agree but a long time ago in feb of 2013 you guys were onto the engineers working on x1.

http://beyond3d.com/showpost.php?p=1713230&postcount=840

Dobwal makes an interesting hypothesis that x1 leverages fpga to interface the memory system posted here:

http://beyond3d.com/showpost.php?p=1713503&postcount=849

We never followed up on that though. Regardless of whether fpga is used or not DME is a common recurring theme from a long time ago. Why DME from DDR to embedded RAM? What's the point?

It makes no sense to use a FPGA in this situation, you gain nothing and lose everything.

A FGPA is hotter, larger, slower and more expensive then a comparable ASIC on the kind of scale that the XB1 is produced at.

Scott_Arm · Sep 5, 2014

They don't use an FPGA. It would be on one of the teardown BOMs. It doesn't really make sense as a solution.

iroboto · Sep 5, 2014

let's ignore the FPGA comment, clearly that derailed - why DMEs?

Scott_Arm · Sep 5, 2014

iroboto said:
let's ignore the FPGA comment, clearly that derailed - why DMEs?

The DMEs are customized versions of the DMA available on AMD GCN. I can't remember the exact features that they added. THey were obviously customized with a particular intent. Don't know the details.

The pros and cons of eDRAM/ESRAM in next-gen

3dilettante

function

None functional

function

None functional

Laa-Yosh

I can has custom title?

function

None functional

liquidboy

Scott_Arm

Scott_Arm

oldschoolnerd

Scott_Arm

mosen

steveOrino

iroboto

Daft Funk

MetalSpirit

iroboto

Daft Funk

iroboto

Daft Funk

Betanumerical

Scott_Arm

iroboto

Daft Funk

Scott_Arm

Similar threads