The pros and cons of eDRAM/ESRAM in next-gen

I was under the impression that the wider the memory controller the harder it is to scale down the whole motherboard, just from what i have picked up on these threads, and that is why console manufacturers loath to use them.

They would much rather use a next gen memory/edram instead to get the bandwidth, as that is more cost effective in the long run..although judging by what was said above edram seems off the table as well, i would gadger either a big slab xdr 2,/a 256bit bus+gdrr5 or an ace card...GDDR6??:???:
I dont think they will go with a split memory architecture/pool.
 
Scanned through the seminar topics of this years ISSCC.

Found this one a bit intriguing.

A 3D System Prototype of an eDRAM Cache Stacked Over Processor-Like Logic Using Through-Silicon Vias
M. Wordeman1, J. Silberman1, G. Maier2, M. Scheuermann1
1IBM T. J. Watson, Yorktown Heights, NY
2IBM Systems and Technology Group, Fishkill, NY

This solution lets you make your main CPU on a different process (cheaper) than the EDRAM and still you get a really high bandwidth connection to the EDRAM through the TSVs instead of a massive number of pins. Still this was a prototype and I doubt that either MS or Sony wants to bet on such a unproven technology for their upcoming console launches.
 
I was under the impression that the wider the memory controller the harder it is to scale down the whole motherboard, just from what i have picked up on these threads, and that is why console manufacturers loath to use them.

That's my understanding too. If the 360 had gone with a 256-bit bus they'd still be stuck with 8 memory chips, just like the launch systems and it could have placed additional restrictions on how small they could make the system.

They would much rather use a next gen memory/edram instead to get the bandwidth, as that is more cost effective in the long run..although judging by what was said above edram seems off the table as well, i would gadger either a big slab xdr 2,/a 256bit bus+gdrr5 or an ace card...GDDR6??:???:
I dont think they will go with a split memory architecture/pool.

I don't get the feeling that embedded memory is any more off the cards this generation than last generation. Nintendo appear to be using it despite going for a relatively low power device that could possibly (probably?) be serviced by a 128-bit bus and GDDR5 in terms of raw bandwidth (at least it looks that way if you compare to the PS360 and and the PC space).

My big concern about edram is that it might discourage developers from using accurate, subsample based forms of AA meaning we end up with popping pixels, unnecessarily blurry/artefacty shader-only based AA and screen space effects that are no better in motion than they are now (I'm looking at you, light-shafts!).
 
I would make assumption EDRAM is good if it is more flexible than in xbox360 implementation and if the amount of it is sufficient.

For eDram to be truly useful it should just be a fast memory area addressable like regular video mem. This way we can put render targets in regular video memory if we like and others in eDram, and no need to ever resolve or copy data between the two areas. Unless of course the game finds that this is the most efficient for its stuff. For deferred rendering we wouldn't have to fit the whole g-buffer setup in eDram. If we put say the first two g-buffers in eDram and depth buffer and another couple of g-buffers in regular video mem we would still benefit from the extra bandwidth to the buffers that fit.

What about EDRAM/large caches to help cpu/gpu talk and divide work. It would be pretty useful to pass significantly sized buffers between computing units without going through main memory.

This generation we have seen people do graphics on SPUs, which in a way is mostly a workaround for a poor GPU, but I think we'll still see this sort of thing going forward. Ideally all memory should be accessible from all units with no bandwidth restrictions in any direction, using a unified address space. This way the app could make wise choices where to put resources based on required bandwidth, in sys mem (slow), vid mem (faster) or eDram (very fast). There could certainly be CPU tasks that would benefit from huge bandwidth.

Provided enough flexibility, any amount of eDram could be a good thing, with more being better of course. With the restrictions it put on Xbox360, it hindered more than it helped.
 
Would this kind of architecture make sense?:

RAM <-> GPU <-> ED-RAM* <-> CPU <-> RAM

*Or equivalent

Would it be beneficial to have the ED-RAM used as a pass through for data both from the GPU back to the CPU and vice versa? That way both the CPU and GPU could benefit from the additional bandwidth as needed depending on the needs of the program architecture?
 
At that point, you might be concerned about I/O interfaces on the three chips and the impact towards future die reductions/combining.

I'm not sure what the implications there are for latency or avoiding contention.

Actually.... Would it be possible to do this then:
Code:
               RAM
                 ||
GPU <-> eDRAM <->CPU
Not sure how that'd work out for passing information back and forth.... i.e. memory controller config.
 
Provided enough flexibility, any amount of eDram could be a good thing, with more being better of course. With the restrictions it put on Xbox360, it hindered more than it helped.
Hmmm. Could you give examples of how you'd use eDRAM at different amounts, assuming enough bandwidth to always be an excess? Let's say 10MBs, 32 MBs and 100 MBs. Unless rendering in tiles I'd have thought you'd need a minimum amount ot fit a render target.
 
Sure, you need to fit the entire render target. Or I suppose it would be possible to put a partial render target in eDram and the rest in Video Ram, but I wouldn't expect that sort of flexibility. Still, let's say you run 1080p and deferred rendering. AA through FXAA/MLAA/whatever. One g-Buffer render target then takes 7.9MB. So even with only 10MB you could at least fit one. With 32MB you could fit four, so either three buffers plus depth buffer, or four buffers with depth buffer in video memory. Or you put two buffers in eDram and the rest in video mem and then put the FP16 light accumulation buffer in eDram. Or you can put the shadow maps in eDram, especially if you want to many light sources with shadows. Or you put a color correction volume lookup texture in eDram because it has such irregular access pattern that it's constantly trashing the texture cache, so if it's backed by eDram it could act as a large higher level cache. There are many possibilities.
 
AlStrong said:
Actually.... Would it be possible to do this then:
I'm not entirely clear about the diagram, are you suggesting a GPU with no access to RAM? That's pretty much the PS2 then.
 
I'm not entirely clear about the diagram, are you suggesting a GPU with no access to RAM? That's pretty much the PS2 then.

hm.... was thinking of a unified memory space with the CPU and GPU having access to the edram.
 
hm.... was thinking of a unified memory space with the CPU and GPU having access to the edram.

So what you are proposing is something smilar as the 360 already had? just with more flexibility and cpu access.
In your scenario would the edram be integrated right onto the gpu, giving the whole gpu the bandwidth, and some kind of link to the cpu, or vice versa?

Surely not another daughter die scenario?
 
So what you are proposing is something smilar as the 360 already had? just with more flexibility and cpu access.
360 has no eDRAM access to the CPU; eDRAM is a part of the GPU memory system only.

Al's topology would have to be either:
Code:
     --- RAM ---
    /           \
CPU-             -GPU
    \           /
     -- eDRAM --
or
Code:
      RAM 
       |   
CPU--eDRAM--GPU
or
Code:
    eDRAM 
      |   
CPU--RAM--GPU
I don't know what memory controller system could be employed. I'd have thought a unified address space with eDRAM populating the first however many megs would be easiest, but I'm still not convinced eDRAM is a good option anyhow. ;)
 
So what you are proposing is something smilar as the 360 already had? just with more flexibility and cpu access.
In your scenario would the edram be integrated right onto the gpu, giving the whole gpu the bandwidth, and some kind of link to the cpu, or vice versa?

Surely not another daughter die scenario?

Another daughter die scenario might make sense if it's the only cost effective way to get as much silicon (and fab choice) as they require.

Putting the edram and memory controller on the CPU, and having IBM fab it, while having a "normal" 28nm GPU from TSMC (or wherever) might be an option. The CPU would have lower latency access to both memory pools and the GPU would need only one bus.

If IBM are designing the CPU then they'll have the most experience/expertise at designing for their own manufacturing process, and it seems reasonable that they'll be doing both CPUs and edram on 32nm (although maybe not in time or cheap enough for a late 2013 console?).
 
I might be misunderstanding something but isn't it basically impossible or at least very complicated (=expensive) to have CPU and GPU have parallel direct acces to either edram or ram? I would think if the device comes with unified memory pool either CPU or the GPU will be behaving as a memory controller for the other similarly to XB360.
 
AlStrong said:
hm.... was thinking of a unified memory space with the CPU and GPU having access to the edram.
That's a PSP :cool:
And while it's all single-die, topologically eDram was "on GPU" (CPU goes through the main bus to get to any memory).

GameCube "almost" fits - the eDram is in CPU address range, but GPU sees each pool differently in terms of what Data it can read/write to it.
 
Last edited by a moderator:
I might be misunderstanding something but isn't it basically impossible or at least very complicated (=expensive) to have CPU and GPU have parallel direct acces to either edram or ram? I would think if the device comes with unified memory pool either CPU or the GPU will be behaving as a memory controller for the other similarly to XB360.
The way I understand it, the dilemma is the following: you either have a memory controller that serves (more-or-less) symmetrically all its clients (classic UMA -style), thus suffering from bus contention issues among clients, or you have privileged clients, and the more privileged and 'intimate' one of those clients is to the memory pool, the closer that client gets to an edram scenario (subject to actual bus performances). Apparently you can grant each and every agent in the system access to a given pool, but their number increasing, or/and the more 'fair' you're trying to be with those clients, the further you'll be from providing an edam-performing pool for any one of those (assuming equal BW and latency needs to each client). The alternative to that is multi-port edram, and I guess that indeed becomes very expensive, plus I'm not aware how many read/write ports to a macro you can get before it becomes infeasible. But I'm sure there are knowledgeable posters who can shed light there.
 
My 2 cents on ESRAM bandwidth fight.


I made a thread on the 3D Architectures & Chips.

About Bonaire 7790 asking if the GPU was a weak link in GCN,because it has a higher than 7850 Flop count yet it perform slower i think up to 20% slower than the 7850.

One of those who reply told me that the 7850 has a little more bandwidth than it needs,while the 7790 maybe have little lower than it needs.

Which bring me to my point why would a GPU with 1.31 TF would need 192GB/s even less 2004GB/s.?

There is a point were more bandwidth will not serve you,i am sure that the is the reason why the 7770 doesn't have 256 Bit bus with 153GB/s is a waste for it.

I also know that even with lower bandwidth one GPU can beat or match another,also this benchmark i got is using different hardware it raises a good point about effective bandwidth usage.

http://www.anandtech.com/bench/product/550?vs=647

7950 vs 660Ti

The 660Ti can actually match and beat the 7950 in some test,regardless of having almost 100Gb/s less in bandwidth than the 7950.

144GB/s for the 660TI

240GB/s for the 7950.

Now i know is not the same hardware but it shows how the 660TI has some serious bandwidth disadvantage and yet stay there with the 7950.

I think this whole fight is silly,in the end 109GB/s or 204Gb/s in both cases i am sure is more than the Xbox one will need to operate well.
 
My 2 cents on ESRAM bandwidth fight.


I made a thread on the 3D Architectures & Chips.

About Bonaire 7790 asking if the GPU was a weak link in GCN,because it has a higher than 7850 Flop count yet it perform slower i think up to 20% slower than the 7850.

One of those who reply told me that the 7850 has a little more bandwidth than it needs,while the 7790 maybe have little lower than it needs.

Which bring me to my point why would a GPU with 1.31 TF would need 192GB/s even less 2004GB/s.?

There is a point were more bandwidth will not serve you,i am sure that the is the reason why the 7770 doesn't have 256 Bit bus with 153GB/s is a waste for it.

I also know that even with lower bandwidth one GPU can beat or match another,also this benchmark i got is using different hardware it raises a good point about effective bandwidth usage.

http://www.anandtech.com/bench/product/550?vs=647

7950 vs 660Ti

The 660Ti can actually match and beat the 7950 in some test,regardless of having almost 100Gb/s less in bandwidth than the 7950.

144GB/s for the 660TI

240GB/s for the 7950.

Now i know is not the same hardware but it shows how the 660TI has some serious bandwidth disadvantage and yet stay there with the 7950.

I think this whole fight is silly,in the end 109GB/s or 204Gb/s in both cases i am sure is more than the Xbox one will need to operate well.

PC's though, the hardware has to fit whatever software exists.

Consoles, the software is fitted to the hardware.

If you had a 3 Teraflop GPU with 120 GB/s on PC, it would be strangled and only I dont know,maybe 1.5 teraflops would actually be used the rest would be wasted. But on a console people people would find ways to use all that compute power because on console the name of the game is maximizing fixed resources. It still wouldn't be an ideal design, but people would try to use that compute.

In your example lets say you designed a game only for the 7950, and you used all that 240 GB/s. Then if you tried to run that game on the 660 you'd have problems.

So there's some danger in looking for answers in PC hardware and software, although I think it can serve as a very general guide.
 
Without more disclosure about the concerned architectures it is tough to make comparison between GPU card based on the external bandwidth alone.

I don't mean there is no disclosure just I don't know nor I'm likely to understand the in details specs for GCN, Fermi or Kepler.

If I look at something like the GTX640 /GK208 (so the "new" one) I cold think that higher end card (Nvidia or AMD) might have texture units and ROPs "in excess" wrt the resolution the actual user may use.
Going by the same line of thinking actually for me Bonaire is more balanced (way more) than the HD 7850. It is too bad that AMD can play with the number of texture units as Nvidia they may save some die space (and so money / margins) with the performances remaining mostly unchanged.

To me it looks like the XB1 should indeed experience no shortage of bandwidth (in or off -chip). It was fine with ~70GB/s off chip and 100GB/s on chip, now with the increase of on chip bandwidth it is even truer.

I hope that AMD tries to mimic Durango for its own APUs, I see no reason for them to pack more than 4 CPUs cores in their next APUs, Kaveri already embarks 8 CUs, I see no reason to increase that amount as the APU (without resorting to GDDR5) is already completely bandwidth starved.
Once 22nm process is available they should try to incorporate lot of sram on their chip and have the "drivers to deal with it".
Having the CPU to use it (with existing software) should be another matter.
 
I was wondering if someone could help me understand what TRAM is ...

Basically I'm a developer and im getting my head around the new features in DirectX11.2 particularly Direct2D and its new "Block Compression DDS" feature.. (video at 37minute mark onwards - http://channel9.msdn.com/Events/Build/2013/3-191 )

Basically in Dx11.2 it allows developers to save up to 80% disk footprint, as well as improved GPU resource utilization and faster GPU load times.. eg. 8mb image asset can be reduced to 0.9mb dds format.

All these DDS compressed textures are native compressed and handled on the HW i.e its not a Software based compression solution.

Anyway I tried to trace the Microsoft patent that covers this feature to see if it gave clues on how it was implemented at the HW level and I believe I may have found it "High dynamic range texture compression" - https://www.google.com/patents/US20...X&ei=nMYoUp6oN6aTigezxYDQDQ&ved=0CEkQ6AEwAzgo

Now I believe I understand how the patent defines this feature, basically there's a special "Texture Memory" that holds this DDS and that is what is accessed by the graphics processors (Figure 1, labelled 156 is the Texture Memory) - https://patentimages.storage.googleapis.com/US20120242674A1/US20120242674A1-20120927-D00000.png

This 156 "Texture Memory" holds the compressed texture(DDS) and is accessible from CPU/GPU & other coProcessors etc.

Its defined in the patent as this:

the memory management unit 162 may read the decompressed textures 164 from a texture memory 156 to facilitate real-time rendering. The texture memory 156 may be specialized RAM (TRAM) that is designed for rapid I/O, facilitating high performance processing for the GPU 154 in rendering images, including 3-D images, from the decompressed textures 164.
Now I assumed it was eSRAM, much like what we have in Xbox One on die in the main SoC ... BUT what is this TRAM, is that just a form of eSRAM?

Anyway sorry for the long winded question BUT i thought I'd ask the experts here ate Beyond3D, its clear there are many here that know what there talking about ;)
 
Back
Top