The ESRAM in Durango as a possible performance aid

Rangers

Legend
So, yeah. I've been wondering about this for a while, spurred by some of ERP's posts and a post or two by fafalada (programmer who posts in neogaf/here as well i believe)

The crux seems to be that low latency of the ESRAM can help the GPU be more efficient.

Of course I'm not a programmer or know that much about tech, so maybe some others can weigh in I hope.

Basically, what kind of cool things could be done PS2 style with the ESRAM?
 
To start, I have read a post somewhere on here that Nvidia GPU's often outperform AMD ones per flop because they have lower latency caches. I did some rough research and, comparing some Nvidia GPU's to AMD GCN ones (GCN only because VLIW would have skewed the efficiency too much) I was able to calculate that GTX 660 seems about 18% more efficient per flop than Southern Islands (you have to be careful here and account for average boost clocks mind you). But the real star was Fermi, I calculated that GTX 570 (rated at 1.41 TF's) compared to the 1.8TF 7850, was 12% faster at 1080P (570>7850) according to tech power up performance summary across a suite of games. So the 570 is 43% faster than 7850 per flop! I specifically chose 570 because it features near identical memory bandwidth to the 7850. Is that because of low latency cache, or why is fermi and to a lesser extent Kepler so much more efficient?
 
Forgive the ignorance of my question, but in the tech threads has this not already been discussed and the efficiency gains possible were determined to be so low that it didn't really matter? As in it was already running about as well as could be expected so that any gains would be minimal and relatively inconsequential?
 
To start, I have read a post somewhere on here that Nvidia GPU's often outperform AMD ones per flop because they have lower latency caches.
SiSoft has done GPU latency measurements for various AMD/Intel/NVidia GPUs. It's an interesting read: http://www.sisoftware.net/?d=qa&f=gpu_mem_latency. Unfortunately the charts do not include GCN based Radeons (7000 series). Fermi was the first Nvidia GPU to have a proper cache hierarchy, and GCN is the first AMD card. It would be interesting to see both of them compared (and Kepler also).

How much latency affects performance? It depends entirely on data access patterns and slackness (how many extra threads you have ready to run if current ones stall). Hiding latency is rather easy in traditional rendering tasks (as access patterns are simple -> cache hit ratio is high + shaders have low GPR count -> you have lots of threads to run if a subset of them stalls). Memory latency is more important for compute tasks (more complex access patters + higher GPR count in shaders = less slackness).
 
Last edited by a moderator:
I was wondering about that too lately, is esram something that will bring benefits to Durango when compared to the PS4 like it was with edram (lots and higher res particles for example) or the esram is simply there to aid and help the overall bandwidth of the system with no advantages at all over the PS4 architecture?

I guess it's too early to say any of that with certainty but it'd be cool if any of the guys in here shed some light into this.
 
What puzzles me is why such an anemic bandwidth for the 32 MB of esram? Is being limited to 102 GB/s bandwidth have to do with being very low latency? When I heard rumors of edram/esram in X720 I had fantasies of 256 GB/s bandwidth or greater all with lower latency than main RAM. So why such a low bandwidth?
 
Isn't HSA supposedly for improve efficiency? I have read that both APUs (PS4 and Durango) will have similar improvements.

More than one thing can improve efficiency. If ram latency does or not I'm unsure, but that's really the subject of the thread I think.
 
Isn't HSA supposedly for improve efficiency? I have read that both APUs (PS4 and Durango) will have similar improvements.
The OP is asking about lower latency RAM access. That's the topic.

I'm not suitably well informed to know the answer, but I expect latency not to make a huge difference. In GCN, a wavefront is presumably prepared ahead of the ALUs needing it (prefetch textures and line up instructions in the pipe). Everything should be working from caches, same as a CPU, and the occasions when the caches fail should be fairly few and far between, especially with GPU workloads. Latency only impacts your processing if you want data and it's not to hand, and then you have to wait for it, which means your data management is failing.
 
The OP is asking about lower latency RAM access. That's the topic.

I'm not suitably well informed to know the answer, but I expect latency not to make a huge difference. In GCN, a wavefront is presumably prepared ahead of the ALUs needing it (prefetch textures and line up instructions in the pipe). Everything should be working from caches, same as a CPU, and the occasions when the caches fail should be fairly few and far between, especially with GPU workloads. Latency only impacts your processing if you want data and it's not to hand, and then you have to wait for it, which means your data management is failing.

Yes it largely comes down to how good the caches are and how much pending work there is that can be done.
It's hard to know how much effect the lower latency would have, it's my impression that even in the best performing PC games GPU utilization isn't very high overall, I've heard numbers like 40-50% thrown around), but that's for many different reasons.
For example the ALU's are generally massively underutilized when processing vertex wavefronts, and that has nothing to do with latency. It's one of the reasons Sony has talked about using compute to process vertices.
Obviously the ALU's are also massively underutilized when rendering shadows or other simple geometry, again reduces latency doesn't help you there either.
What you'd need to know is how much of the under utilization is due to cache misses and how much having the resource in low latency memory reduces that. I doubt most PC GPU's have enough performance counters to even answer the first part of that question.
You could get a bullshit estimate by replacing every texture in a game with a 1x1 to eliminate the cache altogether, but that doesn't answer the question in isolation as it also greatly reduces the bandwidth being consumed.
 
I prefer the ESRAM in an interposer with the SoC, like the current Xbox 360 and using the local memory channel from the GPU.
 
What puzzles me is why such an anemic bandwidth for the 32 MB of esram? Is being limited to 102 GB/s bandwidth have to do with being very low latency? When I heard rumors of edram/esram in X720 I had fantasies of 256 GB/s bandwidth or greater all with lower latency than main RAM. So why such a low bandwidth?

That is a bit curious, and Mark Cerny's interview with Gamasutra included something of a hip check at that figure when he suggested that had Sony gone for embedded memory they would have used something exponentially faster, in the 1TBps range. That would have been more like the PS2 where the VRAM is 15 times as fast as main memory.

It's also important to note the ESRAM in Durango isn't a cache, so it's up to the programmer to make sure data is there and not off chip in the main DDR3 pool in the first place for any latency advantage to manifest, and there will be trade offs in terms of where you're storing your framebuffer, etc. If you keep it in ESRAM to avoid saturating the main memory bus, you limit the utility of the ESRAM for compute, and if you store your framebuffer in the DDR3 you could end up limiting the kind of frame buffer effects you can manage. I don't know, maybe you could earmark a few megabytes and write a software caching algorithm SPE style, but will the benefits actually justify that kind of effort, or would anyone even bother?
 
That is a bit curious, and Mark Cerny's interview with Gamasutra included something of a hip check at that figure when he suggested that had Sony gone for embedded memory they would have used something exponentially faster, in the 1TBps range. That would have been more like the PS2 where the VRAM is 15 times as fast as main memory.


This is what Cerny said:

"I think you can appreciate how large our commitment to having a developer friendly architecture is in light of the fact that we could have made hardware with as much as a terabyte of bandwidth to a small internal RAM, and still did not adopt that strategy."

It's just an example of how PS4 could have been if Sony had privileged BW over memory amount and opted for an "unfriendly" architecture.

This is what he says about eDRAM:

The memory is not on the chip, however. Via a 256-bit bus, it communicates with the shared pool of ram at 176 GB per second.
One thing we could have done is drop it down to 128-bit bus which would drop the bandwidth to 88 gigabytes per second, and then have eDRAM on chip to bring the performance back up again...
We did not want to create some kind of puzzle that the development community would have to solve in order to create their games. And so we stayed true to the philosophy of unified memory"
 
Last edited by a moderator:
What puzzles me is why such an anemic bandwidth for the 32 MB of esram? Is being limited to 102 GB/s bandwidth have to do with being very low latency?
The only reason I can think of is that having a very fast eDRAM pool would require the interface to the actual DRAM banks to be extremely wide, since apparantly DRAM itself won't clock very high. IE, GDDR5 DRAM runs at a couple hundred MHz at most with wide read/write ports on-chip, then extra logic funnels the data to the narrow interconnect to the memory controller. ...From what I understand of it.

And a very wide on-chip interface would mean a tremendous amount of wiring on that chip, which on an already very complicated ASIC could be troublesome to route, perhaps. We're up to as much as 11 metal layers on a high-end microprocessor already, would a 2-4kbit wide DRAM interface push that even further?

...That's the only reason I can think of off-hand anyway. IE, a "weaker" eDRAM implementation would be easier, and thus less costly to implement/build. Or else MS is toning down the importance of multisample antialiasing (in favor of post-processing blurring for example instead), and extra bandwidth is thus simply considered unneccessary. But that also implies a cost saving measure...

*shrug*

We get some answers hopefully at/around the 21st (hopefully without too much PR spin, but maybe too much to hope for.) Anyway, it's not THAT far away now. *takes a deeeeeeep breath*
 
What puzzles me is why such an anemic bandwidth for the 32 MB of esram? Is being limited to 102 GB/s bandwidth have to do with being very low latency?
IMO it's a Nintendo like design decision ...

Sony mentioned they could have had a TB/s eDRAM ... and really that's what you need for it to be interesting.
 
Rather than anything else, it just seems like a way of having a lot of cheap memory while maintaining some degree of performance.

So it's a performance aid over just having DDR3, not special sauce.

Though it'd be interesting to see the latency figures for ESRAM vs DDR3 and GDDR5.
I was asking in the Orbis technical thread and the answer was the difference in DDR3 vs GDDR5 latency was not significant, but I never got actual figures.
 
Amd has a couple of patents of employing embedding memory into a gpu and its not cost saving based related.

http://appft1.uspto.gov/netacgi/nph...&RS=(AN/"advanced+micro+devices"+AND+embedded)

http://appft1.uspto.gov/netacgi/nph...&RS=(AN/"advanced+micro+devices"+AND+embedded)

I havent fully read the patents but one is basically a way to facillitate a high bandwidth and low latency memory system and a way to use embedded memory to allow gpus in a multiple gpu configuration to create a large unified on chip memory pool.

The second seems to refer a more efficient approach when wave fronts are outputting data needed by the subsequent wave fronts of gpgpu based tasks. Im guessing in this case it only applicable to times where L2 isnt enough.

IBM has this patent where the i/o interconnect is used to inject data into caches to avoid the latency of main memory.

http://www.google.com/patents/US20090157977

Here is an AMD patent for using a split memory system for AA.

http://www.google.com/patents/US201...a=X&ei=XtyGUc_nLKnb4AOFsYCACw&ved=0CDoQ6AEwAQ
 
Last edited by a moderator:
I was asking in the Orbis technical thread and the answer was the difference in DDR3 vs GDDR5 latency was not significant, but I never got actual figures.
http://forum.beyond3d.com/showpost.php?p=1714988&postcount=132
Actually, if you look it up, the latencies of usual DDR3 modules measured in nanoseconds are in the exact same range as in case of GDDR5. They use the same memory cells after all, just the interface is a bit different. Or to say a number, GDDR5@6 GBps may run at a CAS latency of 17 cycles (the 1.5 GHz ones, I only looked it up for one series of Hynix, at 1 GHz it supports CAS latency of 12 cycles), which equates to 11.3 ns. DDR3-2133 rarely comes at latencies below 11 cycles (1066 MHz ones), which would be 10.3 ns. Higher latencies for DDR3 are actually common. T_RP, t_RCD and t_RAS are also basically the same when measured in ns (10-12ns for t_RP and t_RCD, 28 ns for t_RAS). The Hynix GDDR5 I was looking was actually supporting latencies of 11.3-10-12-28 when expressed in nanoseconds in the usual order. In cycles @ 1066 the closest would be 12-11-13-30.
 
Back
Top