The pros and cons of eDRAM/ESRAM in next-gen

Scott_Arm · Aug 29, 2014

Rangers said:
...
If MS wanted to salvage this (ESRAM) design they should have just enabled the two redudant CU's and upclocked as far as they could imo. That could have easily got them to 1.5 TF+ (14 CU's at 853)and probably erased much of the competition compute lead, enough to blur the lines (more) sufficiently anyway.
...

Unless the gpu is bandwidth starved because of bus contention and too little ESRAM to compensate.

The overall impression from the interview is that the Xbox One is less powerful, but not that much. It'll probably be 1080p vs 900p or 900p vs 720p as most of us expected.

GravityX · Aug 29, 2014

Scott_Arm said:
Unless the gpu is bandwidth starved because of bus contention and too little ESRAM to compensate.

The overall impression from the interview is that the Xbox One is less powerful, but not that much. It'll probably be 1080p vs 900p or 900p vs 720p as most of us expected.

Yeah like Destiny.

But really shouldn't the PS4 be a lot more powerful?

Betanumerical · Aug 29, 2014

GravityX said:
Yeah like Destiny.

But really shouldn't the PS4 be a lot more powerful?

The difference between 900P and 1080P is roughly the difference between the two consoles

Scott_Arm · Aug 29, 2014

GravityX said:
Yeah like Destiny.

But really shouldn't the PS4 be a lot more powerful?

Well, I don't know. I think if you expect 1080p vs 900p and 900p vs 720p, that will probably be the ballpark (all else being equal). Whether that's a lot or a little is subjective. Coming from a pc developer like the Metro guys, it's probably only a little, because relative to PC horsepower both consoles are kind of a joke.

shredenvain · Aug 29, 2014

I know this question may sound very newbie.
How can the main ram be having all these contention problems from cpu and gpu?
I ask because the cpu is capped with 30gb/s access.
Plus the game side has access to 5gbs of the main ram.
How can they possibly be using so much memory/bandwidth cpu side?
Or is it potentiality a Sdk issue?

Allandor · Aug 29, 2014

Pixel said:
Why wasn't X360's cpu moderately impaired by its gddr3 system memory?

memory access on x360 was terrible for the cpu.
~600 cycles were needed.

HTupolev · Aug 29, 2014

Scott_Arm said:
Coming from a pc developer like the Metro guys, it's probably only a little, because relative to PC horsepower both consoles are kind of a joke.

The same "Metro guys" who just recently said that PS4 and XB1 are "well above what most people have right now, performance-wise"?

Regardless, the strength of one platform that happens to exist doesn't have a whole lot to do with relative gap between two other platforms.

shredenvain said:
I know this question may sound very newbie.
How can the main ram be having all these contention problems from cpu and gpu?
I ask because the cpu is capped with 30gb/s access.
Plus the game side has access to 5gbs of the main ram.
How can they possibly be using so much memory/bandwidth cpu side?

30GB/s is a huge portion of a 68GB/s bus.

Also, "contention" doesn't simply describe one component taking bandwidth from another component; it describes how memory busses tend to be less efficient when they're switching about between different tasks. This becomes more of an issue when two processors increase the accesses they're trying to make. The result is that high amounts of memory access from a CPU have a disproportionate hit on bandwidth available to the GPU.

Pixel said:
Why wasn't X360's cpu moderately impaired by its gddr3 system memory?

Was it not?

shredenvain · Aug 29, 2014

Thank you for clearing that up for me sir.

3dilettante · Aug 29, 2014

shredenvain said:
I know this question may sound very newbie.
How can the main ram be having all these contention problems from cpu and gpu?

At a base level, there are physical constraints on the behavior of any device, and the DRAM bus and DRAM arrays have very significant restrictions and latencies.
DRAM favors highly linear accesses and long stretches of pure reads or pure writes. There are historical reasons, like saving costs by having read and write traffic have to use the same wires and favoring array density to the point that DRAM arrays have not gotten faster for years.
Access a bank that hasn't had enough time to get ready, or force the bus and DRAM to change from read to write, and you start getting long stretches of no memory traffic at all.

Changing once from read mode to write mode for GDDR5, for example, is roughly the same as losing 28 or so data transfers.

The figures for memory operations taking hundreds of cycles are the best-case numbers when the DRAM is actually ready to respond to a CPU's request. Get the access pattern wrong, and the DRAM wastes more time getting everything ready than actually sending data.

It is difficult to get this right, and even the best CPU memory controllers in pretty favorable benchmarks lose tens of percent of peak performance. In reality, CPUs don't commonly consume that much bandwidth thanks to their caches and code that doesn't always need a memory access every cycle. They tend to care more about getting smaller amounts of data quickly.
GPUs are generally better at utilizing large amounts of bandwidth, and they do this by accepting long latencies and firing off a massive number of accesses. Generally, if you throw up enough accesses over time, eventually a good pattern should fall out of it.

If both access patterns are going full-bore, then the CPU takes a latency hit because the GPU is sucking up accesses some of the time, and the GPU loses bandwidth because the CPU's insistence on latency means cutting the long stretches of GPU traffic off prematurely.
Because it's generally a bad idea for the GPU and CPU to access the exact same portions of memory, DRAM loses more time hopping across distant portions of the storage space, which leads to more dead cycles.

How can they possibly be using so much memory/bandwidth cpu side?

A Jaguar core can read in 16 bytes per cycle for its SSE/AVX operations. With 8 of them at 1.75 GHz, that's 224 GB/s without including writes. It is possible to create code that can spam memory traffic, but this is not a compelling scenario to have a low-power CPU target.
The assumption is that the caches reduce the need for external bandwidth, and the actual northbridge bus that connects them to memory is significantly narrower and likely clocked at a lower multiplier than the cores. This is most likely for power and complexity reasons, leading to the 30 GB/s limit for coherent bandwidth.

DavidGraham · Aug 29, 2014

3dilettante said:
If both access patterns are going full-bore, then the CPU takes a latency hit because the GPU is sucking up accesses some of the time, and the GPU loses bandwidth because the CPU's insistence on latency means cutting the long stretches of GPU traffic off prematurely.

Because it's generally a bad idea for the GPU and CPU to access the exact same portions of memory, DRAM loses more time hopping across distant portions of the storage space, which leads to more dead cycles.

Some people were preaching PCs should have one memory pool, now with the contention problem appearing into focus, isn't the current PC design the better solution?

Deleted member 11852 · Aug 29, 2014

DavidGraham said:
Some people were preaching PCs should have one memory pool, now with the contention problem appearing into focus, isn't the current PC design the better solution?

You could also have one bank of multi-ported memory with dedicated buses for CPU and GPU. But then you have to think about coherency. Anything's possible given the right design and money to pay for it.

Shifty Geezer · Aug 29, 2014

DavidGraham said:
Some people were preaching PCs should have one memory pool, now with the contention problem appearing into focus, isn't the current PC design the better solution?

Stacked RAM with awesome BW will solve that.

Scott_Arm · Aug 29, 2014

I wonder if a side-effect of DX12 will be lowered memory contention between the CPU and GPU. If DX11 really can utilize a full core, and is constantly tracking the state changes etc, maybe it is also hitting the memory too often, and lowering the effective bandwidth to the GPU.

Allandor · Aug 29, 2014

Scott_Arm said:
I wonder if a side-effect of DX12 will be lowered memory contention between the CPU and GPU. If DX11 really can utilize a full core, and is constantly tracking the state changes etc, maybe it is also hitting the memory too often, and lowering the effective bandwidth to the GPU.

I wonder if dx12 also means lower memory consumption because of more control for the developer. But I don't know how much control the developer has now with modified dx11

3dilettante · Aug 29, 2014

DavidGraham said:
Some people were preaching PCs should have one memory pool, now with the contention problem appearing into focus, isn't the current PC design the better solution?

It's a hierarchy of bad things: a less than peak high-bandwidth DRAM bus is a better problem to have than getting stuck on a clunky and skinny expansion bus.

The current discrete PC design splits the GPU off and connects it to the CPU with a non-coherent, expansion bus with something like a tenth of the PS4's bandwidth. The split memory spaces require significant copy overhead and system software overhead for transmitting data, since bus transfers are not an automatic consequence of regular loads and stores.
It has its own utilization problems and has latencies that can be measured in significant fractions of a second instead of nanoseconds.

The current PC design works at the high end because it can permit a much greater hardware budget and power ceiling than a console. Then, a high end GPU can simply beat things through peak performance, so long as games don't try passing data over that bus.
Complexity of the software and drivers aside (and those are big), the expansion bus and the necessary overheads in using it are the weak spot where a discrete GPU setup can be hit incredibly hard relative to an APU.

pjbliverpool · Aug 29, 2014

3dilettante said:
It's a hierarchy of bad things: a less than peak high-bandwidth DRAM bus is a better problem to have than getting stuck on a clunky and skinny expansion bus.

The current discrete PC design splits the GPU off and connects it to the CPU with a non-coherent, expansion bus with something like a tenth of the PS4's bandwidth. The split memory spaces require significant copy overhead and system software overhead for transmitting data, since bus transfers are not an automatic consequence of regular loads and stores.
It has its own utilization problems and has latencies that can be measured in significant fractions of a second instead of nanoseconds.

The current PC design works at the high end because it can permit a much greater hardware budget and power ceiling than a console. Then, a high end GPU can simply beat things through peak performance, so long as games don't try passing data over that bus.
Complexity of the software and drivers aside (and those are big), the expansion bus and the necessary overheads in using it are the weak spot where a discrete GPU setup can be hit incredibly hard relative to an APU.

This is something I've assumed Nvidia is trying to address with the Denver on the GPU but I've never been able to work out how that would work given there would obviously still be an x86 CPU at the other end of an expansion bus. Can you shed any light on how NV's approach might improve things (if at all)? Cheers.

3dilettante · Aug 29, 2014

The most simplistic way of doing things would be to configure the discrete board with an APU to run as its own system, and treat the CPU at the far end of the non-coherent bus as an IO service processor.
The big CPUs in PCs are vastly outnumbered by many small cores embedded in disks and platform chips that run in their own little spaces, and it's a matter of normal operation to work with them as IO.

A more complicated setup might find ways to split out the renderer-dominated tasks and have them circulate primarily on the discrete APU. That might limit the amount that has to go over the expansion bus to a small amount and only a small number of times per frame.

mc6809e · Aug 29, 2014

3dilettante said:
Snipped lots of good info...

Do you have any insight into the idleness of the eight cores of the PS4/XBox Jaguar processors as they wait for main memory?

While eight cores sounds impressive, I just can't imagine keeping them all fed while the GPU is accessing memory, especially on the XBone. It seems to me the cores would spend much of their time running up against the "memory wall".

I'm inclined to think it would have been better to toss out four cores and replace the area they took up with an L3 cache or maybe a larger L2.

pjbliverpool · Aug 29, 2014

3dilettante said:
The most simplistic way of doing things would be to configure the discrete board with an APU to run as its own system, and treat the CPU at the far end of the non-coherent bus as an IO service processor.

Thanks, but on the assumption that the CPU cores on the APU side of the bus are going to be fairly weak, isn't that going to harm aspects of gameplay which might require a lot of CPU power, for example AI and world simulation? Perhaps even physics if you choose to free the GPU of that burden? Would it be possible under such a setup to still utlise a the power of a big CPU for certain CPU intensive aspects of a game that don't neccesarily require close intergration with the GPU? Or is that basically what you're talking about here?:

A more complicated setup might find ways to split out the renderer-dominated tasks and have them circulate primarily on the discrete APU. That might limit the amount that has to go over the expansion bus to a small amount and only a small number of times per frame.

How different is that from how things work at the moment? I'd assumed (obviously incorrectly) that all of the GPU related work was pushed over to GPU memory and processed locally there while all the heavy CPU lifting was done in the main memory with the two not needing to exchange large amounts of data (relative to the PCI-E bandwidth) outside of loading screens. The fact that we tend to see very little (if any) speed up from upping PCI-E bandwidth was I'd assumed evidence that the CPU and GPU don't actually need a huge amount of bandwidth between them to oprate at full capacity (assuming you have enough memory local to the GPU). I'd be interested to better understand why that's not the case? Is it simply because current software isn't achitectured to take advantage of the (big) advantages of heavy/fast CPU<->GPU communication and if so do you have any examples of how games might benefit from that compared to a discrete setup?

shredenvain · Aug 29, 2014

3dilettante said:
At a base level, there are physical constraints on the behavior of any device, and the DRAM bus and DRAM arrays have very significant restrictions and latencies.
DRAM favors highly linear accesses and long stretches of pure reads or pure writes. There are historical reasons, like saving costs by having read and write traffic have to use the same wires and favoring array density to the point that DRAM arrays have not gotten faster for years.
Access a bank that hasn't had enough time to get ready, or force the bus and DRAM to change from read to write, and you start getting long stretches of no memory traffic at all.

Changing once from read mode to write mode for GDDR5, for example, is roughly the same as losing 28 or so data transfers.

The figures for memory operations taking hundreds of cycles are the best-case numbers when the DRAM is actually ready to respond to a CPU's request. Get the access pattern wrong, and the DRAM wastes more time getting everything ready than actually sending data.

It is difficult to get this right, and even the best CPU memory controllers in pretty favorable benchmarks lose tens of percent of peak performance. In reality, CPUs don't commonly consume that much bandwidth thanks to their caches and code that doesn't always need a memory access every cycle. They tend to care more about getting smaller amounts of data quickly.
GPUs are generally better at utilizing large amounts of bandwidth, and they do this by accepting long latencies and firing off a massive number of accesses. Generally, if you throw up enough accesses over time, eventually a good pattern should fall out of it.

If both access patterns are going full-bore, then the CPU takes a latency hit because the GPU is sucking up accesses some of the time, and the GPU loses bandwidth because the CPU's insistence on latency means cutting the long stretches of GPU traffic off prematurely.
Because it's generally a bad idea for the GPU and CPU to access the exact same portions of memory, DRAM loses more time hopping across distant portions of the storage space, which leads to more dead cycles.

A Jaguar core can read in 16 bytes per cycle for its SSE/AVX operations. With 8 of them at 1.75 GHz, that's 224 GB/s without including writes. It is possible to create code that can spam memory traffic, but this is not a compelling scenario to have a low-power CPU target.
The assumption is that the caches reduce the need for external bandwidth, and the actual northbridge bus that connects them to memory is significantly narrower and likely clocked at a lower multiplier than the cores. This is most likely for power and complexity reasons, leading to the 30 GB/s limit for coherent bandwidth.

Awesome detailed explanation. Thank you very much.

The pros and cons of eDRAM/ESRAM in next-gen

Scott_Arm

GravityX

Betanumerical

Scott_Arm

shredenvain

Allandor

HTupolev

shredenvain

3dilettante

DavidGraham

Deleted member 11852

Guest

Shifty Geezer

uber-Troll!

Scott_Arm

Allandor

3dilettante

pjbliverpool

B3D Scallywag

3dilettante

mc6809e

pjbliverpool

B3D Scallywag

shredenvain

Similar threads