The pros and cons of eDRAM/ESRAM in next-gen

Strange · Feb 20, 2014

Goodtwin said:
Its hard to believe that the memory is such a problem for developers. In that DF article about the Wii U, the developer said that memory bandwidth was not an issue on the Wii U, and that console has severely limited bandwidth to the main memory, 12.8GB/s compared to the Xbox One 68GB/s. I would assume the 32MB of sram would work in a similar manner to Wii U's edram, freeing up bandwidth taken by the buffers. Not to mention the fact that the GPU is far more advanced than the Wii U's GPU, so I would think it would have even better texture cache, but perhaps that is not the case. When you take a look at the difference in fidelity between COD Ghost on X1 compared to current gen consoles, its hard to fathom that memory bandwidth is the reason the game cant be rendered in 1080p on X1. Anandtech did an anylsis of the Gamecube GPU back in the day, and spoke of the 2MB of onboard memory for the Zbuffer saving tons of memory bandwidth, the 32 MB of sram should be freeing up enough bandwidth to make the 68GB/s to the main DDR3 memory sufficient. Perhaps the additional shader performance ramps of memory bandwidth requirements significantly, but so far is seems like the Xbox One is under performing for a closed box system. Its spec sheet would lead you to believe it should have little to no trouble running games like Ghost and Battlefield in 1080p, but obviously this is not the case.

In the WiiU's case, apparently the bandwidth wasn't a problem because other issues (CPU) were a even worse bottleneck than the bandwidth.

If you gave the WiiU a much more capable CPU, we probably have no doubt that bandwidth would become an issue, but then, with the better CPU it would probably then have a much more commanding edge over the PS3/360 (and still far behind the XB1 or the PS4) unlike now, and thus it wouldn't really matter that much.

3dilettante · Feb 21, 2014

Neighborhoodcroak said:
The resolution differences is irrelevant of the CPUs. Truth be told if you gave the Xbox 360, PS3, Xbox One, and PS4 OpenGL 2.x APIs then you'll find you can run more draw calls on the last gen consoles.

As already noted, this is unlikely since the consoles operate on a lower level than either major API.
In addition, the low-level access and awareness of storage formats and memory arrangement allow for various tricks and shortcuts that are not visible when hidden by an abstraction layer.
Adding the API to the older consoles may have the effect of requiring more draw calls and fewer optimizations as a result.

Most API calls and drivers are single-threaded and actually the number of CPU GFLOPS dropped from last gen.

Do you believe the process of submitting draw calls is heavily gated by SIMD FP throughput?

On a more relevant note bandwidth on the Xbox One is only a partial bottleneck. The bandwidth is 109 GB/s bidirection (218 GB/s front and back)

It's not symmetric. It's 109 GB/s with pure read or pure write, and 204 GB/s with an ideal access pattern when mixed.

Nick Baker ~"it's rare to be able to achieve that over long periods of time so typically an external memory interface you run at 70-80 per cent efficiency"

One thing to note is that tests have shown that ROPs are good at exceeding the average utilization, at 90% and up, which may have a modest impact here because the eSRAM's utilization with a blending test was shown to be less efficient.

Neighborhoodcroak · Feb 21, 2014

3dilettante said:
As already noted, this is unlikely since the consoles operate on a lower level than either major API.
In addition, the low-level access and awareness of storage formats and memory arrangement allow for various tricks and shortcuts that are not visible when hidden by an abstraction layer.
Adding the API to the older consoles may have the effect of requiring more draw calls and fewer optimizations as a result.

I used OpenGL 2.x to make a apples to apples comparison. The point being tested was the performance of the CPU's in submitting draw calls. Why compare a bloated API such as Direct3D 9 to OpenGl 2.x? I'm fully aware PC gaming APIs are higher level/different than consoles, for very good reasons.

3dilettante said:
Do you believe the process of submitting draw calls is heavily gated by SIMD FP throughput?

Not really. However it takes instruction cycles and validation in the driver so IPC count plays a large role. CISC architecture processors are not well known for their high IPC rates, more particularly AMD's bulldozer and it's mobile offspring. Core per Core I have no doubt the 360's Xenos outperforms the jaguar cores in IPC.
GFLOPS is usually (but not always) a good indication of the relative computation power of a CPU.

3dilettante said:
It's not symmetric. It's 109 GB/s with pure read or pure write, and 204 GB/s with an ideal access pattern when mixed.

You got me there without a doubt. Should've looked at it more

3dilettante said:
One thing to note is that tests have shown that ROPs are good at exceeding the average utilization, at 90% and up, which may have a modest impact here because the eSRAM's utilization with a blending test was shown to be less efficient.

Baker's statement has nothing to do with ROPs. It was the sustained bandwidth for the ESRAM. It's understood ROPs and TMUs reach close to their rated peak values provided neither is starved for bandwidth. Benchmarks (aside 3D Marks pixel fill) shows this.

function · Feb 21, 2014

Neighborhoodcroak said:
Core per Core I have no doubt the 360's Xenos outperforms the jaguar cores in IPC.

I doubt it, didn't MS's own analysis show an average of something appalling like 0.2 IPC per core in 360 games? I doubt a Jaguar core at half the clockspeed has trouble outperforming a Xenon core.

GFLOPS is usually (but not always) a good indication of the relative computation power of a CPU.

Really doesn't seem to apply in the case of Xenon (or Cell for some workloads). Dual core Athlon 64's (with only 64 bit SIMD) were considered a good match for the entire Xenon CPU by Capcom. At the time they said this, the very fastest weren't going beyond 2.6 gHz.

(((interference))) · Feb 21, 2014

In all this doom and gloom about the XB1 being weaker (which was expected anyway); people should also keep in mind that XB1 has an on paper CPU advantage; the cores are clocked 9.3% higher and most audio processing can be offloaded to SHAPE (perhaps saving 0.5 to 1 or more CPU cores - depending on the type of game).

That's something not apparent now, but which might come to bear fruit as the generation progresses (like with devs getting to grips with ESRAM).

Neighborhoodcroak · Feb 21, 2014

function said:
I doubt it, didn't MS's own analysis show an average of something appalling like 0.2 IPC per core in 360 games? I doubt a Jaguar core at half the clockspeed has trouble outperforming a Xenon core.

Really doesn't seem to apply in the case of Xenon (or Cell for some workloads). Dual core Athlon 64's (with only 64 bit SIMD) were considered a good match for the entire Xenon CPU by Capcom. At the time they said this, the very fastest weren't going beyond 2.6 gHz.

Apparently I have to re-evaluate and do more Xenon research then. This information is rather surprising to me.

taisui · Feb 21, 2014

You can have a block of memory across the both the esram and ddr3, it's transparent to the coding. It's probably not a wise thing to do, but it's probably not going to matter until the code stress the memory system.

3dilettante · Feb 21, 2014

Neighborhoodcroak said:
Not really. However it takes instruction cycles and validation in the driver so IPC count plays a large role. CISC architecture processors are not well known for their high IPC rates, more particularly AMD's bulldozer and it's mobile offspring. Core per Core I have no doubt the 360's Xenos outperforms the jaguar cores in IPC.
GFLOPS is usually (but not always) a good indication of the relative computation power of a CPU.

I'm not sure what words to use to indicate how bad the previous gen CPUs were. Jaguar is pretty decent for what it is, and with the exception of certain vector loads it isn't in the same realm as the gimpy cores it replaces.

Baker's statement has nothing to do with ROPs. It was the sustained bandwidth for the ESRAM. It's understood ROPs and TMUs reach close to their rated peak values provided neither is starved for bandwidth. Benchmarks (aside 3D Marks pixel fill) shows this.

ROPs are one of the, if not the most, dominant bandwidth consumers on the chip.
They exceed by far anything the CPU section can muster, either due to the cores or the more limited bandwidth connection they have to memory, and they are frequently coupled with memory controllers in GPUs for a reason. The eSRAM's bus width nicely matches up with the FP16 Z throughput of the ROP section, which is a very happy coincidence if not something more purposeful.

We can only come to an incomplete conclusion about bandwidth utilization if we ignore the things that are supposed to be using it, and falling back to general utilization in that article may have been a little disingenuous.

Rangers · Feb 21, 2014

function said:
Which game and what PC setup are you thinking of?

Exactly my question! This could be true, but I need more evidence.

Offhand I saw a Titanfall bench where a 7770, analogous to One GPU, ran Titanfall at 1080P at average of 36 FPS.

Tons and tons of caveats there, but it fits. Basically One couldn't run Titanfall at 1080 and hope for 60 FPS.

One probably has (much?) more effective GPU bandwidth, but is also likely in a weaker platform (CPU, system RAM), but will then have the advantage of more optimization...how should it all shake out? I dont know...at a glance I'd think maybe they ought to be able to push Titanfall One to 1080P...

HTupolev · Feb 21, 2014

Neighborhoodcroak said:
Apparently I have to re-evaluate and do more Xenon research then. This information is rather surprising to me.

The reason that Xenon is able to be clocked so high on such a large process is that it has lots of pipeline stages. Deep pipelines put a ton of complications on achieving high IPC, some of which can't really be circumvented even in theory. For instance, if you have an op that every subsequent op depends on, you have no choice but to wait for that op to finish before clocking more ops in; the more pipeline stages, the more clocks the hit.
Not that Jaguar has a short pipeline either, but it's doing more to mitigate that.

And Xenon just plain has some quirks which can leave it vulnerable to funky stalls. Google "load hit store" for a fine case of Xenon just not doing as well as most processors.

I wouldn't be terribly surprised if a single Xenon core could beat a single Jaguar core at certain tasks (particularly some computational ones), but Jaguars probably tend to enjoy significantly higher IPC.

Gubbi · Feb 21, 2014

3dilettante said:
ROPs are one of the, if not the most, dominant bandwidth consumers on the chip.
They exceed by far anything the CPU section can muster, either due to the cores or the more limited bandwidth connection they have to memory, and they are frequently coupled with memory controllers in GPUs for a reason. The eSRAM's bus width nicely matches up with the FP16 Z throughput of the ROP section, which is a very happy coincidence if not something more purposeful.

As you say, the ROPs and the bandwidth of the ESRAM is matched perfectly

The low utilization of the ESRAM bandwidth is likely a result of being unable to fit your buffers in there and using main memory to some extent. If you increase your resolution, you increase the chunk held in main ram and ESRAM utilization goes even lower (ie. spending more time on the main memory bus).

It'll take a few years before developers have game engines optimized for the smallish ESRAM. The current crop of fat G-buffer implementations aren't doing XB1 any favours, that's for sure.

It's the reverse of last gen, where Sony launched expensive and spend the entire gen trying to catch up performance-wise (and never quite getting there).

Cheers

Goodtwin · Feb 21, 2014

Strange said:
In the WiiU's case, apparently the bandwidth wasn't a problem because other issues (CPU) were a even worse bottleneck than the bandwidth.

If you gave the WiiU a much more capable CPU, we probably have no doubt that bandwidth would become an issue, but then, with the better CPU it would probably then have a much more commanding edge over the PS3/360 (and still far behind the XB1 or the PS4) unlike now, and thus it wouldn't really matter that much.

I kind of doubt that. From what I have read, CPU's tend to be more latency bound than bandwidth, and with Nintendo's love for large amounts of L2 cache, that should reduce bandwidth requirements of the CPU. I think with the Wii U, seeing as how its a more modern 176Gflop GPU its bandwidth requirements are going to be far less than a 1.2 Tflop GPU powering the X1, and the edram plus the DDR3 memory provides sufficient bandwidth, enough that its not problem. The GPU hits its limits before memory bandwidth becomes a problem.

Inuhanyou · Feb 21, 2014

(((interference))) said:
In all this doom and gloom about the XB1 being weaker (which was expected anyway); people should also keep in mind that XB1 has an on paper CPU advantage; the cores are clocked 9.3% higher and most audio processing can be offloaded to SHAPE (perhaps saving 0.5 to 1 or more CPU cores - depending on the type of game).

That's something not apparent now, but which might come to bear fruit as the generation progresses (like with devs getting to grips with ESRAM).

We're talking about the smart choice of using ESRAM here. And the fact of the matter is the choice between the two competitor's solutions is plain to see because of the scratchpad being too small in size.

If course even if we were talking about CPU's you'd still be incorrect, because we know now by certain benchmarking tools that PS4's CPU is able to be exploited more fully, possibly because more cores are unlocked for games. And Shape is dedicated to Kinect processing, it has nothing to do with "offloading from CPU", your not using that for actual game cycles.

taisui · Feb 21, 2014

given that the 360 titles sometimes renders at 640p (728k pixels) without tiling on the 10MB edram, and that 1080p (2074k pixels), is less than 3X of 640p, I am curious on why people kept saying that 32MB esram is too small for 1080p?

AlNom · Feb 21, 2014

G-buffer, intermediate buffers, shadowmaps, FP16...

Neighborhoodcroak · Feb 21, 2014

Deferred rendering is a very common rendering technique. As AlNets said G-buffers and render targets take a good portion of not only memory footprint but bandwidth. Using MSAA/MSAA rendered surface increases the memory footprint. Because of this it requires trick optimization to utilize as much of the main system RAM without saturating its bandwidth.

A single 8bpp 1920x1080 RGBA framebuffer alone costs 16.6MB uncompressed. Compression will only go so far, put otherwise it can full up very quickly if you're not careful.

taisui · Feb 21, 2014

deferred rendering is not common, most games still use forward rendering, and it's just not gonna fit w/o doing some serious tricks. From what I see, there's is a difference saying that it's a hardware limitation versus software limitation, and I feel it's the later and not the former, though most people might claim otherwise.

On a side node, the 360 edram die contains some AA logic, maybe that's one of the big differences vs. the scratchpad esram.

Shifty Geezer · Feb 21, 2014

taisui said:
deferred rendering is not common, most games still use forward rendering...

Do we have any sort of stats on what renderers games use?

Neighborhoodcroak · Feb 21, 2014

taisui said:
deferred rendering is not common, most games still use forward rendering, and it's just not gonna fit w/o doing some serious tricks.

You got me there, I got caught up in the moment. The Frostbyte Engine (the version in BF3 and probably BF4) is deferred rendered that's for certain.

Do we have any sort of stats on what renderers games use?

You can probably look it up. Obscure games AFAIK don't seem to use it all too much and like taisui said most games actually do use forward rendering.

Well there isn't much use for a g-buffer in forward rendering so Render Targets would be the next memory hog on the list I think.

taisui · Feb 21, 2014

Shifty Geezer said:
Do we have any sort of stats on what renderers games use?

"true" deferred "shading" (fully deferred), we are talking about <20, killzones, metros, stalkers, trines.

deferred "lighting" (lighting stage only), less memory requirement, but more rendering pass, a bit more common but still less than 30, but more recent AAA titles are adapting to this, Halo, Assassins, MGSV, etc. This is partly forward rendering though.

Then rest of the games, are all forward rendering.

It'd be nice to tally up a yearly trending because it's getting more common, though I think deferred lighting is still going to be more common with full deferred.

The pros and cons of eDRAM/ESRAM in next-gen

Strange

3dilettante

Neighborhoodcroak

function

None functional

(((interference)))

Neighborhoodcroak

taisui

3dilettante

Rangers

HTupolev

Gubbi

Goodtwin

Inuhanyou

taisui

AlNom

Moderator

Neighborhoodcroak

taisui

Shifty Geezer

uber-Troll!

Neighborhoodcroak

taisui

Similar threads