The pros and cons of eDRAM/ESRAM in next-gen

3dilettante · May 20, 2014

taisui said:
Then why bother with caching? Just go to the RAM every time, no?

Ideally, the vast majority of accesses should hit either the L1 or local L2.

Main memory latency is the sum of all on-die latencies, then the latency of memory itself.
As the memory is a standard type, its contribution is relatively constant across all platforms that use it, and there are platforms with vastly lower latencies.

taisui · May 21, 2014

Yea I know what you were saying, I'm just be picky on your wording. In any case, both the SRAM and DRAM are behind the same controller, and I think the architects mentioned that latency is not much of a contributing factor. Also remember that with graphics tasks, huge amount of data being computed, cache helps when locality is high and on the writes, but the nature of the GPU operations is just tolerant with high latency.

LightHeaven · May 21, 2014

taisui said:
SRAM always have latency advantage over DRAM, that's just how it is.

Whether that translate into any actual performance gains on the gaming space is hard to say, since graphics workload are highly tolerant to latency.

Sure, when working with nice and predictable and mostly linear access, but what about the cases where the nature of the work itself can't be easily predicated? Meaning: Tasks that have little to do with a normal graphic workload, even though it's done for graphics (like I dunno, ray tracing a voxel tree XD) Would gpus still be tolerant to latency then? Or maybe even something more simple like a dof post process effect... Are those still mainly ALU and not memory bound? I've seen some papers that suggest they are memory bound, but I have nearly zero experience with them...

sebbbi said:
Compute shaders are often memory (bw and/or latency) bound. Most CUDA optimizations guides talk extensively about memory optimizations, while ALU optimizations are not discussed as much (since ALU isn't usually the main bottleneck for most algorithms on modern GPUs).

If that's the case, why do they keep increasing ALU count so seemingly disproportionally to everything else? Why just don't increase the onboard memory or I don't know, scheduler more akin to a cpu so they can bring performance up?

And would you say then that for things like PhysX effects, even more so than graphics, could benefit from memory optimizations?

Rangers · May 21, 2014

If that's the case, why do they keep increasing ALU count so seemingly disproportionally to everything else?

There's no easy way to increase bandwidth apace, with current technology?

We only got a ~3X memory bandwidth increase in these consoles due to technological limits, where everything else increased much more.

3dilettante · May 21, 2014

LightHeaven said:
If that's the case, why do they keep increasing ALU count so seemingly disproportionally to everything else?
Why just don't increase the onboard memory or I don't know, scheduler more akin to a cpu so they can bring performance up?

It keeps a number of workloads from becoming ALU-limited, and ALU capacity is something that can be scaled with less effort and not as much overhead by copy-pasting CUs.
The GPU is still quite capable of utilizing extra resources. The underlying arbitration and control logic underpinning Durango is found in devices ranging from Orbis to Hawaii without needing heavy redesign. While the GPU does heavily emphasize transistor density, its modest circuit speeds are more in line with what can be readily manufactured and fewer heroic measures are needed for it to be acceptable.
The workloads that benefit this sort of compute are those that have high arithmetic density, or they can leverage the specialized fixed-function hardware, or they experience very high miss rates to main memory for any reasonable on-die memory pool.

Bumping the architecture to be more CPU-like would involve more engineering work and have different scaling limits. If we want to know what a more CPU-like architecture would be like given that this is AMD, we can look to the Jaguar cores next to the GPU.

Jaguar had a longer gestation period before it could replace its predecessor, and the design was not fully validated or bug-fixed until Puma. The CPU section needed more work to improve things over Bobcat, especially its L2 cache. The CPU cores are not world-beating performance wise, and the CPU-memory subsystem with the two Jaguar modules is not very good. The GPU runs at another order of magnitude in terms of memory traffic, and for its class it is an architecture for which there are a few arguable alternatives. Jaguar does not distinguish itself so well on its own terms.

The GPU is very good at what it does, while an attempt to push it towards Jaguar or Bulldozer promises compromises for what the GPU is good at, with the upside of being a poor Jaguar or Bulldozer, at best.

HTupolev · May 21, 2014

Rangers said:
Because of low latency??

Maybe, but I'm not seeing any particular reasons to suspect that.

It's entirely possible to create compute shaders that thirst for tons of bandwidth.

3dilettante · May 21, 2014

Rangers said:
From Lionhead

Compute shader work? Because of low latency?? Is this a first reference?

It's not the first reference to that tweet in this thread, at least.

eSRAM and DRAM accesses were designed to be handled mostly transparently past initial setup, so the architecture doesn't place any barriers based on the kind of shader that is making those accesses.
Vgleaks off-handedly mentioned using a CU to move data from memory to eSRAM at higher bandwidths than a move engine can do on its own, so I would consider it more noteworthy if for some reason developers couldn't use it for GPU compute.

The eSRAM is where most of the system's bandwidth would be found, and it might lower latency by some undisclosed amount, at least somewhat.

Rangers · May 21, 2014

Arwin said:
This discussion has been around before. But my worry is that the Xbox just wont have the CUs to spare for nice compute.

Kinda OT I guess, but, The stupidest thing about the design imo. They wanted it to do all this extra cool stuff (like snapped Apps), but didn't build in a surplus of compute power to do it. Even 2-4 extra CU's would have been wonderful.

Well, they claim they have extra compute, I guess *shrug*

Like "hey, we're building this box to take over the world...but we're going to skimp on the actual processing"

taisui · May 21, 2014

LightHeaven said:
Sure, when working with nice and predictable and mostly linear access, but what about the cases where the nature of the work itself can't be easily predicated? Meaning: Tasks that have little to do with a normal graphic workload, even though it's done for graphics (like I dunno, ray tracing a voxel tree XD) Would gpus still be tolerant to latency then? Or maybe even something more simple like a dof post process effect... Are those still mainly ALU and not memory bound? I've seen some papers that suggest they are memory bound, but I have nearly zero experience with them...

If that's the case, why do they keep increasing ALU count so seemingly disproportionally to everything else? Why just don't increase the onboard memory or I don't know, scheduler more akin to a cpu so they can bring performance up?

And would you say then that for things like PhysX effects, even more so than graphics, could benefit from memory optimizations?

I think GPU and CPU has came closer ever than before, but the techniques on GPGPU still relies heavily on getting the data to line up nicely so that the GPU does the fp ops efficiently. It's just different than how CPU would operate.

MrFox · May 21, 2014

Arwin said:
This discussion has been around before. But my worry is that the Xbox just wont have the CUs to spare for nice compute.

Maybe it's "Balanced at 8"

In any case the ESRAM is taking the die area that would have allowed them to add more CUs.

Rangers · May 22, 2014

They probably could have added a couple/few more CU's before the 363mm^2 chip got unwieldy. As well, considering the bandwidth would be relatively fixed.

Remember, if nothing else they considered enabling the yield disabled CU's, which would have put them at 14 with no change to chip size.

HTupolev · May 22, 2014

Rangers said:
Remember, if nothing else they considered enabling the yield disabled CU's, which would have put them at 14 with no change to chip size.

But with a change (increase) to cost per size.

Rangers · May 22, 2014

Right, just responding to the suggestion "The ESRAM took the CU's". In a way, yes, in a way, no.

If they had a will, I think they could have easily grafted 2-8 more CU's onto the design (2 just by biting the yield cost, any additional would have had to have been added well back in the planning stage). Of course, would have likely been very ALU heavy just like Sony's design allegedly is, because there's not really a way to up the bandwidth without a major redesign. This also ignores any possible ROP limitations, only speaking about extra compute.

(((interference))) · May 22, 2014

But they wanted to sell it for a profit with Kinect and at a reasonable price (initial target was $399 apparently).

Hence the weak hardware, and yet they still didn't manage to hit that brief so they ditched Kinect.

Why XB1 doesn't have extra CUs for compute all comes down to cost and profitability.

shredenvain · May 22, 2014

As far as CUs for compute go it doesnt have to use extra possibly not always used for graphics compute units. Both companies have stated that they can use unused cycles in compute units that are performing graphic tasks. Im sure both first party and third party devs will develop unique and new ways to gain performance on these gpus. In fact from the pdfs Crytek have released they had all the lighting shading and culling running on 1 Cu with only special skin shaders running on other units.
To me that means if used properly a beautiful game like Ryse has Cu overhead. Now having that alu overhead doesnt mean they arent running into a bandwidth bottleneck.

liolio · May 22, 2014

(((interference))) said:
But they wanted to sell it for a profit with Kinect and at a reasonable price (initial target was $399 apparently).

Hence the weak hardware, and yet they still didn't manage to hit that brief so they ditched Kinect.

Why XB1 doesn't have extra CUs for compute all comes down to cost and profitability.

I may have missed it but I noticed that you've used that 399$ figure more than once, where is figure coming from? I don't see how they could have reached the perfs they were aiming at while reaching that price without subsidizing the system.

As for the hardware it may not be as powerful as some would want it to be, but one can hardly argue the point that MSFT wanted a weak system, kinect or not, the system BOM is higher the PS4 BOM.
Looking at the SoC alone with the RAM the investment of MSFT and Sony are damn close too.

It just doesn't perform well compare to its BOM or more accurately compared to its rival, that is completely different.

EDIT

I think the issue is not the pro and con of eDRAM (or eSRAM) it is more about putting 8GB of RAM above everything else when they decided to scale Yukon performances up which drove MSFT into a corner as far as design choices are concerned.

taisui · May 22, 2014

My take is that MSFT is tired of burning tons of cash and only to break even a few years in, until the cycle repeats losing cycle with the new console gen. SCE barely made any money in FY13, and starts bleeding FY14 again, so much for "winning" the console "war."

Shifty Geezer · May 22, 2014

Sony predicts PS4 will make more money than PS2. If so, ESRAM wasn't a requirement to avoid financial disaster.

taisui · May 22, 2014

Shifty Geezer said:
Sony predicts PS4 will make more money than PS2. If so, ESRAM wasn't a requirement to avoid financial disaster.

Logical fallacy, this says as much as "not getting into the console business avoids financial disaster on the console business."

What about this, piss off your core audience wasn't a requirement to avoid financial disaster.

Rangers · May 23, 2014

Shifty Geezer said:
Sony predicts PS4 will make more money than PS2. If so, ESRAM wasn't a requirement to avoid financial disaster.

Well, Sony Game lost money again last quarter, and there's not a lot but PS4 that should be causing it.

Saying that, I wouldn't doubt PS4 hardware is something like close to break even at worst. Each day that passes, especially this early in the curve, it should improve.

PS2 was not all that wildly profitable, it seems unlikely PS4 would sell as much as PS2, but the hardware should be well more profitable, so overall it makes sense.

But yeah, the only thing ESRAM gained was the ability to use DDR3 instead of GDDR5. I think that's significant cost savings, at least (whether it's "worth it" is another debate). Others are now trying to say GDDR5 and DDR3 are relatively similar in cost, which I think is ridiculous, but is tough to disprove since GDDR5 isn't sold separately, so will continue. And the only way to prove something about this is if the base X1 significantly undercuts the base PS4 on pricing for most of it's life (or, doesn't). And even that of course, wont prove anything, as we will never know the margins on the hardware each manufacturer is accepting.

A case could be made MS screwed up, but I'd like to see the whole gen play out first (perhaps in 3 years, XOne will have reached mostly performance parity, or at least good enough consumers dont care, or perhaps indeed X1 will end up undercutting PS4 on price for most of the gen).

If they did screw up, it's because of their unhealthy obsession with a small fast RAM cache on the GPU. As bkilian said no other design was even considered. I didn't even love it in 360...

The pros and cons of eDRAM/ESRAM in next-gen

3dilettante

taisui

LightHeaven

Rangers

3dilettante

HTupolev

3dilettante

Rangers

taisui

MrFox

Deludedly Fantastic

Rangers

HTupolev

Rangers

(((interference)))

shredenvain

liolio

Aquoiboniste

taisui

Shifty Geezer

uber-Troll!

taisui

Rangers

Similar threads