The pros and cons of eDRAM/ESRAM in next-gen

Scott_Arm · Jul 17, 2014

taisui said:
quick answer is yes, as DDR3 alone would not be able to sustain the bw need, but what exactly do you mean by "esram<->ddr3 read\writes" and what's "sizable"?

I think he means copy data from ESRAM to DDR3 and vice versa.

Pixel · Jul 17, 2014

taisui said:
quick answer is yes, as DDR3 alone would not be able to sustain the bw need, but what exactly do you mean by "esram<->ddr3 read\writes"?

I mean to say transfers between ddr3 and esram. The esram so to speak has to be "fed" data, as well as feed out data (eg a game might have 1 copy out per frame to mainmem for each front buffer write).
Now this doesn't really impact esram giving its real world bandwidth averages ~120GB/s iirc. (i might be off by 10GB/s i dont have photographic memory)

But the ddr3 has less bandwidth to work with, as well as memory contention issues sharing bandwidth between the two processing units (all share memory architecture suffers from this).

So my question is do the combination of all the esram to&from ddr3 reads/writes take sizeable amount of the ddr3 bandwidth in most games?

3dilettante · Jul 17, 2014

At least going by the disclosures, many operations using the ESRAM shouldn't need feeding from DDR3 to the ESRAM.
The way I interpret some of the disclosures, intermediate buffers residing wholly in the ESRAM would not hit DDR3. If data must spill, hopefully it is used more than it is spilled. Whatever final result would be could be written directly to DDR3 by the GPU.
The GPU would need to be fed external data from DDR3, but that's an access that's going to happen anyway. All the work in the middle could potentially never be noticed by main memory.

For other cases, like maintaining residency for reads, it wouldn't be done unless there's an expectation that the data transfered once from DDR3 is going to be read back multiple times.
Data that is read in, modified, and then written back should probably happen if the data is several times beyond the read+write that would be happening anyway.

Even for limited-reuse data, access patterns that punish the DDR3 may still be worth moving to ESRAM. What's the point in quibbling over a fraction of the DRAM's bandwidth if the pattern makes it lose over half its peak?

There may be dynamic cases where the allocation process is too inflexible to catch that specific subsets of should have stayed in main memory, but otherwise each ESRAM transaction should eliminate more than its share of main memory transactions, inflating its apparent influence over the DDR bus as a whole.

The alternative is a case where an optimization was missed or the ESRAM was already fully subscribed, and in that case the ESRAM transfer traffic's percentage of DDR3 goes down, as would performance. As such, it might be better seeing a good chunk taken up by ESRAM transfers.

taisui · Jul 17, 2014

Pixel said:
I mean to say transfers between ddr3 and esram. The esram so to speak has to be "fed" data, as well as feed out data (eg a game might have 1 copy out per frame to mainmem for each front buffer write).

Well, whatever that needs to be "fed" still needs to be "fed" in a single memory pool configuration. When you read from DDR3 and write back to DDR3, it still consumes the same amount of bw as writing to ESRAM, there's no difference that way.

We can example the "frame buffer copy" though. Let's say 60fps and 1080p, just typical RGBA frame buffer, and let's just say that you need to tile it twice, that's ~1GBps of bandwidth for "copying" the frame buffer.

However in practice, what a developer should do is just read from the ESRAM and write back to the DDR3 on the last rendering pass so there's really no "copy." And I don't know how the 3 display panes work in this regard.

But the ddr3 has less bandwidth to work with, as well as memory contention issues sharing bandwidth between the two processing units (all share memory architecture suffers from this).

That's very true, and that means dedicated ESRAM would not be suffering from contention, and that's where most of the read/write would happen at the pixel level.

So my question is do the combination of all the esram to&from ddr3 reads/writes take sizeable amount of the ddr3 bandwidth in most games?

Again, depends on what do you mean by "sizable" as in an actual number in your mind.

Sidebar: The BW is not the bottleneck here, if the BW is the limiting factor, then the June SDK wouldn't have helped at all, and that's the only logical conclusion.

MrFox · Jul 17, 2014

taisui said:
Sidebar: The BW is not the bottleneck here, if the BW is the limiting factor, then the June SDK wouldn't have helped at all, and that's the only logical conclusion.

Why wouldn't GPU code scale up with the longer GPU time slice when it's BW bound?

The code will execute proportionally more work if it has more time to run. Otherwise things don't add up. There would need to be a bottleneck that changes based on the time slice width, which I cannot see.

taisui · Jul 17, 2014

MrFox said:
Why wouldn't GPU code scale up with the longer GPU time slice when it's BW bound?

The code will execute proportionally more work if it has more time to run. Otherwise things don't add up. There would need to be a bottleneck that changes based on the time slice width, which I cannot see.

Wouldn't more work requires more bw, in general? (say, the 1080p narrative)

If you are bw bound, it means you are you are approaching the theoretical limit of the physical hardware, and nothing you do will help unless you up the hardware add more bw.
The fact that you can increase performance by giving the GPU more time means it's never bw bound to begin with, which was my whole point.

HTupolev · Jul 17, 2014

taisui said:
Wouldn't more work requires more bw, in general?

When you're dealing with time slices, more time effectively is more BW, in a long-term time-average sense. You're not going to be making memory access when you're not using the GPU.

The June SDK doesn't change the "instantaneous" BW available to the GPU, but a game is still able to make more total memory access over time.

taisui · Jul 17, 2014

HTupolev said:
When you're dealing with time slices, more time effectively is more BW, in a long-term time-average sense. You're not going to be making memory access when you're not using the GPU.

The June SDK doesn't change the "instantaneous" BW available to the GPU, but a game is still able to make more total memory access over time.

You can't make more memory access if you are already bw bound, which is a physical limit, has nothing to do with utilization.
Other than that, what you and MrFox said are not different than what I've said.

HTupolev · Jul 18, 2014

taisui said:
You can't make more memory access if you are already bw bound

If you spend 90% of 1 second accessing a 100MB/s bus, you can achieve up to 90MB of transfer during that second. If you spend 100% of 1 second accessing the same bus, you can achieve up to 100MB of transfer during that second. If you task is bound by the transfer of data, you're going to be able to get more done in the second case than in the first. Unless a game on XB1 is able to buffer a cartoonishly massive amount of memory access requests and carry them out (somehow without contention issues) during the system's time slice, even a completely BW-bound game should see markedly better performance by getting rid of the time slice.

taisui · Jul 18, 2014

HTupolev said:
If you spend 90% of 1 second accessing a 100MB/s bus, you can achieve up to 90MB of transfer during that second. If you spend 100% of 1 second accessing the same bus, you can achieve up to 100MB of transfer during that second. If you task is bound by the transfer of data, you're going to be able to get more done in the second case than in the first. Unless a game on XB1 is able to buffer a cartoonishly massive amount of memory access requests and carry them out (somehow without contention issues) during the system's time slice, even a completely BW-bound game should see markedly better performance by getting rid of the time slice.

Don't disagree, but apparently you have a very different definition on what bw bound means, other than that, you and I are saying the same thing in the gist (that the X1 does not have a bw problem).

By definition, you simply can not be bw bound if you are able to improve performance w/o adding physical bandwidth.

function · Jul 18, 2014

But that's the thing, BW *is* being added. So now you can continue along at your previously BW bound rate, but as you doing it for longer (with the time slice reduced or removed) you've moved more data.

Instantaneous BW for the device remains the same, but both the amount of processing and the amount of BW available to the application (the game) are effectively both increased by however much reserved time slice is removed.

MS themselves referred to an increase in BW available with the June SDK update. BW bound operations won't become less BW bound, but total work done and data moved will increase.

Incidentally, after MS's comments about bandwidth increases I found myself wondering if the reserved slice had applied to both esram and main ram, especially given how tightly coupled the two memory pools seem to be (ROPS can write just as easily to both - reserving the esram bus but not the main bus would seem inconsistent).

taisui · Jul 18, 2014

function said:
But that's the thing, BW *is* being added. So now you can continue along at your previously BW bound rate, but as you doing it for longer (with the time slice reduced or removed) you've moved more data.

So that was a comment on the subject of "sizable transfer" between ESRAM<->DDR3, in the context of if such "sizable transfer" would causing bw saturation and memory contention (I assume, for simplicity's sake, at the hw level)

At this point I think that one line was is taken way out of context, especially when I actually don't disagree with your main points, if that makes better sense.

HTupolev · Jul 18, 2014

taisui said:
you and I are saying the same thing in the gist (that the X1 does not have a bw problem).

I haven't really commented on that question (my entire point is that the characteristics of things before and after the update doesn't say much about whether something is BW bound or not), though I'd be a little surprised if it were a tremendous issue.

By definition, you simply can not be bw bound if you are able to improve performance w/o adding physical bandwidth.

That's an oddly black-and-white view of the situation.

Realistically, it's possible to be both limited due to the throughput of the bus, and limited because you don't have enough time where you're in control of the processor, such that increasing either would have a corresponding significant increase in performance compared with increases in other areas. Most people would say that, in this circumstance, you are bandwidth bound, especially since limitation due to required time slice would roughly be an across-the-board percentage decrease in available performance and thus not really an interesting thing to comment on regarding the characteristics of any particular task.

taisui · Jul 18, 2014

That's an oddly black-and-white view of the situation.

It's not a view, it's the very definition of "bandwidth bound." If you believe there is a different definition, then what you thought I meant is not what I meant, in which case further discussion on this is moot.

HTupolev · Jul 18, 2014

taisui said:
It's not a view, it's the very definition of "bandwidth bound."

Eh, the recognized definitions of these things aren't very strict.

In typical use, "bound" often implies a roughly proportional (or similar) relationship between increases in the "bound" resource and the overall performance. In situations where the term comes up, it's used because this "bound" resource is a severe limiting factor, and increases in other resources wouldn't be useful because they're bottlenecked on it. Usually, this occurs when the "bound" resource is the exclusive available resource with these characteristics. But in terms of definitions, nothing in the terminology demands true exclusivity; exclusivity only occurs because it tends to be the context where the phrasing becomes most fluid and useful.

And even if we demand exclusivity in our definition, we could still use it here by recognizing that in the obvious flow of the English being used in this thread, a term like "pre-June-SDK XB1 GPU" would be in reference to the limited "90%-duty-cycle" GPU, not to the full GPU with recognition of the 10% time slice.

IMO either way of looking at it is preferable to your English, where it's seemingly impossible to use the term "bound" when there are multiple hard limiting factors, for no obvious benefit.

If you believe there is a different definition, then what you thought I meant is not what I meant, in which case further discussion on this is moot

I suppose.

This does seem to be turning into a linguistics conversation.

taisui · Jul 18, 2014

It has little to do with English, it has more to do with complexity theory.

MrFox · Jul 18, 2014

semantics... :nope:

I have to agree with HTupolev. "bandwidth bound" have been used most often to describe code that spends the majority of it's time waiting for memory I/O. The context being that adding more GFLOPS will add negligible performance in these situations, while adding more memory bandwidth would be very beneficial.

But no matter how you define it, even with 100% bandwidth bound code, the June SDK update would still make GPU code 11% faster. The GB/s and GFLOP/s still have the same ratio. Bottlenecks, whatever they are, remain proportional.

I think Pixel's question was a good one and 3dilettante gave an interesting answer. Maybe the discussion can continue from there?

function · Jul 18, 2014

Indeed, it was interesting.

It seems that esram + dram memory operations will usually be additive in terms of none-copy operations they allow, with exceptions being cases where the speed up from having the data waiting in sram will be worth the BW cost of a pre-emptive copy in.

This does mean paying more attention to when and where your memory access occurs than on PC & PS4 however, and the easiest way to use the esram is still probably to just cram your buffers in to the esram and write out when your done.

Perhaps we'll see the true effectiveness of even this simple approach as games without the 10% reserve land on Digital Foundry's doorstep ...

Allandor · Jul 18, 2014

function said:
Indeed, it was interesting.

It seems that esram + dram memory operations will usually be additive in terms of none-copy operations they allow, with exceptions being cases where the speed up from having the data waiting in sram will be worth the BW cost of a pre-emptive copy in.

This does mean paying more attention to when and where your memory access occurs than on PC & PS4 however, and the easiest way to use the esram is still probably to just cram your buffers in to the esram and write out when your done.

this is the simplest approach, yes, but it is not the most efficient. you waste much bandwidth if you just fill the esram with your buffers, rendertarget or whatever. you must swap memory from sram to dram and back if you want to get a real advantage out of the sram.
Just using it as render target memory is just a quick win, but far from being optimized. and that's what most developers do right now. when the tools are optimized and the developers have some more know-how about the platform the sram should do great things for that system.

taisui · Jul 19, 2014

I could've swore that I've seen MSFT slide on the 4 stages of ESRAM adaption somewhere before.

The pros and cons of eDRAM/ESRAM in next-gen

Scott_Arm

Pixel

3dilettante

taisui

MrFox

Deludedly Fantastic

taisui

HTupolev

taisui

HTupolev

taisui

function

None functional

taisui

HTupolev

taisui

HTupolev

taisui

MrFox

Deludedly Fantastic

function

None functional

Allandor

taisui

Similar threads