Understanding XB1's internal memory bandwidth *spawn

Ceger · Oct 7, 2013

bkilian said:
Why is the time step relevant? If all you did was blending for a whole frame, then sure, you could average 150GB/s for a long time, but no one does that. The real question is, "for real workloads, is the ESRAM bandwidth going to be a bottleneck?" And the answer is obviously no. The ESRAM can provide the bandwidth required for most workloads requiring 16 or less ROPS, as long as you're doing a mix of reads and writes.

Will the full bandwidth be utilized all the time? Of course not. Do you drive your car at 60 miles an hour 100% of the time? No. In fact, my vehicle informs me that my average speed is 25 miles an hour, despite the fact that I have no trouble maintaining 60 mph when required. And that my peak speed is apparently 120mph.

Most of any frame will be low bandwidth. The only time you need full bandwidth is during ROP heavy operations like blending, which typically take a small fraction of a frame.

There, complete with car analogy.

Thank you for putting into words what most wanted to say but have had the misfortune of not saying. You bridged the gap of dispute between many members.

Pixel · Oct 7, 2013

Shifty Geezer said:
Prefetching solves issues of high latency. A 'low latency' system, if not expanded on, means having the data on hand for the logic to work on. The very best solution to that is to have the data ready for the logic instead of having to wait to fetch the data from somewhere, regardless how low the latency is. So broad remarks about a system with low latency doesn't have any real meaning for me. Without details, my interpretation is to assume the designers look more at dataflow than at getting particularly fast RAM in there, because that's the better solution most times. System wide low latency can even extend to aspects like IO. In the DF interview (I've read it now ), they talk of using the flash for OS to not interfere with the HDD head (as some of us suggested in that discussion the most sensible use of it) - that's a way to decrease latency on HDD reads.

I know all about the flash, I was arguing with people that it would be used for OS functionality to enable instant switching because the hdd needs to be free to facilitate unimpeded data transfer for games and the OS needs to be able to have unimpeded OS program loading to facilitate MS multitasking vision.

Back to the gpu architeecure and Boyd Multerers commemnt:
Beyond the dual porting nature of certain varieties of 6t sram (and x1s 6t) and higher types, that screenshot made me wonder if the latency advantage was another reason they put high realestate sram and not edram like the ibm power chips. There may be modifications to the architecture so there is some benefit to the low latency esram. They could have tripled their embedded cache size with edram (not hexuple despite dram only having a single transistor per bit).

Despite good predicting and precaching certain types of gpu operations that lend themselves to low latency such as RoP operations?

Alucardx23 · Oct 7, 2013

bkilian said:
Why is the time step relevant? If all you did was blending for a whole frame, then sure, you could average 150GB/s for a long time, but no one does that. The real question is, "for real workloads, is the ESRAM bandwidth going to be a bottleneck?" And the answer is obviously no. The ESRAM can provide the bandwidth required for most workloads requiring 16 or less ROPS, as long as you're doing a mix of reads and writes.

Will the full bandwidth be utilized all the time? Of course not. Do you drive your car at 60 miles an hour 100% of the time? No. In fact, my vehicle informs me that my average speed is 25 miles an hour, despite the fact that I have no trouble maintaining 60 mph when required. And that my peak speed is apparently 120mph.

Most of any frame will be low bandwidth. The only time you need full bandwidth is during ROP heavy operations like blending, which typically take a small fraction of a frame.

There, complete with car analogy.

Beautiful example.

dumbo11 · Oct 7, 2013

Pixel said:
Back to the gpu architeecure and Boyd Multerers commemnt:
Beyond the dual porting nature of certain varieties of 6t sram and higher, why else would they put huge real estate sram and not edram like the ibm power chips. There may be modifications to the architecture so there is some benefit to the low latency esram. They could have tripled their cache size with edram (not hexuple despite dram only having a single transistor per bit).

http://www.eurogamer.net/articles/digitalfoundry-the-complete-xbox-one-interview

Nick Baker: It's just a matter of who has the technology available to do eDRAM on a single die.

Betanumerical · Oct 7, 2013

dumbo11 said:
http://www.eurogamer.net/articles/digitalfoundry-the-complete-xbox-one-interview

iirc eSRAM is easier to do a single die, shrinks quicker (I think) and more places can fab it also.

Shifty Geezer · Oct 7, 2013

Ceger said:
Thank you for putting into words what most wanted to say but have had the misfortune of not saying. You bridged the gap of dispute between many members.

Alucardx23 said:
Beautiful example.

Nah. He resorted to a car analogy which carries a -10 point penalty.

Incidentally, do car fanatics use computers for analogies when talking about their motors?

Exophase · Oct 7, 2013

Gipsel said:
What do you mean with running an actual title?
An actual title will never use continously 150+GB/s of eSRAM bandwith or it means it looks to be severely bandwidth limited (as the requirements usually spike a lot over the course of a frame).
Furthermore, it gets really hard to quantify the used bandwidth exactly outside of slightly more synthetic tests as a lot of factors kick in like the often quite efficient Z compression (that works also without MSAA) and caching in the ROPs for instance. And for a real game you have also likely overdraw from the same batch of geometry at some times during your rendering where the ROP caches can help a bit to reduce the external bandwidth requirements. Or an early Z test kicks the fragments out of the pipeline even before the pixel shader! For texture or buffer accesses it is even more important as the relevant caches are larger. You basically need synthetic test cases to minimize the effect of the caches and other "efficiency helpers" in the GPU to get reliable bandwidth numbers. Otherwise you measure a blend of cache bandwidth and external bandwidth. You can compare that with the latency measurements on CPUs where the prefetchers got smarter and basically broke a lot of the latency benchmarks at some point. While this is of course good for the performance (as they also work in real software), it may tell you not everything about latency or bandwidth in this case (and the PS4 has more/larger caches).
The problem is that everything is more complicated in the real world.
One simply can't give an simple average as it varies a lot with the use case. And if one makes up some conditions to derive some kind of "average" it will be a not very meaningful number, as under different conditions it will look different. And taking one game isn't going to help much, as another one using different rendering techniques will have different needs. And as devs will tune their engines to use what is there, this will probably lead to a shift in the observed "average".

Couldn't the "real world" bandwidth numbers have been measured using performance counters on the GPU?

Ceger · Oct 7, 2013

Shifty Geezer said:
Nah. He resorted to a car analogy which carries a -10 point penalty.

Incidentally, do car fanatics use computers for analogies when talking about their motors?

I can certainly give that a shot. Let me get back to you on that.

Gipsel · Oct 7, 2013

Exophase said:
Couldn't the "real world" bandwidth numbers have been measured using performance counters on the GPU?

Just checked, somewhat recent AMD GPUs offer performance counters named CBMemRead and CBMemWritten, so that is a possibility. But as explained (also by bkilian with his car analogy), a normal game won't sustain 150GB/s over longer periods (i.e. 1 frame or even half of it). You need something closely resembling a fillrate test or one will have just bursts of this bandwidth use. I mean, it would be something like 1.2 kB per pixel at 1080p60. What are you supposed to do with that much of data per pixel? Especially as MS has said they got this for blending operations? If they did, they did not much else which means it was basically a fillrate test (at least for the period they measured).

Exophase · Oct 7, 2013

Gipsel said:
Just checked, somewhat recent AMD GPUs offer performance counters named CBMemRead and CBMemWritten, so that is a possibility. But as explained (also by bkilian with his car analogy), a normal game won't sustain 150GB/s over longer periods (i.e. 1 frame or even half of it). You need something closely resembling a fillrate test or one will have just bursts of this bandwidth use. I mean, it would be something like 1.2 kB per pixel at 1080p60. What are you supposed to do with that much of data per pixel? Especially as MS has said they got this for blending operations? If they did, they did not much else which means it was basically a fillrate test (at least for the period they measured).

Nick Baker didn't say anything about doing blending operations, much less a blending fillrate test for this measurement. All he said was "that is real code that is running at that bandwidth." He didn't say that it was the average over the course of the frame, it's unclear what the time period is. I think all he really needs to demonstrate is that the period is long enough that it could have been indefinite if that's what the game's workload actually mandated.

.

Gipsel · Oct 7, 2013

Exophase said:
Nick Baker didn't say anything about doing blending operations, much less a blending fillrate test for this measurement. All he said was "that is real code that is running at that bandwidth."

And that says exactly nothing. A fillrate test also executes real code. Otherwise it wouldn't run. The blending stuff comes from the earlier MS statement on the matter and it is the kind of code to run to demonstrate high bandwidth use. It's a natural fit.

Exophase · Oct 8, 2013

Gipsel said:
And that says exactly nothing. A fillrate test also executes real code. Otherwise it wouldn't run. The blending stuff comes from the earlier MS statement on the matter and it is the kind of code to run to demonstrate high bandwidth use. It's a natural fit.

He explicitly said it in contrast with a synthetic test or diagnostic. I'm pretty sure that when he said "real code" he didn't just mean any code period, what exactly would be the point of that? Of course it's code. The strong implication is that it comes from a game.

taisui · Oct 8, 2013

A: this is the theoretical peak at 204...
B: that's just theoretical, what's the real blah blah blah
A:; real number we measure at 150...
B: well you could be using test code, not game code blah blah blah blah
A: we measured Forza at 150...
B: well that's a racing game, what about different genre blah blah blah
A: Ryse is at the same range...
B: Crytek's a big shop studio, little guys can't do it blah blah blah
A: well we have this indie developer....
B: he's a genius, that doesn't count for the common case blah blah blah

I just feel I had to get this out. If you have problem believe the Earth is round, that's really just your problem.

liolio · Oct 8, 2013

As a side be it 135 or somewhere between 140 and 150 GB/s I don't find the figure hard to swallow, actually it is not that high for on chip memory.

Gipsel · Oct 8, 2013

Exophase said:
Nick Baker didn't say anything about doing blending operations, much less a blending fillrate test for this measurement. All he said was "that is real code that is running at that bandwidth." He didn't say that it was the average over the course of the frame, it's unclear what the time period is. I think all he really needs to demonstrate is that the period is long enough that it could have been indefinite if that's what the game's workload actually mandated.

Exophase said:
He explicitly said it in contrast with a synthetic test or diagnostic. I'm pretty sure that when he said "real code" he didn't just mean any code period, what exactly would be the point of that? Of course it's code. The strong implication is that it comes from a game.

I'm still not sure what you are trying to discuss. Of course one can use a code fragment or a short period (a single millisecond) from a game, where it basically resembles a fillrate test (with blending). If you want to quantify the usable bandwidth in any way (and they talk about bandwidth all the time), you look for these situations or construct testcases for it. How else should it work?
And btw., while you are right, that Nick Baker didn't explicitly mention blending at that point, Andrew Goossen did right after that passage of Nick you referred to:

Andrew Goossen in that same interview about that bandwidth issue said:
Andrew Goossen: If you're only doing a read you're capped at 109GB/s, if you're only doing a write you're capped at 109GB/s. To get over that you need to have a mix of the reads and the writes but when you are going to look at the things that are typically in the ESRAM, such as your render targets and your depth buffers, intrinsically they have a lot of read-modified writes going on in the blends and the depth buffer updates. Those are the natural things to stick in the ESRAM and the natural things to take advantage of the concurrent read/writes.

I think I used the phrase "natural fit".

To sum it up, I don't get your point. What do you want to disuss?

adev · Oct 8, 2013

I find the back and forward on the ESRAM bandwidth amusing.

204Gbps theoretical bandwidth to 32Mb of memory makes it look like the least likely place to suffer a bottleneck, even if only 130-150 is actually usable.

The other pool only has 68Gbps bandwidth. If you want a decently performing AAA game you're going to have render buffers in ESRAM irrespective of its performance because it's clearly better than not using it. If you end up fill or ROP limited you'll be reducing overdraw, blending and resolution as much as possible. It's not necessarily the screen res which has to suffer either, I've worked on games which have been able to optimise this type of situation by reducing cube map render buffer resolution.

It's hard to look at any of this in isolation. I can't imagine that the One's designers didn't anticipate a fill rate issue given the hardware display plane scaling and blending.

Hornet · Oct 8, 2013

Now that we know pretty much all high-level details on Xbox One, I think it would be valuable to focus the discussion on workloads and the most interesting ways to use the ESRAM. There are many questions, which I guess will have different answers depending on the game, but interesting nonetheless.
1) When using multiple render targets, is it suitable, from a bandwidth perspective, to put some of the render targets in the DDR3 pool? Or is the DDR3 bandwidth going to be just enough for CPU, geometry and textures?
2) Do all render targets in a deferred renderer require roughly the same amount of read/writes? Or is it possible to pick high read/write render targets and store only them in the ESRAM?
3) How does Forward+ compare to Deferred rendering in terms of render targets size and bandwidth requirements?
4) Is tiling going to be more or less costly than in the previous generation? Which techniques can be used to reduce the cost of tiling?
5) Is it suitable to split render targets in areas with high and low overdraw, in order to determine which pages should be put in ESRAM and which ones should stay in DDR3? Or is it too variable from frame to frame?
6) Will it be common for multiplatform game to use the majority of PS4 bandwidth for texturing or geomery, requiring ports to use the ESRAM as an asset cache rather than as a target for the ROPs?
7) Will the flexibility of ESRAM compared to the Xbox 360 EDRAM provide any advantages in things like shadow rendering? Or is the ESRAM size too small for this?

Exophase · Oct 8, 2013

Gipsel said:
I'm still not sure what you are trying to discuss. Of course one can use a code fragment or a short period (a single millisecond) from a game, where it basically resembles a fillrate test (with blending). If you want to quantify the usable bandwidth in any way (and they talk about bandwidth all the time), you look for these situations or construct testcases for it. How else should it work?

As far as I'm concerned, "using a section that resembles a fillrate test" and "running a fillrate test" are not the same thing. The important distinction is that if you're writing a synthetic test you can be doing something totally unrealistic that the GPU likes a lot more than anything realistic. Maybe it's representative, maybe it isn't. A game is at least representative of something.

Gipsel said:
And btw., while you are right, that Nick Baker didn't explicitly mention blending at that point, Andrew Goossen did right after that passage of Nick you referred to:
I think I used the phrase "natural fit".

Still doesn't mean that the test measurement figure was exclusively over a burst using blending. I'm sorry if I'm not giving you enough of a discussion here by saying I think you're assuming too much, because that's all I'm doing.

Gipsel · Oct 8, 2013

Exophase said:
As far as I'm concerned, "using a section that resembles a fillrate test" and "running a fillrate test" are not the same thing. The important distinction is that if you're writing a synthetic test you can be doing something totally unrealistic that the GPU likes a lot more than anything realistic. Maybe it's representative, maybe it isn't. A game is at least representative of something.

A fillrate test is also representative of something.

Furthermore, each game or even each different phase of rendering a frame of the same game will be representative of something else, a something we don't know.
But here, they looked specifically for high bandwidth usage situations, so we are given a hint, for what it should be representative, the bandwidth.

Exophase said:
Still doesn't mean that the test measurement figure was exclusively over a burst using blending.

Of course it does. At least given the circumstances like what they are talking about (high bandwidth usage scenarios and which specific one [blending] they think of!). Given the reasoning in that interview as well as by several people here in the thread, it would be quite hard to explain how else they got 150+GB/s out of the eSRAM.

Exophase said:
I'm sorry if I'm not giving you enough of a discussion here by saying I think you're assuming too much, because that's all I'm doing.

If you refuse to assume some stuff, you will never arrive at some conclusion. Key is to make reasonable assumptions which are most likely fullfilled.

Ceger · Oct 8, 2013

Gipsel said:
A fillrate test is also representative of something.
Furthermore, each game or even each different phase of rendering a frame of the same game will be representative of something else, a something we don't know.
But here, they looked specifically for high bandwidth usage situations, so we are given a hint, for what it should be representative, the bandwidth.
Of course it does. At least given the circumstances like what they are talking about (high bandwidth usage scenarios and which specific one [blending] they think of!). Given the reasoning in that interview as well as by several people here in the thread, it would be quite hard to explain how else they got 150+GB/s out of the eSRAM.
If you refuse to assume some stuff, you will never arrive at some conclusion. Key is to make reasonable assumptions which are most likely fullfilled.

The problem people are having is that I'm not sure they are pulling this out fro a specific context. The viewpoint is that they provided average bandwidth hits of actual code, which is taken by all (except few here) as actual games. That creates a large variation in scenarios of what goes across the bandwidth at any given time. So everyone goes that it is a believable statement that bandwidths only really average a hit of 70-80% (MS should be quite knowledgeable especially as this is an evolution of their technology from the 360) and thus should be attributable to all bandwidths. But you keep pushing that there is no deifference between actual code and synthetic code (code produced to test isolated situations) and then support that GDDR5 hit over 90% based on synthetic code aspects.

The problem people are having is specifically that they are not sure if you are saying that the ESRAM bandwidth ad DDR3 bandwidth for some reason are relegated to 70-80% but GGDR5 is not, or if you are just pointing out data simply as data presented.

IF MS provided a bunch of synthetic tests that showed 90%+ of the bandwidth, would you then take that at face value or argue that it cannot achieve that? Again, are you simply talking data samples as provided or making assumptions that define the bandwidths in terms of what could only be called limitations of one over another?

Understanding XB1's internal memory bandwidth *spawn

Ceger

Pixel

Alucardx23

dumbo11

Betanumerical

Shifty Geezer

uber-Troll!

Exophase

Ceger

Gipsel

Exophase

Gipsel

Exophase

taisui

liolio

Aquoiboniste

Gipsel

adev

Hornet

Exophase

Gipsel

Ceger

Similar threads