Understanding XB1's internal memory bandwidth *spawn

taisui · Oct 7, 2013

Betanumerical said:
What matters a lot is that time step. You can believe the number and still want to know how long it was measured over.

Whats also interesting is the 50-55GB/s number. I'm just wondering how much bandwidth a CPU would like and need in next gen games, it has a 30GB/s bus to the DRAM probably for a good reason.

You are linking completely unrelated dots together and try to retrofit a faulty conclusion.

Betanumerical · Oct 7, 2013

taisui said:
You are linking completely unrelated dots together and try to retrofit a faulty conclusion.

Are you going to provide answers or are you going to continually attempt to attack my questions. I'm interested in how much CPU bandwidth a game would likely use and suddenly im trying to draw a faulty conclusion?. What?.

sebbbi said:
4 core (8 thread) Haswell at 3.5 GHz (3.9 GHz turbo) is fine with 25.6 GB/s (and it shares that bandwidth with the HD 4600 IGP). Jaguar cores have around half the clocks and lower IPC compared to Haswell. Bandwidth shouldn't be a problem for them.

Thank you sebbi, its always appreciated to get some real answers. I was looking more for how much bandwidth games will use but the max needed for a CPU to operate well is just as good.

Shifty Geezer · Oct 7, 2013

Pixel said:
Boyd Multerer would agree with you. See below.
Look at the wording Boyd uses very carefully. 'Dataflow' 'right data in right cache at right time' 'makes all the difference in the world' sounds like not just bandwidth but low latency is extremely important for the architecture.

Prefetching solves issues of high latency. A 'low latency' system, if not expanded on, means having the data on hand for the logic to work on. The very best solution to that is to have the data ready for the logic instead of having to wait to fetch the data from somewhere, regardless how low the latency is. So broad remarks about a system with low latency doesn't have any real meaning for me. Without details, my interpretation is to assume the designers look more at dataflow than at getting particularly fast RAM in there, because that's the better solution most times. System wide low latency can even extend to aspects like IO. In the DF interview (I've read it now

), they talk of using the flash for OS to not interfere with the HDD head (as some of us suggested in that discussion the most sensible use of it) - that's a way to decrease latency on HDD reads.

Shifty Geezer · Oct 7, 2013

taisui said:
They told you, you didn't want to listen, and I thought the subject of this malarkey has already been banned.

It was banned from the XB1 hardware investigation thread. The discussion is moved here so those unwilling to believe the figures can continue to argue about it.

Gubbi · Oct 7, 2013

Shifty Geezer said:
Prefetching solves issues of high latency. A 'low latency' system, if not expanded on, means having the data on hand for the logic to work on.

I doubt GPUs use prefetching to any significant degree. Prefetching is a guess, if you guess wrong you've wasted the bandwidth, - and bandwidth is at a premium. In fact, I'd expect the opposite to be true, memory requests are delayed slightly to see if more adjacent requests comes down the pipe, so memory accesses can be coalesced.

GPUs solves the latency issue by working on a lot of things in parallel. A CU has 4 SIMDs each with 10 active waves. Each instruction in a wave takes 4 cycles to complete. If all waves are active, there are 40 cycles, - 50ns between each instruction. For a CU to stall, every wave would have to be blocked on a data dependency (ie. cache missing memory request). That's not a common mode of operation for a GPU - the caches and per CU register space are tailored to fit most graphics workloads.

However, if you can't populate all 10 wavefronts of a CU, either because you don't have enough independent jobs to run, or because each job puts pressure on the registers, latency tolerance falls.

Cheers

Shifty Geezer · Oct 7, 2013

Gubbi said:
I doubt GPUs use prefetching to any significant degree. Prefetching is a guess...

I didn't mean prefetching as a technical cache term, but fetching the data ahead of its being used as the standard GPU design. A GPU shader doesn't start with ..."fetch texture. Wait until it arrives. Twiddle thumbs. Finally do something." It'll send for texture data ahead of when it's needed, I believe by deferring to another wavefront that is ready to go in current parlance. No processor sits waiting for data as a normal course of events any more. The times when that happens are an unwanted shortcoming of the memory system, for which lower latency means less waiting but isn't really a solution engineers are relying on. Reducing latency between when an ALU wants to work on data and when that data is available is more an issue of making sure data is moving and available at the time that ALU is ready to work on that data than on making the delay to the fetch as short as possible, and that's when these engineers are almost certainly talking about when they mention reducing latencies.

Gipsel · Oct 7, 2013

astrograd said:
Dear lord...you cannot be serious with this post.

Of course I am. Because they did what I said and the known facts back it up.
Latency is almost irrelevant for ROP exports as the CUs itself as well as the ROP caches are designed to handle it. And they do quite well doing so. A pixel shader doesn't care about the latency of the memory the ROPs have to deal with. It's a fire and forget operation. The shader proceeds as soon as the data are read from the registers and send to the ROPs. After that, it is all handled by the ROPs alone which are - I repeat - designed to handle the DRAM latency quite well. AMD is on record saying that bandwidth is a limitation to ROP performance more than anything else (evident by keeping their 32 ROPs and just fitting a wider 384Bit interface to Tahiti, a logic backed up by fillrate tests) and MS said exactly the same when asked about the potential ROP advantage of the PS4. They only talked bandwidth and when asked for latency they said that the GPU is designed to be very tolerant to it. You have to come up with other usage scenarios than ROP exports (and there are some!) to benefit from a reduced latency. And frankly, we don't actually know, what the latency is. It's probably not an order of magnitude shorter than for the DRAM, it could be a relatively small factor after all.

LightHeaven · Oct 7, 2013

Betanumerical said:
The problem is that the 150GB/s is not achievable all the time. Its a 'sometimes case' just like any other peak bandwidth, but it has further caveats.

They said 150 GB/s comes from real games running on the machine, not from a controlled test in a unlikely game scenario.

Betanumerical · Oct 7, 2013

LightHeaven said:
They said 150 GB/s comes from real games running on the machine, not from a controlled test in a unlikely game scenario.

the problem still exists that we do not have enough information to make a worthwhile comparison imo.

LightHeaven · Oct 7, 2013

Betanumerical said:
I would like further clarification from someone who knows more to be honest.

This is what DF said previously.

That says the same thing? Factor in the upclock and you got just the same figure we were given now.

And it says this comes from real game code on the machine in a scenario with lots of blending... If you are not doing much blending wouldn't you need a lot less bandwidth in the first place? The important is to know, that if the title needs tons of bandwidth to do expensive blending operations, the esram provides for you.

liolio said:
I guess that is how much bandwidth the scratchpad can deliver in bandwidth bound scenario.
It is quite an high measurement.
I'm not sure (or worse I suspect why) why so much people are wary of MSFT claims and can't take what they said at its face value, while in the mean time... well...

Anyway they are pretty honest, too much in my opinion, they said that achievable bandwidth from DDR3 is ~55GB/s (out of 67GB/s), not exactly PR friendly when you come out of your way to explain the choice they made in the light of Sony choice for lots of really fast Memory.

That 55GB/s is the total DDR3 bandwidth? I assumed that was the bandwidth that the gpu was seeing at a time... So actually, from that 55, the gpu would actually see less, because that is shared with the CPU...

Betanumerical · Oct 7, 2013

LightHeaven said:
That says the same thing? Factor in the upclock and you got just the same figure we were given now.

And it says this comes from real game code on the machine in a scenario with lots of blending... If you are not doing much blending wouldn't you need a lot less bandwidth in the first place? The important is to know, that if the title needs tons of bandwidth to do expensive blending operations, the esram provides for you.

To be honest it says that they got 133GB/s in blending which I have no doubt, what I was talking about there was that it actually gave us more information on where the numbers actually game from, which was useful, if only it gave us some sort of time step as well. But alas none of them do.

Ceger · Oct 7, 2013

Betanumerical said:
To be honest it says that they got 133GB/s in blending which I have no doubt, what I was talking about there was that it actually gave us more information on where the numbers actually game from, which was useful, if only it gave us some sort of time step as well. But alas none of them do.

Just a reminder, throw in the upclock and you have 140, on current code from when they tested. They said the average was 140-150, so nothing provided from them out of what you are running around.

Ceger · Oct 7, 2013

LightHeaven said:
That says the same thing? Factor in the upclock and you got just the same figure we were given now.

And it says this comes from real game code on the machine in a scenario with lots of blending... If you are not doing much blending wouldn't you need a lot less bandwidth in the first place? The important is to know, that if the title needs tons of bandwidth to do expensive blending operations, the esram provides for you.

That 55GB/s is the total DDR3 bandwidth? I assumed that was the bandwidth that the gpu was seeing at a time... So actually, from that 55, the gpu would actually see less, because that is shared with the CPU...

Again, that 55 was provided as a 70-80% measure accounting for average of quality of code measured.

The problem here is that MS was too honest in even discussing real world bandwidth numbers as opposed to letting things lie like everyone else with wonderful peak numbers.

Scott_Arm · Oct 7, 2013

The statement about the DDR3 bandwidth was meant to mean you won't typically max out the bandwidth on any memory interface, not just theirs.

The part of this discussion I'm not understanding is why people believe busses are (and should be) near saturation at all times.

bkilian · Oct 7, 2013

Betanumerical said:
To be honest it says that they got 133GB/s in blending which I have no doubt, what I was talking about there was that it actually gave us more information on where the numbers actually game from, which was useful, if only it gave us some sort of time step as well. But alas none of them do.

Why is the time step relevant? If all you did was blending for a whole frame, then sure, you could average 150GB/s for a long time, but no one does that. The real question is, "for real workloads, is the ESRAM bandwidth going to be a bottleneck?" And the answer is obviously no. The ESRAM can provide the bandwidth required for most workloads requiring 16 or less ROPS, as long as you're doing a mix of reads and writes.

Will the full bandwidth be utilized all the time? Of course not. Do you drive your car at 60 miles an hour 100% of the time? No. In fact, my vehicle informs me that my average speed is 25 miles an hour, despite the fact that I have no trouble maintaining 60 mph when required. And that my peak speed is apparently 120mph.

Most of any frame will be low bandwidth. The only time you need full bandwidth is during ROP heavy operations like blending, which typically take a small fraction of a frame.

There, complete with car analogy.

Gipsel · Oct 7, 2013

bkilian said:
There, complete with car analogy.

And a good one for a change.

I tried to explain before that if a game really averages 150GB/s bandwidth use from the eSRAM that it looks utterly bandwidth constrained not leaving any space for the usual spikes in bandwidth demand and questioned that one could observe it in real games besides some more synthetic scenarios (or short moments during rendering a frame). But where I failed, that car analogy hopefully makes it clear to everyone.

DrJay24 · Oct 7, 2013

That car analogy seems to make more sense with respect to a single RAM pool. I'm not sure how you could change it to be more accurate, maybe Turbo?

I just wonder how well devs can have the available data in eSRAM and how much work it is. If it were simple and a driver could control it automagically, you would see this solution on the PC side. It must take some clever hand coding? Clearly the DDR+eSRAM is not going to look and act like a 200GB/s single pool, but when and how they differ is going to be interesting. I guess we will have to wait and see when devs have shipped games and offer insight.

Betanumerical · Oct 7, 2013

bkilian said:
Why is the time step relevant? If all you did was blending for a whole frame, then sure, you could average 150GB/s for a long time, but no one does that. The real question is, "for real workloads, is the ESRAM bandwidth going to be a bottleneck?" And the answer is obviously no. The ESRAM can provide the bandwidth required for most workloads requiring 16 or less ROPS, as long as you're doing a mix of reads and writes.

Will the full bandwidth be utilized all the time? Of course not. Do you drive your car at 60 miles an hour 100% of the time? No. In fact, my vehicle informs me that my average speed is 25 miles an hour, despite the fact that I have no trouble maintaining 60 mph when required. And that my peak speed is apparently 120mph.

Most of any frame will be low bandwidth. The only time you need full bandwidth is during ROP heavy operations like blending, which typically take a small fraction of a frame.

There, complete with car analogy.

This helps a lot, I had a feeling that there wouldn't be a lot of bandwidth constrained points in a frame but I wasn't 100% sure, thanks for the clarification.

function · Oct 7, 2013

bkilian said:
Why is the time step relevant? If all you did was blending for a whole frame, then sure, you could average 150GB/s for a long time, but no one does that. The real question is, "for real workloads, is the ESRAM bandwidth going to be a bottleneck?" And the answer is obviously no. The ESRAM can provide the bandwidth required for most workloads requiring 16 or less ROPS, as long as you're doing a mix of reads and writes.

Will the full bandwidth be utilized all the time? Of course not. Do you drive your car at 60 miles an hour 100% of the time? No. In fact, my vehicle informs me that my average speed is 25 miles an hour, despite the fact that I have no trouble maintaining 60 mph when required. And that my peak speed is apparently 120mph.

Most of any frame will be low bandwidth. The only time you need full bandwidth is during ROP heavy operations like blending, which typically take a small fraction of a frame.

There, complete with car analogy.

Holy shit. Not only have you explained this in a way that anyone can understand (what matters is that the BW is there when it's needed), but more importantly you've actually gone and done the first good car analogy, ever.

The second point is almost worthy of song and praise.

3dilettante · Oct 7, 2013

DrJay24 said:
I just wonder how well devs can have the available data in eSRAM and how much work it is. If it were simple and a driver could control it automagically, you would see this solution on the PC side.

The PC's problem space is something that is a lot bigger. The hardware and software variability is massive, and output resolutions have a much wider range than the one architecture, one bin, and one-ish resolution console.
The level of integration and customization a console has may also persist, although with some of the software initiatives, physical integration, and a common hardware platform it may not be as vast.

With a fixed graphics system and control of the whole platform, it may be a more tractable problem to provide a solution that will provide decent results out of the box.

Understanding XB1's internal memory bandwidth *spawn

taisui

Betanumerical

Shifty Geezer

uber-Troll!

Shifty Geezer

uber-Troll!

Gubbi

Shifty Geezer

uber-Troll!

Gipsel

LightHeaven

Betanumerical

LightHeaven

Betanumerical

Ceger

Ceger

Scott_Arm

bkilian

Gipsel

DrJay24

Betanumerical

function

None functional

3dilettante

Similar threads