Understanding XB1's internal memory bandwidth *spawn

3dilettante · Sep 13, 2013

DaveNagy said:
Of course the hardware doesn't care. But the person for which the work is being accomplished might very well care.

It was probably copied for a reason. So if step 1 in an algorithm is "Copy this" then the move to step 2 is progress.

On the other hand, moving one chunk of stuff from one place to another place does not rate any such adding-together.

It's pretty straightforward to produce code that performs some kind of data transformation without data reduction, reorientation, or amplification.
Within reason, additional work can be inserted into the copy scenario with no difference in bandwidth consumption from the point of view of the memory system.
A copy is a degenerate case of the above.

One thing has been accomplished, and therefore we count that work as having been done once.

The work is that the transistors switched, wires flipped, and there are now memory cells with values that were different than before.

taisui · Sep 13, 2013

DaveNagy said:
If only one pathway is ever usable at once, you can't. In a particular use-case (e.g. ESRAM-to-mem copy) where you are not using multiple pathways simultaneously, you won't be using all of that total system bandwidth, although that bandwidth obviously still exists and can listed on a spec sheet.

Huh?

But a ESRAM->DDR3 copy IS using multiple pathways simultaneously.

bkilian · Sep 13, 2013

DaveNagy said:
I don't agree with that.

You're arbitrarily choosing to divvy up and then repeatedly count the same data in multiple parts of the same "system". The web server is serving to me at 20MB/s. Period. That completely describes the "work" that is being done. Data started out on one machine and ended up on another. Are you also going to start adding additional duplicate 20 MB/s rates as the data leaves the server's hard drive, traverses its memory systems, its network adapter, various routers, my network adapter, my memory, my hard drive cache? There's no bottom to that rat hole, and it provides no additional insight about the raw download speed I see from that server.

How about looking at it in terms of money? Let's assume everyone pays $1 per megabyte for their internet usage (uplink and downlink).
In the scenario above, you would pay $20/second for your download. The owner of the server would _also_ pay $20/second for their uplink. Total money paid, $40/second. Right? In the eyes of the ISP, the bandwidth used is 40MB/s. In _every_ case where data is being moved from one location to another, it has to be both read, and written, so the bandwidth used is _always_ twice the data rate of the data being moved.

DaveNagy · Sep 13, 2013

taisui said:
Let's try another example, be warned I'm bad at analogy:

You have two garden hoses for watering, each allows a flow of 1 gallon per minute. So at max, you can have 2 gallons per minute flowing through between the two them, do you agree on this?

So you have 2 Gps of "bandwidth"

If you use both of them to get water into a bucket, you can get 2 gallons of water every minute, since there are 1 gallon of water flowing in each hose, agree?

Now, if you decide to link up the 2 hoses into one longer hose, you can only then get 1 gallon of water every minute, correct? But does this change the maximum flow, which is 2Gps?

Yes, the maximum flow of the new, reconfigured system is now just 1Gps. The old system, which presumably incorporated twice as many spigots, was capable of double the flow. (This assumes that the maximum household flow rate is at least 2x greater than the max flow rate of any single spigot.)

Is that right? How does this apply to the XB1?

Hmm, does your analogy involve both hoses hooked to the same faucet? In which case, the difference between the two configurations would be minor. The 2x longer hose would achieve slightly less flow, due to increased friction and back pressure, but both configs are probably spigot-limited, which might swap that effect. If we're talking about idealized hoses and water, I guess both configs would flow the same, both limited by the spigot flow rate. But... you said the hoses were the limiters of flow, so I don't know what to think...

taisui said:
In fact, you can turn the water off, and there's no water flowing through, does that make the bandwidth 0Gps, at if the hoses are not hoses anymore?

The system bandwidth should still be the same, even when the system is turned off. But two short hoses certainly have more bandwidth than one long one, regardless of utilization.

Again, I'm not sure how this relates to the argument we're having. Heck, I'm not sure what the argument is anymore! Could you please summarize the two opposing sides? I suspect that they aren't what I thought they were. For that, I apologize.

I think I'll just leave you guys to it. You were clearly doing fine without me.

taisui · Sep 13, 2013

DaveNagy said:
Again, I'm not sure how this relates to the argument we're having. Heck, I'm not sure what the argument is anymore! Could you please summarize the two opposing sides? I suspect that they aren't what I thought they were. For that, I apologize.

The confusion is on data rate versus bandwidth.

chart:
DDR3->(68GBps)->GPU->(68GBps)->ESRAM

Some think that because the data transfer rate is at 68GBps, the consumed bandwidth is therefore, 68GBps.

But actually the BW consumed would be 136GBps, since there are 2 interfaces involved, each utilizing 68GBps.

As for how this relates to the XO, in a perfect world,
the peak theoretical BW will be 68 + 109 + 109 = 286GBps.

Here's a hint, consuming less BW is for the same kind of work in the same amount of cycles, is actually better.

DaveNagy · Sep 13, 2013

bkilian said:
How about looking at it in terms of money? Let's assume everyone pays $1 per megabyte for their internet usage (uplink and downlink).
In the scenario above, you would pay $20/second for your download. The owner of the server would _also_ pay $20/second for their uplink. Total money paid, $40/second. Right? In the eyes of the ISP, the bandwidth used is 40MB/s. In _every_ case where data is being moved from one location to another, it has to be both read, and written, so the bandwidth used is _always_ twice the data rate of the data being moved.

Still here.

Sure, I agree with that, assuming both parties are using that one ISP.

So, let's see, how does this relate to the XB1.... Moving data from (say) the ESRAM, to the system RAM at a max rate of 68 GB/s would "consume" 136 GB/s of the system's total BW "pool". You are in effect using all of the system RAM's bandwidth, and a goodly chunk of the ESRAM's bandwidth as well. It couldn't be used for anything else, after all.

So, I'd tend to think of that as as a somewhat "diabolical" use case: You're sucking up a substantial portion of the system's potential BW, but achieving comparatively little moving-shit-around. Almost a worst case. (It sounds like a totally valid and perhaps typical use case though.)

If that was the best anyone could manage on this hardware, it might be correct to characterize the system's memory bandwidth as only being 68 GB/s. Certain forum warriors might have used this this calculus in the past. No one here, of course.

But if I understand correctly, plopping a GPU down in the middle somewhere enables higher data transfer rates. In ideal cases you are transferring as much as you are "consuming". You gave an example of where the GPU was both reading and writing to the ESRAM, while simultaneously reading or writing to system RAM. (Not both!) If everything is balanced just so, you are not only consuming all of the bandwidth available, (~250GB/s) but actually transferring data at that rate as well. Woot!

As such, it would be correct to characterize the system's bandwidth as something like 250GB/s.

Is that close?

Is that what taisui was saying too? I guess it probably was. I apologize. I thought he was trying to double stuff to "pad" the XB1's specs. Quite the opposite, I guess.

taisui · Sep 13, 2013

DaveNagy said:
Is that what taisui was saying too? I guess it probably was. I apologize. I thought he was trying to double stuff to "pad" the XB1's specs. Quite the opposite, I guess.

You got it. The max BW is always the same, and in general the more data you can move across all paths simultaneously (data rate), the more meaningful performance you'll get out of the system.

(hence if you have 2 garden hoses, don't link them up into a single one

, if this is making any sense...)

Strange · Sep 13, 2013

blakjedi said:
bandwidth is maximum capacity of the restaurant at any given time.

If a restaurant has a drive through window then it has different service options at each.

PS4 has a capacity of 176 patrons

XB1 can only serve 68 patrons in the store while it can serve 109 patrons each at the drive up order and the pick up windows (you called ahead).

It can theoretically service a total of 286 persons at any time and users at the pickup window and drive through windows need not go through the store.

However, parking is a logistical nightmare so its rare that they will ever actually service all 286 people. The average service count is 68 customers inside plus 140 users at the drive through and pick up windows.

No, the 109 patrons each is more like McCafe.

bkilian · Sep 13, 2013

DaveNagy said:
Still here.

Sure, I agree with that, assuming both parties are using that one ISP.

So, let's see, how does this relate to the XB1.... Moving data from (say) the ESRAM, to the system RAM at a max rate of 68 GB/s would "consume" 136 GB/s of the system's total BW "pool". You are in effect using all of the system RAM's bandwidth, and a goodly chunk of the ESRAM's bandwidth as well. It couldn't be used for anything else, after all.

So, I'd tend to think of that as as a somewhat "diabolical" use case: You're sucking up a substantial portion of the system's potential BW, but achieving comparatively little moving-shit-around. Almost a worst case. (It sounds like a totally valid and perhaps typical use case though.)

If that was the best anyone could manage on this hardware, it might be correct to characterize the system's memory bandwidth as only being 68 GB/s. Certain forum warriors might have used this this calculus in the past. No one here, of course.

But if I understand correctly, plopping a GPU down in the middle somewhere enables higher data transfer rates. In ideal cases you are transferring as much as you are "consuming". You gave an example of where the GPU was both reading and writing to the ESRAM, while simultaneously reading or writing to system RAM. (Not both!) If everything is balanced just so, you are not only consuming all of the bandwidth available, (~250GB/s) but actually transferring data at that rate as well. Woot!

As such, it would be correct to characterize the system's bandwidth as something like 250GB/s.

Is that close?

Is that what taisui was saying too? I guess it probably was. I apologize. I thought he was trying to double stuff to "pad" the XB1's specs. Quite the opposite, I guess.

Yes, exactly. The total bandwidth available on the X1 is 280 odd GB/s. if you find yourself wasting it by copying memory around, you're doing something wrong. This applies equally if you have one memory pool or two.

Betanumerical · Sep 13, 2013

bkilian said:
Yes, exactly. The total bandwidth available on the X1 is 280 odd GB/s. if you find yourself wasting it by copying memory around, you're doing something wrong. This applies equally if you have one memory pool or two.

Its only 280GB/s if your doing some very specific things, I thought the generally acceptable obtainable bandwidth of the eSRAM was 133GB/s with alpha blending?. wouldn't that make the peak 201GB/s?. Obviously its not the peak peak, but the peak theoretical of the interface is probably never going to be reached with the eSRAM it seems except for some very very small scenarios.

taisui · Sep 13, 2013

Betanumerical said:
Its only 280GB/s if your doing some very specific things, I thought the generally acceptable obtainable bandwidth of the eSRAM was 133GB/s with alpha blending?. wouldn't that make the peak 201GB/s?. Obviously its not the peak peak, but the peak theoretical of the interface is probably never going to be reached with the eSRAM it seems except for some very very small scenarios.

Theoretical BW is peak BW.
you are mixing it up with practical utilization.

this is not particular to the esram, this is the case for all memory, single pool or not.

Betanumerical · Sep 13, 2013

taisui said:
Theoretical BW is peak BW.
you are mixing it up with practical utilization.

this is not particular to the esram, this is the case for all memory, single pool or not.

Fair enough, im just not sure the amount of 'theoretical bandwidth' the eSRAM is going to get in a practical scenario, I feel like its a bit weird to compare the two theoretically one one will get no where near its peak.

Scott_Arm · Sep 13, 2013

Isn't peak bandwidth utilization rare on any platform?

Brad Grenz · Sep 13, 2013

Sure, but I think beta is asking what are the limiting factors for achieving the 204 figure rather than the 109 figure. The only "real world" example we've heard using the simultaneous read/write only gets your 2/3rds of the way there. Is that a problem with the technique needed to exploit the extra bandwidth, or are you limited by the throughput of the client devices? If, for example, there is no theoretical way to read and write more than 150GBps of data to the ESRAM based on external factors then it doesn't matter if the ESRAM itself is technically capable of more. The extra 54GBps is always waste in that scenario.

Scott_Arm · Sep 13, 2013

Brad Grenz said:
Sure, but I think beta is asking what are the limiting factors for achieving the 204 figure rather than the 109 figure. The only "real world" example we've heard using the simultaneous read/write only gets your 2/3rds of the way there. Is that a problem with the technique needed to exploit the extra bandwidth, or are you limited by the throughput of the client devices? If, for example, there is no theoretical way to read and write more than 150GBps of data to the ESRAM based on external factors then it doesn't matter if the ESRAM itself is technically capable of more. The extra 54GBps is always waste in that scenario.

As far as the math goes, the rops should be able to saturate the write bandwidth to the ESRAM, in theory. I wouldn't expect the read to be more limited than the write. Why the typical average case would fall below the sum of the two, I can't answer. Without knowing how the ESRAM is used, or if the DMEs play any role, it might be a tough question to answer.

rokkerkory · Sep 13, 2013

Brad Grenz said:
Sure, but I think beta is asking what are the limiting factors for achieving the 204 figure rather than the 109 figure. The only "real world" example we've heard using the simultaneous read/write only gets your 2/3rds of the way there. Is that a problem with the technique needed to exploit the extra bandwidth, or are you limited by the throughput of the client devices? If, for example, there is no theoretical way to read and write more than 150GBps of data to the ESRAM based on external factors then it doesn't matter if the ESRAM itself is technically capable of more. The extra 54GBps is always waste in that scenario.

You might be right and maybe that's why the data move engines are there? (Not sure)

3dilettante · Sep 13, 2013

Brad Grenz said:
Sure, but I think beta is asking what are the limiting factors for achieving the 204 figure rather than the 109 figure. The only "real world" example we've heard using the simultaneous read/write only gets your 2/3rds of the way there. Is that a problem with the technique needed to exploit the extra bandwidth, or are you limited by the throughput of the client devices?

That was one of my criticisms of the DF article that leaked it. It's a real-world example, but no context was given as to where it stood in the continuum of workloads that one would probably find running. Is it an example of a good utilization case, a mediocre one, or a bad one? The secret sources, via a non-technical writer, did not say.
What is the optimum mix?
What exactly were they measuring?
Are there cache and buffering effects that needed to be corrected for?

All we get is a number and a host of ambiguities and questions.

If, for example, there is no theoretical way to read and write more than 150GBps of data to the ESRAM based on external factors then it doesn't matter if the ESRAM itself is technically capable of more. The extra 54GBps is always waste in that scenario.

Technically capable in this case means being theoretically capable.

rokkerkory said:
You might be right and maybe that's why the data move engines are there? (Not sure)

DMEs perform operations so that you don't waste a CU or two on it. A few specialized functions also reside on a few of them.
The bandwidth of the DMEs is inferior to the shader array, but the vector units have better things to do.

Billy Idol · Sep 13, 2013

Scott_Arm said:
This suddenly became the best thread. Hall of fame.

It started with quantum mechanics and now it's on to fast food analogies.

Lol!

Oh, and thanks to bkilian for his explanations.

Shifty Geezer · Sep 13, 2013

Betanumerical said:
Fair enough, im just not sure the amount of 'theoretical bandwidth' the eSRAM is going to get in a practical scenario, I feel like its a bit weird to compare the two theoretically one one will get no where near its peak.

Bandwidth applies in different ways under different workloads. There's no singular metric to understand the flow of data within a system. Peak BW is peak BW. It doesn't tell us utilisation though. Same as peak flops. But as it's the only metric one can realistically put out there for devs to understand the system, it's the one used. That doesn't mean you gain access to full BW at all times, but nor does it mean the peak BW can be discounted as meaningless. The situation is only muddied a little because of people making comparisons. PS4's situation is far simpler on paper, and people are trying to compare the different setups with different numbers that mean different things.

The old water analogy - PS4 has a reservoir holding 8 billion litres and a pipe providing ~170 billion litres a second. It can supply water to the Communal Purification Unit and Global Purification Units for washing at that rate. These will consume some water and send it back to the reservoir.

XB1 has a reservoir holding 8 billion litres and a pipe providing ~60 billion litres a second. It also has a tank that holds 32 million litres with a pipe providing ~109/206/133 litres a second. the CPU and GPU can be fed a total of ~270 billions litres of water a second which is recycled back to the tank and reservoir (pipes are dual ported

).

That's one example. You can then explore what happens when tanks empty and need to be filled, which is the whole memory system, and latency, which is how long it take to turn taps on and get water flowing. There are other analogies that describe utilisation under other circumstances. They can be right if different, and as such there's no singular understanding of the data BW within a system. It's only engineers and PR people who care - PR people can pick whichever representation is best for their system when marketing. "Our system has the highest data throughput," versus, "our system has the fastest minimum throughput and you'll never get less than this higher average," sort of thing.

Recognising that XB1 does have RAM bandwidths that combine, maybe now we can get back to understanding how the eSRAM gets different rates on the eSRAM bus?

pjbliverpool · Sep 13, 2013

That was fun! So now that I think I understand how this all works, would it be fair to summarise this in laymans terms as follows (and in relation to a single pool of GDDR5 which is ultimately what we're all trying to understand):

The XB1 has 272 GB/s of theoretically useable bandwidth
Compared with a single pool of GDDR5 this has 2 disadvantages in real world utilisation:
1. It's made up of 2 seperate memory pools and so any direct copy of data between those pools without any sort of transformation to that data by the GPU will waste bandwidth performing an operation that wouldn't have been required in a single memory pool system.
2. Even with no data copy between memory pools the maximum useful bandwidth of the esram can only be achieved if you perfectly balance read and write operations at all time. Whereas with GDDR5 since the full bandwidth is available at all times in any ratio between read and write, that's not a concern.
So the bottom line is that the GPU in the XB1 will usually only be able to consume a smaller percentage of its total available bandwidth than the GPU of a different system that uses a single unified pool of GDDR5 (or DDR for that matter).

Understanding XB1's internal memory bandwidth *spawn

3dilettante

taisui

bkilian

DaveNagy

taisui

DaveNagy

taisui

Strange

bkilian

Betanumerical

taisui

Betanumerical

Scott_Arm

Brad Grenz

Philosopher & Poet

Scott_Arm

rokkerkory

3dilettante

Billy Idol

Shifty Geezer

uber-Troll!

pjbliverpool

B3D Scallywag

Similar threads