Understanding XB1's internal memory bandwidth *spawn

3dilettante · Sep 12, 2013

dobwal said:
I know that. You didn't specify and I am simply simplifying. If the GPU and is eating 256 bits from the DRAM in a cycle and then performing a operation and spitting that new data to eSRAM at 256 bits per cycles. Then 256 bit per cycle is the bandwidth being consumed at any one time.

This is a pipelined scenario. The system isn't loading data, stopping everything else it is doing, performing an operation, stopping everything it is doing, writing data, stopping everything it is doing, loading the next chunk, etc.

taisui · Sep 12, 2013

dobwal said:
68 GBs.

So answer this:

RAM->(reads at 34GBps)->GPU->(writes at 34GBps)->RAM

What's the total BW?

DaveNagy · Sep 12, 2013

Wow, this thread sure got headache-inducing today....

To my tiny little mind, it all depends whether the data being moved is, or even could be, "unique" or not. That determines whether it's kosher to add up the different chunks of bandwidth. If we're talking about different chunks of data, then sure, add away. If we're talking about the same data, moving through a series of path-legs, then no, you can't add up the different legs of that trip.

Examples:

1) One tiny chunk of a data is moving from the ESRAM to the GPU, while simultaneously a different tiny chunk of data is moving out of the GPU and into DRAM. Both transfers are happening at the max possible speed possible over their respective pathways. In that case, the total bandwidth is measured by adding those two separate-but-simultaneous moves together. All possible bandwidth that is available for this particular task-set is being used, and it's an impressive 190-ish GB/s. You're doing two fairly impressive things at the same time, of course you get "credit" for both.

2) A chunk of data is being moved "directly" from ESRAM to RAM. Nothing else is happening. In this case, the max speed of that move will be limited to the slowest link in that chain. The RAM write speed in this case. There's nothing to add up, because there's only "one thing" happening here. The total bandwidth available for this particular task is again being used, but it's a relatively paltry 68 GB/s. You are only accomplishing one thing, so you only get to take credit for it once. (You can't count the two ends of the move as being separate things you accomplished.)

If we're talking about the capabilities of the system as a whole, the hardware as a whole, then you have to acknowledge "best case"* scenarios like example one. So, yes, in general "double counting" is a valid way to characterize the performance of the entire system. This is despite the fact that certain tasks (like example two) don't take advantage of all of that performance.

* Actually, example one is not a best case. Such a case would need to somehow be transferring in and out of ESRAM at whatever that magical combined rate is, while simultaneously maxing out the RAM link with different data. Something north of 200 GB/s, allegedly.

taisui · Sep 12, 2013

DaveNagy said:
2) A chunk of data is being moved "directly" from ESRAM to RAM. Nothing else is happening. In this case, the max speed of that move will be limited to the slowest link in that chain. The RAM write speed in this case. There's nothing to add up, because there's only "one thing" happening here. The total bandwidth available for this particular task is again being used, but it's a relatively paltry 68 GB/s. You are only accomplishing one thing, so you only get to take credit for it once. (You can't count the two ends of the move as being separate things you accomplished.)

Whether the data is manipulated or not, the hardware does not care and it does not matter. You are putting a human perception into what constitutes "meaningful"

Let me put it another way.

When you download a file from a web server at say, 20MBps. How much total BW is consumed?

The answer is 40MBps, because the web server uploads at 20MBps, and your PC downloads at 20Mbps.

The file is transferred "directly" but it doesn't matter.

dobwal · Sep 12, 2013

Lets me ask both of you all a question.

If I take GDDR5 RAM and customize by instead of running 8 32 bit channels in parallel to a 256 bit interface, I run 8 32 bit channels serially to a 32 bit interface, can I claim the same internal bandwidth at the same speed rating as vanilla GDDR5?

Why not? Each channel is handling 32 bit of data per cycle just like plain vanilla GDDR5.

When moving data from eSRAM to DRAM even when moving through gpu the two buses lie along a serial path so adding up the bandwidth of the two buses serves no purpose. Because the data will never move at a rate of 136 GBs.

Taisui and 3dilettante, you are both right when looking at summing up eSRAM and DRAM bandwidth when the sources and destinations are different. If the gpu is reading from both the eSRAM and DRAM then it has 170 GBs (or 177 GBs) worth of bandwidth available. Because you have the GPU serving as the destination and both eSRAM and DRAM serving as sources. The gpu is pulling data from a 256 bit interface and a 1024 bit interface every cycle. But I wouldn't claim 177 GBs of bandwidth when the source is eSRAM and the destination is DRAM or vice versa. Because you moving data from either a 256 bit interface to a 1024 bit interface or vice versa. That 256 bit interface acts as an impediment allowing the source to only satisfy reads at 68 GBs or the destination to only satisfy writes at 68 GBs.

bkilian · Sep 12, 2013

dobwal said:
Thank you for confirming my beliefs.

It doesn't matter. When explicitly describing the bandwidth available between eSRAM and DRAM, the DDR3 bandwidth is the limiting factor.

68GBs of reads or 68GBs of writes from eSRAM to DRAM is an either/or proposition so summing up those bandwidths is nonsensical.

DRAM to DRAM is accommodating the 68 GBs with half the bandwidth utilized by writes and half utilized by reads. 136 GBs of bandwidth using DRAM requires both reads and writes to independently utilized the entire bandwidth allowed by DRAM simultaneously. A 512 bit interface and 2133 Mhz DDR3 could drive reads and writes bandwidths at a total of 136 GBs, a 256 bit interface cannot. The DDR3 can provide 68 GBs bandwidth of reads or writes not 68 GBs of reads and 68 GBs writes.

In fact, if MS's 204.8 bandwidth figure between the gpu and eSRAM were derived in such a fashion, we all here and any of our console savvy, technical enlightened mothers would pitch a fit and go "GTFOH".

Here's a scenario:
Read textures from RAM at 68GB/s
GPU does it's magic with the textures and creates a framebuffer.
Write intermediate framebuffer to ESRAM at 109 GB/s
Read the intermediate framebuffer from ESRAM at 80GB/s

repeat.

Now imagine that at each time point the system is doing all three things at the same time:
Read Tex 1
Read Tex 2 : Write FB 1
Read Tex 3 : Write FB 2 : Read FB 1
Read Tex 4 : Write FB 3 : Read FB 2 <---- At this point, what is the instantaneous bandwidth utilization?

bkilian · Sep 12, 2013

dobwal said:
Lets me ask both of you all a question.

If I take GDDR5 RAM and customize by instead of running 8 32 bit channels in parallel to a 256 bit interface, I run 8 32 bit channels serially to a 32 bit interface, can I claim the same internal bandwidth at the same speed rating as vanilla GDDR5?

Why not? Each channel is handling 32 bit of data per cycle just like plain vanilla GDDR5.

When moving data from eSRAM to DRAM even when moving through gpu the two buses lie along a serial path so adding up the bandwidth of the two buses serves no purpose. Because the data will never move at a rate of 136 GBs.

Aah, there's your problem. We're talking about the total amount of the available bandwidth _consumed_. If you're copying from DRAM to DRAM, the data is moving at 34GB/s, but the bandwidth _consumed_ is 68GB/s. Copying from RAM to ESRAM, the data moves at 68GB/s, but the bandwidth _consumed_ (ie, the bandwidth of the system that can no longer be used for other tasks) is 136GB/s

taisui · Sep 12, 2013

dobwal said:
Lets me ask both of you all a question.

When moving data from eSRAM to DRAM even when moving through gpu the two buses lie along a serial path so adding up the bandwidth of the two buses serves no purpose. Because the data will never move at a rate of 136 GBs.

You are working on a false assumption.

Just because memory copy through GPU this way, doesn't mean that this is the the ONLY way that it can be used, and that it's not the same as linking up two buses in serial.

BRiT · Sep 12, 2013

bkilian said:
Here's a scenario:
Read textures from RAM at 68GB/s
GPU does it's magic with the textures and creates a framebuffer.
Write intermediate framebuffer to ESRAM at 109 GB/s
Read the intermediate framebuffer from ESRAM at 80GB/s

repeat.

Now imagine that at each time point the system is doing all three things at the same time:
Read Tex 1
Read Tex 2 : Write FB 1
Read Tex 3 : Write FB 2 : Read FB 1
Read Tex 4 : Write FB 3 : Read FB 2 <---- At this point, what is the instantaneous bandwidth utilization?

Thank you for a scenario that should be more understandable in the context.

taisui · Sep 12, 2013

dobwal said:
But I wouldn't claim 177 GBs of bandwidth when the source is eSRAM and the destination is DRAM or vice versa. Because you moving data from either a 256 bit interface to a 1024 bit interface or vice versa. That 256 bit interface acts as an impediment allowing the source to only satisfy reads at 68 GBs or the destination to only satisfy writes at 68 GBs.

The data is moving at a rate of 68GBps, but the BW consumed is 136Gbps, they are different things. This holds true regardless of the source/destination.

If you do a copy from DRAM->DRAM, 34GBps read, 34GBps write, the total BW consumed is still 68, not 34.

3dilettante · Sep 12, 2013

DaveNagy said:
2) A chunk of data is being moved "directly" from ESRAM to RAM. Nothing else is happening. In this case, the max speed of that move will be limited to the slowest link in that chain. The RAM write speed in this case. There's nothing to add up, because there's only "one thing" happening here. The total bandwidth available for this particular task is again being used, but it's a relatively paltry 68 GB/s. You are only accomplishing one thing, so you only get to take credit for it once. (You can't count the two ends of the move as being separate things you accomplished.)

If you're copying anything larger than 64 bytes, this isn't what copying would do.
Data would be read into the CUs in a minimum of 64 byte chunks, which is the length of a cache line.
That's multiple RAM and eSRAM transfers per read or write issued.
The CU can in a single instruction start the read process for multiple addresses.
The same goes for writes.
Once a read instruction is complete, the CU can either line up more reads or fire up a write.
It can issue the next read the next issue cycle after a write, even though that write's data hasn't gone all the way through the long pipeline and all the buffers on the way to the eSRAM.

The next bit of data is being read in, even as the first chunk is being written out.
After an initial filling of the pipeline, chunk N is being written as chunk N+1 is coming in.
The offset is likely higher, since the pipelines in question are likely significantly longer than the simplified example.

Code:

R0  R1   R2   R3 ...
    W0   W1   W2  ...

DaveNagy · Sep 13, 2013

taisui said:
Let me put it another way.

When you download a file from a web server at say, 20MBps. How much total BW is consumed?

The answer is 40MBps, because the web server uploads at 20MBps, and your PC downloads at 20Mbps.

I don't agree with that.

You're arbitrarily choosing to divvy up and then repeatedly count the same data in multiple parts of the same "system". The web server is serving to me at 20MB/s. Period. That completely describes the "work" that is being done. Data started out on one machine and ended up on another. Are you also going to start adding additional duplicate 20 MB/s rates as the data leaves the server's hard drive, traverses its memory systems, its network adapter, various routers, my network adapter, my memory, my hard drive cache? There's no bottom to that rat hole, and it provides no additional insight about the raw download speed I see from that server.

Your example describes one set of data being moved at a specified rate between two points. There's no need to make it more complicated than that. And if you do choose to make it more complicated, the amount of data moved over time still doesn't change. If you were talking about downloading from multiple servers simultaneously, while also doing processing and uploading, then sure, there would be value to a more nuanced tallying up of "total BW being used within a system". But that's because we would now be talking about a "bigger task".

Which takes me back to my "moving unique chunks of data" example. That's where you can add stuff. Not when it's the same data traversing multiple points on a serial path. That just isn't how we've agreed to characterize BW between two entities.

Maybe. That's the way I see it. (I think we actually agree that adding certain BWs is a valid way to characterize the XB1 hardware. You are just taking that concept too far, IMO, with your slice-apart-and-re-count methodology.)

dobwal · Sep 13, 2013

bkilian said:
Aah, there's your problem. We're talking about the total amount of the available bandwidth _consumed_. If you're copying from DRAM to DRAM, the data is moving at 34GB/s, but the bandwidth _consumed_ is 68GB/s. Copying from RAM to ESRAM, the data moves at 68GB/s, but the bandwidth _consumed_ (ie, the bandwidth of the system that can no longer be used for other tasks) is 136GB/s

NO!!! Bkilian, you are unequivocally, undeniably and straight up...syke.

Thanks, I see what the guys are talking about now.

Taisui and 3dilettante, if I caused endless frustration, my apologies.

I tend to conceptualize the data movement as a set number of bits moving from per cycle. 256 bits or 1024 in this case.

Silent_Buddha · Sep 13, 2013

DaveNagy said:
I don't agree with that.

You're arbitrarily choosing to divvy up and then repeatedly count the same data in multiple parts of the same "system". The web server is serving to me at 20MB/s. Period. That completely describes the "work" that is being done. Data started out on one machine and ended up on another. Are you also going to start adding additional duplicate 20 MB/s rates as the data leaves the server's hard drive, traverses its memory systems, its network adapter, various routers, my network adapter, my memory, my hard drive cache? There's no bottom to that rat hole, and it provides no additional insight about the raw download speed I see from that server.

Your example describes one set of data being moved at a specified rate between two points. There's no need to make it more complicated than that. And if you do choose to make it more complicated, the amount of data moved over time still doesn't change. If you were talking about downloading from multiple servers simultaneously, while also doing processing and uploading, then sure, there would be value to a more nuanced tallying up of "total BW being used within a system". But that's because we would now be talking about a "bigger task".

Which takes me back to my "moving unique chunks of data" example. That's where you can add stuff. Not when it's the same data traversing multiple points on a serial path. That just isn't how we've agreed to characterize BW between two entities.

Maybe. That's the way I see it. (I think we actually agree that adding certain BWs is a valid way to characterize the XB1 hardware. You are just taking that concept too far, IMO, with your slice-apart-and-re-count methodology.)

No it is correct in that the total bandwidth "consumed" is 40 MBps in the example. 20 MB/s is unavailable to any other source from the webserver. 20 MB/s is unavailable to any other source from the PC.

Now let's turn that example around another way. You download 20 MB/s from another server. Another PC downloads 20 MB/s worth of stuff from the original web server.

With regards to the original PC and original webserver. 40 MB/s is still being "comsumed" they just now have a different source and destination, respectively.

With regards to all 4 devices in question. Now 20 MB/s is being "consumed." In other words 20 MB/s worth of bandwidth is no longer available to be used by anything else on those 4 devices.

Regards,
SB

taisui · Sep 13, 2013

DaveNagy said:
I don't agree with that.

You're arbitrarily choosing to divvy up and then repeatedly count the same data in multiple parts of the same "system".

Data rate is not the same as bandwidth, and hardware does not care and do not know if the data pass through them is any different.

Let's try another example, be warned I'm bad at analogy:

You have two garden hoses for watering, each allows a flow of 1 gallon per minute. So at max, you can have 2 gallons per minute flowing through between the two them, do you agree on this?

So you have 2 Gps of "bandwidth"

If you use both of them to get water into a bucket, you can get 2 gallons of water every minute, since there are 1 gallon of water flowing in each hose, agree?

Now, if you decide to link up the 2 hoses into one longer hose, you can only then get 1 gallon of water every minute, correct? But does this change the maximum flow, which is 2Gps?

In fact, you can turn the water off, and there's no water flowing through, does that make the bandwidth 0Gps, at if the hoses are not hoses anymore?

Scott_Arm · Sep 13, 2013

bkilian said:
Here's a scenario:
Read textures from RAM at 68GB/s
GPU does it's magic with the textures and creates a framebuffer.
Write intermediate framebuffer to ESRAM at 109 GB/s
Read the intermediate framebuffer from ESRAM at 80GB/s

repeat.

Now imagine that at each time point the system is doing all three things at the same time:
Read Tex 1
Read Tex 2 : Write FB 1
Read Tex 3 : Write FB 2 : Read FB 1
Read Tex 4 : Write FB 3 : Read FB 2 <---- At this point, what is the instantaneous bandwidth utilization?

200 big macs in a car travelling 35 mph?

upnorthsox · Sep 13, 2013

Ok, so out of all of this we can confirm that the PS4 actually has a max bandwidth of 352GB, good let move along then. I'm hungry.

taisui · Sep 13, 2013

upnorthsox said:
Ok, so out of all of this we can confirm that the PS4 actually has a max bandwidth of 352GB, good let move along then. I'm hungry.

Not sure if you are really not getting it, or just being sarcastic, which I see no reason for doing so.
Sure, if you are saying that the PS4 can do simultaneous R/W to the GDDR5, that'll make it 352Gbps.

DaveNagy · Sep 13, 2013

taisui said:
Whether the data is manipulated or not, the hardware does not care and it does not matter. You are putting a human perception into what constitutes "meaningful".

Of course the hardware doesn't care. But the person for which the work is being accomplished might very well care. (Some kid playing CoD who values fast frame rates.) A benchmark might "care". If being able to move data over two separate pathways, both at full speed, allows software to work better or run faster than otherwise, then that ability is meaningful and should thus be "counted".

And is has nothing to do with "manipulation", necessarily. It's about moving stuff. We're thinking about whether the ability to move two (or more) things at the same time provides a performance advantage as compared to being able to move just one of those things in that same time. Doesn't that sound useful to you? Would that be a meaningful advantage to have? If yes, then we should be able to add those two moves together as part of a total peak/potential/theoretical BW spec.

On the other hand, moving one chunk of stuff from one place to another place does not rate any such adding-together. One thing has been accomplished, and therefore we count that work as having been done once. No separate extra "value" was added as it left one place, or entered another. Something was moved, once. BW specifies the rate at which it moved. There's no reason to add anything to anything in that case. 'Cause there's only one thing.

It seems simple to me: If the ability to use multiple pathways at the same time is both possible, and could conceivably increase performance, we should add the BW of those pathways together. If only one pathway is ever usable at once, you can't. In a particular use-case (e.g. ESRAM-to-mem copy) where you are not using multiple pathways simultaneously, you won't be using all of that total system bandwidth, although that bandwidth obviously still exists and can listed on a spec sheet.

DaveNagy · Sep 13, 2013

bkilian said:
Here's a scenario:
Read textures from RAM at 68GB/s
GPU does it's magic with the textures and creates a framebuffer.
Write intermediate framebuffer to ESRAM at 109 GB/s
Read the intermediate framebuffer from ESRAM at 80GB/s

repeat.

Now imagine that at each time point the system is doing all three things at the same time:
Read Tex 1
Read Tex 2 : Write FB 1
Read Tex 3 : Write FB 2 : Read FB 1
Read Tex 4 : Write FB 3 : Read FB 2 <---- At this point, what is the instantaneous bandwidth utilization?

I say that you should add'm all up together. 68 + 109 + 80 = 257?

Those three streams of data are all different, are all in motion simultaneously, and all could three could arguably be useful as soon as they get to where they are headed. (Losing any of those pathways would probably hurt overall performance.)

Am I right? What do I win?

Understanding XB1's internal memory bandwidth *spawn

3dilettante

taisui

DaveNagy

taisui

dobwal

bkilian

bkilian

taisui

BRiT

(>• •)>⌐■-■ (⌐■-■)

taisui

3dilettante

DaveNagy

dobwal

Silent_Buddha

taisui

Scott_Arm

upnorthsox

taisui

DaveNagy

DaveNagy

Similar threads