Understanding XB1's internal memory bandwidth *spawn

Discussion in 'Console Technology' started by zupallinere, Sep 11, 2013.

  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    This is a pipelined scenario. The system isn't loading data, stopping everything else it is doing, performing an operation, stopping everything it is doing, writing data, stopping everything it is doing, loading the next chunk, etc.
     
  2. taisui

    Regular

    Joined:
    Aug 29, 2013
    Messages:
    674
    Likes Received:
    0
    So answer this:

    RAM->(reads at 34GBps)->GPU->(writes at 34GBps)->RAM

    What's the total BW?
     
  3. DaveNagy

    Newcomer

    Joined:
    Jan 18, 2013
    Messages:
    51
    Likes Received:
    0
    Wow, this thread sure got headache-inducing today....

    To my tiny little mind, it all depends whether the data being moved is, or even could be, "unique" or not. That determines whether it's kosher to add up the different chunks of bandwidth. If we're talking about different chunks of data, then sure, add away. If we're talking about the same data, moving through a series of path-legs, then no, you can't add up the different legs of that trip.

    Examples:

    1) One tiny chunk of a data is moving from the ESRAM to the GPU, while simultaneously a different tiny chunk of data is moving out of the GPU and into DRAM. Both transfers are happening at the max possible speed possible over their respective pathways. In that case, the total bandwidth is measured by adding those two separate-but-simultaneous moves together. All possible bandwidth that is available for this particular task-set is being used, and it's an impressive 190-ish GB/s. You're doing two fairly impressive things at the same time, of course you get "credit" for both.

    2) A chunk of data is being moved "directly" from ESRAM to RAM. Nothing else is happening. In this case, the max speed of that move will be limited to the slowest link in that chain. The RAM write speed in this case. There's nothing to add up, because there's only "one thing" happening here. The total bandwidth available for this particular task is again being used, but it's a relatively paltry 68 GB/s. You are only accomplishing one thing, so you only get to take credit for it once. (You can't count the two ends of the move as being separate things you accomplished.)

    If we're talking about the capabilities of the system as a whole, the hardware as a whole, then you have to acknowledge "best case"* scenarios like example one. So, yes, in general "double counting" is a valid way to characterize the performance of the entire system. This is despite the fact that certain tasks (like example two) don't take advantage of all of that performance.

    * Actually, example one is not a best case. Such a case would need to somehow be transferring in and out of ESRAM at whatever that magical combined rate is, while simultaneously maxing out the RAM link with different data. Something north of 200 GB/s, allegedly.
     
    #83 DaveNagy, Sep 12, 2013
    Last edited by a moderator: Sep 12, 2013
  4. taisui

    Regular

    Joined:
    Aug 29, 2013
    Messages:
    674
    Likes Received:
    0
    Whether the data is manipulated or not, the hardware does not care and it does not matter. You are putting a human perception into what constitutes "meaningful"

    Let me put it another way.

    When you download a file from a web server at say, 20MBps. How much total BW is consumed?

    The answer is 40MBps, because the web server uploads at 20MBps, and your PC downloads at 20Mbps.

    The file is transferred "directly" but it doesn't matter.
     
  5. dobwal

    Legend

    Joined:
    Oct 26, 2005
    Messages:
    5,955
    Likes Received:
    2,325
    Lets me ask both of you all a question.

    If I take GDDR5 RAM and customize by instead of running 8 32 bit channels in parallel to a 256 bit interface, I run 8 32 bit channels serially to a 32 bit interface, can I claim the same internal bandwidth at the same speed rating as vanilla GDDR5?

    Why not? Each channel is handling 32 bit of data per cycle just like plain vanilla GDDR5.

    When moving data from eSRAM to DRAM even when moving through gpu the two buses lie along a serial path so adding up the bandwidth of the two buses serves no purpose. Because the data will never move at a rate of 136 GBs.

    Taisui and 3dilettante, you are both right when looking at summing up eSRAM and DRAM bandwidth when the sources and destinations are different. If the gpu is reading from both the eSRAM and DRAM then it has 170 GBs (or 177 GBs) worth of bandwidth available. Because you have the GPU serving as the destination and both eSRAM and DRAM serving as sources. The gpu is pulling data from a 256 bit interface and a 1024 bit interface every cycle. But I wouldn't claim 177 GBs of bandwidth when the source is eSRAM and the destination is DRAM or vice versa. Because you moving data from either a 256 bit interface to a 1024 bit interface or vice versa. That 256 bit interface acts as an impediment allowing the source to only satisfy reads at 68 GBs or the destination to only satisfy writes at 68 GBs.
     
    #85 dobwal, Sep 12, 2013
    Last edited by a moderator: Sep 12, 2013
  6. bkilian

    Veteran

    Joined:
    Apr 22, 2006
    Messages:
    1,539
    Likes Received:
    3
    Here's a scenario:
    Read textures from RAM at 68GB/s
    GPU does it's magic with the textures and creates a framebuffer.
    Write intermediate framebuffer to ESRAM at 109 GB/s
    Read the intermediate framebuffer from ESRAM at 80GB/s

    repeat.

    Now imagine that at each time point the system is doing all three things at the same time:
    Read Tex 1
    Read Tex 2 : Write FB 1
    Read Tex 3 : Write FB 2 : Read FB 1
    Read Tex 4 : Write FB 3 : Read FB 2 <---- At this point, what is the instantaneous bandwidth utilization?
     
  7. bkilian

    Veteran

    Joined:
    Apr 22, 2006
    Messages:
    1,539
    Likes Received:
    3
    Aah, there's your problem. We're talking about the total amount of the available bandwidth _consumed_. If you're copying from DRAM to DRAM, the data is moving at 34GB/s, but the bandwidth _consumed_ is 68GB/s. Copying from RAM to ESRAM, the data moves at 68GB/s, but the bandwidth _consumed_ (ie, the bandwidth of the system that can no longer be used for other tasks) is 136GB/s
     
  8. taisui

    Regular

    Joined:
    Aug 29, 2013
    Messages:
    674
    Likes Received:
    0
    You are working on a false assumption.

    Just because memory copy through GPU this way, doesn't mean that this is the the ONLY way that it can be used, and that it's not the same as linking up two buses in serial.
     
  9. BRiT

    BRiT (>• •)>⌐■-■ (⌐■-■)
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    20,516
    Likes Received:
    24,424
    Thank you for a scenario that should be more understandable in the context.
     
  10. taisui

    Regular

    Joined:
    Aug 29, 2013
    Messages:
    674
    Likes Received:
    0
    The data is moving at a rate of 68GBps, but the BW consumed is 136Gbps, they are different things. This holds true regardless of the source/destination.

    If you do a copy from DRAM->DRAM, 34GBps read, 34GBps write, the total BW consumed is still 68, not 34.
     
  11. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    If you're copying anything larger than 64 bytes, this isn't what copying would do.
    Data would be read into the CUs in a minimum of 64 byte chunks, which is the length of a cache line.
    That's multiple RAM and eSRAM transfers per read or write issued.
    The CU can in a single instruction start the read process for multiple addresses.
    The same goes for writes.
    Once a read instruction is complete, the CU can either line up more reads or fire up a write.
    It can issue the next read the next issue cycle after a write, even though that write's data hasn't gone all the way through the long pipeline and all the buffers on the way to the eSRAM.

    The next bit of data is being read in, even as the first chunk is being written out.
    After an initial filling of the pipeline, chunk N is being written as chunk N+1 is coming in.
    The offset is likely higher, since the pipelines in question are likely significantly longer than the simplified example.

    Code:
    R0  R1   R2   R3 ...
        W0   W1   W2  ...
    
     
  12. DaveNagy

    Newcomer

    Joined:
    Jan 18, 2013
    Messages:
    51
    Likes Received:
    0
    I don't agree with that.

    You're arbitrarily choosing to divvy up and then repeatedly count the same data in multiple parts of the same "system". The web server is serving to me at 20MB/s. Period. That completely describes the "work" that is being done. Data started out on one machine and ended up on another. Are you also going to start adding additional duplicate 20 MB/s rates as the data leaves the server's hard drive, traverses its memory systems, its network adapter, various routers, my network adapter, my memory, my hard drive cache? There's no bottom to that rat hole, and it provides no additional insight about the raw download speed I see from that server.

    Your example describes one set of data being moved at a specified rate between two points. There's no need to make it more complicated than that. And if you do choose to make it more complicated, the amount of data moved over time still doesn't change. If you were talking about downloading from multiple servers simultaneously, while also doing processing and uploading, then sure, there would be value to a more nuanced tallying up of "total BW being used within a system". But that's because we would now be talking about a "bigger task".

    Which takes me back to my "moving unique chunks of data" example. That's where you can add stuff. Not when it's the same data traversing multiple points on a serial path. That just isn't how we've agreed to characterize BW between two entities.

    Maybe. That's the way I see it. (I think we actually agree that adding certain BWs is a valid way to characterize the XB1 hardware. You are just taking that concept too far, IMO, with your slice-apart-and-re-count methodology.)
     
  13. dobwal

    Legend

    Joined:
    Oct 26, 2005
    Messages:
    5,955
    Likes Received:
    2,325

    NO!!! Bkilian, you are unequivocally, undeniably and straight up...syke.

    Thanks, I see what the guys are talking about now.

    Taisui and 3dilettante, if I caused endless frustration, my apologies.

    I tend to conceptualize the data movement as a set number of bits moving from per cycle. 256 bits or 1024 in this case.
     
  14. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    19,426
    Likes Received:
    10,320
    No it is correct in that the total bandwidth "consumed" is 40 MBps in the example. 20 MB/s is unavailable to any other source from the webserver. 20 MB/s is unavailable to any other source from the PC.

    Now let's turn that example around another way. You download 20 MB/s from another server. Another PC downloads 20 MB/s worth of stuff from the original web server.

    With regards to the original PC and original webserver. 40 MB/s is still being "comsumed" they just now have a different source and destination, respectively.

    With regards to all 4 devices in question. Now 20 MB/s is being "consumed." In other words 20 MB/s worth of bandwidth is no longer available to be used by anything else on those 4 devices.

    Regards,
    SB
     
  15. taisui

    Regular

    Joined:
    Aug 29, 2013
    Messages:
    674
    Likes Received:
    0
    Data rate is not the same as bandwidth, and hardware does not care and do not know if the data pass through them is any different.

    Let's try another example, be warned I'm bad at analogy:

    You have two garden hoses for watering, each allows a flow of 1 gallon per minute. So at max, you can have 2 gallons per minute flowing through between the two them, do you agree on this?

    So you have 2 Gps of "bandwidth"

    If you use both of them to get water into a bucket, you can get 2 gallons of water every minute, since there are 1 gallon of water flowing in each hose, agree?

    Now, if you decide to link up the 2 hoses into one longer hose, you can only then get 1 gallon of water every minute, correct? But does this change the maximum flow, which is 2Gps?

    In fact, you can turn the water off, and there's no water flowing through, does that make the bandwidth 0Gps, at if the hoses are not hoses anymore?
     
    #95 taisui, Sep 13, 2013
    Last edited by a moderator: Sep 13, 2013
  16. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    15,134
    Likes Received:
    7,680

    200 big macs in a car travelling 35 mph?
     
  17. upnorthsox

    Veteran

    Joined:
    May 7, 2008
    Messages:
    2,106
    Likes Received:
    380
    Ok, so out of all of this we can confirm that the PS4 actually has a max bandwidth of 352GB, good let move along then. I'm hungry.
     
  18. taisui

    Regular

    Joined:
    Aug 29, 2013
    Messages:
    674
    Likes Received:
    0
    Not sure if you are really not getting it, or just being sarcastic, which I see no reason for doing so.
    Sure, if you are saying that the PS4 can do simultaneous R/W to the GDDR5, that'll make it 352Gbps.
     
  19. DaveNagy

    Newcomer

    Joined:
    Jan 18, 2013
    Messages:
    51
    Likes Received:
    0
    Of course the hardware doesn't care. But the person for which the work is being accomplished might very well care. (Some kid playing CoD who values fast frame rates.) A benchmark might "care". If being able to move data over two separate pathways, both at full speed, allows software to work better or run faster than otherwise, then that ability is meaningful and should thus be "counted".

    And is has nothing to do with "manipulation", necessarily. It's about moving stuff. We're thinking about whether the ability to move two (or more) things at the same time provides a performance advantage as compared to being able to move just one of those things in that same time. Doesn't that sound useful to you? Would that be a meaningful advantage to have? If yes, then we should be able to add those two moves together as part of a total peak/potential/theoretical BW spec.

    On the other hand, moving one chunk of stuff from one place to another place does not rate any such adding-together. One thing has been accomplished, and therefore we count that work as having been done once. No separate extra "value" was added as it left one place, or entered another. Something was moved, once. BW specifies the rate at which it moved. There's no reason to add anything to anything in that case. 'Cause there's only one thing.

    It seems simple to me: If the ability to use multiple pathways at the same time is both possible, and could conceivably increase performance, we should add the BW of those pathways together. If only one pathway is ever usable at once, you can't. In a particular use-case (e.g. ESRAM-to-mem copy) where you are not using multiple pathways simultaneously, you won't be using all of that total system bandwidth, although that bandwidth obviously still exists and can listed on a spec sheet.
     
  20. DaveNagy

    Newcomer

    Joined:
    Jan 18, 2013
    Messages:
    51
    Likes Received:
    0
    I say that you should add'm all up together. 68 + 109 + 80 = 257?

    Those three streams of data are all different, are all in motion simultaneously, and all could three could arguably be useful as soon as they get to where they are headed. (Losing any of those pathways would probably hurt overall performance.)

    Am I right? What do I win? :wink:
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...