Understanding XB1's internal memory bandwidth *spawn

Discussion in 'Console Technology' started by zupallinere, Sep 11, 2013.

  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    It was probably copied for a reason. So if step 1 in an algorithm is "Copy this" then the move to step 2 is progress.

    It's pretty straightforward to produce code that performs some kind of data transformation without data reduction, reorientation, or amplification.
    Within reason, additional work can be inserted into the copy scenario with no difference in bandwidth consumption from the point of view of the memory system.
    A copy is a degenerate case of the above.

    The work is that the transistors switched, wires flipped, and there are now memory cells with values that were different than before.
     
  2. taisui

    Regular

    Joined:
    Aug 29, 2013
    Messages:
    674
    Likes Received:
    0
    Huh?

    But a ESRAM->DDR3 copy IS using multiple pathways simultaneously.
     
  3. bkilian

    Veteran

    Joined:
    Apr 22, 2006
    Messages:
    1,539
    Likes Received:
    3
    How about looking at it in terms of money? Let's assume everyone pays $1 per megabyte for their internet usage (uplink and downlink).
    In the scenario above, you would pay $20/second for your download. The owner of the server would _also_ pay $20/second for their uplink. Total money paid, $40/second. Right? In the eyes of the ISP, the bandwidth used is 40MB/s. In _every_ case where data is being moved from one location to another, it has to be both read, and written, so the bandwidth used is _always_ twice the data rate of the data being moved.
     
  4. DaveNagy

    Newcomer

    Joined:
    Jan 18, 2013
    Messages:
    51
    Likes Received:
    0
    Yes, the maximum flow of the new, reconfigured system is now just 1Gps. The old system, which presumably incorporated twice as many spigots, was capable of double the flow. (This assumes that the maximum household flow rate is at least 2x greater than the max flow rate of any single spigot.)

    Is that right? How does this apply to the XB1?

    Hmm, does your analogy involve both hoses hooked to the same faucet? In which case, the difference between the two configurations would be minor. The 2x longer hose would achieve slightly less flow, due to increased friction and back pressure, but both configs are probably spigot-limited, which might swap that effect. If we're talking about idealized hoses and water, I guess both configs would flow the same, both limited by the spigot flow rate. But... you said the hoses were the limiters of flow, so I don't know what to think...

    The system bandwidth should still be the same, even when the system is turned off. But two short hoses certainly have more bandwidth than one long one, regardless of utilization.

    Again, I'm not sure how this relates to the argument we're having. Heck, I'm not sure what the argument is anymore! Could you please summarize the two opposing sides? I suspect that they aren't what I thought they were. For that, I apologize.

    I think I'll just leave you guys to it. You were clearly doing fine without me. :wink:
     
  5. taisui

    Regular

    Joined:
    Aug 29, 2013
    Messages:
    674
    Likes Received:
    0
    The confusion is on data rate versus bandwidth.

    chart:
    DDR3->(68GBps)->GPU->(68GBps)->ESRAM

    Some think that because the data transfer rate is at 68GBps, the consumed bandwidth is therefore, 68GBps.

    But actually the BW consumed would be 136GBps, since there are 2 interfaces involved, each utilizing 68GBps.

    As for how this relates to the XO, in a perfect world,
    the peak theoretical BW will be 68 + 109 + 109 = 286GBps.

    Here's a hint, consuming less BW is for the same kind of work in the same amount of cycles, is actually better.
     
  6. DaveNagy

    Newcomer

    Joined:
    Jan 18, 2013
    Messages:
    51
    Likes Received:
    0
    Still here. :sad:

    Sure, I agree with that, assuming both parties are using that one ISP.

    So, let's see, how does this relate to the XB1.... Moving data from (say) the ESRAM, to the system RAM at a max rate of 68 GB/s would "consume" 136 GB/s of the system's total BW "pool". You are in effect using all of the system RAM's bandwidth, and a goodly chunk of the ESRAM's bandwidth as well. It couldn't be used for anything else, after all.

    So, I'd tend to think of that as as a somewhat "diabolical" use case: You're sucking up a substantial portion of the system's potential BW, but achieving comparatively little moving-shit-around. Almost a worst case. (It sounds like a totally valid and perhaps typical use case though.)

    If that was the best anyone could manage on this hardware, it might be correct to characterize the system's memory bandwidth as only being 68 GB/s. Certain forum warriors might have used this this calculus in the past. No one here, of course.

    But if I understand correctly, plopping a GPU down in the middle somewhere enables higher data transfer rates. In ideal cases you are transferring as much as you are "consuming". You gave an example of where the GPU was both reading and writing to the ESRAM, while simultaneously reading or writing to system RAM. (Not both!) If everything is balanced just so, you are not only consuming all of the bandwidth available, (~250GB/s) but actually transferring data at that rate as well. Woot!

    As such, it would be correct to characterize the system's bandwidth as something like 250GB/s.

    Is that close?

    Is that what taisui was saying too? I guess it probably was. I apologize. I thought he was trying to double stuff to "pad" the XB1's specs. Quite the opposite, I guess.
     
  7. taisui

    Regular

    Joined:
    Aug 29, 2013
    Messages:
    674
    Likes Received:
    0
    You got it. The max BW is always the same, and in general the more data you can move across all paths simultaneously (data rate), the more meaningful performance you'll get out of the system.

    (hence if you have 2 garden hoses, don't link them up into a single one :wink:, if this is making any sense...)
     
    #107 taisui, Sep 13, 2013
    Last edited by a moderator: Sep 13, 2013
  8. Strange

    Veteran

    Joined:
    May 16, 2007
    Messages:
    1,698
    Likes Received:
    428
    Location:
    Somewhere out there
    No, the 109 patrons each is more like McCafe.
     
  9. bkilian

    Veteran

    Joined:
    Apr 22, 2006
    Messages:
    1,539
    Likes Received:
    3
    Yes, exactly. The total bandwidth available on the X1 is 280 odd GB/s. if you find yourself wasting it by copying memory around, you're doing something wrong. This applies equally if you have one memory pool or two.
     
  10. Betanumerical

    Veteran

    Joined:
    Aug 20, 2007
    Messages:
    1,763
    Likes Received:
    280
    Location:
    In the land of the drop bears
    Its only 280GB/s if your doing some very specific things, I thought the generally acceptable obtainable bandwidth of the eSRAM was 133GB/s with alpha blending?. wouldn't that make the peak 201GB/s?. Obviously its not the peak peak, but the peak theoretical of the interface is probably never going to be reached with the eSRAM it seems except for some very very small scenarios.
     
  11. taisui

    Regular

    Joined:
    Aug 29, 2013
    Messages:
    674
    Likes Received:
    0
    Theoretical BW is peak BW.
    you are mixing it up with practical utilization.

    this is not particular to the esram, this is the case for all memory, single pool or not.
     
  12. Betanumerical

    Veteran

    Joined:
    Aug 20, 2007
    Messages:
    1,763
    Likes Received:
    280
    Location:
    In the land of the drop bears
    Fair enough, im just not sure the amount of 'theoretical bandwidth' the eSRAM is going to get in a practical scenario, I feel like its a bit weird to compare the two theoretically one one will get no where near its peak.
     
  13. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    15,134
    Likes Received:
    7,680
    Isn't peak bandwidth utilization rare on any platform?
     
  14. Brad Grenz

    Brad Grenz Philosopher & Poet
    Veteran

    Joined:
    Mar 3, 2005
    Messages:
    2,531
    Likes Received:
    2
    Location:
    Oregon
    Sure, but I think beta is asking what are the limiting factors for achieving the 204 figure rather than the 109 figure. The only "real world" example we've heard using the simultaneous read/write only gets your 2/3rds of the way there. Is that a problem with the technique needed to exploit the extra bandwidth, or are you limited by the throughput of the client devices? If, for example, there is no theoretical way to read and write more than 150GBps of data to the ESRAM based on external factors then it doesn't matter if the ESRAM itself is technically capable of more. The extra 54GBps is always waste in that scenario.
     
  15. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    15,134
    Likes Received:
    7,680
    As far as the math goes, the rops should be able to saturate the write bandwidth to the ESRAM, in theory. I wouldn't expect the read to be more limited than the write. Why the typical average case would fall below the sum of the two, I can't answer. Without knowing how the ESRAM is used, or if the DMEs play any role, it might be a tough question to answer.
     
  16. rokkerkory

    Regular

    Joined:
    Sep 3, 2013
    Messages:
    371
    Likes Received:
    112
    You might be right and maybe that's why the data move engines are there? (Not sure)
     
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    That was one of my criticisms of the DF article that leaked it. It's a real-world example, but no context was given as to where it stood in the continuum of workloads that one would probably find running. Is it an example of a good utilization case, a mediocre one, or a bad one? The secret sources, via a non-technical writer, did not say.
    What is the optimum mix?
    What exactly were they measuring?
    Are there cache and buffering effects that needed to be corrected for?

    All we get is a number and a host of ambiguities and questions.

    Technically capable in this case means being theoretically capable.


    DMEs perform operations so that you don't waste a CU or two on it. A few specialized functions also reside on a few of them.
    The bandwidth of the DMEs is inferior to the shader array, but the vector units have better things to do.
     
  18. Billy Idol

    Legend

    Joined:
    Mar 17, 2009
    Messages:
    6,067
    Likes Received:
    907
    Location:
    Europe
    Lol!

    Oh, and thanks to bkilian for his explanations.
     
    #118 Billy Idol, Sep 13, 2013
    Last edited by a moderator: Sep 13, 2013
  19. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    44,106
    Likes Received:
    16,898
    Location:
    Under my bridge
    Bandwidth applies in different ways under different workloads. There's no singular metric to understand the flow of data within a system. Peak BW is peak BW. It doesn't tell us utilisation though. Same as peak flops. But as it's the only metric one can realistically put out there for devs to understand the system, it's the one used. That doesn't mean you gain access to full BW at all times, but nor does it mean the peak BW can be discounted as meaningless. The situation is only muddied a little because of people making comparisons. PS4's situation is far simpler on paper, and people are trying to compare the different setups with different numbers that mean different things.

    The old water analogy - PS4 has a reservoir holding 8 billion litres and a pipe providing ~170 billion litres a second. It can supply water to the Communal Purification Unit and Global Purification Units for washing at that rate. These will consume some water and send it back to the reservoir.

    XB1 has a reservoir holding 8 billion litres and a pipe providing ~60 billion litres a second. It also has a tank that holds 32 million litres with a pipe providing ~109/206/133 litres a second. the CPU and GPU can be fed a total of ~270 billions litres of water a second which is recycled back to the tank and reservoir (pipes are dual ported ;)).

    That's one example. You can then explore what happens when tanks empty and need to be filled, which is the whole memory system, and latency, which is how long it take to turn taps on and get water flowing. There are other analogies that describe utilisation under other circumstances. They can be right if different, and as such there's no singular understanding of the data BW within a system. It's only engineers and PR people who care - PR people can pick whichever representation is best for their system when marketing. "Our system has the highest data throughput," versus, "our system has the fastest minimum throughput and you'll never get less than this higher average," sort of thing.

    Recognising that XB1 does have RAM bandwidths that combine, maybe now we can get back to understanding how the eSRAM gets different rates on the eSRAM bus? :D
     
  20. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,237
    Likes Received:
    4,260
    Location:
    Guess...
    That was fun! So now that I think I understand how this all works, would it be fair to summarise this in laymans terms as follows (and in relation to a single pool of GDDR5 which is ultimately what we're all trying to understand):
    • The XB1 has 272 GB/s of theoretically useable bandwidth
    • Compared with a single pool of GDDR5 this has 2 disadvantages in real world utilisation:
      1. It's made up of 2 seperate memory pools and so any direct copy of data between those pools without any sort of transformation to that data by the GPU will waste bandwidth performing an operation that wouldn't have been required in a single memory pool system.
      2. Even with no data copy between memory pools the maximum useful bandwidth of the esram can only be achieved if you perfectly balance read and write operations at all time. Whereas with GDDR5 since the full bandwidth is available at all times in any ratio between read and write, that's not a concern.
    • So the bottom line is that the GPU in the XB1 will usually only be able to consume a smaller percentage of its total available bandwidth than the GPU of a different system that uses a single unified pool of GDDR5 (or DDR for that matter).
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...