How to understand the 560 GB/s and 336 GB/s memory pools of Series X *spawn*

Discussion in 'Console Industry' started by Metal_Spirit, Apr 10, 2020.

Thread Status:
Not open for further replies.
  1. Proelite

    Veteran Subscriber

    Joined:
    Jul 3, 2006
    Messages:
    1,620
    Likes Received:
    1,107
    Location:
    Redmond
    I've divided Killzone ShadowFall's memory usage into two groups, one requiring fast access, and one requiring slow access.

    Total: 3429MB

    Total: 1307MB

    So let's say that a console game usually requires around 25-33% to be in the fast pool.

    For a next gen, that can be minimum 8GB HBM + 32GB DDR5 (4GB for OS).

    @3dilettante, any changes adjustment do you think is needed for my division? Do you think the ratio of 1 to 2-3 fast to slow memory split makes sense?
     
    #21 Proelite, Apr 10, 2020
    Last edited: Apr 10, 2020
  2. Metal_Spirit

    Regular

    Joined:
    Jan 3, 2007
    Messages:
    632
    Likes Received:
    397
    Not as I see it. You have two pools, 6 and 10 GB, accesing one uses data channels that during that time cannot.be used to access the other.
    So, if you are accessing the slow memory you must at least use 6 16 bits channels to access the full pool, reducing bandwidth on the 10 GB RAM by 168 GB/s.
    So you will have 392 GB/s on the fast pool and 168 on the slow one, even if you access the slow memory just to "check the time of day".
    Off course average bandwidth will depend on the number of accesses, but regardless of how many average you have, you cannot rely on more than 392 GB/s if you want your performance to be steady at all times.
     
  3. Arwin

    Arwin Now Officially a Top 10 Poster
    Moderator Legend

    Joined:
    May 17, 2006
    Messages:
    18,762
    Likes Received:
    2,639
    Location:
    Maastricht, The Netherlands
    So why not use the default setup then. Why bother with split pools at all?
     
  4. BRiT

    BRiT (>• •)>⌐■-■ (⌐■-■)
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    20,516
    Likes Received:
    24,424
    No.
     
    function, tinokun and PSman1700 like this.
  5. Metal_Spirit

    Regular

    Joined:
    Jan 3, 2007
    Messages:
    632
    Likes Received:
    397
    why not?
    I was accepting the slower ram bandwidth was fully used with the CPU, So I was talking about GPU only.
     
  6. BRiT

    BRiT (>• •)>⌐■-■ (⌐■-■)
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    20,516
    Likes Received:
    24,424
    Developers are going to optimize all memory access patterns to get as close to 560 GB/s as possible when required.
     
    tinokun likes this.
  7. Proelite

    Veteran Subscriber

    Joined:
    Jul 3, 2006
    Messages:
    1,620
    Likes Received:
    1,107
    Location:
    Redmond
    I guess you cannot rely on developers to be not complete bozos.
     
    PSman1700 likes this.
  8. Metal_Spirit

    Regular

    Joined:
    Jan 3, 2007
    Messages:
    632
    Likes Received:
    397
    I know that... but if you have a drop due to acesses to the 6GB, and you count on the extra bandwidth, your performance will suffer.
     
  9. Metal_Spirit

    Regular

    Joined:
    Jan 3, 2007
    Messages:
    632
    Likes Received:
    397
    The bandwidth cut is something that cannot be avoided.
    You can minimize it at max, but not remove it from the equation.
    Even less if the CPU is working on this memory.
    I don’T see any bozos, but I don’t see any houdini’s either.
     
  10. BRiT

    BRiT (>• •)>⌐■-■ (⌐■-■)
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    20,516
    Likes Received:
    24,424
    "Doctor, it hurts when I do this."
    "Don't do that".

    Just like if games on PS5 will use 90 GB/s of bandwidth on audio and 22 GB/s on decompression it will only have memory bandwidth of 336 GB/s left to use for GPU and CPU.
    Developers won't do that.
     
    Silenti, function, VitaminB6 and 5 others like this.
  11. MrFox

    MrFox Deludedly Fantastic
    Legend

    Joined:
    Jan 7, 2012
    Messages:
    6,488
    Likes Received:
    5,996
    Because like any distributed storage system, the access rate would become the average of the entire available memory space. In this case it would be 6GB at 336 and 10GB at 560... So that would be 476GB/s average. That's the primary reason nobody makes such imbalanced busses.

    If they had the BOM for 20GB they would have a unified system. Now they are working to mitigate the negative impact of having unequal capacities. They need to split it to prevent the excess requests to some chips from lowering the average bandwidth. The best known solution is to isolate the excess into a rarely accessed upper region: the 6GB partition. It's a much better average than going random access on a single partition and it's worth the trouble of making that split.

    You get 560GB/s only if the address space is spread evenly. So if they want more, they either make all chips bigger equally, or in this case, they make the excess of the few bigger chips a rarely accessed region.
     
    blakjedi, function, milk and 5 others like this.
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Some of the big bandwidth consumers at that time would have been the render targets. I am not sure about excluding the video memory portion that contains non-streaming textures and the streaming texture pool. Textures might be more spiky in consumption or dependent on filtering behavior, but they're being accessed by the highly parallel CUs and their vector memory pipelines. The non-streaming textures sound like those textures that were too in-demand to risk waiting on disk reads, and streaming textures might have slow stream-in to the pool, after which some could be very heavily hit by GPU reads.
    However, a good amount of the motivation for those pools' size would be the HDD bottleneck that the next gen is changing significantly, and things like sampler feedback may also make it possible to reduce the overall size. However, with the performance of the GPUs and ability to use resources in less time, what fraction of that category remains would be the more intensely used data.

    I don't think that's necessary. From the point of view of the DRAM, the granularity can be as fine as one 16-bit channel for 16 bus clocks before picking a different address range.
    There may be other constraints on the parallelism of the system, but accesses like cache line fills can be satisfied in 2-4 bursts from a single channel. Why reserve any more channels than are needed for a given transaction? A CPU's 64 byte line likely uses one channel for a few bursts, or at most two if they're in lockstep. 6 channels would pull in 192 bytes, which doesn't map to any basic action done by the system.

    That aside, I don't understand the idea that if data comes from "slow" memory that it means there's some innate "slow" quality to the data that hurts overall performance. It's just bandwidth. If the GPU or CPU are pulling over 100 GB/s from a portion of regular memory, it's because the game needed that much data. The developer would have dictated that.
    If that slow functionality needs more bandwidth, why couldn't the developer map the allocation to the address range where it could get more?

    If the problem is that a given function needs 168 GB/s, while a different function needs 560 GB/s at the exact same time, that's just life. You'd still have a problem in a 20GB system if there are two pieces of functionality whose total demand exceeds the physical capabilities of the memory bus. There's only a difference if the developer somehow cannot fit a high-use target in the 10 GB range, but I haven't seen an argument that games are in imminent danger of having 10GB of render targets or the like.

    There's a separate debate to be had about contention between client types and the difficulty in getting max utilization due to DRAM sensitivity to access patterns. However, that sort of challenge is more about how accesses map to ranges with patterns in the KB or MB range, not 10GB or more.


    2.5GB of the 6GB normal space is OS reserve, which the design of the platform should be trying to minimize the bandwidth impact of in favor of what the game needs, unless the OS or some application takes over the foreground.
    I think it's risky to correlate the capacity of different spaces to bandwidth utilization. It also creates the image of a random or purposefully even distribution of accesses to all addresses in the system, which isn't the case.
    Some of the most dominant bandwidth consumers are streaming compute shaders or the ROPs writing and overwriting a finite set of render targets. That's a fraction of the overall memory capacity, but those ranges get hit many more times, and this is very predictable.
    The trade-off here is that Microsoft thinks developers that can provide lists of buffer capacity and frame-time budgets can figure out if something has significant stretches where it pulls close to or more than 24 bytes a cycle from DRAM. If so, the developer would be expected to move that memory target into an address range that can supply 40 bytes per cycle.

    If a client pulls 336 GB/s from the slow pool, and something else needs 560 GB/s but just gets the unused channels, that's very high utilization. Another combination is that the 560GB/s load gets everything and the 336 GB/s load gets nothing, still high utilization of the bus. The real problem is that the game is written to need 896 GB/s, and having balanced capacity chips wouldn't help with that.

    If by rarely accessed we mean something that can be accessed for up to 336 GB/s.

    This happens if up to 10 GB of GPU data can be fit in the GPU-optimized range. If it's not GPU-related, nobody will know. If there's a heavy GPU target in slow memory, is it because there's 10 GB of high-demand GPU assets already in the optimal range, or did the dev misplace the render target?
     
  13. MrFox

    MrFox Deludedly Fantastic
    Legend

    Joined:
    Jan 7, 2012
    Messages:
    6,488
    Likes Received:
    5,996
    The unused channels during an access to the 6GB portion would deplete their queue. Any gpu algorithm would statistically request equally across all channels to get an equivalent 560GB/s. So having some request resolved (the channel that are still free) while the others stalled from serving the 6GB portion would result with a proportional stall.

    If the additional requests going to the 336 portion don't have an equivalent number of requests going to the remaining channels, wouldn't that still imbalance the queues? OTOH, if the data is spread unevenly to make the unused channels useful in that time frame, they are then not balanced equally when the 6GB portion is not used, and cannot do 560GB/s in total, it would instead cause the opposite queues starving.
     
    DSoup and BRiT like this.
  14. Vhatt

    Joined:
    Mar 19, 2020
    Messages:
    8
    Likes Received:
    5
    Please correct me if I'm wrong but GDDR6 chips are dual channel (ported) and allow for synchronous data transfers (just like the ESRAM in the Xone?). The GPU only gets 10gb at the full 560GB when both channels are used and the CPU can only access one channel of the larger chips with the other given over to the GPU to use.
    But it makes me wonder if all the operations of the GPU require the full 560GBs at all times? Wouldn't some operations require less than this and if so wouldn't we have a situation where one channel moves data out while the other moves fresh data in? Would this allow for greater efficiency? If this is the case wouldn't MS just create a virtually addressable memory space where the GPU gets 560/ and the CPU gets 280GBs and the lower bandwidth tasks get the other 280? I would think after this it comes down to the dev to order his memory usage to maximize the system?
     
    blakjedi likes this.
  15. dobwal

    Legend

    Joined:
    Oct 26, 2005
    Messages:
    5,955
    Likes Received:
    2,325
    Arent you are accessing some combo of 32 bit interfaces? Just because you divide the memory in half doesn’t mean you have to divide the interface in half too.

    At least from what I read the cpu and gpu can’t access ram simultaneously.
     
    #35 dobwal, Apr 11, 2020
    Last edited by a moderator: Apr 11, 2020
    blakjedi likes this.
  16. MrFox

    MrFox Deludedly Fantastic
    Legend

    Joined:
    Jan 7, 2012
    Messages:
    6,488
    Likes Received:
    5,996
    Each channel can only access half of the data, this is not dual ported ram. It's indistinguishable from having two separate chips with their own part of the data.
     
    blakjedi and BRiT like this.
  17. Proelite

    Veteran Subscriber

    Joined:
    Jul 3, 2006
    Messages:
    1,620
    Likes Received:
    1,107
    Location:
    Redmond
    From what I remember from 2013 was that texture data isn't bandwidth intensive. Otherwise, Xbox One textures would be half the fidelity as PS4 texture. It didn't seem to be the case last gen.

    https://www.eurogamer.net/articles/digitalfoundry-the-complete-xbox-one-interview
     
    PSman1700 and BRiT like this.
  18. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,834
    Likes Received:
    18,634
    Location:
    The North
    Uhh. You quoted the wrong person :)
     
  19. dobwal

    Legend

    Joined:
    Oct 26, 2005
    Messages:
    5,955
    Likes Received:
    2,325
    I just realize that. I had to change up my language to be a bit more diplomatic. LOL
     
  20. dobwal

    Legend

    Joined:
    Oct 26, 2005
    Messages:
    5,955
    Likes Received:
    2,325
    So you are advocating carving out a partition for the gpu’s VRAM in both the fast portion and slow portion of RAM?
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...