Next Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Discussion in 'Console Technology' started by Proelite, Mar 16, 2020.

  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,367
    Likes Received:
    3,959
    Location:
    Well within 3d
    For there to be additional bandwidth overhead due to GPU and CPU memory traffic, the CPU's access patterns don't need to be random, just different from the GPU's. If the CPU isn't hitting the same arrays or happens to be writing something when the GPU is content with mostly reads, there would be some additional cycles lost.
    The idea that CPU bandwidth consumption should be minimized if the GPU is bandwidth constrained was the point of the PS4 slide back when it was first released, although it doesn't seem like the zero bandwidth case is all that practical.
    As far as "prefetch, work in cache, write" goes, I don't know how broadly I should interpret your wording. While it is preferred for a working set to fit in cache, CPUs don't have full control over whether the cache hierarchy writes to memory, since it's not a local store. There are hardware prefetchers and software prefetch, but there are practical limits to how far ahead they can go for most workloads before bandwidth consumption on unnecessary reads becomes counterproductive, or before the cache starts evicting parts of it. Zen 2 has decently sized caches (not clear what the capacity is for the consoles), but high performance cores will quickly exhaust what they can hold in many cases.

    This is outside of cases where the CPUs or DMA controllers are expected to move data into system memory, which would have overhead.

    There is peer to peer DMA functionality, and there were somewhat recent Linux changes mentioning it for Zen. Perhaps if the drive works with that it could avoid a trip to main memory.
     
  2. Metal_Spirit

    Regular Newcomer

    Joined:
    Jan 3, 2007
    Messages:
    558
    Likes Received:
    341
    if you work with a 1 second offscreen margin to each size, is enough to feed 9 GB of new data to the memory, in other words, replace 56,25% of your global memory content.
     
    KeanuReeves likes this.
  3. Betanumerical

    Veteran

    Joined:
    Aug 20, 2007
    Messages:
    1,695
    Likes Received:
    171
    Location:
    In the land of the drop bears
    When I said memory spaces I mean it as, you would need to specifically request that memory be stripped in a way for the speed and access you want. Because Microsoft has only mentioned that there is a fast and slow space it makes me think the parallel access iroboto mentioned is not something that they did.
     
    BRiT likes this.
  4. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    10,837
    Likes Received:
    10,890
    Location:
    The North
    Or it might be; by the same logic. If somehow 10GB was striped for bandwidth and 6GB for generic.

    I’m actually quite curious. I’ve learned a lot the last couple of days.
     
    blakjedi and BRiT like this.
  5. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    732
    Likes Received:
    134
    They may, but will they target XBSX as a baseline?

    Pretty close to RAM random read on PC.
    Where you cannot allocate-place things in RAM anyway.
    So you can think of it like a PC with 16GB VRAM and 100+GB of ~DDR3 RAM.
    Still doesn't ring a bell? :)
     
    egoless, chris1515 and megre like this.
  6. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    294
    Likes Received:
    187
    Base clock is guaranteed for processors out in the market in an optimal operating environment (with some degree of tolerance), or otherwise it would not be called base.

    [edit: These parameters are published years ahead of platform launch, so that partners can design against them, and QA/binning have a reference model. That’s also why we don’t have ridiculous news of your AM4 125W capable big brand cooler causing your Ryzen to melt, since we have deterministic laws of chemistry and physics upon which our world of semiconductor manufacturing is built.]

    If we put aside the assumption of optimal operating environment, well... almost all modern processors throttle down to the minimum frequency (800 MHz for many AMD CPU designs), until it gets worse to the point where max junction temp is breached and auto shutdown kicks in.
     
    #1686 pTmdfx, Apr 1, 2020
    Last edited: Apr 1, 2020
  7. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    732
    Likes Received:
    134
    But that's not how CPU should work in a game though...
    In the end what is rendered is a final result, therefore CPU needs to work on the same buffers as GPU.
    Maybe I'm just not interested in theoretical scenarios, but practical ones.

    Which was just a warning for the developers. We don't know if PS4 was ever in any right hand side of this graph in any game. Do we? :)

    Yep, but you can predict things. And profile these.

    That still brings us back to the main question: what the practical typical loads?

    Yup. So the "base clock" is still a prediction, albeit a more conservative one. You can make any other even more conservative predictions about hypothetical "100% load" scenario.
    Or we can safely assume that for some particularly bad loads we can go a low as possible, and the solution is just not to use that load configuration.
     
  8. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    10,837
    Likes Received:
    10,890
    Location:
    The North
    What?
    Why? It's doing way more work per clock cycle. That's the only reason it's slowing down so much.
     
  9. dobwal

    Legend Veteran

    Joined:
    Oct 26, 2005
    Messages:
    5,435
    Likes Received:
    1,497
    AMD's base clocks are the lowest frequency their gpus will run at in the presence of a power virus.
     
  10. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    294
    Likes Received:
    187
    Well... if you play this card, everything in the industry is merely a “conservative prediction”. Conditionality and error tolerance are not equivalent to unpredictability and indeterminism.

    Specifications (and the design & validation processes around them) exist to define constraints that, if satisfied, enable the chip to attain repeatable optimal performance as designed over the expected lifespan.
     
    #1690 pTmdfx, Apr 2, 2020
    Last edited: Apr 2, 2020
    PSman1700, iroboto and VitaminB6 like this.
  11. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    732
    Likes Received:
    134
    Map screen in HZD is also doing a lot of work per cycle. Is it really needed?
    Or even: should we optimize our TDP and cooling for that particular case? Why not?

    Agree.

    The point that "nay sayers" articulate is that somehow MSFT claims about "fixed clock" are much better/honest/valuable than Sony's claims on "fixed power".
    While I see them both as pretty optimistic targets. With no real difference.
     
    egoless and Mitchings like this.
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,367
    Likes Received:
    3,959
    Location:
    Well within 3d
    Reading data into RAM should be more linear than other scattered workloads may consider random.
    GDDR6 DRAM pages are 2KB, and various forms of NAND have similar minimum page sizes. If striping across 16 or 20 GDDR6 channels, each channel would need to fill 2KB each, so 32KB or 40 KB in that scenario. That would be 16 or 20 4KB memory pages, and it might get convoluted to try fiddling with bytes to mess with alignment. It would take a stream of 64 linear writes to populate.

    For CPU write updates there would be barriers preventing them from working on the same addresses at the same time. The GPU's memory pipeline isn't coupled tightly enough to start reading through the buffer until the CPU signaled it was done. GPU write updates would be round trips to memory, either by not caching or flushing cache lines since the GPU caches cannot be snooped. Although in that case it's better than trying uncached reads by the CPU, which at least for older APUs were massive performance hits.
    At that point, one or the other should have moved on to other places in memory, and the horizon for aligned access within DRAM is on the order of 2KB pages or 8-16 if accepting some penalties within bank groups.



    I haven't seen a breakdown for PS4 games on that metric. I'm not sure if disclosure would be allowed. The point would be to encourage them to reduce bandwidth consumption, but given how games continue to show influence from DRAM speed and bandwidth on PC games where the GPU has separate memory I don't expect that they could make the CPU portion negligible.

    There's a pretty spotty record on that. AMD discourages software prefetch in most instances because there's a limit to how far these predictions hold, and the lower-overhead hardware prefetches tend to win.
    However, there has been profiling on the amount of bandwidth increase due to prefetch traffic, and decreasing accuracy the further the prefetch runs ahead.

    Going by how PC games have shown measurable benefits with Zen based on memory speed, my conservative guess would be to initially try a safe footprint 30-40 GB/s if working with a similar structure to games that cross platforms. Granted, that is a mixture of latency and bandwidth due to how closely linked the infinity fabric is to memory speed.
    On the other hand, the PS4 had <20GB/s for its Jaguar cores, and a coherent Onion bus with 10 GB/s read/write. A ~4x improvement in CPU capability allow extrapolation of 3-4x that for the new platform.
    I think the Jaguar module may have had a 16 byte link to the northbridge, likely running half core speed or at the GPU speed since those tended to line up. Two modules might have doubled that.
    Zen has twice the width and the fabric is running at 2x or more the speed, so that fits too.
     
  13. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    10,837
    Likes Received:
    10,890
    Location:
    The North
    I mean, that's not the same that I'm referring to. That's just unlocked frame rate in which higher frequencies means it can just give it.

    if you're doing AVX2 workloads or tons of parallel processing together that is causing a downclock, that's something else entirely.
     
  14. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    732
    Likes Received:
    134
    Dunno. PC has a different set of trade-offs. And I don't think that streaming RAM->VRAM on PC has no impact on the GPU-accessible bandwidth.
    Not to mention that CPU cannot write too much into RAM per frame anyway, too slow.
    So I would suspect that on PC most of the time GPU and CPU work on similar data sets in a "read heavy" manner.

    0.5-1GB per frame? What for?
    Pahtfinding for 1000 agents on a real geometry? :)
     
  15. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,367
    Likes Received:
    3,959
    Location:
    Well within 3d
    There's minimal performance gain with PCIe 3.0 vs 4.0, so the footprint for streaming should be below 15 GB/s in most situations. Board RAM is on the order of 4-8 GB for many cards, and games are written to be paranoid about streaming like they are on the consoles. The PC memory is not a guaranteed amount, PCIe transfer utilization isn't consistent, and we've seen performance suffer if swapping starts to happen.
    Utilization of PCIe transactions favors larger payloads, which translates into more linear accesses in main memory if DMA hasn't bypassed it.
    If we are concerned about the impact of CPU data movement to the graphics domain, PCIe 3.0 or 4.0 x16 would appear to give 15-30 GB/s to be concerned about.

    The CCXs with the fabric they have on the desktop could generate 115 GB/s of read traffic and the same amount for writes at the ~1.8 GHz ceiling of the fabric. Although the fabric may restrict things further, such as in the case of only allowing 16 bytes/cycle for writes to the already modest memory controllers.
    We don't know yet if the consoles keep to those limits or what may change with to 16-20 channels of GDDR6. The peak values of the CCX are no longer hidden behind the limits of the memory interface.


    For the purposes of DRAM channel utilization, my concern isn't whether the workloads are similar, it's if the data being accessed are the in the exact same few dozen kilobytes during a given memory controller's time window.

    Whatever they want, I'm just giving a conservative amount with hefty safety margins for the lifetime of a platform that hasn't launched yet. Perhaps whatever high-utilization vector code Cerny might have been concerned about. Even when they utilize the local caches well, some can still demand more than average amounts of bandwidth.
    I'm also making allowances for system operations or functions that can produce bursts of high bandwidth fractions of a second that may be on a critical path for dependent processes.
     
  16. Rockster

    Regular

    Joined:
    Nov 5, 2003
    Messages:
    973
    Likes Received:
    129
    Location:
    On my rock
    3d, this quote from Andrew Goossen makes me think the infinity fabric connection from each CCD is still in that 100GB/s or so range as they've been...
    "GPU optimal and standard offer identical performance for CPU audio and file IO. The only hardware component that sees a difference is the GPU."
     
    RagnarokFF and blakjedi like this.
  17. Rockster

    Regular

    Joined:
    Nov 5, 2003
    Messages:
    973
    Likes Received:
    129
    Location:
    On my rock
    I also find it funny that Sony, while admitting that the PS5 is unable to run at max clocks at 100% ALU utilization (which is what it would take to hit the max TFLOP figure) 2.223GHz x 36 CU's x 128FP ops/clk, takes the liberty to round that ~10.25 result up to 10.3. While MS on the other hand simply discards .15 TFLOPS of actual compute and rounds down to 12 in their marketing.
     
    davew, blakjedi, milk and 1 other person like this.
  18. Proelite

    Veteran Regular Subscriber

    Joined:
    Jul 3, 2006
    Messages:
    1,459
    Likes Received:
    818
    Location:
    Redmond
    It's because when the yields are known and they bump the clocks up to 12.85TF, they can just put 13TF instead.

    Still April 1st.
     
    #1698 Proelite, Apr 2, 2020
    Last edited: Apr 2, 2020
    blakjedi, disco_, Silenti and 2 others like this.
  19. Globalisateur

    Globalisateur Globby
    Veteran Regular Subscriber

    Joined:
    Nov 6, 2013
    Messages:
    3,498
    Likes Received:
    2,191
    Location:
    France
    That's not best case. That's the typical speed. Best case is 22GB/s (and we know speeds of 20GB/s are already reached depending of the data compressed, from a actual dev).

    MS 4.8GB/s is their best case, like 6GB/s announced (for textures I think, but still not benchmarked on their machine and how it will impact the CPU, it's rather unclear). MS still haven't divulged the typical cases, and I doubt they will do that anyway.
     
    egoless and KeanuReeves like this.
  20. Metal_Spirit

    Regular Newcomer

    Joined:
    Jan 3, 2007
    Messages:
    558
    Likes Received:
    341
    Actually Nvidia states 1 to 2 GB. Consoles have 16. I would say its a fair ammount. RT on Xbox has to be done on the first 10 GB, so its up to 20% memory usage just for the RT structure. Wouldn’t it be nice if you could dump this, even if only part, to the ssd?
    I would say so!

    As for the SSD... not really. Just a faster SSD will not release you of all constrains. A 10 times faster ssd on the PS4 will only bring you 2x gains, and although proportions may change, this reality is common to all systems. To really take advantage of an SSD you need to get rid of a lot of other restrains. That was what PS5 did. Xbox has changes too, but as far as it’s public, not to the same extent.

    We were talking about generic SSD gains. But we all very well know that tweet was a joke about the possible gains an ssd could bring to the ps5 over the X.
    In fact this is the console forum!
    And in that case, you cannot dissociate the fact that the SSD works in conjunction with those changes. That’s not something available on the PC space for a comparison to be made.

    Also I would not say all SSDs would be enough to break any sort of I/O limit for some time. At least not in comparable ways. For instance, both consoles use dedicated compression and several optimizations on I/O.
    Yet Microsoft games will support Xbox One. Can they really use these changes for anything meaningfull in game concept and design?
    And when the current gen consoles are left behind? PCs cannot reach those levels of data compression without sacrificing CPU performance. They do not have dedicated decompressors and other custom changes.
    And PCs are now part of the Xbox platform.
    Heck, most PCs have no SSD, and most of the ones that have it, have 120 to 256 GB, half used with windows 10, and other installs.
    I see complaints on The last Call of Dutty community over their last games, and the fact that with all the patches the game is now over 100 GB in size, and people simply do not have enough SSD.
    Besides, most of them cannot even reach a 1 GB/s transfer speed.
    So how will that work? Can we really compare those SSDs and the gains they can bring to performance and gaming design?
     
    egoless likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...