Next Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Discussion in 'Console Technology' started by Proelite, Mar 16, 2020.

  1. MrFox

    MrFox Deludedly Fantastic
    Legend Veteran

    Joined:
    Jan 7, 2012
    Messages:
    6,421
    Likes Received:
    5,827
    Yes that's the point, it needs twice or more of everything upstream unless the devs use the cpu to decompress some oodle format. Games would need to be uncompressed on the ssd and that would divide the efficiency by that ratio. RDO or BCPack multiplying this even more. I don't see devs wanting that, just because the dma saves cpu cores which scale more on PC, and add some small efficiency improvement with cache scrubbers needing to flush the caches less often. They would just require more memory instead for streaming in a larger time window.

    The question was whether cache scrubbers are useful in the reality of PC game design and hardware.
     
    #2641 MrFox, May 27, 2020
    Last edited: May 27, 2020
  2. PSman1700

    Veteran Newcomer

    Joined:
    Mar 22, 2019
    Messages:
    2,498
    Likes Received:
    775
    Usually high end pc parts dont go very narrow/extremely high clocks, its the contrary.
     
  3. disco_

    Newcomer

    Joined:
    Jan 4, 2020
    Messages:
    215
    Likes Received:
    170
    They don't need to. Consoles are on the opposite end of the spectrum when it comes to margins and chip binning.
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,363
    Likes Received:
    3,942
    Location:
    Well within 3d
    It seems to be a performance measure for the GPU. Data being read in will be written to its destination regardless of what state the GPU's execution is in.
    How the notification process works, and what sort of synchronization operations are needed with the scrubbers aren't clear.
    Sony may be trying to reduce the time and cache thrashing related to the global stall and cache invalidates, meaning the events are cheaper but are still used in the same fashion as the normal invalidates.
    If there's some kind of non-standard way of synchronizing with that hardware, maybe some workloads can use a custom localized barrier that might allow some parts of the GPU to be excluded from the stall--but that may be a more extensive change than what has been described.

    It hasn't been mentioned by MS, and the PS5 presentation indicated scrubbers were something that were customized for Sony and that AMD did not find compelling enough to include in its own IP despite it being available.
    This seems to be an optimization for one category of methods that is probably uncommon now and might a subset of many other implementations. Sometimes these may be optimizations that are nice to have, but might not find much sufficient use or benefit in a broader market.
    The PS4's volatile flag may have helped make GPU compute more capable of working alongside graphics, but the concept didn't catch on anywhere else and nobody's indicated that the other GPUs suffered significantly for the lack of it.
    The PS4 had a form of triangle sieve that might have been a forerunner to the culling-focused primitive shaders in Vega, so the idea might make sense. However, the PS4's implementation in particular has only really been mentioned in the pre-launch articles in 2013, and I don't recall it being mentioned since.
    The PS4 Pro's ID buffer and checkerboard optimizations have had an unclear amount of adoption. Many of the leading engines found something other than checkerboard relatively quickly.
    There may be other areas that the XSX has emphasized, like sampler feedback customizations or other tweaks that might provide different benefits.

    The PC space has a wider range of hardware and has to worry about a broader legacy base that might not give the IO capability that would justify them. If there are PS5-specific ways utilizing SSD data by shaders or GPU hardware that interface with the scrubbers in a non-standard way, that may make them less likely to be used.
    Discrete products have a PCIe bus to transfer over, and until there's more unified memory those explicit transfers may be heavyweight enough to exceed the savings from scrubbing.
    APUs might be better-positioned due to the single memory pool, but then we'd need one with more of a performance focus.

    Perhaps this is an optimization with a certain class of workloads in mind, such as virtual texturing like in the later Trials games? A virtual texturing cache is a range of memory addresses that may be updated by data from different disk locations or different assets based on how the GPU/renderer chooses to update it. Couple that with some of the ideas about how the latest Unreal demo may be virtualizing its geometry, there could be objects or subsets of them at different levels of detail being read in or switched out of a limited working set.

    Assigning specific ranges within the virtual asset caches may see benefit from the scrubbers, since they could be used to clean up a given allocation more cleanly without thrashing other in-progress objects and allow a new object to take it over. However, that may require a level of interaction between the scrubbers and shaders that might not match reality, more fine-grained synchronization than reality, and an unclear level optimism with regards to SSD latency.
     
    tinokun, disco_, BRiT and 6 others like this.
  5. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    475
    Likes Received:
    196
    Cache invalidate seems weird to me. For PC normal ram has pretty much always been used as cache, it was there before dedicated graphical ram even. It's also much faster terms of latency than an SSD and has all the bandwidth you'd need, if AMD can't see the benefit even there... well maybe it's just another weird Cerny fixation like a dedicated audio shader module or the ID Buffer thing. I mean, so a new virtual texture block streams in a frame or so earlier, even if you make the stall lesser it's still probably not worth a stall.
     
    PSman1700, Dictator and pjbliverpool like this.
  6. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,793
    Likes Received:
    1,077
    Location:
    Guess...
    Yes this was/is one of the sources of my original confusion. Since main RAM in a PC is performing a similar function to the SSD's in the consoles, (albeit smaller and faster) then it suggests PC GPU's may also benefit from this. Unless the GPU see's system RAM as opposed to VRAM as the "last level cache" in reference to MrFox's post above.

     
    PSman1700 likes this.
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,363
    Likes Received:
    3,942
    Location:
    Well within 3d
    The big limiters would be capacity and the initial-load latencies running into little-improved HDD performance and unusually modest RAM capacity improvements for the consoles this time around.
    Some devs are indicating that they have ambitions for scenes where the fidelity for for some objects makes them impractical to load in their totality, or scenes have total amounts of detail that are too large to hold in memory with traditional methods. Exactly how crazy they can get with asset size given that the SSDs aren't massive is a question yet to be answered.
    There have always been barriers and flushes at certain boundaries, like between frames and context switches for some platforms. Glacial IO could have the driver/OS update resource status at those points in already high-overhead operations.

    A more responsive IO system doesn't need to waste as much RAM capacity on data that may not be used, but more frequent changes can take the synchronization and flush operations out of the shadow of the larger barriers they were hidden by.
    One additional thought I forgot until now is that Cerny mentioned the scrubbers invalidated targeted ranges in various caches, which may mean some other on-chip caches or buffers might get notifications as well, and the pitfalls for those may fall outside of the L2-focused examples I'm aware of.

    It's difficult to say how much of an impact the various tweaks have had, but it does seem like they haven't been game-changing.
    I'm still not sure how aggressive Sony thinks it's going to be with gathering data from the SSD, and how consistent its latency is expected to be. If the latency could be made reliably intra-frame, or even within specific phases of a frame's time budget, that could be at least notable. If the stall cost and penalties caused by wiping the cache for all other workloads can be significantly reduced, that might make some algorithms more practical.
    AMD could have multiple reasons for not using the tech in this way. It can be that the win is modest, or at least too modest for the broader set of constraints AMD's market has.
     
    tinokun, BRiT, VitaminB6 and 2 others like this.
  8. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,025
    Likes Received:
    5,562
    To be fair, it's not @John Norum who's making it up, it's techpowerup on their placeholder PS5 specs page.

    His mistake is believing the data is based on anything other than whatever the TPU editor took out of his ass at the time, for both the PS5 and the SeriesX.
    80 ROPs on the SeriesX, 1750MHz base GPU clocks on the PS5, TDP for either chip, L2 cache sizes and many other datapoints are completely made up. They're only "cold numbers" in the way of how dead useless they are.
    The SeriesX page even says there's only 10GB of GDDR6.
     
    Silenti, BRiT, Jay and 6 others like this.
  9. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,793
    Likes Received:
    1,077
    Location:
    Guess...
    Don't the consoles decompress everything off the SSD before it goes into main memory? Therefore the space utilised in memory (system or VRAM) would be the same as is used in the consoles unified RAM wouldn't it?

    So your relative issues in a PC environment with using uncompressed data would be SSD capacity and bandwidth from the disk to the rest of the system. Looking at those in turn:

    Bandwidth to the rest of the system - not a massive concern at the high end (PCIe 4.0) as there's ~7.5 GB/s available which already exceeds one consoles uncompressed throughput and comes reasonably close to the others. PCIe 5.0 which may be due as early as 2021 more than alleviates any remaining concerns. People with slower drives just have to suck it up and lower texture quality or accept longer loading screens if they have the RAM capacity to pre-load more data.

    SSD Capacity - This is an interesting one which I've been thinking about over the past few days. The obvious solution is just to buy more disk space which is always an option with a PC, but it's not very elegant or efficient. A better option would be to have on disk decompression. I know some drives do this already but I'm not sure how much benefit it brings in those implementations. However I'm talking something standardised, perhaps even tied into DirectStorage. e.g. a DirectStorage compatible drive will feature a hardware decompression module similar to that found in the XSX. It wouldn't help with bandwidth as it'd be on the wrong side of the PCIe bus, but it would completely alleviate the relative capacity defect. It would also be a pretty awesome selling point for the SSD vendor who could sell a 1TB drive but claim "an effective capacity of 2TB" using the onboard hardware decompression.

    Game vendors could then package 2 different distributables of their games, one with the DirectStorage compression applied (zlib + BCPACK) and one without. The one without simply takes up more space for people without a DirectStorage certified drive but the application would not see a difference since the data would be identical once it passes over the PCIe bus.

    The beauty of the decompression module is that it wouldn't have to apply to only DirectStorage certified games. You could presumably software compress (or even hardware if the drive was capable) any data you wish on that drive into the correct formats for on the fly decompression by any application that requires it. The application would presumably never need to know.
     
    Tkumpathenurpahl, BRiT and PSman1700 like this.
  10. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,025
    Likes Received:
    5,562
    Both the consoles and the PC need to decompress the data.


    You still seem convinced publishers/developers can just push non-compressed game distributions for the PC..
    They can't... textures, geometry, shadowmaps, etc. still need to be compressed or they might even not fit a regular-sized SSD at all. Just google for any game + "compression" and you'll find references to compression in error reports.
    It's also been explained that DXTC compressed textures (which GPUs work on) have a relatively low compression ratio, so when placed in storage the textures get further compressed over their DXTC form.

    Also, what are you calling a "DirectStorage certified drive"? A M.2 SSD with an ASIC dedicated to hardware decompression?
    If so, does that make sense? How would e.g. a RAID distribution work on such hardware?
    IMO it'd make more sense to develop a 16x PCIe card that houses one or more M.2 SSDs and has hardware decompression.


    Yes, some things in there are pretty wrong even for people without an NDA.

    If the SeriesX can maintain their substantially wider GPU at 1825MHz 100% of the time, why would the PS5 with a narrower GPU need to go all the way down to 1750MHz?
    There are no limitations to what the GPU can address, and there's more than a bunch of GPU data that can go to the slower pool without repercussions, so the 10GB for the GPU is just wrong.


    Why are you excusing the made-up data on techpowerup with a single fact about INT throughput (which may not even be exclusive to the SeriesX), when the concern you expressed is in the clock comparisons which are, in fact, based on completely made up numbers?
     
    DSoup likes this.
  11. BRiT

    BRiT Verified (╯°□°)╯
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    15,473
    Likes Received:
    13,972
    Location:
    Cleveland
    There is no place on B3D for incorrect specifications.
    There is no place on the B3D forums for versus discussion.
    These will always be purged.
     
  12. DSoup

    DSoup meh
    Legend Veteran Subscriber

    Joined:
    Nov 23, 2007
    Messages:
    12,440
    Likes Received:
    7,691
    Location:
    London, UK
    And is there a memory access mechanism in PS5 for the GPU to skip cache referral?

    Yes, and on console you can be sure that devs will primarily target the best compression which for many will be the hardware accelerated option, which above all else, is one single standard.

    The RAID issue a good one. With data stripped across different drives, any compression may be entirely negated. For RAID support alone, the decompression needs to be done off-drive further up the hardware I/O hierarchy.
     
  13. PSman1700

    Veteran Newcomer

    Joined:
    Mar 22, 2019
    Messages:
    2,498
    Likes Received:
    775
    This has always been the case, and never been a real problem.
     
  14. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,025
    Likes Received:
    5,562
    Are you suggesting BCPack and Kraken won't be widely adopted, only zlib?
     
  15. chris1515

    Veteran Regular

    Joined:
    Jul 24, 2005
    Messages:
    4,645
    Likes Received:
    3,547
    Location:
    Barcelona Spain


    COD teams were using Kraken. And some people use it on PS4 because it decompress faster than the hardware zlib dsp.
     
  16. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,793
    Likes Received:
    1,077
    Location:
    Guess...
    As I've said before there is a difference between system level IO compression/decompression of everything coming off the disk, and selective compression/decompression of specific data sets. The consoles do the former, I'm suggesting that PC's can do the latter and therefore the decompression requirements would not have to be as high as the "5 Zen 2 cores" quoted for the PS5. It's a trade off between disk space and CPU requirements that the consoles don't have to make so can simply compress everything.

    You get on average 45-64% additional compression with Kraken over the "uncompressed" data feed (which would already contain GPU native compressed textures) according to Sony's own figures.

    Are you suggesting that in order to not overburden every system with less than 16 CPU cores, developers would not accept an install footprint of 45-64% larger on the PC? Maybe they wouldn't but that doesn't seem like the impossible scenario that you're painting it to be.

    And that's before you consider that some data types could likely still be compressed and sent via the CPU for decompression because they represent a much lighter load than decompressing textures for example. So that potentially gains you some of that space back for minimal CPU burden.

    I'm suggesting that is one possible route that could be taken. Actually I find it unlikely that Microsoft would stipulate a dedicated hardware decompression block in order to gain DirectStorage compliance as I think that if drives need DirectStorage compliance at all, it will be more focused on their DMA capabilities which is likely to be more important to get out into the market. That said, they could encourage the use of a hardware decompression block though some sort of "Direct Storage Ultra/Premium" type certification which does include such a block. Or SSD makers could simply do this off their own backs given that it could be a good selling point in terms of capacity, but then less likely to be tied into how games are packaged.

    The RAID point is a good one, but then given that these drives are likely to already be coming close to saturating the available PCIe bandwidth is there much benefit to RAID? Perhaps that's one technology they'd simply have to forgo in favour of the compression. Since these would be gaming targeted drives rather than server, and already very high speed, I'm not seeing that as a huge loss.

    Such a card would still be limited by a 4x PCIe interface into the CPU on non server based chips unless you were going to use it instead of a GPU which wouldn't make much sense in a gaming system.

    Incidentally a decompression block on the SSD would also allow for increased bandwidth use when PCIe 5.0 arrives even if the new controllers are unable to saturate that bandwidth.
     
  17. DSoup

    DSoup meh
    Legend Veteran Subscriber

    Joined:
    Nov 23, 2007
    Messages:
    12,440
    Likes Received:
    7,691
    Location:
    London, UK
    Nope, it's easy for devs to support one standard (Kraken on PS5, BCPack on XSX) but what compression will devs use to package their PC games? Or is the expectation that the hardware will support them all? And new ones in the future?

    [​IMG]
     
    egoless, TheAlSpark and disco_ like this.
  18. BRiT

    BRiT Verified (╯°□°)╯
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    15,473
    Likes Received:
    13,972
    Location:
    Cleveland
    Well the hardware does support it all, using software CPU solutions. ;)
     
    milk and PSman1700 like this.
  19. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,793
    Likes Received:
    1,077
    Location:
    Guess...
    That's where the DirectStorage certification could potentially come in. Microsoft could stipulate that DirectStorage compatible games must have a zdlib/BCPACK compressed distributable and lead the way with UWP games. I'm not suggesting this is overly likely, but it's one potential solution to the problem.
     
  20. mrcorbo

    mrcorbo Foo Fighter
    Veteran

    Joined:
    Dec 8, 2004
    Messages:
    3,915
    Likes Received:
    2,617
    Compression/Decompression on the peripheral side of the PCIe bus makes little sense to me. You want that on the system side as "close" (shortest distance/fastest bus) to the system memory as possible and/or on the GPU as "close" to the video memory as possible.
     
    tinokun and DSoup like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...