Current Generation Hardware Speculation with a Technical Spin [post launch 2021] [XBSX, PS5]

Discussion in 'Console Technology' started by pjbliverpool, Feb 9, 2021.

  1. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    12,986
    Likes Received:
    15,717
    Location:
    The North
    [​IMG]
    If you look at where the Geometry Engine is and if you follow the old FF pipeline, you can see how many stages will pass until at the primitive assembler/rasterizer does back face and frustrum culling occurs.
    You can see that advantage of discarding the triangles way up front as providing a cumulative benefit down the line to not have to work on triangles that don't need operations on. This can be monumental the more triangles are removed from view for instance. This is assuming developers are following a basic flow of course, probably inaccurate, but useful for the discussion of timing.

    I'm sort of at the understanding that
    a) even if you use the older pipeline, RDNA 2 is still biased towards triangle discard, so it discards a lot more triangles than it can raster, and this is something we know from AMD presentations and also something Cerny spoke to, as well as Matt H. But this would be non-compute based culling.
    b) and I believe if developers decide to take advantage of primitive shaders to handle the culling up front, then you get that cumulative effect down the chain. I do not think it's possible for a driver to know what needs to be culled necessarily, but I could be wrong. but it can do everything else in terms of compiling the front end shaders into primitive shaders.
     
    PSman1700, Pete, BRiT and 1 other person like this.
  2. cwjs

    Newcomer

    Joined:
    Nov 17, 2020
    Messages:
    164
    Likes Received:
    342
    Yup, that makes sense with what i read in the thread you linked, ty for the extra info.

    I do think 'a' is probably less significant in these cases than the potential of b -- unless i'm really underestimating what better triangle discard does I think the big dips of the size we're seeing, if they were attributable to culling, would be something driven by developers not taking advantage of dx12u features for some reason, but being able to easily take advantage of equivalent ps5 features. Which I wouldn't, honestly, put past ms at this point, but doesn't seem super likely.
     
  3. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    12,986
    Likes Received:
    15,717
    Location:
    The North
    My only rebuttal here is that, yes DX12 is a very hard animal to tame on PC. But on console it should be significantly more straight forward and less of those issues exist.
    In Hitman, a zoom in is basically hardcore frustrum culling. Usually lots of small geometry, moving grass, etc, tends to clog things up without a good way to manage it.
     
    cwjs likes this.
  4. goonergaz

    Veteran

    Joined:
    Jun 3, 2005
    Messages:
    4,186
    Likes Received:
    1,502
    We do have the road to PS5, I just re watched the part (from 18-20mins if you were interested) where Cerny talks about the IO Unit...I really don't understand why people are being dismissive about it? They have implemented several bespoke 'systems/functions' that seem to help stream data from the SSD and minimise things that will impact GPU performance (coherency engine/cache scrubbers) - and it apparently happens automatically without any required dev knowledge.

    Anyway, hopefully as you say we will one day find out a bit more.
     
  5. DSoup

    DSoup meh
    Legend Veteran Subscriber

    Joined:
    Nov 23, 2007
    Messages:
    14,870
    Likes Received:
    10,985
    Location:
    London, UK
    I don't know if anybody is dismissing it (again.. ignore list) but we're not seeing many games leverage PS5's supercharged I/O system yet. Astrobot and Spider Man Miles Morales do and both are impressive for their lack of load times. Booting into Miles Morales, a dense open world super hero game in around six seconds is still difficult to believe.

    As for the cache scrubbers; anything that improves use of cache is gong to benefit performance but whatever the cache scrubbers bring to the game is impossible to quantify and is going to vary from game to game - much like cache usage itself.

    Cache scrubbers could be massively helping or only marginally. ¯\_(ツ)_/¯
     
  6. Globalisateur

    Globalisateur Globby
    Veteran Regular Subscriber

    Joined:
    Nov 6, 2013
    Messages:
    4,107
    Likes Received:
    3,027
    Location:
    France
    Demon's souls is probably the best use of custom I/O as they do like Cerny said in the presentation: After the initial loading they gradually load the data they need just before it is needed. This is why Demon's Souls has always very dense scenes.

    And Nioh could be the first third party game using it.
     
    DSoup likes this.
  7. DSoup

    DSoup meh
    Legend Veteran Subscriber

    Joined:
    Nov 23, 2007
    Messages:
    14,870
    Likes Received:
    10,985
    Location:
    London, UK
    I don't have Demon Souls but in terms of getting an AAA game off the drive and running from cold, Miles Morales is hard to beat. Sure, it's a different type of game but the game is loading New York. In six seconds!!! :runaway:
     
    #27 DSoup, Feb 10, 2021
    Last edited: Feb 10, 2021
    Globalisateur and BRiT like this.
  8. goonergaz

    Veteran

    Joined:
    Jun 3, 2005
    Messages:
    4,186
    Likes Received:
    1,502
    The IO is not just for loading though, it's for streaming anything at any time from the SSD to the GPU. From what I understood the coherency engine/cache scrubbers could help minimise GPU stalls.

    "The I/O complex further houses two co-processors that help direct that data traffic. One co-processor is focused on SSD input-output that's specifically designed to bypass file read bottlenecks. The other handles memory mapping. The SRAM, or static RAM, which grants faster access to cached memory.
    Finally there's a block dedicated to coherency engines which work directly with the GPU to optimize caching. The engines communicate with the GPU, which "scrubbers" that clean up cached data on the GPU itself to improve efficiency and ensure the cache isn't filled with redundant data."
    Read more: https://www.tweaktown.com/news/7134...ep-dive-into-next-gen-storage-tech/index.html

    Also;
    https://forum.beyond3d.com/posts/2128115/
    The general cache invalidation process for the GCN/RDNA caches is a long-latency event. It's a pipeline event that blocks most of the graphics pipeline (command processor, CUs, wavefront launch, graphics blocks) until the invalidation process runs its course. This also comes up when CUs read from render targets in GCN, particularly after DCC was introduced and prior to the ROPs becoming L2 clients with Vega. The cache flush events are expensive and advised against heavily.

    So could this be what is causing the stutters on XSX?
     
  9. DSoup

    DSoup meh
    Legend Veteran Subscriber

    Joined:
    Nov 23, 2007
    Messages:
    14,870
    Likes Received:
    10,985
    Location:
    London, UK
    If the GPU stalls are caused by stale data consuming valuable cache then cache scrubbers will help. My understanding on the rudimentary operation of the cache scrubbers is that whether data is in RAM, or virtually mapped to files on the drive, if the cache contains data that is overridden (or remapped) then the correlating data in cache is freed much earlier than it becoming "stale" and naturally being freed anyway - meaning more relevant data can take it's place.

    To what extent this actually happens is important to determine how beneficial the cache scrubber hardware is. But cache is precious and making the best use of it can make the difference between completing operations inside a frame and having to go to super-slow RAM and not completing operations inside a frame.

    To what extent this happens and is measurable, again, ¯\_(ツ)_/¯
     
    BRiT and goonergaz like this.
  10. thicc_gaf

    Regular Newcomer

    Joined:
    Oct 9, 2020
    Messages:
    324
    Likes Received:
    246
    They corrected themselves in another post and read in error. Suspected as such.
     
  11. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    12,986
    Likes Received:
    15,717
    Location:
    The North
    Not dismissing it, but not sure if it's a factor _today_. To put it plainly, dealing with latency is something developers have been doing for a very long time. For instance, GDDR is more latent than regular DDR; but we have ways around this: the more CUs you have, the more latency you can account for from memory because each CU can hold up to so many waves of work while waiting for the data to arrive, it'll switch back and forth between threads as the data arrives for it to work on.

    WRT the idea of cache scrubbers, for it to be a 'factor', the developer would have to be purposely, imo, programming in such a way to exploit the hardware, otherwise the latency can be dealt with in other ways. So the developer must be looking to be aggressive with their timing such that the latency reduction properties of cache scrubbers are a necessity by design. And I don't expect many games today to require that, considering something like Control was designed around 4 cores and a slow spinning HDD. So to me, not a non-factor, but likely not _the factor_ when looking at stuttering on Xbox.
     
    mr magoo likes this.
  12. DSoup

    DSoup meh
    Legend Veteran Subscriber

    Joined:
    Nov 23, 2007
    Messages:
    14,870
    Likes Received:
    10,985
    Location:
    London, UK
    I don't know how you could even program to exploit cache scrubbers. This is passive hardware that triggers freeing non-stale data when certain conditions are met, i.e. cached data is no longer in real or virtualised memory because it has been overwritten. If your game environment / GPU access to data is that dynamic it seems impossible to program for it compared to say writing a bespoke chunk of code that runs within 32kb of L1 cache. Like the cache controller, this hardware is transparent for a reason.:yes:
     
    iroboto and BRiT like this.
  13. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    12,986
    Likes Received:
    15,717
    Location:
    The North
    agreed. I don't know how you would either. the only example I could think of, and I'm likely wrong her, but was when they discussed the idea that the I/O on PS5 is so effective that they could stream from the SSD within frame and the texture still arrive in time for rendering. I suppose in the case of a streaming pool of memory for instance, 720MB, if you are in a point where you need to release memory while bring in memory at the same time within the frame, adhering the pool size restriction, it's going to bring in new textures writing over the ones that are now out of view, and I think cache scrubbers there may have an important role to play in marking stale data in caches for these particular instances.
     
    DSoup likes this.
  14. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    12,986
    Likes Received:
    15,717
    Location:
    The North
    Yea, I think these synthetic benchmarks can really put into perspective some things here. I do believe triangle discard is being heavily underestimated. Cross gen titles shouldn't be designed to completely bottleneck the system at the triangle level, so you won't see the performance gains listed here. But that doesn't mean you won't see major gains. I think having a primitive shader compilation vs traditional pipeline will have quantities of measurable performance benefits like the ones we saw with the launch titles.

    At least the metrics here can sort of shed some light on how large gains can be when we move away from the traditional pipeline.

    https://videocardz.com/newz/ul-rele...t-results-of-nvidia-ampere-and-amd-rdna2-gpus
    [​IMG]

    • NVIDIA Ampere: 702%
    • AMD RDNA2: 547%
    • NVIDIA Turing (RTX): 409%
    • NVIDIA Turing: 244%
    Actual test looks like this:


    So every bit of savings do matter.



    ^^ They are rendering outside plus inside, XSX is fine outside. But inside it's clearly struggling with culling here as the building obstructs. PS5 no problem. I can find the example on DMC 5 as well. I think you're seeing the difference in culling. How it's being culled needs to be verified, but that's what I'm seeing here.

    Similar issue at 14:05 although it may not be as obvious. The assumption here is that there are alpha issues, but if you rule out alpha, then you're talking about culling problems. If the play area here is too 'large' for XSX to cull, you can see it suffering. Unfortunately there is no way to tell how large a particular loading area is without talking to the developers. Once again, PS5 no problem.

    Basically the challenge of culling is that you have limited triangle generation per cycle, if you're tossing away way more than being rendered, all that triangle creation is being wasted.

    Inside building example 2 on DMC 5. No contest.
     
    #34 iroboto, Feb 11, 2021
    Last edited: Feb 11, 2021
    RagnarokFF, Pete, PSman1700 and 4 others like this.
  15. Allandor

    Regular Newcomer

    Joined:
    Oct 6, 2013
    Messages:
    587
    Likes Received:
    520
    So, 3D Mark has now implemented a Mesh Shader test.
    UL releases 3DMark Mesh Shaders Feature test, first results of NVIDIA Ampere and AMD RDNA2 GPUs - VideoCardz.com
    To make it short:
    • NVIDIA Ampere: 702%
    • AMD RDNA2: 547%
    • NVIDIA Turing (RTX): 409%
    • NVIDIA Turing: 244%
    Yes this is a highly theoretical test, but it shows, that there is much to gain with newer hardware.
    So xbox has it. Playstation should have something similar (as it should be part of RDNA2 and Sony might just give it another name). It really seems like the console hardware can make a few bigger jumps in future projects, when all those new features are used.

    edit:
    btw, the Radeon cards seem to have driver issues in this bench (as 6800 has more fps than the 6900), I guess AMD will fix this with a driver, but the difference between on and off is still a big jump.
     
    #35 Allandor, Feb 11, 2021
    Last edited: Feb 11, 2021
  16. Karamazov

    Veteran Regular

    Joined:
    Sep 20, 2005
    Messages:
    3,723
    Likes Received:
    3,649
    Location:
    France
    sadly we don't know what features the PS5 has, Sony went a bit the nintendo route with tech info, it may end up not having an equivalent feature at all.
     
    Johnny Awesome, PSman1700 and BRiT like this.
  17. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    12,986
    Likes Received:
    15,717
    Location:
    The North
    I've been talking heavily about geometry processing and triangle discard/culling advantages that PS5 could have over XSX in another thread. It may be worthwhile to move/merge that here since we're no longer really discussing any of the videos.

    Examples of where I believe XSX is suffering from major culling problems in the post below. Once again, I believe XSX is failing to cull well or PS5 is doing an extraordinary job at it. But this is starting to become the pattern that I'm latching onto. Culling may actually do a great job at possibly explaining the Corridor of Death in Control. And the issues with XSX and major drops in Hitman 3, in particular with the flowers (obstruction) and the zoom in sniper rifle (once again, a culling limit)
    https://forum.beyond3d.com/posts/2192110/

    Another example again.


    So the hardest thing is that we're not actually sure what parts of the area is loaded for us to play since it culls the stuff we can't see. So it's hard to say it's just this or that. But if you look at the frame graphs this is unlikely to be CPU issues here, and I believe we're looking at triangle culling limitations again.

    This other area here in Cold War:

    This could be another area where we triangle discard and generation as being more important as you need to render a lot of triangles being high up in the sky with complex geometry, and having better discard will help with this considering how dense this particular scene is. The assumption is that COD uses compute shaders for culling, but triangle generation is still a key factor. PS5 should do about 22% more triangles than XSX.
     
    #37 iroboto, Feb 11, 2021
    Last edited: Feb 11, 2021
  18. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    14,751
    Likes Received:
    6,878
    @iroboto I know some games use compute shaders to do coarse culling before feeding into the vertex pipeline, or something like that. If you just rely on the vertex shader pipeline you'll end up processing and shading many vertices before they're eventually culled by the fixed raster units. So you're wasting time shading vertices that you never needed to shade, and then wasting clock cycles on the fixed raster units by having them do more culling then necessary. I would have thought at least Assassin's Creed would be doing something smart to do coarse culling with compute shaders to alleviate that bottleneck. Maybe not?

    Edit: I know on PC there are still games, especially ones with legacy engines, where you can change the direction you're facing and watch the frame rate alter drastically, even though you're effectively looking at flat walls. They're most likely wasting a lot of time processing vertices that are occluded. There are places on the maps in Apex Legends that are like that, and I seem to remember the same issue in Remnant. You don't really notice it until you start trying to push past 60 fps to high framerates by lowering settings, and I'm assuming the bottleneck shifts from pixel/fragment shading to vertex shading. There are areas on apex maps that look relatively similar but facing one direction I can get 250fps and facing another direction I'll get 160.
     
    Pete, PSman1700, BRiT and 3 others like this.
  19. goonergaz

    Veteran

    Joined:
    Jun 3, 2005
    Messages:
    4,186
    Likes Received:
    1,502
    I think you’re onto something, but didn’t we get some examples of how great the Xbox is at culling?

    Or is that the mesh shaders?
     
  20. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    12,986
    Likes Received:
    15,717
    Location:
    The North
    not that I can recall, if you have something that would be great. IIRC Triangle discard was a Cerny marketing point. MS never touched it.
     
    Scott_Arm likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...