Current Generation Hardware Speculation with a Technical Spin [post launch 2021] [XBSX, PS5]

Status
Not open for further replies.
Thanks a lot for the info -- this is something where my knowledge of the hardware side is way below my knowledge of the software side so bear with me if this is a stupid question: Automatically converting standard fixed function shaders (vertex, etc) into something that the hardware is better designed for (more compute, etc) is one thing, but where would culling get introduced here? Does the hardware have some way to know, or is something happening on the developer side? With Mesh Shaders, for example, theyre not necessarily faster than traditional fixed function geometry at all, at least for simple cases -- but because of how they're structured (Task shaders -- specialized compute shaders that dispatch Mesh Shaders, specialized compute shaders that take the place of Vertex Shaders) you can relatively simply introduce huge culling benefits that just aren't practical in a straightfoward way on the old pipeline. (There are some recent xbox developer youtube videos about this on the dx12 side)
48JKUyZ.png

If you look at where the Geometry Engine is and if you follow the old FF pipeline, you can see how many stages will pass until at the primitive assembler/rasterizer does back face and frustrum culling occurs.
You can see that advantage of discarding the triangles way up front as providing a cumulative benefit down the line to not have to work on triangles that don't need operations on. This can be monumental the more triangles are removed from view for instance. This is assuming developers are following a basic flow of course, probably inaccurate, but useful for the discussion of timing.

I'm sort of at the understanding that
a) even if you use the older pipeline, RDNA 2 is still biased towards triangle discard, so it discards a lot more triangles than it can raster, and this is something we know from AMD presentations and also something Cerny spoke to, as well as Matt H. But this would be non-compute based culling.
b) and I believe if developers decide to take advantage of primitive shaders to handle the culling up front, then you get that cumulative effect down the chain. I do not think it's possible for a driver to know what needs to be culled necessarily, but I could be wrong. but it can do everything else in terms of compiling the front end shaders into primitive shaders.
 
I'm sort of at the understanding that
a) even if you use the older pipeline, RDNA 2 is still biased towards triangle discard, so it discards a lot more triangles than it can raster, and this is something we know from AMD presentations and also something Cerny spoke to, as well as Matt H.
b) and I believe if developers decide to take advantage of primitive shaders to handle the culling up front, then you get that cumulative effect down the chain. I do not think it's possible for a driver to know what needs to be culled, but it can do everything else in terms of compiling the front end shaders into primitive shaders.

Yup, that makes sense with what i read in the thread you linked, ty for the extra info.

I do think 'a' is probably less significant in these cases than the potential of b -- unless i'm really underestimating what better triangle discard does I think the big dips of the size we're seeing, if they were attributable to culling, would be something driven by developers not taking advantage of dx12u features for some reason, but being able to easily take advantage of equivalent ps5 features. Which I wouldn't, honestly, put past ms at this point, but doesn't seem super likely.
 
I mean, I think the "nothing much going on" tells us that something shady is going on (bad dx12 renderer constantly hanging on fences, serious tool problem, something wrong with the hardware) rather than a performance difference. With the hitman example: I don't think that's actually very easy to cull. The kind of thorough geometric tests that can guarantee 'the flowers obstruct all views' are expensive and hard to get right -- I think the safer guess is that one has more to do with: 1- the way the hitman renderer works (maybe a relatively 'straightforward' forward renderer or something?) 2- the xbox running at a way higher resolution and 3- yeah, a ps5 hardware advantage on fill rate would make sense, but not that big.
My only rebuttal here is that, yes DX12 is a very hard animal to tame on PC. But on console it should be significantly more straight forward and less of those issues exist.
In Hitman, a zoom in is basically hardcore frustrum culling. Usually lots of small geometry, moving grass, etc, tends to clog things up without a good way to manage it.
 
I think the answer to that is that nobody able to post knows which is probably why there is little discussion. The speculation : fact ratio is already bad in the Console Technical forums! Too many folks participate on the premise of wanting to learn but they don't want to learn, they want their view to be correct.

Assuming Sony even have devtools to measure the effectiveness of the cache scrubbers (and how you would even measure it) maybe at some future GDC some dev will include this.
We do have the road to PS5, I just re watched the part (from 18-20mins if you were interested) where Cerny talks about the IO Unit...I really don't understand why people are being dismissive about it? They have implemented several bespoke 'systems/functions' that seem to help stream data from the SSD and minimise things that will impact GPU performance (coherency engine/cache scrubbers) - and it apparently happens automatically without any required dev knowledge.

Anyway, hopefully as you say we will one day find out a bit more.
 
We do have the road to PS5, I just re watched the part (from 18-20mins if you were interested) where Cerny talks about the IO Unit...I really don't understand why people are being dismissive about it?

I don't know if anybody is dismissing it (again.. ignore list) but we're not seeing many games leverage PS5's supercharged I/O system yet. Astrobot and Spider Man Miles Morales do and both are impressive for their lack of load times. Booting into Miles Morales, a dense open world super hero game in around six seconds is still difficult to believe.

As for the cache scrubbers; anything that improves use of cache is gong to benefit performance but whatever the cache scrubbers bring to the game is impossible to quantify and is going to vary from game to game - much like cache usage itself.

Cache scrubbers could be massively helping or only marginally. ¯\_(ツ)_/¯
 
I don't know if anybody is dismissing it (again.. ignore list) but we're not seeing many games leverage PS5's supercharged I/O system yet. Astrobot and Spider Man Miles Morales do and both are impressive for their lack of load times. Booting into Miles Morales, a dense open world super hero game in around six seconds is still difficult to believe.

As for the cache scrubbers; anything that improves use of cache is gong to benefit performance but whatever the cache scrubbers bring to the game is impossible to quantify and is going to vary from game to game - much like cache usage itself.

Cache scrubbers could be massively helping or only marginally. ¯\_(ツ)_/¯
Demon's souls is probably the best use of custom I/O as they do like Cerny said in the presentation: After the initial loading they gradually load the data they need just before it is needed. This is why Demon's Souls has always very dense scenes.

And Nioh could be the first third party game using it.
 
Demon's souls is probably the best use of custom I/O as they do like Cerny said in the presentation: After the initial loading they gradually load the data they need just before it is needed. This is why Demon's Souls has always very dense scenes.
I don't have Demon Souls but in terms of getting an AAA game off the drive and running from cold, Miles Morales is hard to beat. Sure, it's a different type of game but the game is loading New York. In six seconds!!! :runaway:
 
Last edited by a moderator:
I don't know if anybody is dismissing it (again.. ignore list) but we're not seeing many games leverage PS5's supercharged I/O system yet. Astrobot and Spider Man Miles Morales do and both are impressive for their lack of load times. Booting into Miles Morales, a dense open world super hero game in around six seconds is still difficult to believe.

As for the cache scrubbers; anything that improves use of cache is gong to benefit performance but whatever the cache scrubbers bring to the game is impossible to quantify and is going to vary from game to game - much like cache usage itself.

Cache scrubbers could be massively helping or only marginally. ¯\_(ツ)_/¯
The IO is not just for loading though, it's for streaming anything at any time from the SSD to the GPU. From what I understood the coherency engine/cache scrubbers could help minimise GPU stalls.

"The I/O complex further houses two co-processors that help direct that data traffic. One co-processor is focused on SSD input-output that's specifically designed to bypass file read bottlenecks. The other handles memory mapping. The SRAM, or static RAM, which grants faster access to cached memory.
Finally there's a block dedicated to coherency engines which work directly with the GPU to optimize caching. The engines communicate with the GPU, which "scrubbers" that clean up cached data on the GPU itself to improve efficiency and ensure the cache isn't filled with redundant data."
Read more: https://www.tweaktown.com/news/7134...ep-dive-into-next-gen-storage-tech/index.html

Also;
https://forum.beyond3d.com/posts/2128115/
The general cache invalidation process for the GCN/RDNA caches is a long-latency event. It's a pipeline event that blocks most of the graphics pipeline (command processor, CUs, wavefront launch, graphics blocks) until the invalidation process runs its course. This also comes up when CUs read from render targets in GCN, particularly after DCC was introduced and prior to the ROPs becoming L2 clients with Vega. The cache flush events are expensive and advised against heavily.

So could this be what is causing the stutters on XSX?
 
The IO is not just for loading though, it's for streaming anything at any time from the SSD to the GPU. From what I understood the coherency engine/cache scrubbers could help minimise GPU stalls.

If the GPU stalls are caused by stale data consuming valuable cache then cache scrubbers will help. My understanding on the rudimentary operation of the cache scrubbers is that whether data is in RAM, or virtually mapped to files on the drive, if the cache contains data that is overridden (or remapped) then the correlating data in cache is freed much earlier than it becoming "stale" and naturally being freed anyway - meaning more relevant data can take it's place.

To what extent this actually happens is important to determine how beneficial the cache scrubber hardware is. But cache is precious and making the best use of it can make the difference between completing operations inside a frame and having to go to super-slow RAM and not completing operations inside a frame.

To what extent this happens and is measurable, again, ¯\_(ツ)_/¯
 
We do have the road to PS5, I just re watched the part (from 18-20mins if you were interested) where Cerny talks about the IO Unit...I really don't understand why people are being dismissive about it? They have implemented several bespoke 'systems/functions' that seem to help stream data from the SSD and minimise things that will impact GPU performance (coherency engine/cache scrubbers) - and it apparently happens automatically without any required dev knowledge.

Anyway, hopefully as you say we will one day find out a bit more.
Not dismissing it, but not sure if it's a factor _today_. To put it plainly, dealing with latency is something developers have been doing for a very long time. For instance, GDDR is more latent than regular DDR; but we have ways around this: the more CUs you have, the more latency you can account for from memory because each CU can hold up to so many waves of work while waiting for the data to arrive, it'll switch back and forth between threads as the data arrives for it to work on.

WRT the idea of cache scrubbers, for it to be a 'factor', the developer would have to be purposely, imo, programming in such a way to exploit the hardware, otherwise the latency can be dealt with in other ways. So the developer must be looking to be aggressive with their timing such that the latency reduction properties of cache scrubbers are a necessity by design. And I don't expect many games today to require that, considering something like Control was designed around 4 cores and a slow spinning HDD. So to me, not a non-factor, but likely not _the factor_ when looking at stuttering on Xbox.
 
WRT the idea of cache scrubbers, for it to be a 'factor', the developer would have to be purposely, imo, programming in such a way to exploit the hardware, otherwise the latency can be dealt with in other ways.
I don't know how you could even program to exploit cache scrubbers. This is passive hardware that triggers freeing non-stale data when certain conditions are met, i.e. cached data is no longer in real or virtualised memory because it has been overwritten. If your game environment / GPU access to data is that dynamic it seems impossible to program for it compared to say writing a bespoke chunk of code that runs within 32kb of L1 cache. Like the cache controller, this hardware is transparent for a reason.:yes:
 
I don't know how you could even program to exploit cache scrubbers. This is passive hardware that triggers freeing non-stale data when certain conditions are met, i.e. cached data is no longer in real or virtualised memory because it has been overwritten. If your game environment / GPU access to data is that dynamic it seems impossible to program for it compared to say writing a bespoke chunk of code that runs within 32kb of L1 cache. Like the cache controller, this hardware is transparent for a reason.:yes:
agreed. I don't know how you would either. the only example I could think of, and I'm likely wrong her, but was when they discussed the idea that the I/O on PS5 is so effective that they could stream from the SSD within frame and the texture still arrive in time for rendering. I suppose in the case of a streaming pool of memory for instance, 720MB, if you are in a point where you need to release memory while bring in memory at the same time within the frame, adhering the pool size restriction, it's going to bring in new textures writing over the ones that are now out of view, and I think cache scrubbers there may have an important role to play in marking stale data in caches for these particular instances.
 
unless i'm really underestimating what better triangle discard does I think the big dips of the size we're seeing, if they were attributable to culling, would be something driven by developers not taking advantage of dx12u features for some reason,
Yea, I think these synthetic benchmarks can really put into perspective some things here. I do believe triangle discard is being heavily underestimated. Cross gen titles shouldn't be designed to completely bottleneck the system at the triangle level, so you won't see the performance gains listed here. But that doesn't mean you won't see major gains. I think having a primitive shader compilation vs traditional pipeline will have quantities of measurable performance benefits like the ones we saw with the launch titles.

At least the metrics here can sort of shed some light on how large gains can be when we move away from the traditional pipeline.

https://videocardz.com/newz/ul-rele...t-results-of-nvidia-ampere-and-amd-rdna2-gpus
What is a mesh shader?
In 3D graphics, a mesh is the set of vertices, edges and faces that define the shape of an object. In current graphics pipelines, all the geometry data in a mesh must be processed sequentially before any further steps can be taken. This can be a significant performance bottleneck.

Mesh shaders replace the old model with a new approach to geometry processing that simplifies the graphics pipeline while also giving developers more flexibility and control. Mesh shaders can process small sections of a mesh, called meshlets, in parallel with a much greater degree of flexibility and control.

Test mesh shader performance with 3DMark
The 3DMark Mesh Shader feature test shows how game engines can improve performance by using the mesh shader pipeline to efficiently cull geometry that is not visible to the camera.

The test scene is a hall containing many rows of highly detailed, carved pillars. As the camera moves through the scene, the pillars in the foreground block the view of those further back.

vDs2aHe.png


  • NVIDIA Ampere: 702%
  • AMD RDNA2: 547%
  • NVIDIA Turing (RTX): 409%
  • NVIDIA Turing: 244%
Actual test looks like this:

So every bit of savings do matter.


^^ They are rendering outside plus inside, XSX is fine outside. But inside it's clearly struggling with culling here as the building obstructs. PS5 no problem. I can find the example on DMC 5 as well. I think you're seeing the difference in culling. How it's being culled needs to be verified, but that's what I'm seeing here.

Similar issue at 14:05 although it may not be as obvious. The assumption here is that there are alpha issues, but if you rule out alpha, then you're talking about culling problems. If the play area here is too 'large' for XSX to cull, you can see it suffering. Unfortunately there is no way to tell how large a particular loading area is without talking to the developers. Once again, PS5 no problem.

Basically the challenge of culling is that you have limited triangle generation per cycle, if you're tossing away way more than being rendered, all that triangle creation is being wasted.

Inside building example 2 on DMC 5. No contest.
 
Last edited:
So, 3D Mark has now implemented a Mesh Shader test.
UL releases 3DMark Mesh Shaders Feature test, first results of NVIDIA Ampere and AMD RDNA2 GPUs - VideoCardz.com
To make it short:
  • NVIDIA Ampere: 702%
  • AMD RDNA2: 547%
  • NVIDIA Turing (RTX): 409%
  • NVIDIA Turing: 244%
Yes this is a highly theoretical test, but it shows, that there is much to gain with newer hardware.
So xbox has it. Playstation should have something similar (as it should be part of RDNA2 and Sony might just give it another name). It really seems like the console hardware can make a few bigger jumps in future projects, when all those new features are used.

edit:
btw, the Radeon cards seem to have driver issues in this bench (as 6800 has more fps than the 6900), I guess AMD will fix this with a driver, but the difference between on and off is still a big jump.
 
Last edited:
I've been talking heavily about geometry processing and triangle discard/culling advantages that PS5 could have over XSX in another thread. It may be worthwhile to move/merge that here since we're no longer really discussing any of the videos.

Examples of where I believe XSX is suffering from major culling problems in the post below. Once again, I believe XSX is failing to cull well or PS5 is doing an extraordinary job at it. But this is starting to become the pattern that I'm latching onto. Culling may actually do a great job at possibly explaining the Corridor of Death in Control. And the issues with XSX and major drops in Hitman 3, in particular with the flowers (obstruction) and the zoom in sniper rifle (once again, a culling limit)
https://forum.beyond3d.com/posts/2192110/

Another example again.

So the hardest thing is that we're not actually sure what parts of the area is loaded for us to play since it culls the stuff we can't see. So it's hard to say it's just this or that. But if you look at the frame graphs this is unlikely to be CPU issues here, and I believe we're looking at triangle culling limitations again.

This other area here in Cold War:
This could be another area where we triangle discard and generation as being more important as you need to render a lot of triangles being high up in the sky with complex geometry, and having better discard will help with this considering how dense this particular scene is. The assumption is that COD uses compute shaders for culling, but triangle generation is still a key factor. PS5 should do about 22% more triangles than XSX.
 
Last edited:
@iroboto I know some games use compute shaders to do coarse culling before feeding into the vertex pipeline, or something like that. If you just rely on the vertex shader pipeline you'll end up processing and shading many vertices before they're eventually culled by the fixed raster units. So you're wasting time shading vertices that you never needed to shade, and then wasting clock cycles on the fixed raster units by having them do more culling then necessary. I would have thought at least Assassin's Creed would be doing something smart to do coarse culling with compute shaders to alleviate that bottleneck. Maybe not?

Edit: I know on PC there are still games, especially ones with legacy engines, where you can change the direction you're facing and watch the frame rate alter drastically, even though you're effectively looking at flat walls. They're most likely wasting a lot of time processing vertices that are occluded. There are places on the maps in Apex Legends that are like that, and I seem to remember the same issue in Remnant. You don't really notice it until you start trying to push past 60 fps to high framerates by lowering settings, and I'm assuming the bottleneck shifts from pixel/fragment shading to vertex shading. There are areas on apex maps that look relatively similar but facing one direction I can get 250fps and facing another direction I'll get 160.
 
I've been talking heavily about geometry processing and triangle discard/culling advantages that PS5 could have over XSX in another thread. It may be worthwhile to move/merge that here since we're no longer really discussing any of the videos.

Examples of where I believe XSX is suffering from major culling problems in the post below. Once again, I believe XSX is failing to cull well or PS5 is doing an extraordinary job at it. But this is starting to become the pattern that I'm latching onto. Culling may actually do a great job at possibly explaining the Corridor of Death in Control. And the issues with XSX and major drops in Hitman 3, in particular with the flowers (obstruction) and the zoom in sniper rifle (once again, a culling limit)
https://forum.beyond3d.com/posts/2192110/

Another example again.

So the hardest thing is that we're not actually sure what parts of the area is loaded for us to play since it culls the stuff we can't see. So it's hard to say it's just this or that. But if you look at the frame graphs this is unlikely to be CPU issues here, and I believe we're looking at triangle culling limitations again.

This other area here in Cold War:
This could be another area where we triangle discard and generation as being more important as you need to render a lot of triangles being high up in the sky with complex geometry, and having better discard will help with this considering how dense this particular scene is.
I think you’re onto something, but didn’t we get some examples of how great the Xbox is at culling?

Or is that the mesh shaders?
 
I think you’re onto something, but didn’t we get some examples of how great the Xbox is at culling?
not that I can recall, if you have something that would be great. IIRC Triangle discard was a Cerny marketing point. MS never touched it.
 
Status
Not open for further replies.
Back
Top