Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

3dilettante · Mar 20, 2020

From the Digital Foundry article, there is a BCPack decompressor block fed from the SSD that can produce a maximum of 6 GB/s.
This does seem like it will allow for more compact storage footprints and a higher average read bandwidth from the SSD.
The storage system in this case is the reverse of the PS5's GPU and DRAM bandwidth situation, as the BCPack system is competing against a raw bandwidth advantage that still exceeds that cap. The lower capacity might feel the pinch of a compression scheme not tailored to game textures, though.

Lurkmass · Mar 20, 2020

3dilettante said:
Parts of them map to AMD's primitive shaders, or at least the formulation we know of. The primitive shaders we know about lack the developer access, generalized input, or anything but the culling. In that regard, both the Microsoft and Nvidia formulations give small amounts of their descriptions to culling. Nvidia's initial announcements on mesh shading had something like one sentence which encapsulates all that AMD's method does.

Other differences are that primitive shaders as we know them slot into the existing geometry pipeline, which DX12 and Turing do not do.
Primitive shaders work with existing triangle topologies (trianges, strips, fans), DX12 and Turing allow programmer-defined topologies.
AMD's method takes the existing entry point of the geometry pipeline as its input, which the others don't since they exclude the existing pipeline.
Primitive shaders depend on a surface shader to allow the tessellation stage to feed into the primitive shader. The others keep it separate or in DX12's case aim to completely discard that stage.

At least in the public disclosures, what AMD said about going into anything besides culling was maybe a sentence in the Vega white paper.
There are indications in driver code that whatever interfaces are in play for Navi aren't the same as Vega's--with changes explicitly deleting the hooks that were put in for Vega. At least for now, the outward descriptions for what's left in Navi are substantially lower in throughput than what was claimed for Vega, so they under-promise but at least deliver in Navi's case.

Culling is one of the mesh shaders biggest use case. The point of using mesh shaders for culling is to breakdown the geometry into these tiny little 'meshlets' to do finer grain culling. AMD's primitive shaders are very much capable of breaking down geometry into meshlets. A 'meshlet' on AMD hardware consists of 254 verts/128 prims per wave and the primitive topology can very much be programmer defined too.

Also the reason why vertex/geometry/tessellation shaders don't work in D3D's mesh shader pipeline is mostly down to a hardware limitation from Nvidia (possibly Intel as well) so Microsoft has to make this compromise otherwise mesh shaders in D3D would have no way of working on their hardware.

It would be more accurate on Microsoft's part to describe their new console's geometry pipeline in terms of primitive shaders rather than mesh shaders since the former is a concept that does map perfectly to their new system but I see their marketing team is at work again so it's no use even trying ...

JasonLD · Mar 20, 2020

Well, if we just cancel out all the "secret sauces" those consoles are touting, it will basically comes down to raw GPU capability, which XBSX definitely has advantage of.

Proelite · Mar 20, 2020

Lurkmass said:
Culling is one of the mesh shaders biggest use case. The point of using mesh shaders for culling is to breakdown the geometry into these tiny little 'meshlets' to do finer grain culling. AMD's primitive shaders are very much capable of breaking down geometry into meshlets. A 'meshlet' on AMD hardware consists of 254 verts/128 prims per wave and the primitive topology can very much be programmer defined too.

Also the reason why vertex/geometry/tessellation shaders don't work in D3D's mesh shader pipeline is mostly down to a hardware limitation from Nvidia (possibly Intel as well) so Microsoft has to make this compromise otherwise mesh shaders in D3D would have no way of working on their hardware.

It would be more accurate on Microsoft's part to describe their new console's geometry pipeline in terms of primitive shaders rather than mesh shaders since the former is a concept that does map perfectly to their new system but I see their marketing team is at work again so it's no use even trying ...

timestamp 1:29:00
https://mixer.com/DirectX?vod=_FR0z1IhEEOZ0vK5pQwgvw

Seems like culling upfront can half the rendering time for polygons. I wonder how much savings does it add to the entire pipeline. Do you have an idea?

3dilettante · Mar 20, 2020

Lurkmass said:
Culling is one of the mesh shaders biggest use case. The point of using mesh shaders for culling is to breakdown the geometry into these tiny little 'meshlets' to do finer grain culling. AMD's primitive shaders are very much capable of breaking down geometry into meshlets.

The headline motivation for mesh shaders is to break a serial bottleneck at the primitive setup stage of the geometry pipeline. The front end becomes more like a compute shader that can define the primitives it takes, how they are arranged, and how they can map to wavefront lanes or invocations further down the pipeline. Some variations on the task/amplification stage allow for more proactive control over how many primitives may be used (task shader LOD selection) or are generated (task/amplification shader sets number of mesh shaders).
Nvidia may have had the least to say for culling because at that time its hardware capability for culling had grown substantially enough that there could be performance costs to doing all the culling prior to reaching the hardware.
Primitive shaders, as described, take the full stream of primitives from input assembly or the tesselation stage, and then try to cull from it and try to best fit the number of wavefronts and active lanes automatically.

A 'meshlet' on AMD hardware consists of 254 verts/128 prims per wave and the primitive topology can very much be programmer defined too.

Is there a link to AMD's pages describing how to define primitive shader meshlets for game programmers?

Also the reason why vertex/geometry/tessellation shaders don't work in D3D's mesh shader pipeline is mostly down to a hardware limitation from Nvidia (possibly Intel as well) so Microsoft has to make this compromise otherwise mesh shaders in D3D would have no way of working on their hardware.

The apparent direction Microsoft is taking is that it's not going to want to fit mesh shaders into that pipeline. Vertex shaders are already amenable to conversion to either primitive or mesh shaders. There's no desire to continue using the tessellation stage, and general disinterest in geometry shaders going forward. AMD's primitive shaders, as they are known, attempt to cater to them.

From what we've heard of primitive shaders with Vega, they were either easy for compilers to generate or profoundly difficult to expose to programmers.
There are some hints in the Vega ISA document about how low-level it might have been, with an instruction designed to give the shader engine or engines responsible for a given primitive's bounding box, likely to determine what low-level state machines needed to be broadcast a triangle or which ones needed to be told to drop a primitive from their FIFOs, as there were message types that could cull primitives from shader engines. There are driver changes and warnings about other operations needing to be very careful about addressing the shader engines or their queues, given bugs or the tendency to hard lock the GPU when working with them.

Even the culling shader version ran into problems, possibly because much of their work was done by compute shaders, and potentially due to additional latency or occupancy issues in the more serial setup stages making performance gains limited or unreliable. Navi's focus on latency and the dual-CU arrangement may have made the more modest level of culling in current auto-generated primitive shaders feasible.
(edit: Clarification on the work being done by compute shaders: as in by the time primitive shaders were introduced, developers were already using compute shaders that did most of what they did.)

Karamazov · Mar 20, 2020

So PS5's SSD "Super Sauce Data" is the New DR FP16 secret sauce ?

chris1515 · Mar 20, 2020

Very interesting Direct Storage seems to use a special game format dedicated to SSD. Great news for XSX and PC.

DuckThor Evil · Mar 20, 2020

Betanumerical said:
I except the CPU to take bandwidth away from the GPU for both the PS5 and XSX. I except it to be higher on the XSX due to the CPU needing a longer time slice to transfer the same amount of data. What this does to the effective bandwidth is something we will have to wait and see with.

I expect it to be the other way around in practise. I would guess that the OS and CPU needs are relatively fixed amount between the platforms instead of a say certain percentage, assuming they are doing similar things. Crudely explaining using an example, the Series X can split 64bit for the CPU and still have 256bit exclusively for the GPU, whereas PS5 would drop to 192bit for the GPU with the same CPU reserve and suddenly the gap of available bandwidth for the GPU widens from 25% to 33% more.

It would be nice to have a full explanation on how these memory buses work exactly on this generation. there is a lot of speculation going on now...

Betanumerical · Mar 20, 2020

DuckThor Evil said:
I expect it to be the other way around in practise. I would guess that the OS and CPU needs are relatively fixed amount between the platforms instead of a say certain percentage, assuming they are doing similar things. Crudely explaining using an example, the Series X can split 64bit for the CPU and still have 256bit exclusively for the GPU, whereas PS5 would drop to 192bit for the GPU with the same CPU reserve and suddenly the gap of available bandwidth for the GPU widens from 25% to 33% more.

It would be nice to have a full explanation on how these memory buses work exactly on this generation. there is a lot of speculation going on now...

I honestly don't think (but also don't know and would like someone who does to chime in) thats how a memory controller works, I'm pretty sure one of the devices would effectively control the entire thing and it'd be time sliced between the two. I don't think you get to split off seperate channels for seperate parts of the SoC.

Tkumpathenurple · Mar 20, 2020

Remij said:
https://twitter.com/x/status/1240825295994793984

Rather than being console war fodder, why can't we just be happy that, if BCPack allows the XSX to stream from the SSD closer to 6GB/s, both the XSX and PS5 can have engines built for them that can stream in assets as needed?

Nah, bugger it. The microwave I buy is going to be, like, so much better than, like, a Vauxhall Astra.

DuckThor Evil · Mar 20, 2020

Betanumerical said:
I honestly don't think (but also don't know and would like someone who does to chime in) thats how a memory controller works, I'm pretty sure one of the devices would effectively control the entire thing and it'd be time sliced between the two. I don't think you get to split off seperate channels for seperate parts of the SoC.

I do think the devs are free and have control to put the data where they choose and the split pool thing MS has kinda states that as well. OS and CPU data is meant to be on certain chips only, whereas GPU optimal data has access to all of the 10 chips.

Betanumerical · Mar 20, 2020

DuckThor Evil said:
I do think the devs are free and have control to put the data where they choose and the split pool thing MS has kinda states that as well. OS and CPU data is meant to be on certain chips only, whereas GPU optimal data has access to all of the 10 chips.

I am certain that developers can choose where to put there data, but I’m pretty sure accessing either the fast or slow pool of RAM consumes the entire memory controller for the duration of the memory transaction. meaning that when the CPU is accessing the memory the GPU can’t.

Lurkmass · Mar 20, 2020

3dilettante said:
Is there a link to AMD's pages describing how to define primitive shader meshlets for game programmers?

It's in the settings from their open source graphics driver. Here are the relevant lines of code:

pSettings->nggVertsPerSubgroup = 254;
pSettings->nggPrimsPerSubgroup = 128;

A possible reason why Vega/RDNA don't expose mesh shaders in D3D is because of these limits. D3D specifes a minimum of 256 verts/256 prims per meshlet so RDNA2 might have raised these limits.

Other than that, primitive shaders are still a much better description of console hardware than mesh shaders are.

On consoles there are extensions to raise the thread group dispatch size to 256 threads for mesh shaders. On D3D this limit is currently 128 threads. Limiting the thread group dispatch size to 128 thread robs AMD of the fastest path.

The apparent direction Microsoft is taking is that it's not going to want to fit mesh shaders into that pipeline. Vertex shaders are already amenable to conversion to either primitive or mesh shaders. There's no desire to continue using the tessellation stage, and general disinterest in geometry shaders going forward. AMD's primitive shaders, as they are known, attempt to cater to them.

From what we've heard of primitive shaders with Vega, they were either easy for compilers to generate or profoundly difficult to expose to programmers.
There are some hints in the Vega ISA document about how low-level it might have been, with an instruction designed to give the shader engine or engines responsible for a given primitive's bounding box, likely to determine what low-level state machines needed to be broadcast a triangle or which ones needed to be told to drop a primitive from their FIFOs, as there were message types that could cull primitives from shader engines. There are driver changes and warnings about other operations needing to be very careful about addressing the shader engines or their queues, given bugs or the tendency to hard lock the GPU when working with them.

Even the culling shader version ran into problems, possibly because much of their work was done by compute shaders, and potentially due to additional latency or occupancy issues in the more serial setup stages making performance gains limited or unreliable. Navi's focus on latency and the dual-CU arrangement may have made the more modest level of culling in current auto-generated primitive shaders feasible.
(edit: Clarification on the work being done by compute shaders: as in by the time primitive shaders were introduced, developers were already using compute shaders that did most of what they did.)

Well we don't know for sure if Microsoft wants to entirely forgo the traditional geometry pipeline.

In the future, Microsoft could still give programmers these "special append buffer UAV type or mode which will ensure the UAV’s outputs are in order of inputs" which could potentially allow us to do streamout in mesh shaders.

Those "special append buffers" uses an AMD specific instruction known as 'ds_ordered_count' which is exactly how their open source OpenGL drivers implements transform feedbacks (streamout in D3D lingo) in conjunction with NGG.

DuckThor Evil · Mar 20, 2020

Betanumerical said:
I am certain that developers can choose where to put there data, but I’m pretty sure accessing either the fast or slow pool of RAM consumes the entire memory controller for the duration of the memory transaction. meaning that when the CPU is accessing the memory the GPU can’t.

GDDR6 has 2 16bit individual channels in one 32bit connection to a memory chip, whether that chip is 1GB or 2GB in density, so it should be possible to assign one of those channels for CPU tasks and one for GPU splitting the mem chip, I don't know if more complex granularity can be done? Of course even in that scenario you cannot run max 560GBs bandwidth to the GPU anymore, but you can keep all 10 chips fully or partially active for the GPU imo.

Janne Kylliö · Mar 20, 2020

Xbat said:
If it was we probably would know about it. Companies tend to publicize things when they have an advantage.

Rich Geldreich is a founder of Binomial, the company behind the Basis Universal Supercompressed GPU Texture Codec. And apparently Microsoft is using this technology in XSX.

https://github.com/BinomialLLC/basis_universal/blob/master/README.md

Janne Kylliö · Mar 20, 2020

And, BTW? Does PS5 support Sampler Feedback?

Betanumerical · Mar 20, 2020

DuckThor Evil said:
GDDR6 has 2 16bit individual channels in one 32bit connection to a memory chip, whether that chip is 1GB or 2GB in density, so it should be possible to assign one of those channels for CPU tasks and one for GPU splitting the mem chip, I don't know if more complex granularity can be done? Of course even in that scenario you cannot run max 560GBs bandwidth to the GPU anymore, but you can keep all 10 chips fully or partially active for the GPU imo.

From a hardware point of view it’s possible the question is whether or not a memory controller would allow for such a split or if it would effectively enforce locking off the entire bus to a single host. It feels like it wouldn’t and you’d get either one or the other and have to time division multiplex between the two consumers.

Shifty Geezer · Mar 20, 2020

Frenetic Pony said:
Not at all, it's the opposite in fact. Initial games will all be CPU targeted to the lowest common denominator, the PS5...

Huh? What intensive PC games are there that don't run faster/more stably with faster CPUs? The only way XBSX won't get an advantage is if the game is perfectly balanced to always hit the framerate with zero drops - 60 fps perfect on PS5 and XBSX. The moment you get a typical game with some drops here and there caused by CPU load, XBSX will have an immediate advantage without any work being needed by the devs.

London Geezer · Mar 20, 2020

The tsunami of 'other forum' users on here during new console launches is always so fun.

/s

Deleted member 7537 · Mar 20, 2020

Shifty Geezer said:
Huh? What intensive PC games are there that don't run faster/more stably with faster CPUs? The only way XBSX won't get an advantage is if the game is perfectly balanced to always hit the framerate with zero drops - 60 fps perfect on PS5 and XBSX. The moment you get a typical game with some drops here and there caused by CPU load, XBSX will have an immediate advantage without any work being needed by the devs.

Giving that both CPUs are based on the same architecture and the 0.1 ghz difference, if any game causes a significant drop in PS5 do to hitting CPU limits, XBSX will suffer a similar drop as well.

On the topic of CPU, Alex from DF just posted on resetera "Zen just runs around the Jag that most cross gen games are not going to worry about CPU time". By the time any game hits CPU limits on these consoles we will probably have PS5Pro already on the market.

Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

3dilettante

Lurkmass

JasonLD

Proelite

3dilettante

Karamazov

chris1515

DuckThor Evil

Betanumerical

Tkumpathenurple

DuckThor Evil

Betanumerical

Lurkmass

DuckThor Evil

Janne Kylliö

Janne Kylliö

Betanumerical

Shifty Geezer

uber-Troll!

London Geezer

Deleted member 7537

Guest

Similar threads