Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

3dilettante · Dec 2, 2020

Globalisateur said:
From my memory of that github data the 2 NGG legacy values are there for Pro legacy, not PS4.

NGG was introduced but not really implemented with Vega. It wasn't cited as something brought into the Pro, although perhaps some element of it was brought in.
Rapid packed math is the Vega feature I remember being adopted by the Pro.

Could another interpretation be that NGG legacy is when vertex shader code is being converted or run through the new pipeline. Interpreting older code to run on NGG was a big part of the work done with Vega, and possibly a contributor in why it wasn't adopted then.

QPlayer · Dec 2, 2020

XSX needs the modern shader language.

Old code no way

chris1515 · Dec 3, 2020

Seeing hair technology in FIFA 21 or at a lesser extent Spiderman Miles Morales, UE 5 Nanite and Lumen, Demon's soul's level of geometry Demon's soul's and Watchdog GI, or raytracing in Watchdog Legions and Spiderman Miles Morales.

We begin to have an idea of what we can really do with PS5 and Xbox Series X|S as the target. This will be interesting when everything will be merge together and when the PS4 and XB1 and midgen consoles will be left behind.

j^aws · Dec 6, 2020

iroboto said:
I would suspect that depends largely on how many render targets the engines use that could demand more or less. Something like Doom Eternal can use well over 50 render targets per frame. Thus I believe the reason why triangle and fill rate numbers have to rise to meet the requirements from developers.

PS5 has 64 ROPs (Github), so its pixel fillrate has increased from previous gen and is plentiful:
64x2.23GHz = 142.7 Gpix/s

Triangle throughput:
PS5:
2x2.23GHz = 4.46 GTri/s (if 2 tri/cycle for RDNA2)
PS4 Pro:
4x0.911GHz = 3.64 GTri/s

Triangle throughput has also increased from the PS4 Pro. And RDNA2 Raster Units apparently are closer to peak performance and therefore more efficient:
https://forum.beyond3d.com/threads/...6900-xt-2020-10-28.62091/page-53#post-2176773
Doesn't sound like this is a limitation.

iroboto said:
Interesting; okay, so I was thinking micro geometry - as in triangles that are < 16 pixels in size, so 8, 4 and 1 px explicitly and sub pixel triangles. These types of triangles tend to outright destroy the rasterization system entirely.
So you're basically saying there is an advantage here with larger triangles essentially. Or effectively, lower fidelity geometry, etc.
Yea that's a possibility of a small win here for RDNA 2 devices. Having hardware for actual micro-geometry (1 pixel triangles) where no one else did, I suspect would put the fidelity and performance well above the gap we see today. Typically you'll just stall the pipeline at resolutions above 1080p at least looking at some older GPU architectures.

I'm expecting a Normal distribution to be around 16 fragment triangle sizes or thereabouts. Up to 32 and down to 1 fragment triangles, I'm still expecting better performance than RDNA1 as well. Will need benchmarks to see. As games progress, triangles are getting smaller, so with a new Raster Unit with two scan converters per triangle, I'd expect both larger and smaller triangle efficiency gains.

iroboto said:
Part of the reason I've been looking for this is because IIRC, Demon's Souls uses very fine triangles IIRC as part of that developer diary. So I'm not exactly certain on their sizing just yet.

Was this the DF interview with Bluepoint? I recall the devs saying they would tessellate wireframe meshes down to triangles so small that you couldn't tell the difference - implying triangles are down to pixel sizes. It's a shame the interview didn't followup if this was done with the new Geometry Engine and its capabilities.

3dilettante said:
Can you provide a reference again to the coarse/fine scan converter discussion?
The usage of the term for coarse seems a little out of line from what the term coarse rasterization would imply. Coarse rasterization's typical use case would yield no coverage information usable for wavefront launch.

https://forum.beyond3d.com/threads/...6900-xt-2020-10-28.62091/page-65#post-2177723
Coarse rasteriser (scan converter) is feeding multiple fine rasterisers. Not sure on the topology - parallel, series or combination of scan converters.

3dilettante said:
Coarse rasterization would help determine which tiles or regions of screen space may have coverage by a primitive. In hardware or primitive shader, it may help determine which shader engines are passed a primitive for additional processing. I wonder if some coarse rasterization checks can be handled by the geometry processor.
However, knowing a general region may be touched by a triangle isn't sufficient to generate coverage information for a pixel shader, so subsequent rasterization at finer granularity would be needed.

Using two scan converters to cover the same triangle may not pair well with their usage model with RDNA1. A wavefront only references one packer, which wouldn't align with more than one scan converter being involved in a wavefront's launch since packers are per-SC. However, a shader engine so far has hardware that works on launching one wavefront at a time.

For example for Navi21: 1 Shader Engine has 2 scan converters, working on 1 triangles. You could arrange those scan converters in a number of coarse and fine arrangements, and feed the Shader Arrays appropriately. With 4 Shader Engines and 8 scan converters working on 4 triangles, you could have some kind of network of coarse and fine rasterisation feeding 8 Shader Arrays.

3dilettante said:
It would become a problem in instances where one of the backwards compatibility modes is invoked, and iso-clock the PS5 would lose to the prior console it's trying to emulate.

From Github, Prim Legacy is still 4 triangles per cycle and depending on clocks for BC, you'll have enough triangle throughput. You could possibly use the NGG Legacy path as well. Perhaps the scan converters can be arranged to work on 1 triangle with 1 scan converter in Legacy mode, and 1 triangle with 2 scan converters in Fast/ Native mode.

3dilettante said:
NGG is a reorganization of the internal shaders used for geometry processing, which seems to include hardware and compiler changes. Primitive shaders are a part of it, but there were other changes like the merging of several internal shader types that were discussed separately--at least for prior GPU generations.

NGG Primitive Shaders still use fixed-function scan converters? IIRC, Raster Units are still involved, but not Prim Units.

3dilettante said:
Legacy hardware for the PS4 generation wouldn't have the ability to cull 2 primitives per cycle, so the 4 to 8 jump could come just from the marketed 8 pre-cull figure for the geometry processor in RDNA.
The specific tests have different settings, which may impact what is being measured. The specific format being processed and culling settings can change behavior, and 3.3 primitives from a 10-vertex peak rate makes sense assuming 3 vertices/triangle.

The suffixes aren't very clear in that Github table. However, there's only 1 NGG Legacy entry. This could be for PS4 Pro culling BC. Why have missing corresponding entries - pre-cull 8 for NGG Legacy and post-cull 4 for NGG Fast/ Native? Without distinction, I assume they correspond.

If NGG Legacy is pre-cull = 8 prim/cycle
And NGG Fast/ Native is pre-cull = 4 prim/cycle

Then NGG Fast/ Native is post-cull = 2 prim/ cycle
This feeds Prim Fast/ Native at 2 triangles per cycle - assumed PS5 fixed-function RDNA2 Raster Units

3dilettante said:
It may be a different mode, or it's not measuring the same facet of the front end as the others.

Yes, the suffixes/ labels are not explicit.

3dilettante said:
Given that the tests may not be comparable to one another, why assume a different value than what is given? Which entries say 2 primitives/clk?

From the Github leak, which is mainly for backwards compatiblity, there is an entry missing, "peak Prim Fast". This is assumed fixed-function native capability for rasterisation/ scan conversion and is absent, so in my previous post and above post, deduced it in 2 different ways - as 2 triangles per clock.

3dilettante · Dec 7, 2020

j^aws said:
https://forum.beyond3d.com/threads/...6900-xt-2020-10-28.62091/page-65#post-2177723
Coarse rasteriser (scan converter) is feeding multiple fine rasterisers. Not sure on the topology - parallel, series or combination of scan converters.

That's different from having a coarse rasterizer working alongside a fine rasterizer. The coarse rasterizer is routing primitives to multiple finer rasterizers, the coarse rasterizer itself doesn't provide coverage, since a coarse rasterizer doesn't sample finely enough to give the appropriate coverage information.

For example for Navi21: 1 Shader Engine has 2 scan converters, working on 1 triangles. You could arrange those scan converters in a number of coarse and fine arrangements, and feed the Shader Arrays appropriately. With 4 Shader Engines and 8 scan converters working on 4 triangles, you could have some kind of network of coarse and fine rasterisation feeding 8 Shader Arrays.

Coarse rasterization would give a general screen space tile or tiles that a primitive might cover. That's too broad for the pixel coverage needed for a pixel shader wavefront's launch. What coarse rasterization can do is provide information on which rasterizers may be responsible for providing the per-pixel coverage data needed.

From Github, Prim Legacy is still 4 triangles per cycle and depending on clocks for BC, you'll have enough triangle throughput.

If it's 4 triangles per clock then I don't see the point in speculating about a 2-triangle per clock arrangement. Per the BC testing and some of the discussion of the boost and non-boost forms for the PS5's backwards compatibility, a fallback to PS4 or PS4 Pro clocks would not work very well if the hardware was half as wide.

NGG Primitive Shaders still use fixed-function scan converters? IIRC, Raster Units are still involved, but not Prim Units.

As of the last time AMD discussed them with any detail, yes. Vega's primitive shader still fed into a primitive assembler. Primitive shaders have optional levels of culling available, leaving it up to the primitive unit and rasterizer to make the final determination.

The suffixes aren't very clear in that Github table. However, there's only 1 NGG Legacy entry.

I'm seeing NGG Vertex legacy and NGG Prim legacy, which one are you referring to?

This could be for PS4 Pro culling BC. Why have missing corresponding entries - pre-cull 8 for NGG Legacy and post-cull 4 for NGG Fast/ Native? Without distinction, I assume they correspond.

Do you mean the ones that include the (tri list) descriptor? They may not be testing the same input mode, and the B column which isn't expanded fully in the screenshot indicates there are other possible differences.

If NGG Legacy is pre-cull = 8 prim/cycle
And NGG Fast/ Native is pre-cull = 4 prim/cycle

This is ignoring that the test names hint they aren't testing the same thing.

From the Github leak, which is mainly for backwards compatiblity, there is an entry missing, "peak Prim Fast". This is assumed fixed-function native capability for rasterisation/ scan conversion and is absent, so in my previous post and above post, deduced it in 2 different ways - as 2 triangles per clock.

If fast launch is unique to NGG, the non-NGG tests wouldn't need to test it.

j^aws · Dec 7, 2020

3dilettante said:
Coarse rasterization would give a general screen space tile or tiles that a primitive might cover. That's too broad for the pixel coverage needed for a pixel shader wavefront's launch. What coarse rasterization can do is provide information on which rasterizers may be responsible for providing the per-pixel coverage data needed.

https://forum.beyond3d.com/threads/...6900-xt-2020-10-28.62091/page-53#post-2176773
A triangle can touch a single pixel or up to 32 in the above link. We still have 1 Raster Unit per Shader Engine and two scan converters per triangle. Scan Converters (rasterisers) and their detailed arrangements aren't explicit.

3dilettante said:
If it's 4 triangles per clock then I don't see the point in speculating about a 2-triangle per clock arrangement. Per the BC testing and some of the discussion of the boost and non-boost forms for the PS5's backwards compatibility, a fallback to PS4 or PS4 Pro clocks would not work very well if the hardware was half as wide.

This started with Navi21, where Navi22 with half the number of Shader Engines, and half the number of Scan Converters would do half the number of triangles per cycle at 2, and speculation on PS5 being similar.

The Github data isn't explicit enough to confirm or deny, and the point of the speculation.

3dilettante said:
Do you mean the ones that include the (tri list) descriptor? They may not be testing the same input mode, and the B column which isn't expanded fully in the screenshot indicates there are other possible differences.

Yes, I'm only looking at the Prim entries - NGG Prim Legacy, NGG Prim Fast and Prim Legacy. And I'm aware they may not be testing the same input method as it isn't explicit in the table. It makes more sense to test corresponding inputs for both Legacy and Fast. Still could be the opposite.

3dilettante said:
This is ignoring that the test names hint they aren't testing the same thing.

This isn't clear.

3dilettante said:
If fast launch is unique to NGG, the non-NGG tests wouldn't need to test it.

Yes, my point that Github is testing BC and a corresponding Prim Fast entry is missing, and a deduction made.

3dilettante · Dec 7, 2020

j^aws said:
https://forum.beyond3d.com/threads/...6900-xt-2020-10-28.62091/page-53#post-2176773
A triangle can touch a single pixel or up to 32 in the above link. We still have 1 Raster Unit per Shader Engine and two scan converters per triangle. Scan Converters (rasterisers) and their detailed arrangements aren't explicit.

The number of pixels a triangle covers isn't what differentiates coarse and fine rasterization. The ability of a given rasterizer stage to determine how many pixels a triangle covers determines whether it is coarse or fine rasterization. A coarse rasterizer generally cannot give pixel-level information, and might be as coarse as a screen tile or region of pixels/quads. A standard rasterizer would need to follow up with the actual coverage information, and the coarse rasterizer's output can be used to determine which rasterizers would need to do this.
A very coarse-level check was mentioned for Vega's primitive shaders, where there was an instruction used to look up how many shader engines (1:1 with rasterizer at that time) would be responsible for evaluating a given primitive. Such a coarse check might be handled with primitive shaders now, or may possibly be part of the the workload handled by the geometry processor.

This started with Navi21, where Navi22 with half the number of Shader Engines, and half the number of Scan Converters would do half the number of triangles per cycle at 2, and speculation on PS5 being similar.

The Github data isn't explicit enough to confirm or deny, and the point of the speculation.

The Github data gives 4 primitives per cycle natively. Its PS4 Pro mode gives 4, and the PS4 mode gives 2, which gives a decent pattern for the number of primitives the hardware can process per clock. I'm not following on why generally proven details on the actual hardware would be disputed by the settings of a different GPU.

Yes, I'm only looking at the Prim entries - NGG Prim Legacy, NGG Prim Fast and Prim Legacy. And I'm aware they may not be testing the same input method as it isn't explicit in the table. It makes more sense to test corresponding inputs for both Legacy and Fast. Still could be the opposite.

I'm not following your wording here. I was responding to your statement that there was only one legacy entry, when I found two. Your answer now gives three, and one of them isn't legacy.
If the fast case is a special case for NGG that isn't considered special for the legacy case, it wouldn't be necessary. One specific form of primitive input is being singled out, which may make it a condition for whatever fast launch is.

This isn't clear.

It's not a definitive statement, but they bothered to change the naming in a specific way, when if there were no change they wouldn't need to.

Yes, my point that Github is testing BC and a corresponding Prim Fast entry is missing, and a deduction made.

Perhaps if the full rows of that section were available, some of the differences may be partly explained. There are elements in the spreadsheet that explicitly measure functionality that do not have equivalents in the original hardware, like the earlier row on WGP mode LDS bandwidth or L1 graphics cache bandwidth.

j^aws · Dec 7, 2020

3dilettante said:
The number of pixels a triangle covers isn't what differentiates coarse and fine rasterization. The ability of a given rasterizer stage to determine how many pixels a triangle covers determines whether it is coarse or fine rasterization. A coarse rasterizer generally cannot give pixel-level information, and might be as coarse as a screen tile or region of pixels/quads. A standard rasterizer would need to follow up with the actual coverage information, and the coarse rasterizer's output can be used to determine which rasterizers would need to do this.
A very coarse-level check was mentioned for Vega's primitive shaders, where there was an instruction used to look up how many shader engines (1:1 with rasterizer at that time) would be responsible for evaluating a given primitive. Such a coarse check might be handled with primitive shaders now, or may possibly be part of the the workload handled by the geometry processor.

There seems to be definition vagueness. We know that the Raster Unit spits out 1-32 fragments per triangle. There are 2 scan converters involved from the driver leak. The 'coarse' and 'fine' arrangement of scan converters (rasterisers) are not detailed. We are speculating on how we get 1-32 fragments.

3dilettante said:
The Github data gives 4 primitives per cycle natively. Its PS4 Pro mode gives 4, and the PS4 mode gives 2, which gives a decent pattern for the number of primitives the hardware can process per clock. I'm not following on why generally proven details on the actual hardware would be disputed by the settings of a different GPU.

These are not explicit PS5 modes to tell us what the non-NGG, non-legacy PS5 native Raster Units are capable of. We have suffixes 'legacy', 'NGG' and 'fast' as below:

- peak prim legacy = 4 prim/clk (fixed-function)
- peak NGG legacy = 8 prim/clk (primitive shader)
- peak NGG fast = 3.3 prim/clk (weird, non-integer)
- peak NGG fast / scan conv = 4 prim/ clk (native?, primitive shader)

We are missing 'peak prim fast', the corresponding entry to 'peak prim legacy'.

As already mentioned, NGG legacy is twice NGG fast, with no extra details about them being pre or post cull.

1) You are trying to say NGG legacy is 8 because of pre cull, and NGG fast is 4 because of post cull. That's why it's higher. I'm saying we don't have that info, and it's just as valid to say they are both pre cull.

2) If they are both pre cull, why is NGG fast 4 instead of 8? Because for RDNA2 and Navi21, 8 to 4 ratio is pre to post cull ratio. But for Navi22, 4 to 2 ratio is expected for pre to post cull ratio.

Which means 2 prim/ cycle sent to 2 Raster Units for Navi22, instead of 4 prim/ cycle sent to 4 Raster Units for Navi21. If PS5 follows Navi22, then 2 prim/ cycle is expected. You are arguing 1) and I'm arguing 2) as far as I can see.

3dilettante said:
I'm not following your wording here. I was responding to your statement that there was only one legacy entry, when I found two. Your answer now gives three, and one of them isn't legacy.
If the fast case is a special case for NGG that isn't considered special for the legacy case, it wouldn't be necessary. One specific form of primitive input is being singled out, which may make it a condition for whatever fast launch is.

See above, hope that is clearer.

3dilettante said:
It's not a definitive statement, but they bothered to change the naming in a specific way, when if there were no change they wouldn't need to.

It's not clear as mentioned above.

3dilettante said:
Perhaps if the full rows of that section were available, some of the differences may be partly explained. There are elements in the spreadsheet that explicitly measure functionality that do not have equivalents in the original hardware, like the earlier row on WGP mode LDS bandwidth or L1 graphics cache bandwidth.

Yes, the data isn't clear to confirm explicitly what I said above.

3dilettante · Dec 8, 2020

j^aws said:
There seems to be definition vagueness. We know that the Raster Unit spits out 1-32 fragments per triangle. There are 2 scan converters involved from the driver leak. The 'coarse' and 'fine' arrangement of scan converters (rasterisers) are not detailed. We are speculating on how we get 1-32 fragments.

The rasterizer stage produces coverage data for a 32/64-wide pixel shader wavefront. I'm guessing there's an assumption that the path between rasterizer and the launch hardware is sized for 32 pixels, although equality isn't necessary since there's no requirement that launches occur every cycle (GCN was 16:64). A triangle can cover more than 32 pixels, since that is a function of the dimensions of the triangle (could be screen-sized if necessary) but that would require additional wavefronts that cannot launch in the same cycle, blocking further geometry processing upstream. That coverage information is at the fine level of granularity, as coarse rasterization doesn't give an adequate answer for the individual elements in the coverage mask.
Having multiple rasterizers opens up the question of how many of them need to evaluate a given triangle. The straightforward approach would be to submit the same triangle to however many rasterizers there are, but that wastes their cycles in many cases because rasterizers are responsible for separate tiles of screen space, and most triangles touch fewer tiles than there are rasterizers. Coarse rasterization can flag which rasterizers need to be sent a triangle for coverage evaluation, which may make more sense if there are additional subdivisions in the scan conversion process beyond the original 4.

These are not explicit PS5 modes to tell us what the non-NGG, non-legacy PS5 native Raster Units are capable of. We have suffixes 'legacy', 'NGG' and 'fast' as below:

This is where having visibility on all the columns in that spreadsheet and their headers would give more information. I'd have to search where the data was discussed before, but there are columns for the native clocks at the time and modes coinciding with the PS4 Pro and PS4, with per-clock adjustments for things like the PS4 being half as wide as the Pro.
The pipeline for processing geometry would be present for at least 4 primitives per clock, going by the tests for the native, PS4 Pro BC mode, and 2 per clock in the compatibility mode for the PS4.
Much of the hardware is shared between the types, so I would think it would be more straightforward to maintain the throughput.

- peak prim legacy = 4 prim/clk (fixed-function)
- peak NGG legacy = 8 prim/clk (primitive shader)
- peak NGG fast = 3.3 prim/clk (weird, non-integer)
- peak NGG fast / scan conv = 4 prim/ clk (native?, primitive shader)

We are missing 'peak prim fast', the corresponding entry to 'peak prim legacy'.

The "weird" non-integer value may not be that weird if we don't delete the text related to triangle lists and the row related to there being 10 vertices per clock in NGG Vertex Fast mode.
There are 3 vertices per triangle, so I can imagine one way to get 3.3 triangles/clock from a process that supports 10 vertices/clock.
Since Fast always mentions (Tri list) it could be that it's part of the condition for the fast test, in which case we have the peak value for fast.

As already mentioned, NGG legacy is twice NGG fast, with no extra details about them being pre or post cull.

The rest of the process around primitive processing is sized for 4 primitives/clock, going by the wave launch section. Dropping NGG's throughput despite the the hardware being mostly there with the legacy pipeline and wave launch path being sized for 4 doesn't seem necessary to me.

1) You are trying to say NGG legacy is 8 because of pre cull, and NGG fast is 4 because of post cull. That's why it's higher. I'm saying we don't have that info, and it's just as valid to say they are both pre cull.

The first part is a possibility I've mentioned.
I'm theorizing NGG fast's behavior may be related to the triangle list condition, which could have different constraints.
That they have different culling settings is something I've noted, though I don't know what the settings specifically mean.
The 3.3 prim/clk ratio aligns with the NGG Vertex Fast row that wasn't in the list posted, given triangles have 3 vertices.
The fractional throughput limit there may constrain the overall rate, but I don't know if that means other formats would have the same ceiling.
The Peak NGG Prim Fast(Tri list)/Scan Conv line may have some implication on the throughput or culling capabilities of the chip in fast mode since there are multiple scan convertors. The BC testing's scaling from a 2 SE mode to a 4 SE mode would seem to indicate there are 4 however.

Which means 2 prim/ cycle sent to 2 Raster Units for Navi22, instead of 4 prim/ cycle sent to 4 Raster Units for Navi21. If PS5 follows Navi22, then 2 prim/ cycle is expected. You are arguing 1) and I'm arguing 2) as far as I can see.

Navi 22 is supposedly a 2 SE GPU, but the PS5's testing behavior is consistent with there being 4.

j^aws · Dec 8, 2020

3dilettante said:
The rasterizer stage produces coverage data for a 32/64-wide pixel shader wavefront. I'm guessing there's an assumption that the path between rasterizer and the launch hardware is sized for 32 pixels, although equality isn't necessary since there's no requirement that launches occur every cycle (GCN was 16:64). A triangle can cover more than 32 pixels, since that is a function of the dimensions of the triangle (could be screen-sized if necessary) but that would require additional wavefronts that cannot launch in the same cycle, blocking further geometry processing upstream. That coverage information is at the fine level of granularity, as coarse rasterization doesn't give an adequate answer for the individual elements in the coverage mask.
Having multiple rasterizers opens up the question of how many of them need to evaluate a given triangle. The straightforward approach would be to submit the same triangle to however many rasterizers there are, but that wastes their cycles in many cases because rasterizers are responsible for separate tiles of screen space, and most triangles touch fewer tiles than there are rasterizers. Coarse rasterization can flag which rasterizers need to be sent a triangle for coverage evaluation, which may make more sense if there are additional subdivisions in the scan conversion process beyond the original 4.

Are you using 'rasterisers' and 'scan converters' interchangeably?

Given a Shader Engine, there are 2 scan converters in 2 Shader Arrays, and the Raster Unit sits above SAs at the SE level processing 1 triangle. It isn't clear how we are getting 1-32 fragments still from 'coarse' and 'fine' scan converters. Is every scan converter upgraded to 1-32, or still 1-16, but in some combination? A whitepaper with pipeline breakdown would be great.

Then there are 8 Packers in each RDNA2 Shader Engine from the driver leak, so 2x2 quads would make 32 fragments being sent, which matches wave size.

3dilettante said:
This is where having visibility on all the columns in that spreadsheet and their headers would give more information. I'd have to search where the data was discussed before, but there are columns for the native clocks at the time and modes coinciding with the PS4 Pro and PS4, with per-clock adjustments for things like the PS4 being half as wide as the Pro.
The pipeline for processing geometry would be present for at least 4 primitives per clock, going by the tests for the native, PS4 Pro BC mode, and 2 per clock in the compatibility mode for the PS4.
Much of the hardware is shared between the types, so I would think it would be more straightforward to maintain the throughput.

Well, if you have a better spreadsheet that has more information, then that would make things clearer. As mentioned previously, what we have isn't explicit enough.

3dilettante said:
Since Fast always mentions (Tri list) it could be that it's part of the condition for the fast test, in which case we have the peak value for fast.

Yes, but the 'tri list' doesn't make it clear if the numbers are pre or post cull, as mentioned previously, and doesn't make it any more explicit for 'peak Prim Fast', so we are still looking at different views.

3dilettante said:
The rest of the process around primitive processing is sized for 4 primitives/clock, going by the wave launch section. Dropping NGG's throughput despite the the hardware being mostly there with the legacy pipeline and wave launch path being sized for 4 doesn't seem necessary to me.

The 'wave launch' section has the first row for Work Items as 64 items/clock, which is GCN wavefronts as far as I can see. So, all subsequent entries in that section would be legacy BC. So seeing 4 prim/clk entries isn't surprising. Github was about BC, so ascertaining RDNA native capabilities are obfuscated.

3dilettante said:
The Peak NGG Prim Fast(Tri list)/Scan Conv line may have some implication on the throughput or culling capabilities of the chip in fast mode since there are multiple scan convertors. The BC testing's scaling from a 2 SE mode to a 4 SE mode would seem to indicate there are 4 however.

Do you mean SEs as Shader Engines or Shader Arrays? What is 4 referring to? I expect 4 scan converters. Even Navi22 has 4 from the driver leak.

3dilettante said:
Navi 22 is supposedly a 2 SE GPU, but the PS5's testing behavior is consistent with there being 4.

Sorry, not following '4' - is that referring to 4 scan converters? Both Navi22 and PS5 should have the same numbers.

3dilettante · Dec 10, 2020

j^aws said:
Are you using 'rasterisers' and 'scan converters' interchangeably?

Rasterization as a process is synonymous with scan conversion, in this case I'm treating scan converters as a later stage in the rasterization hardware's functionality. Whether they're physically distinct in a manner that hasn't been indicated before isn't clear, but for the purposes of determining peak geometry rate the rasterizer block would be accepting the geometry first.

Given a Shader Engine, there are 2 scan converters in 2 Shader Arrays, and the Raster Unit sits above SAs at the SE level processing 1 triangle. It isn't clear how we are getting 1-32 fragments still from 'coarse' and 'fine' scan converters. Is every scan converter upgraded to 1-32, or still 1-16, but in some combination? A whitepaper with pipeline breakdown would be great.

Perhaps there's something about the links being used to reference posts, but I haven't seen what is supposed to indicate there are coarse and fine scan converters like you've claimed.
The amount of coverage information being used for pixel shader wavefront launch is equal to the size of the wavefront. Whether that coverage mask is filled with active lanes is based on how many pixels/quads a triangle is found to be touching, where a scan converter probably provides the information that populates the mask.
I'm not seeing gain in restricting the amount of coverage information being generated by narrowing the scan converter output, a wavefront isn't going to launch until it has that information, irrespective of the number of pixels the triangle covers--which can be more than 32.

The model I'm working with for now is what was documented in AMD's patent for a binning rasterizer, which is presumably the DSBR introduced with Vega. https://www.freepatentsonline.com/20190122417.pdf
What AMD has publicly described as its rasterizer covers the primitive batching module, accumulator, and a scan converter. If AMD has split or duplicated scan conversion hardware, the path from the binning and culling portion of the rasterizer would define peak geometry rate for triangles that are rendered.

Then there are 8 Packers in each RDNA2 Shader Engine from the driver leak, so 2x2 quads would make 32 fragments being sent, which matches wave size.

I'm not 100% certain on the identity of the packers in the driver leak, but if it's related to POPS packers in the ISA it's not how they would be used. A wavefront can reference a packer ID, but that ID is for all pixels in the wavefront. The point of it is to provide a way to detect that exports from different triangles' pixel shaders are hitting the same pixels, and the packer ID and the value given by that packer give the order those exports should retire in based on what sequence the triangle entered the rasterization process.

Well, if you have a better spreadsheet that has more information, then that would make things clearer. As mentioned previously, what we have isn't explicit enough.

It came up some time ago, I would need to see if any pages were attached to posts or there are other repositories.

Yes, but the 'tri list' doesn't make it clear if the numbers are pre or post cull, as mentioned previously, and doesn't make it any more explicit for 'peak Prim Fast', so we are still looking at different views.

The vertex entry and the 3.3 throughput may indicate there is another factor. There are choices that can be made in terms of how the geometry is passed to the GPU that can have significant throughput impacts, although the most recent example of synthetics being used to check this was for Vega.

The 'wave launch' section has the first row for Work Items as 64 items/clock, which is GCN wavefronts as far as I can see. So, all subsequent entries in that section would be legacy BC. So seeing 4 prim/clk entries isn't surprising. Github was about BC, so ascertaining RDNA native capabilities are obfuscated.

GCN has a 4-clock cadence, so 16 items would be brought up per clock, per shader engine.
The PS4 Pro's rate was 64, and it has 4 shader engines. The PS4's rate is 32, and it has 2 shader engines.

Do you mean SEs as Shader Engines or Shader Arrays? What is 4 referring to? I expect 4 scan converters. Even Navi22 has 4 from the driver leak.

SE means shader engine, Navi 22 has two SEs in the leak.
At least so far, shader launch has seemed to be part of the hardware that is actually at the shader engine level, versus shader array.

Sorry, not following '4' - is that referring to 4 scan converters? Both Navi22 and PS5 should have the same numbers.

Shader engines.

eastmen · Dec 10, 2020

This is why I believe the digital only versions of the consoles will be the most popular moving forward and why I think console refreshes or next gen consoles wont have Bluray drives

I also think for next gen post ps5/xbox series bluray makes zero sense as that would be the 4th generation with the same storage format and its going to have issues for storage.

Bluray is 25/50 XL is 100/128.
Bluray has a maximum speed of 72MB/s @16x I am not sure if xl discs can read faster , can't find the info.

About the only thing bluray has going for it is the disc cost. But its going to be harder and harder to justify that disc cost when 75% of customers buy digital. The drive price is a price that is carried with every console sold that has one.

I think if we see any physical format in the next generation of consoles it will be some nand cart. 128Gigs of nand continues to get cheaper and cheaper with some on black friday dipping under $10 and 256gig dipping to $20. Fast forward another 6 years and what would we have ?

Pete · Dec 10, 2020

IIRC, the PC/console preorder split was 59/41. Assuming virtually no one on PC got a physical version, that’s 26/41=63% physical on console. 63%*41%*8mil=2mil copies, nothing to sneeze at.

Still, a $400 PS5 DE is nothing to sneeze at, either.

Globalisateur · Dec 10, 2020

eastmen said:
About the only thing bluray has going for it is the disc cost. But its going to be harder and harder to justify that disc cost when 75% of customers buy digital. The drive price is a price that is carried with every console sold that has one.

Many customers still can't realistically download 100GB games. I know I can't.

I think if we see any physical format in the next generation of consoles it will be some nand cart. 128Gigs of nand continues to get cheaper and cheaper with some on black friday dipping under $10 and 256gig dipping to $20. Fast forward another 6 years and what would we have ?

Games size of >300GB and even more greedy publishers.

thicc_gaf · Dec 11, 2020

eastmen said:
This is why I believe the digital only versions of the consoles will be the most popular moving forward and why I think console refreshes or next gen consoles wont have Bluray drives

I also think for next gen post ps5/xbox series bluray makes zero sense as that would be the 4th generation with the same storage format and its going to have issues for storage.

Bluray is 25/50 XL is 100/128.
Bluray has a maximum speed of 72MB/s @16x I am not sure if xl discs can read faster , can't find the info.

About the only thing bluray has going for it is the disc cost. But its going to be harder and harder to justify that disc cost when 75% of customers buy digital. The drive price is a price that is carried with every console sold that has one.

I think if we see any physical format in the next generation of consoles it will be some nand cart. 128Gigs of nand continues to get cheaper and cheaper with some on black friday dipping under $10 and 256gig dipping to $20. Fast forward another 6 years and what would we have ?

NAND-based USB carts could definitely be a thing for at least one of the 10th-gen systems. Already in arcade/FEC markets, there's systems like the exa-Arcadia which use something like this as a physical media for game delivery.

Microsoft already has some proto-form of this on the market already with the expansion cards. A few years from now, cut down on the capacity a ton, keep the same bandwidth spec, maybe tune the decompression capabilities a bit and 64 GB/128 GB, maybe even 256 GB cards for physical delivery can be doable at affordable prices. Though it'll probably still be a bit more than equivalent Blu-Ray disc by that time.

Anyway about Cyberpunk, the split is interesting but not surprising. Been reading a lot about the glitches and performance issues though, even on cards like the 3080 it seems to be very unoptimized. Hopefully by the time the next-gen patch is ready the glitches and performance issues will be fixed.

Globalisateur said:
Many customers still can't realistically download 100GB games. I know I can't.

Games size of >300GB and even more greedy publishers.

xD that's a true if pessimistic outlook on that note. But even if any type of NAND-based USB cart doesn't happen, let's at least hope SSD sizes will be a lot bigger.

4 TB should be completely doable by then, probably at a lower cost than the 1 TB/768 GB equivalents costed the 9th-gen systems.

fehu · Dec 11, 2020

If you must lose compatibility with the disc, go directly to digital only.
The customers that you lose are nothing compared to the cost of maintaining an expensive niche format.

liams · Dec 11, 2020

They could just use straight up USB memory sticks, with a read-only mode on it. The benefit of using a USB based solution and still requiring an install process is that you could introduce it as a distribution format whenever you wanted, you don't need to wait for a hard cut gen to gen.

Jay · Dec 11, 2020

liams said:
They could just use straight up USB memory sticks, with a read-only mode on it. The benefit of using a USB based solution and still requiring an install process is that you could introduce it as a distribution format whenever you wanted, you don't need to wait for a hard cut gen to gen.

Disks are hardly even a distribution format now.
Not when day 1 patches are the same size as the install or only part is actually on the disk to begin with.

What I've always thought places like gamestop should do is provide a service where they downloaded the games, and you can either rent the usb drive or bring your own and copy the game to it.
So still digital, but for people with data caps or bad net, they jist pay couple bucks to use theres. (obviously they would have load of games already downloaded etc)

liams · Dec 11, 2020

Jay said:
Disks are hardly even a distribution format now.
Not when day 1 patches are the same size as the install or only part is actually on the disk to begin with.

What I've always thought places like gamestop should do is provide a service where they downloaded the games, and you can either rent the usb drive or bring your own and copy the game to it.
So still digital, but for people with data caps or bad net, they jist pay couple bucks to use theres. (obviously they would have load of games already downloaded etc)

I think we will that on the xbox side with GameStop's in the near future. Once the expandable storage cards come down in price a bit and the world is a bit less covidy I think they will roll something like that out. I could see them wanting to delay that at the moment because they don't want to be seen encouraging people to risk their health by going into stores. The expandable storage cards for the xboxs are perfect for this. I could even see it being a free service, if Microsoft provided the kiosks for free to gamestop and gamestop provided the floor space for free its just another reason for customers to go into a gamestop, giving them more opportunity for sales.

Plus you can download and install any xbox game now, even if you don't own it, which would help with this. You wouldn't have to own every game that you transfer, so if you saw something cool you could just copy it over.

Rodéric · Dec 11, 2020

I think ROM (cartridges or whatever read only electronic system) are better than disk, and digital only is not such a good idea, it should always be an option IMO.
(Some people buy games at release finish them and resell them to play more games, there's no reason to prevent them, and disks are sooo slow...)

Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

3dilettante

QPlayer

chris1515

j^aws

3dilettante

j^aws

3dilettante

j^aws

3dilettante

j^aws

3dilettante

eastmen

Pete

Moderate Nuisance

Globalisateur

Globby

thicc_gaf

fehu

liams

Jay

liams

Rodéric

a.k.a. Ingenu

Similar threads