There seems to be definition vagueness. We know that the Raster Unit spits out 1-32 fragments per triangle. There are 2 scan converters involved from the driver leak. The 'coarse' and 'fine' arrangement of scan converters (rasterisers) are not detailed. We are speculating on how we get 1-32 fragments.
The rasterizer stage produces coverage data for a 32/64-wide pixel shader wavefront. I'm guessing there's an assumption that the path between rasterizer and the launch hardware is sized for 32 pixels, although equality isn't necessary since there's no requirement that launches occur every cycle (GCN was 16:64). A triangle can cover more than 32 pixels, since that is a function of the dimensions of the triangle (could be screen-sized if necessary) but that would require additional wavefronts that cannot launch in the same cycle, blocking further geometry processing upstream. That coverage information is at the fine level of granularity, as coarse rasterization doesn't give an adequate answer for the individual elements in the coverage mask.
Having multiple rasterizers opens up the question of how many of them need to evaluate a given triangle. The straightforward approach would be to submit the same triangle to however many rasterizers there are, but that wastes their cycles in many cases because rasterizers are responsible for separate tiles of screen space, and most triangles touch fewer tiles than there are rasterizers. Coarse rasterization can flag which rasterizers need to be sent a triangle for coverage evaluation, which may make more sense if there are additional subdivisions in the scan conversion process beyond the original 4.
These are not explicit PS5 modes to tell us what the non-NGG, non-legacy PS5 native Raster Units are capable of. We have suffixes 'legacy', 'NGG' and 'fast' as below:
This is where having visibility on all the columns in that spreadsheet and their headers would give more information. I'd have to search where the data was discussed before, but there are columns for the native clocks at the time and modes coinciding with the PS4 Pro and PS4, with per-clock adjustments for things like the PS4 being half as wide as the Pro.
The pipeline for processing geometry would be present for at least 4 primitives per clock, going by the tests for the native, PS4 Pro BC mode, and 2 per clock in the compatibility mode for the PS4.
Much of the hardware is shared between the types, so I would think it would be more straightforward to maintain the throughput.
- peak prim legacy = 4 prim/clk (fixed-function)
- peak NGG legacy = 8 prim/clk (primitive shader)
- peak NGG fast = 3.3 prim/clk (weird, non-integer)
- peak NGG fast / scan conv = 4 prim/ clk (native?, primitive shader)
We are missing 'peak prim fast', the corresponding entry to 'peak prim legacy'.
The "weird" non-integer value may not be that weird if we don't delete the text related to triangle lists and the row related to there being 10 vertices per clock in NGG Vertex Fast mode.
There are 3 vertices per triangle, so I can imagine one way to get 3.3 triangles/clock from a process that supports 10 vertices/clock.
Since Fast always mentions (Tri list) it could be that it's part of the condition for the fast test, in which case we have the peak value for fast.
As already mentioned, NGG legacy is twice NGG fast, with no extra details about them being pre or post cull.
The rest of the process around primitive processing is sized for 4 primitives/clock, going by the wave launch section. Dropping NGG's throughput despite the the hardware being mostly there with the legacy pipeline and wave launch path being sized for 4 doesn't seem necessary to me.
1) You are trying to say NGG legacy is 8 because of pre cull, and NGG fast is 4 because of post cull. That's why it's higher. I'm saying we don't have that info, and it's just as valid to say they are both pre cull.
The first part is a possibility I've mentioned.
I'm theorizing NGG fast's behavior may be related to the triangle list condition, which could have different constraints.
That they have different culling settings is something I've noted, though I don't know what the settings specifically mean.
The 3.3 prim/clk ratio aligns with the NGG Vertex Fast row that wasn't in the list posted, given triangles have 3 vertices.
The fractional throughput limit there may constrain the overall rate, but I don't know if that means other formats would have the same ceiling.
The Peak NGG Prim Fast(Tri list)/Scan Conv line may have some implication on the throughput or culling capabilities of the chip in fast mode since there are multiple scan convertors. The BC testing's scaling from a 2 SE mode to a 4 SE mode would seem to indicate there are 4 however.
Which means 2 prim/ cycle sent to 2 Raster Units for Navi22, instead of 4 prim/ cycle sent to 4 Raster Units for Navi21. If PS5 follows Navi22, then 2 prim/ cycle is expected. You are arguing 1) and I'm arguing 2) as far as I can see.
Navi 22 is supposedly a 2 SE GPU, but the PS5's testing behavior is consistent with there being 4.