Rasterization as a process is synonymous with scan conversion, in this case I'm treating scan converters as a later stage in the rasterization hardware's functionality. Whether they're physically distinct in a manner that hasn't been indicated before isn't clear, but for the purposes of determining peak geometry rate the rasterizer block would be accepting the geometry first.
Yes, the Raster Unit is determining the peak geometry rate, but I am discussing the change in fragment output per triangle from 1-16 for RDNA1 to 1-32 fragments per triangle for RDNA2, more specifically the Scan Converters.
Perhaps there's something about the links being used to reference posts, but I haven't seen what is supposed to indicate there are coarse and fine scan converters like you've claimed.
In the other thread, discussion went:
- Navi 21 (RDNA2) hss 8 Scan Converters, twice as many as Navi10 (RDNA1)
- Yet, both Navi21 and Navi10 have 4 Raster Units
- And both have peak geometry at 4 triangles per clock
- Later clarified that triangle fragment coverage for Navi21 increased from 1-16 to 1-32, and geometry throughput closer to peak
- Extra detail provided that 'coarse rasteriser' and 'fine rasterisers' are involved
Explanations are around how twice as many Scan Converters are involved for 1 triangle to be converted to 1-32 fragments.
The amount of coverage information being used for pixel shader wavefront launch is equal to the size of the wavefront. Whether that coverage mask is filled with active lanes is based on how many pixels/quads a triangle is found to be touching, where a scan converter probably provides the information that populates the mask.
This coverage information being equal to the size of the wavefront being launched doesn't look the same across architectures. For example:
- GCN, 1-16 fragments, Wave64 across SIMD-16 over 4 cycles
- RDNA1, 1-16 fragments, Wave32 across SIMD-32 over 1 cycle?
- RDNA2, 1-32 fragments, Wave32 across SIMD-32 over 1 cycle?
I'm not seeing gain in restricting the amount of coverage information being generated by narrowing the scan converter output, a wavefront isn't going to launch until it has that information, irrespective of the number of pixels the triangle covers--which can be more than 32.
Well, if 'coarse' rasterisation is sending 'partial' triangles to 'fine' scan converters, coverage should be less sparse, so more fragments produced per scan converter. And more occupancy for SIMD-32 units and better efficiency.
The model I'm working with for now is what was documented in AMD's patent for a binning rasterizer, which is presumably the DSBR introduced with Vega.
https://www.freepatentsonline.com/20190122417.pdf
What AMD has publicly described as its rasterizer covers the primitive batching module, accumulator, and a scan converter. If AMD has split or duplicated scan conversion hardware, the path from the binning and culling portion of the rasterizer would define peak geometry rate for triangles that are rendered.
Something has changed for RDNA2, as we still have the same geometry throughput but double the number of scan converters.
BTW, that patent doesn't clarify Scan Conversion and 1-32 fragment output. It's more an efficient primitive culling algorithm with screen space tiling and depth testing before fragment shading. Looks like a TBDR-type stage incorporated into an IMR. Interestingly, the primitive batching module is sending partial triangles to the scan converter, as mentioned above.
This patent discusses the hardware to make fragment shading more efficient, and save bandwidth, by binning and using hidden surface removal with the accumulator and hierarchical-z depth testing. With deffered shading like a TBDR. To make these fixed-function units more effective, faster GPU clocks now make sense for RDNA2 and PS5.
I'm not 100% certain on the identity of the packers in the driver leak, but if it's related to POPS packers in the ISA it's not how they would be used. A wavefront can reference a packer ID, but that ID is for all pixels in the wavefront. The point of it is to provide a way to detect that exports from different triangles' pixel shaders are hitting the same pixels, and the packer ID and the value given by that packer give the order those exports should retire in based on what sequence the triangle entered the rasterization process.
In the driver leak, these Packers are associated with Scan Converters. And the number of Packers per Scan Converter have doubled from RDNA1 to RDNA2. Since both fragments per cycle and scan converter coverage have doubled in RDNA2, these Packers seem related.
GCN has a 4-clock cadence, so 16 items would be brought up per clock, per shader engine.
For the above, that's Wave64 per SIMD-16 and 4 clock cycles on native GCN. This Wave64 and 16 items is per SIMD-16 units (there are 4 in a GCN CU) and not per shader engine.
RDNA1 has a legacy mode where 2 Wave32 across 2 SIMD-32 units per cycle would achieve 64 work items/ cycle in Wave64 mode.
The PS4 Pro's rate was 64, and it has 4 shader engines. The PS4's rate is 32, and it has 2 shader engines.
PS4 Pro has 4 Shader Engines? I'm sure this is meant as 4 Shader Arrays. Likewise, PS4 has 2 Shader Arrays rather than 2 Shader Engines.
Likewise, Navi22 and PS5 have 4 Shader Arrays rather than 4 Shader Engines. Navi21 has 4 Shader Engines and 8 Shader Arrays for contrast.