I can see more than 2 per clock, so that cannot be the hard limit. Probably they are running into some other bottleneck that keeps them from reaching clearly more than 3 or approaching four even.
edit: Just re-read the Whitepaper. AMD says explicitly, that each primitive unit can cull 2 triangles per clock and draw 1 (ouput to rasterizer). Each (of the four) rasterizer can process 1 triangle per clock, test for coverage and emit 16 pixels per clock. I haven't seen culled triangle rates much above 8 GTri/s though, maybe the prim units are not fed quickly enough or the test runs into another bottleneck.
The efficiency gap in AMD's front-end has been a topic of debate for generations. I think the first questions about scaling came up in the VLIW era when the first "dual rasterizer" models were released and AMD didn't seem to benefit all that much from it.
Fast-forward through years of product releases and increases to 4 rasterizers and seeing AMD fall further from peak.
The most recent GPUs did seem to catch up in a number of targeted benchmarks to the competition, however.
I think there were some posts by some with more inside knowledge about why this was, but I don't recall a definitive answer.
At two or four geometry blocks, there would have been a problem of deciding how to partition a stream of primitives between them, and how to pass geometry that covered more than one screen tile between them.
There are code references to potential heuristics, such as moving from the first geometry engine to the second after a certain saturation on the first, round-robin selection, or maybe just feeding one engine at a time.
References to limitations in how a geometry engine can then pass shared geometry to other front ends shows up in a few places and also in AMD patents.
It does seem like there are challenges in how much overhead is incurred in feeding geometry to one or more front ends, where different scenarios might result in performance degradation for a given choice. The process for passing data between front ends and synchronizing them is also a potential bottleneck, as it seems these paths are finicky in terms of synchronization and latency, and there is presumably some heavy crossbar hardware that is difficult to scale.
What Nvidia did to stay ahead of AMD for so long, or what AMD did that left it behind for so long isn't spelled out, to my knowledge.
I think AMD's proposed schemes for moving beyond input assemblers and rasterizers feeding each other through a crossbar network.
However, the rough outline of having up to 4 rasterizers responsible for a checkerboard pattern of tiles in screens space continues even into the purported leak for the big RDNA2 architecture.
In theory, some kind of distributed form of primitive shader might allow for the architecture to drop the limited scaling of the crossbar, but no such scaling is in evidence. The centralized geometry engine seems to regress from some of these proposals, which attempted to make it possible to scale the front end out. Perhaps load-balancing between four peer-level geometry front ends proved more problematic than having a stage in the process that makes some of the distribution decisions ahead of the primitive pipelines.
That isn't working as easy as that. You don't get to mix multiple polygons in a single wavefront, due to a usually significant data dependency on per-triangle uniform vertex attributes which is handled in the scalar data path. In order to mix like that, you would need to accept a 16x load amplification on the rasterizer output bandwidth as you would have to drop the scalar path and the compacted inputs for a fully vectorised one. There is no cost effective way to afford that amplification with a hardware rasterization, with geometry engines being kept centralized.
EDIT: Maybe we could actually see this in a future architecture, "lone" pixels being caught in a bucket, and then dispatched in batch in a specialized, scalar free variant of the fragment shader program. But that would still require a decentralised geometry engine to better cope with the increased bandwidth requirements, and a higher geometry throughput.
Triangle packing has been visited as a topic on various occasions, but it seems like in most cases the overheads are too extreme on SIMD hardware. One brief exception being mention of possibly packing triangles in the case of certain instanced primitives in some AMD presentations.
Rasterization, tessellation, culling and triangle setup are all distributed on RDNA in each shader array. What does the central “geometry processor” actually do?
It may play a part in deciding which shader engines/arrays have cycles allocated to processing geometry that straddles their screen tiles, and perhaps some early culling that would be redundantly performed if the default process of passing a triangle whose bounding box indicates multiple engines may be involved. Some references to primitive shader culling in Vega do rely on calculating a bounding box and the values of certain bits in the x and y dimension indicating 1, 2, or 4 front ends being involved.
Why cant GPUs be designed to not shade in quads so that micro polygons don't destroy efficiency?
Quads come in in part because there are built-in assumptions about gradients and interpolation that like 2x2 blocks desired at the shader level. It's a common case for graphics, and a crossbar between 4 clients appears to be a worthwhile hardware investment in general, as various compute, shifts, and cross-lane operations also have shuffles or permutation between the lanes in blocks of 4 as an option or as intermediate steps.
Just removing quad functionality doesn't mean the SIMD hardware, cache structure, or DRAM architecture wouldn't still be much wider than necessary.
The micro polygon problem is mitigated somewhat by higher resolutions with their finer pixel grids. Probably doesn’t help much though if your triangles are pixel sized at 1080p.
One thing I noticed for many compute solutions for culling triangles is that a large number of them avoided placing the culling of triangles based on their being too small or falling between sample points on the programmable hardware. Decisions like frustrum or backface culling tended to be handled in a small number of instructions, and it seems like primitive shaders or CS sieves needed to be mindful of the overhead the culling work would have, since there would be a serial component and duplicated work for any non-culled triangles.
However, even if the pain point for the rasterizers were somehow handled, it's not so much the fixed-function block but the whole SIMD architecture that's behind them as well. SIMDs are 16-32 lanes wide (wavefronts/warps potentially wider) and without efficient packing, a rasterizer that handles small triangles efficiently would still be generating mostly empty or very divergent thread groups.