Modern and Future Geometry Rasterizer layout? spawn

trinibwoy · Aug 20, 2020

CarstenS said:
Yet, there seems to be some merit to do exactly that (centralizing geo engines), since AMD has done so with the move to RDNA. At least that's what I've gathered from the whitepaper.

Rasterization, tessellation, culling and triangle setup are all distributed on RDNA in each shader array. What does the central “geometry processor” actually do?

CarstenS · Aug 20, 2020

techuse said:
Why cant GPUs be designed to not shade in quads so that micro polygons don't destroy efficiency?

Probably, because shading in quads nets more efficiency for the general use case than you lose with micropolygons as a corner case.

Bondrewd · Aug 20, 2020

trinibwoy said:
What does the central “geometry processor” actually do?

We don't know.
Seems like scheduling or w/l distribution.

trinibwoy · Aug 20, 2020

CarstenS said:
Yes, I tried to condense it down a bit. Micropolygons are dreaded by rasterizers, so people came up with compute shader solutions for this, like AMDs Geometry FX. They use CS to cull microgeometry to lessen the burden on rasterizers and geometry engines: https://gpuopen.com/geometryfx/

Also a good read on primitives: https://frostbite-wp-prd.s3.amazonaws.com/wp-content/uploads/2016/03/29204330/GDC_2016_Compute.pdf

The micro polygon problem is mitigated somewhat by higher resolutions with their finer pixel grids. Probably doesn’t help much though if your triangles are pixel sized at 1080p.

3dilettante · Aug 20, 2020

CarstenS said:
I can see more than 2 per clock, so that cannot be the hard limit. Probably they are running into some other bottleneck that keeps them from reaching clearly more than 3 or approaching four even.
edit: Just re-read the Whitepaper. AMD says explicitly, that each primitive unit can cull 2 triangles per clock and draw 1 (ouput to rasterizer). Each (of the four) rasterizer can process 1 triangle per clock, test for coverage and emit 16 pixels per clock. I haven't seen culled triangle rates much above 8 GTri/s though, maybe the prim units are not fed quickly enough or the test runs into another bottleneck.

The efficiency gap in AMD's front-end has been a topic of debate for generations. I think the first questions about scaling came up in the VLIW era when the first "dual rasterizer" models were released and AMD didn't seem to benefit all that much from it.
Fast-forward through years of product releases and increases to 4 rasterizers and seeing AMD fall further from peak.
The most recent GPUs did seem to catch up in a number of targeted benchmarks to the competition, however.

I think there were some posts by some with more inside knowledge about why this was, but I don't recall a definitive answer.
At two or four geometry blocks, there would have been a problem of deciding how to partition a stream of primitives between them, and how to pass geometry that covered more than one screen tile between them.
There are code references to potential heuristics, such as moving from the first geometry engine to the second after a certain saturation on the first, round-robin selection, or maybe just feeding one engine at a time.
References to limitations in how a geometry engine can then pass shared geometry to other front ends shows up in a few places and also in AMD patents.
It does seem like there are challenges in how much overhead is incurred in feeding geometry to one or more front ends, where different scenarios might result in performance degradation for a given choice. The process for passing data between front ends and synchronizing them is also a potential bottleneck, as it seems these paths are finicky in terms of synchronization and latency, and there is presumably some heavy crossbar hardware that is difficult to scale.
What Nvidia did to stay ahead of AMD for so long, or what AMD did that left it behind for so long isn't spelled out, to my knowledge.

I think AMD's proposed schemes for moving beyond input assemblers and rasterizers feeding each other through a crossbar network.
However, the rough outline of having up to 4 rasterizers responsible for a checkerboard pattern of tiles in screens space continues even into the purported leak for the big RDNA2 architecture.
In theory, some kind of distributed form of primitive shader might allow for the architecture to drop the limited scaling of the crossbar, but no such scaling is in evidence. The centralized geometry engine seems to regress from some of these proposals, which attempted to make it possible to scale the front end out. Perhaps load-balancing between four peer-level geometry front ends proved more problematic than having a stage in the process that makes some of the distribution decisions ahead of the primitive pipelines.

Ext3h said:
That isn't working as easy as that. You don't get to mix multiple polygons in a single wavefront, due to a usually significant data dependency on per-triangle uniform vertex attributes which is handled in the scalar data path. In order to mix like that, you would need to accept a 16x load amplification on the rasterizer output bandwidth as you would have to drop the scalar path and the compacted inputs for a fully vectorised one. There is no cost effective way to afford that amplification with a hardware rasterization, with geometry engines being kept centralized.

EDIT: Maybe we could actually see this in a future architecture, "lone" pixels being caught in a bucket, and then dispatched in batch in a specialized, scalar free variant of the fragment shader program. But that would still require a decentralised geometry engine to better cope with the increased bandwidth requirements, and a higher geometry throughput.

Triangle packing has been visited as a topic on various occasions, but it seems like in most cases the overheads are too extreme on SIMD hardware. One brief exception being mention of possibly packing triangles in the case of certain instanced primitives in some AMD presentations.

trinibwoy said:
Rasterization, tessellation, culling and triangle setup are all distributed on RDNA in each shader array. What does the central “geometry processor” actually do?

It may play a part in deciding which shader engines/arrays have cycles allocated to processing geometry that straddles their screen tiles, and perhaps some early culling that would be redundantly performed if the default process of passing a triangle whose bounding box indicates multiple engines may be involved. Some references to primitive shader culling in Vega do rely on calculating a bounding box and the values of certain bits in the x and y dimension indicating 1, 2, or 4 front ends being involved.

techuse said:
Why cant GPUs be designed to not shade in quads so that micro polygons don't destroy efficiency?

Quads come in in part because there are built-in assumptions about gradients and interpolation that like 2x2 blocks desired at the shader level. It's a common case for graphics, and a crossbar between 4 clients appears to be a worthwhile hardware investment in general, as various compute, shifts, and cross-lane operations also have shuffles or permutation between the lanes in blocks of 4 as an option or as intermediate steps.
Just removing quad functionality doesn't mean the SIMD hardware, cache structure, or DRAM architecture wouldn't still be much wider than necessary.

trinibwoy said:
The micro polygon problem is mitigated somewhat by higher resolutions with their finer pixel grids. Probably doesn’t help much though if your triangles are pixel sized at 1080p.

One thing I noticed for many compute solutions for culling triangles is that a large number of them avoided placing the culling of triangles based on their being too small or falling between sample points on the programmable hardware. Decisions like frustrum or backface culling tended to be handled in a small number of instructions, and it seems like primitive shaders or CS sieves needed to be mindful of the overhead the culling work would have, since there would be a serial component and duplicated work for any non-culled triangles.

However, even if the pain point for the rasterizers were somehow handled, it's not so much the fixed-function block but the whole SIMD architecture that's behind them as well. SIMDs are 16-32 lanes wide (wavefronts/warps potentially wider) and without efficient packing, a rasterizer that handles small triangles efficiently would still be generating mostly empty or very divergent thread groups.

CarstenS · Aug 20, 2020

Interestingly, the new geometry engine had it's own slide at the Next Horizon Techday, were AMD introduced both Zen2 as well as RDNA/RX 5700. In Mike Mantor's presentation from last years Hot Chips, I could not find any mention of it.

trinibwoy · Aug 20, 2020

3dilettante said:
However, even if the pain point for the rasterizers were somehow handled, it's not so much the fixed-function block but the whole SIMD architecture that's behind them as well. SIMDs are 16-32 lanes wide (wavefronts/warps potentially wider) and without efficient packing, a rasterizer that handles small triangles efficiently would still be generating mostly empty or very divergent thread groups.

I don't know if it's formally documented but given Nvidia's rasterizer throughput is 16 pixels I assumed they were doing some sort of packing to fill 32-wide SIMDs. Unless internally their pixel warps are 16 wide and not 32.

Digidi · Aug 21, 2020

CarstenS said:
Interestingly, the new geometry engine had it's own slide at the Next Horizon Techday, were AMD introduced both Zen2 as well as RDNA/RX 5700. In Mike Mantor's presentation from last years Hot Chips, I could not find any mention of it.

You are right.

But i found Multi-Lane Primitives. This is a new word for me.

https://www.planet3dnow.de/cms/50413-praesentation-amd-hot-chips-31-navi-und-rdna-whitepaper/

The primitive units assemble triangles from vertices and are also responsible for fixed-function tessellation. Each primitive unit has been enhanced and supports culling up to two primitives per clock, twice as fast as the prior generation. One primitive per clock is output to the rasterizer. The work distribution algorithm in the command processor has also been tuned to distribute vertices and tessellated polygons more evenly between the different shader arrays, boosting throughput for geometry.The rasterizer in each shader engine performs the mapping from the geometry-centric stages of the graphics pipeline to the pixel-centric stages. Each rasterizer can process one triangle, test for coverage, and emit up to sixteen pixels per clock. As with other fixed-function hardware, the screen is subdivided and each portion is distributed to one rasterizer.

This is how AMD rasterize Polygons.
https://www.amd.com/system/files/documents/rdna-whitepaper.pdf

CarstenS · Aug 21, 2020

trinibwoy said:
I don't know if it's formally documented but given Nvidia's rasterizer throughput is 16 pixels I assumed they were doing some sort of packing to fill 32-wide SIMDs. Unless internally their pixel warps are 16 wide and not 32.

Pixels are composed of the four RGBA channels, each occupying a SIMD-lane at some point.

nAo · Aug 21, 2020

CarstenS said:
Pixels are composed of the four RGBA channels, each occupying a SIMD-lane at some point.

AFAIK no modern GPU works this way. Each pixel runs on a SIMD/T lane.

CarstenS · Aug 21, 2020

Yes, most probably. But they nevertheless occupy a lane at some point (in space and time). And the processor groups in the SM (like FP32, INT32) are 16-wide AFAIK, taking two clocks to process their 32-wide warps.

3dilettante · Aug 22, 2020

trinibwoy said:
I don't know if it's formally documented but given Nvidia's rasterizer throughput is 16 pixels I assumed they were doing some sort of packing to fill 32-wide SIMDs. Unless internally their pixel warps are 16 wide and not 32.

I think the warps are still 32-wide, and the exact details of Nvidia's solution might not be disclosed. However, if it's similar to AMD's situation, the solution is to take multiple rasterizer clocks to supply the coverage information for a warp/wavefront. AMD's rasterizers have maxed out at 16 pixels/clock despite having 64/32-wide waves. Pixel shader waves aren't required to launch every cycle.

3dcgi · Aug 22, 2020

It seems this thread has gone off topic so I'll continue the trend.

Ext3h said:
That isn't working as easy as that. You don't get to mix multiple polygons in a single wavefront, due to a usually significant data dependency on per-triangle uniform vertex attributes which is handled in the scalar data path.

On AMD hardware multiple primitives can contribute to a PS wave. Up to 16 in fact.

techuse said:
Why cant GPUs be designed to not shade in quads so that micro polygons don't destroy efficiency?

The API needs to change to allow developers to specify texture LOD.

trinibwoy said:
Rasterization, tessellation, culling and triangle setup are all distributed on RDNA in each shader array. What does the central “geometry processor” actually do?

Some of what it does is fetch indices, form primitives, and distribute primitives. Some of the geometry processing, like those tasks you mentioned, are distributed.

OlegSH · Aug 22, 2020

techuse said:
Why cant GPUs be designed to not shade in quads so that micro polygons don't destroy efficiency?

Does it make sense at all?
Wouldn't it be easier to use RT for visibility pass? With RT, it's possible to draw billions of triangles as long as BVH fits in video memory. This presumes heavy instancing, but not that UE5 doesn't use it)

OlegSH · Aug 22, 2020

RT also allows for much more customizable sampling patterns, such technics as TAA or DLSS might benefit a lot. RT will be much faster on micro triangles right out of the box and then much more can be achieved on top of that via more robust reconstruction technics, looks like a way forward to me.

Ext3h · Aug 22, 2020

3dcgi said:
On AMD hardware multiple primitives can contribute to a PS wave. Up to 16 in fact.

How? When I looked at the compiled shaders, I saw non-interpolated data dependencies on the triggering vertex handled with scalar instructions. Everything I saw so far indicated that VS waves never can span instance boundaries (during instanced rendering), and I was under the impression that FS waves can't span geometry primitives either.

Are there any special preconditions which just be met to lift these restrictions?

3dcgi · Aug 22, 2020

Ext3h said:
How? When I looked at the compiled shaders, I saw non-interpolated data dependencies on the triggering vertex handled with scalar instructions. Everything I saw so far indicated that VS waves never can span instance boundaries (during instanced rendering), and I was under the impression that FS waves can't span geometry primitives either.

Are there any special preconditions which just be met to lift these restrictions?

I can't explain what you saw in the shader. RDNA can have multiple draw instances in the same VS wave. This was sometimes true for previous hardware. For example, Xbox One and PS4 could have multiple instances in a VS wave.

milk · Aug 23, 2020

3dcgi said:
The API needs to change to allow developers to specify texture LOD.

Layman here, but is that really the only way to derive miplevel? I always undertood it that mips use pixel quads to decide mip level as a clever exploitation of the fact pixels already were computed in groups anyway for other reasons, but not because that was the only possible way to to derive the texel density at any given pixel. As some decade old assupmptions become obsolete, they have these domino effect of knocking down other optimizations that dependes on them, but such is progress.

Ext3h · Aug 23, 2020

milk said:
Layman here, but is that really the only way to derive miplevel? I always undertood it that mips use pixel quads to decide mip level as a clever exploitation of the fact pixels already were computed in groups anyway for other reasons, but not because that was the only possible way to to derive the texel density at any given pixel. As some decade old assupmptions become obsolete, they have these domino effect of knocking down other optimizations that dependes on them, but such is progress.

For determining the mip level alone, it wouldn't be required. If your UV mapping made certain guarantees about uniform texture resolution, and you know the projection, then normal, tangent and fragment depth are sufficient to decide on the correct mip level, as well as the level of anisotropic filtering required. And the APIs already allow you to provide the derivates if you can calculate them yourselves.

And switching from quads to single pixel packing, if the fragment shader program requires no derivates would not require any API changes at all, but it's pure implementation detail.

Without these preconditions met, there isn't any real alternative to quads though.

3dcgi · Aug 24, 2020

milk said:
Layman here, but is that really the only way to derive miplevel?

I'm not an expert in this area, but my understanding is developer input is necessary to get rid of quads and perform well. It's been a discussion topic between some ISVs and IHVs for years, but seemingly hasn't been a big enough pain point to solve. There was a quad fragment merging paper years ago, but I think it was a lossy approach. I'm not aware of anyone implementing the technique.

Modern and Future Geometry Rasterizer layout? spawn

trinibwoy

Meh

CarstenS

Moderator

Bondrewd

trinibwoy

Meh

3dilettante

CarstenS

Moderator

Attachments

trinibwoy

Meh

Digidi

CarstenS

Moderator

nAo

Nutella Nutellae

CarstenS

Moderator

3dilettante

3dcgi

OlegSH

OlegSH

Ext3h

3dcgi

milk

Like Verified

Ext3h

3dcgi

Similar threads

Modern and Future Geometry Rasterizer layout? *spawn*

Meh

Moderator

Meh

Moderator

Attachments

Meh

Moderator

Nutella Nutellae

Moderator

Like Verified

Similar threads

Modern and Future Geometry Rasterizer layout? spawn