Modern and Future Geometry Rasterizer layout? *spawn*

Digidi

Regular
I think this is not Total correct. The Rasterizers limit is how much polygons he can take. It doesn’t matter how much pixel he can make out of a polygon. The question is how much polygons do you get into pixel.

AMD stated at GCN that the worst case which can happen is, that the polygon is not bigger than a pixel. If you have this case the old gcn Rasterizer could put out only one pixel for one polygon. If you now have a Rasterizer which can take two polygons, you can get out 2 pixels at 1 clock. As more Polygons a Rasterizer can take, as more Pixels you get.
Page60:
https://de.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah
 
As far as I'm aware both are limits. AFAIK both AMD and nvidia's rasterizers can take at most one triangle/clock and output at most 16 pixels/clock. So large triangles will take more than one cycle to complete, and small triangles will cause lower than the peak pixel throughput
 
Thank you @Ryan Smith. Sorry but i'm not a native englisch speaker. I don't get it. So this mean after Culling you get 2 Primitives. Does this mean they have 2 Rasterizer? Or is it befor Rasterizer. Thank you in advanced!
The situation where two backfaced primitives are culled indicates that the hardware can discard up to two primitives per cycle. At some point, every primitive that reaches the fixed-function hardware must be rejected or used to produce a set of threads of some kind corresponding to the pixels it covers.

The old implementation would spend one cycle per primitive regardless of whether it would produce any pixels for rendering. If the hardware can cull two triangles per cycle, a stream of triangles that need to be discarded can be discarded in half the time.
Depending on how many triangles are encountered that can be culled, this can speed up overall processing since there can be fewer cycles spent on triangles that do not produce output pixels. However, if the stream has mostly non-culled triangles, the throughput would be similar to before.
 
I think this is not Total correct. The Rasterizers limit is how much polygons he can take. It doesn’t matter how much pixel he can make out of a polygon. The question is how much polygons do you get into pixel.

AMD stated at GCN that the worst case which can happen is, that the polygon is not bigger than a pixel. If you have this case the old gcn Rasterizer could put out only one pixel for one polygon. If you now have a Rasterizer which can take two polygons, you can get out 2 pixels at 1 clock. As more Polygons a Rasterizer can take, as more Pixels you get.
Page60:
https://de.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah
Yes, I tried to condense it down a bit. Micropolygons are dreaded by rasterizers, so people came up with compute shader solutions for this, like AMDs Geometry FX. They use CS to cull microgeometry to lessen the burden on rasterizers and geometry engines: https://gpuopen.com/geometryfx/

Also a good read on primitives: https://frostbite-wp-prd.s3.amazonaws.com/wp-content/uploads/2016/03/29204330/GDC_2016_Compute.pdf
 
Thank you 3dilettante and CartenS for your explanation. Nice pages you have finde CarstenS thank you for this.

My point ist. We tallk all about culling but maybe culling is not the only issue you have. If you hava fine meshes even after culling you have a lot of polygons to rasterize. I think in these days small Polygons are used more and more to create fine conturs of area and faces of people.

If you have these case only 1 pixel is created by polygon, which leaves the rest of the pipline (Shaders, Rops) realy empty. The pipline works only realy well when the rasterizer put out 16 pixels for 1 polygon, then you have enough pixels to fill all the shaders and rops.

So in programms with fine meshe where each polygon is not bigger than a pixel you will accelerate your speed dramndiusly if the one rasterizer cane take two polygons per clock and adress the resulting pixels to the shaders and rops. You will directly double your performance for this worst case.
 
As I understood this, and please correct me If I am wrong, NVIDIA can rasterize and cull a max of 6 primitives per cycle, as they have a distributed tessellation (PolyMorph) engine that can deal effectively with this. AMD on the other hand can cull 4 primitives but only rasterize about 2.
 
Last edited:
Navi is still a max of 4, that I am sure of, the question is, 4 culled and 4 rasterized or just 4 culled?
I can see more than 2 per clock, so that cannot be the hard limit. Probably they are running into some other bottleneck that keeps them from reaching clearly more than 3 or approaching four even.
edit: Just re-read the Whitepaper. AMD says explicitly, that each primitive unit can cull 2 triangles per clock and draw 1 (ouput to rasterizer). Each (of the four) rasterizer can process 1 triangle per clock, test for coverage and emit 16 pixels per clock. I haven't seen culled triangle rates much above 8 GTri/s though, maybe the prim units are not fed quickly enough or the test runs into another bottleneck.
 
Last edited:
One Polymorphengine at Nvidia can handel 0.5 polygon/clock. Which lead to maximum 15 polygon/clock for Pascale. But there is a maybe a cache limit, which results in 11 polyons/clock. So 11 Polygons can be culled and only 6 Polygons can get Rasterized at Nvidia. Because Nvidia hast 2 Rasterizer more, maybe this is the Reason why Nvidia get more Performance than AMD? Because in Worst Case Szenario Nvidia can always put out 2 more Polygons than AMD.

You can read here in German my conversation about it with good information from other people like Pixeljetstream.

https://www.forum-3dcenter.org/vbulletin/showthread.php?p=11466705&highlight=0.5#post11466705
 
I can see more than 2 per clock, so that cannot be the hard limit. Probably they are running into some other bottleneck that keeps them from reaching clearly more than 3 or approaching four even.
PCGH used to run some theoretical tests which included geometry. Do you know why they stopped?
 
So in programms with fine meshe where each polygon is not bigger than a pixel you will accelerate your speed dramndiusly if the one rasterizer cane take two polygons per clock and adress the resulting pixels to the shaders and rops. You will directly double your performance for this worst case.
That isn't working as easy as that. You don't get to mix multiple polygons in a single wavefront, due to a usually significant data dependency on per-triangle uniform vertex attributes which is handled in the scalar data path. In order to mix like that, you would need to accept a 16x load amplification on the rasterizer output bandwidth as you would have to drop the scalar path and the compacted inputs for a fully vectorised one. There is no cost effective way to afford that amplification with a hardware rasterization, with geometry engines being kept centralized.

EDIT: Maybe we could actually see this in a future architecture, "lone" pixels being caught in a bucket, and then dispatched in batch in a specialized, scalar free variant of the fragment shader program. But that would still require a decentralised geometry engine to better cope with the increased bandwidth requirements, and a higher geometry throughput.
 
Last edited:
[...], with geometry engines being kept centralized.
Yet, there seems to be some merit to do exactly that (centralizing geo engines), since AMD has done so with the move to RDNA. At least that's what I've gathered from the whitepaper.
 
Back
Top