NVIDIA Fermi: Architecture discussion

Does this mean that the aggregate rasterisation area will be smaller than GT200's on mainstream derivatives?
We don't know yet, raster area may be variable (I'd expect so). If I'm honest, my biggest wonder in the last few days is whether there's really more than one unit at all. Parallelisable setup into one fixed-area rasteriser makes some sense, where it can work on up to four input triangles in a clock.
 
It's currently 8 ppc/raster unit. If triangles are larger than 32 pix you don't necessarily benefit but only move the bottleneck to the rasters instead of the tri setup. Mainstream parts will be affected based on Nvidias choice of implementation, i.e. their number of GPCs.

Ok, now I get it.

Thanks!
 
I believe if they keep this design going forward from here, that if and when they put support for 3+ monitors on 1 card, they'd have a leg up on ATI unless they change the setup rate aswell. As the number of triangles being rendered across say 3 2" WS LCD is 3x as many as a single 24. Take Crysis, 5-6M triangles per scene, across 3 monitors you get 15-18M. For a good framrate of say an average 60FPS, your talking 900M to 1.04B triangles per seconded needed for that. So if GTX3xx can do 2400-2800M triangles/sec, had they done even just 3 monitors on one single GPU card, I'd say they'd have a very clear and distinct advantage.
I don't think the setup rate will in any way help to run more displays, as the primary limitation there is simply the fill rate. Basically, yes, you're drawing more triangles, but you're also drawing more pixels, so the pixel/triangle ratio doesn't change.
 
I interpreted that the aggregate raster output rate is 32 pixels per 1/2 shader clock. Depending on where the shader clocks go and what the L2/ROP domain is set to, the bottleneck could go between raster and ROP depending on the exact configuration and clocks of a given derivative.
 
So... It's actually doing four quarter-triangles per clock? More detail, but not faster...?

That would depend on the size of the triangles.
If it were 4 triangles that need 32 pixels to be scanned, it would be 4 quarter triangles per clock.
If it were 4 tiny triangles, it would be 4 triangles per clock.

In the case of a single 32-pixel raster block, it's 1 big 32-pixel triangle per clock.
If it's a tiny triangle that covers 8 triangles (edit: pixels), it's one tiny triangle per clock.
If there were 4 8-pixel triangles, still 1 triangle per clock.
 
Last edited by a moderator:
It's unclear if GF100 can achieve full rate without tessellation, but it should be easy to test with a custom app. The potential limitation is that there is a single index buffer per draw call. So they need to parallelize processing of the index buffer to make a single draw command run faster than 1x. This is non trivial.

Is that not also the case for tessellated primitives, that there is a single index buffer for vertices/control points per draw call. If each rasterizer is indeed separate and independent then Rys' theory doesn't work and it would mean that Nvidia has to sustain close to 4 triangles per clock in order to avoid being rasterization limited, even on a fully stocked GF100.
 
Is that not also the case for tessellated primitives, that there is a single index buffer for vertices/control points per draw call. If each rasterizer is indeed separate and independent then Rys' theory doesn't work and it would mean that Nvidia has to sustain close to 4 triangles per clock in order to avoid being rasterization limited, even on a fully stocked GF100.
Even if the index buffer is processed at 1 prim/clock amplification in the tessellation stage can keep the setup/rasterizer busy. That's what I was trying to say. Not sure I understood your question.

Just noting that I'm not saying GF100 can't always run a 4 prims/clock. I don't know.
 
Well if I understand you correctly you're saying it's the amplification during tessellation that generates work for multiple raster units. But that doesn't explain how the 16 different tessellation units get fed in the first place - your draw call still has to be parcelled out.

Basically, if that's not the case then in non-tessellation scenarios GF100 would only have a rasterization rate of 8 pixels per clock which would be a bit ridiculous (i.e way too low).
 
It would take time to feed the 16 tessellators, but if each tessellator is amplifying eventually the input will catch up to the output.

Consider Cypress, it can setup 1 prim/clock and rasterizes 2 triangles, each at 16 pixels/clock. It can still run faster than 16 pixels/clock because many triangles are 32 or greater pixels so it gives the setup engine time to fill up the input buffers of the rasterizer. If it hits a group of small triangles that have been buffered up having 2 rasterizers allows it to finish the small triangles quicker than if it had a single 32 pixel rasterizer.

The same principle applies to GF100.
 
Yeah the buffering between the tessellator and rasterizer is clear but I thought we were talking about how draw calls are parcelled out to the tessellators in the first place?
 
I forgot the obvious things which are patches are specified as lists so it should be easy to split up these draws between the tessellators. Or you can send separate objects (draws) to each cluster.
 
Last edited by a moderator:
Consider Cypress, it can setup 1 prim/clock and rasterizes 2 triangles, each at 16 pixels/clock.
I don't think that's right. It rasterizes 1 per clock, and according to the B3D article the two rasterizers in the diagrams was just an error.

It can still run faster than 16 pixels/clock because many triangles are 32 or greater pixels so it gives the setup engine time to fill up the input buffers of the rasterizer. If it hits a group of small triangles that have been buffered up having 2 rasterizers allows it to finish the small triangles quicker than if it had a single 32 pixel rasterizer.
Having more rasterizers than setup units does nothing for you, because you can't feed the rasterizers any quicker than 1 per clock. Buffering gobs of triangles between setup and rasterization is going to be very costly and not very effective.

GF100 has 16 Polymorph engines and 4 rasterizers because it can then cull triangles at up to 16 per clock. This is a very useful ability. This is also where unified shader architechtures made the biggest impact, because long vertex shaders creating off-screen or backfacing triangles would be processed at one per clock instead of one every 5 clocks or whatever, depending on the vertex shader. Fermi will now blast through them an order of magnitude faster.

R100/GF3 did a similar thing with pixels. Blast through the invisible ones as fast as possible, because they're going to come in clumps too large to buffer and they'll hold back the rest of the pipeline.
 
I don't think that's right. It rasterizes 1 per clock, and according to the B3D article the two rasterizers in the diagrams was just an error.
In this case its the B3D article that is in error.

Having more rasterizers than setup units does nothing for you, because you can't feed the rasterizers any quicker than 1 per clock.
Works fine on Cypress though ;)
 
I think it's purely because setup splits triangles into screen-space tiles, and it's those that are rasterised, not triangles per se (unless, of course, the triangle entirely fits within a single tile).

Jawed
 
GF100 has 16 Polymorph engines and 4 rasterizers because it can then cull triangles at up to 16 per clock.

It can only cull 4 per clock.

In the edge setup stage, vertex positions are
fetched and triangle edge equations are computed. Triangles not facing the screen are removed via​
back face culling. Each edge setup unit processes up to one point, line, or triangle per clock.

 
Back
Top