NVIDIA Fermi: Architecture discussion

Rys · Jan 20, 2010

Alexko said:
Does this mean that the aggregate rasterisation area will be smaller than GT200's on mainstream derivatives?

We don't know yet, raster area may be variable (I'd expect so). If I'm honest, my biggest wonder in the last few days is whether there's really more than one unit at all. Parallelisable setup into one fixed-area rasteriser makes some sense, where it can work on up to four input triangles in a clock.

ShaidarHaran · Jan 20, 2010

CarstenS said:
It's currently 8 ppc/raster unit. If triangles are larger than 32 pix you don't necessarily benefit but only move the bottleneck to the rasters instead of the tri setup. Mainstream parts will be affected based on Nvidias choice of implementation, i.e. their number of GPCs.

Ok, now I get it.

Thanks!

KimB · Jan 20, 2010

XMAN26 said:
I believe if they keep this design going forward from here, that if and when they put support for 3+ monitors on 1 card, they'd have a leg up on ATI unless they change the setup rate aswell. As the number of triangles being rendered across say 3 2" WS LCD is 3x as many as a single 24. Take Crysis, 5-6M triangles per scene, across 3 monitors you get 15-18M. For a good framrate of say an average 60FPS, your talking 900M to 1.04B triangles per seconded needed for that. So if GTX3xx can do 2400-2800M triangles/sec, had they done even just 3 monitors on one single GPU card, I'd say they'd have a very clear and distinct advantage.

I don't think the setup rate will in any way help to run more displays, as the primary limitation there is simply the fill rate. Basically, yes, you're drawing more triangles, but you're also drawing more pixels, so the pixel/triangle ratio doesn't change.

3dilettante · Jan 20, 2010

I interpreted that the aggregate raster output rate is 32 pixels per 1/2 shader clock. Depending on where the shader clocks go and what the L2/ROP domain is set to, the bottleneck could go between raster and ROP depending on the exact configuration and clocks of a given derivative.

Bouncing Zabaglione Bros. · Jan 20, 2010

Rys said:
For GF100, yes, but the aggregate rasterisation area is no bigger than prior hardware could rasterise in a clock.

So... It's actually doing four quarter-triangles per clock? More detail, but not faster...?

Rys · Jan 20, 2010

Bouncing Zabaglione Bros. said:
So... It's actually doing four quarter-triangles per clock? More detail, but not faster...?

In terms of the number of pixels sent for shading, yes.

3dilettante · Jan 20, 2010

Bouncing Zabaglione Bros. said:
So... It's actually doing four quarter-triangles per clock? More detail, but not faster...?

That would depend on the size of the triangles.
If it were 4 triangles that need 32 pixels to be scanned, it would be 4 quarter triangles per clock.
If it were 4 tiny triangles, it would be 4 triangles per clock.

In the case of a single 32-pixel raster block, it's 1 big 32-pixel triangle per clock.
If it's a tiny triangle that covers 8 triangles (edit: pixels), it's one tiny triangle per clock.
If there were 4 8-pixel triangles, still 1 triangle per clock.

trinibwoy · Jan 20, 2010

3dcgi said:
It's unclear if GF100 can achieve full rate without tessellation, but it should be easy to test with a custom app. The potential limitation is that there is a single index buffer per draw call. So they need to parallelize processing of the index buffer to make a single draw command run faster than 1x. This is non trivial.

Is that not also the case for tessellated primitives, that there is a single index buffer for vertices/control points per draw call. If each rasterizer is indeed separate and independent then Rys' theory doesn't work and it would mean that Nvidia has to sustain close to 4 triangles per clock in order to avoid being rasterization limited, even on a fully stocked GF100.

3dcgi · Jan 20, 2010

trinibwoy said:
Is that not also the case for tessellated primitives, that there is a single index buffer for vertices/control points per draw call. If each rasterizer is indeed separate and independent then Rys' theory doesn't work and it would mean that Nvidia has to sustain close to 4 triangles per clock in order to avoid being rasterization limited, even on a fully stocked GF100.

Even if the index buffer is processed at 1 prim/clock amplification in the tessellation stage can keep the setup/rasterizer busy. That's what I was trying to say. Not sure I understood your question.

Just noting that I'm not saying GF100 can't always run a 4 prims/clock. I don't know.

trinibwoy · Jan 20, 2010

Well if I understand you correctly you're saying it's the amplification during tessellation that generates work for multiple raster units. But that doesn't explain how the 16 different tessellation units get fed in the first place - your draw call still has to be parcelled out.

Basically, if that's not the case then in non-tessellation scenarios GF100 would only have a rasterization rate of 8 pixels per clock which would be a bit ridiculous (i.e way too low).

3dcgi · Jan 20, 2010

It would take time to feed the 16 tessellators, but if each tessellator is amplifying eventually the input will catch up to the output.

Consider Cypress, it can setup 1 prim/clock and rasterizes 2 triangles, each at 16 pixels/clock. It can still run faster than 16 pixels/clock because many triangles are 32 or greater pixels so it gives the setup engine time to fill up the input buffers of the rasterizer. If it hits a group of small triangles that have been buffered up having 2 rasterizers allows it to finish the small triangles quicker than if it had a single 32 pixel rasterizer.

The same principle applies to GF100.

trinibwoy · Jan 20, 2010

Yeah the buffering between the tessellator and rasterizer is clear but I thought we were talking about how draw calls are parcelled out to the tessellators in the first place?

3dcgi · Jan 20, 2010

I forgot the obvious things which are patches are specified as lists so it should be easy to split up these draws between the tessellators. Or you can send separate objects (draws) to each cluster.

Mintmaster · Jan 20, 2010

3dcgi said:
Consider Cypress, it can setup 1 prim/clock and rasterizes 2 triangles, each at 16 pixels/clock.

I don't think that's right. It rasterizes 1 per clock, and according to the B3D article the two rasterizers in the diagrams was just an error.

It can still run faster than 16 pixels/clock because many triangles are 32 or greater pixels so it gives the setup engine time to fill up the input buffers of the rasterizer. If it hits a group of small triangles that have been buffered up having 2 rasterizers allows it to finish the small triangles quicker than if it had a single 32 pixel rasterizer.

Having more rasterizers than setup units does nothing for you, because you can't feed the rasterizers any quicker than 1 per clock. Buffering gobs of triangles between setup and rasterization is going to be very costly and not very effective.

GF100 has 16 Polymorph engines and 4 rasterizers because it can then cull triangles at up to 16 per clock. This is a very useful ability. This is also where unified shader architechtures made the biggest impact, because long vertex shaders creating off-screen or backfacing triangles would be processed at one per clock instead of one every 5 clocks or whatever, depending on the vertex shader. Fermi will now blast through them an order of magnitude faster.

R100/GF3 did a similar thing with pixels. Blast through the invisible ones as fast as possible, because they're going to come in clumps too large to buffer and they'll hold back the rest of the pipeline.

Dave Baumann · Jan 20, 2010

Mintmaster said:
I don't think that's right. It rasterizes 1 per clock, and according to the B3D article the two rasterizers in the diagrams was just an error.

In this case its the B3D article that is in error.

Having more rasterizers than setup units does nothing for you, because you can't feed the rasterizers any quicker than 1 per clock.

Works fine on Cypress though

AlNom · Jan 20, 2010

Dave Baumann said:
Works fine on Cypress though

To what end :?:

(fishing for info

)

Jawed · Jan 20, 2010

I think it's purely because setup splits triangles into screen-space tiles, and it's those that are rasterised, not triangles per se (unless, of course, the triangle entirely fits within a single tile).

Jawed

Rys · Jan 20, 2010

Dave Baumann said:
In this case its the B3D article that is in error.

No it's not

Mintmaster · Jan 20, 2010

Dave Baumann said:
In this case its the B3D article that is in error.

Rys said:
No it's not

Dave Baumann said:
Works fine on Cypress though

I never said it doesn't work. I said it doesn't have any major advantages over a single, double-pixel-throughput rasterizer.

trinibwoy · Jan 21, 2010

Mintmaster said:
GF100 has 16 Polymorph engines and 4 rasterizers because it can then cull triangles at up to 16 per clock.

It can only cull 4 per clock.

In the edge setup stage, vertex positions are

fetched and triangle edge equations are computed. Triangles not facing the screen are removed via
back face culling. Each edge setup unit processes up to one point, line, or triangle per clock.

NVIDIA Fermi: Architecture discussion

Rys

Graphics @ AMD

ShaidarHaran

hardware monkey

KimB

3dilettante

Bouncing Zabaglione Bros.

Rys

Graphics @ AMD

3dilettante

trinibwoy

Meh

3dcgi

trinibwoy

Meh

3dcgi

trinibwoy

Meh

3dcgi

Mintmaster

Dave Baumann

Gamerscore Wh...

AlNom

Moderator

Jawed

Rys

Graphics @ AMD

Mintmaster

trinibwoy

Meh

Similar threads