NVIDIA Fermi: Architecture discussion

GZ007 · Jan 26, 2010

Jawed said:
I don't see DS as relevant here. It's merely a vertex shader that consumes naked vertices and paints them with attributes (position, normal, colour...). That's how DS was implemented all these years in ATI's pre-D3D11 tessellation pipeline.

My interpretation of there being 4 PolyMorph engines, rather than 1, per GPC is that this enables each SM to take vertices from VS all the way through TS, HS, DS, GS, SO and pre-setup triangle operations to produce a completed triangle. By localising the processing of primitives like this, keeping them private to a SM, NVidia has minimised the amount of communication outside of each SM. This is important because communication is expensive and slow.

Jawed

I think making tesselation 4 times faster in a given time is much bigger gain. I mean why would you keep the whole thing near a single SM and than use only 1/4 of the SM-s :?:

Would it be slower to use one PolyMorph engine for 4 SM-s than a single one for 1 SM. Of course not :!:

The 4 PolyMorph engines will surely work parallel or else it doesnt have much sense.

Jawed · Jan 26, 2010

I'm saying that 4 slow ones, close to each SM, are better than 1 big one shared by the GPC - despite the duplication in functionality, the lack of data movement across the SMs for the intricacies of moving work back-and-forth between the PolyMorph Engine and the ALUs is a win.

Additionally geometry performance scales with SM count in a fine-grained fashion, while rasterisation scales more coarsely with GPC count. Rasterisation is most tightly coupled to ROP configuration and memory performance - things like shadow buffer rendering don't consume much pixel shading yet are a frequent bottleneck in overall frame render time.

So, overall, it makes sense to have these two levels of control over performance scaling for different sectors of the market. Additionally NVidia has maximised die area efficiency by constraining the high-bandwidth intermediate-data of geometry processing to a single SM for each triangle.

We'll have to wait and see what the instantaneous rates for geometry (e.g. tessellated vertex rate) are...

Jawed

rpg.314 · Jan 26, 2010

Jawed said:
I'm saying that 4 slow ones, close to each SM, are better than 1 big one shared by the GPC - despite the duplication in functionality, the lack of data movement across the SMs for the intricacies of moving work back-and-forth between the PolyMorph Engine and the ALUs is a win.

Absolutely. Intra SM communication is via L1 and can be much faster than across SM's which have to go via L2.

It is intriguing nevertheless to note that nv saw tess as a bigger bottleneck than the setup/raster stage. I wonder why?

Jawed · Jan 26, 2010

rpg.314 said:
It is intriguing nevertheless to note that nv saw tess as a bigger bottleneck than the setup/raster stage. I wonder why?

Both are increased in Fermi and both scale, so they appear to have equal weight in NVidia's architectural intentions.

Bandwidth is the most limited, so rasterisation will tend only to scale as fast as on-die efficiency gains occur (such as big caches for ROPs or multiple-triangles per hardware thread). And those tend to occur on architectural inflections - nevertheless it all looks scalable to me.

Jawed

rpg.314 · Jan 26, 2010

Jawed said:
Both are increased in Fermi and both scale, so they appear to have equal weight in NVidia's architectural intentions.

True, but what surprises me is that why not have 1 SM per setup/raster unit instead of 4 when you have 1 SM/tess?

GZ007 · Jan 26, 2010

rpg.314 said:
True, but what surprises me is that why not have 1 SM per setup/raster unit instead of 4 when you have 1 SM/tess?

U can easily reach 1M triangles from a 8k model with higher levels of tesselation.
Maybe the new stages in HS and DS give enough aditional compute complexity to keep busy the SM for a long time(at least with high level of tesselations) and keeping it in SM is actualy much faster with intra SM comunication.

GZ007 · Jan 26, 2010

Cypress can keep up quite good despite tesselating trough all the 20 SIMDS. If u turn on tesselation it is clearly a additional burden for the rendering pipeline and your framerates will drop even on fermi. The question is how much.
Its quite suspicious that they didnt showed fps in heaven benchmark with and without tesselation as no one cared about cypress in the graph when hemlock would beat gf100.

Jawed · Jan 26, 2010

rpg.314 said:
True, but what surprises me is that why not have 1 SM per setup/raster unit instead of 4 when you have 1 SM/tess?

Relative density of computation versus communication - and not forgetting that rasterisation is fundamentally bandwidth limited (no point going faster) and setup is fundamentally rasterisation limited (no point going faster).

You're arguing for 16 quad rasterisers (doesn't make sense to go smaller than a quad). But that's optimising rasterisation for a scenario where the rest of the GPU can't, and wouldn't need to, keep up.

You could make these 16 quad rasterisers 1/4 rate, but any saving you've gained on the per-rasteriser implementation is now lost on the huge increase in interconnectedness required to ensure that triangles are rendered in the correct order. That's a 16x16-way crossbar just to ensure setup and early-Z function correctly.

ATI massively simplified here by only parallelising rasterisation and making setup broadcast triangles to all rasterisers. This way there's no ordering headache and communication is trivial.

Jawed

MfA · Jan 26, 2010

You only need a massive crossbar to serialize in a single cycle ... it's cheaper with a bit more latency. The hierarchical Z cache, without using NVIDIA's tile coalescing approach, would be a bigger problem I think.

rpg.314 · Jan 26, 2010

Jawed said:
Relative density of computation versus communication - and not forgetting that rasterisation is fundamentally bandwidth limited (no point going faster) and setup is fundamentally rasterisation limited (no point going faster).

Could you explain this?

GZ007 · Jan 26, 2010

Jawed said:
ATI massively simplified here by only parallelising rasterisation and making setup broadcast triangles to all rasterisers. This way there's no ordering headache and communication is trivial.

Jawed

The 5000 series looks more like a 4000 refresh so i dont think they did something simplified but they have real cards for several months now. The flaws of tesselation will only come out if real games will ship from real developers, in house demos dont mean too much.
I could rather question if GF100 future architecture will be used in the near future or u will need another generation till the rest of the chip will catch up with the rasterizers (bandwith,number of cores) :?:

With multi displays and 3 times the ressolution u can bet that rasterization wont be your main problem

.

trinibwoy · Jan 26, 2010

GZ007 said:
I could rather question if GF100 future architecture will be used in the near future or u will need another generation till the rest of the chip will catch up with the rasterizers (bandwith,number of cores)

Don't Dirt2 and Stalker already put a bit of a smackdown on Cypress running DX11? If GF100 is faster at DX11 bits it should be readily apparent even on current and upcoming games.

eastmen · Jan 26, 2010

trinibwoy said:
Don't Dirt2 and Stalker already put a bit of a smackdown on Cypress running DX11? If GF100 is faster at DX11 bits it should be readily apparent even on current and upcoming games.

stalker dx 11 is faster than stalker dx 10. You turn on tesselation and its slower.

trinibwoy · Jan 26, 2010

Sure, but tessellation is an integral part of DX11 so I'm not sure why you're making the distinction. After all we are discussing GF100's geometry performance. The increased performance with tessellation off is due to replacing the pixel shader SSAO algorithm with a compute shader. Who knows, maybe GF100's cache architecture helps in that scenario as well.

GZ007 · Jan 26, 2010

In Dirt 2 Dx11 u have full floating point HDR lighting,high Definition Ambient Occlusion and i think the biggest performance eater is the full screen resolution post processing (in Dx9 at quarter size).
The tesselation has only minor usage in Dirt2 (majority of the benchmark has no tesselation on screen at all) and still your framerate has a masive difference. In Dirt2 Dx11 the shader usage is litle excessive. So GF100 can have the same fps drop betwen Dx9 and Dx11 in Dirt2.

The biggest advantage of the GF100 in near future will be still the 512 cuda cores and the new cache.
I just want to say that people expecting same speed with tesselation on as off is quite absurd. The cards still use unified shader architecture so if u have another stages in pipeline (the new tesselation stages) and u use them with same amount of cores than before u will get less fps. Now that should be quite logical.

green.pixel · Jan 26, 2010

Stalker games run far slower then they should for what they are rendering IMO.
I really don't see anything in this games that require highest end hardware, even at 1080p + AA. We need properly coded games, not throwing enormous ammount of power on bad code. I totally agree with "canuck" and zalbard.

Sontin · Jan 26, 2010

GZ007 said:
The biggest advantage of the GF100 in near future will be still the 512 cuda cores and the new cache.
I just want to say that people expecting same speed with tesselation on as off is quite absurd. The cards still use unified shader architecture so if u have another stages in pipeline (the new tesselation stages) and u use them with same amount of cores than before u will get less fps. Now that should be quite logical.

Cypress has twice the alus of juniper but only one setup-engine and one tessellation unit. That's result in a higher decrease in performance in the unigine benchmark with tessellation. Without tessellation the 5870 is 81% faster, with only 58% faster than juniper.
http://www.hardocp.com/article/2009/11/06/unigine_heaven_benchmark_dx11_tessellation/

No, nVidia's advantage in tessellation comes from the higher geometry performance and the four triangles/clock. I think it's possible that we will see no performance hit with tessellation in stalker and a low one in the unigine benchmark.

eastmen · Jan 26, 2010

trinibwoy said:
Sure, but tessellation is an integral part of DX11 so I'm not sure why you're making the distinction. After all we are discussing GF100's geometry performance. The increased performance with tessellation off is due to replacing the pixel shader SSAO algorithm with a compute shader. Who knows, maybe GF100's cache architecture helps in that scenario as well.

Just going by what i remember of the benchmarks. Who knows is stalker is a good canidate for tesselation anyway. I've seen pics on various forums and i don't see any visaul diffrences.

IT also scales very well with video ram.

OlegSH · Jan 26, 2010

green.pixel said:
Stalker games run far slower then they should for what they are rendering IMO.
I really don't see anything in this games that require highest end hardware, even at 1080p + AA. We need properly coded games, not throwing enormous ammount of power on bad code. I totally agree with "canuck" and zalbard.

[off]I think you confusing terms, XRay engine of STALKER is not so bad, it's bit old what come out in absence of geometry LODs, but in pixel shading it's still best in class engine, it's fully deferred, there are much of newest effects like parralax oñclusion, hd ambient occlusion, sun shafts, and soft penumbra shadows and it's still fast on top cards. There are also some of old content in game that come from dark age of dx8(since Stalker content develop was started in 2000!) but it nicely hide by detail textures and normal maps and in dx11 by tesselation. Also now very intriguing second top ukranian game - Metro 2033, it's use tesselation on models, physX and some cute effects on compute shaders like OIT, cinematic deep of field and MLAA[/off]

eastmen · Jan 26, 2010

Sontin said:
Cypress has twice the alus of juniper but only one setup-engine and one tessellation unit. That's result in a higher decrease in performance in the unigine benchmark with tessellation. Without tessellation the 5870 is 81% faster, with only 58% faster than juniper.
http://www.hardocp.com/article/2009/11/06/unigine_heaven_benchmark_dx11_tessellation/

No, nVidia's advantage in tessellation comes from the higher geometry performance and the four triangles/clock. I think it's possible that we will see no performance hit with tessellation in stalker and a low one in the unigine benchmark.

You mean right now . We have no idea whats causing performance issues in unigine . It could even be drivers for al lwe know.

Best bet is to wait till the game is out and we have a few real world benchmarks out there to test with.

NVIDIA Fermi: Architecture discussion

GZ007

Jawed

rpg.314

Jawed

rpg.314

GZ007

GZ007

Jawed

MfA

rpg.314

GZ007

trinibwoy

Meh

eastmen

trinibwoy

Meh

GZ007

green.pixel

Sontin

eastmen

OlegSH

eastmen

Similar threads