NVIDIA Fermi: Architecture discussion

ShaidarHaran · Jan 20, 2010

Re: triangle setup rate

are we 100% sure Fermi sets up 4x tri/clk (assuming 4 GPCs) in all situations?

I'm curious because this could finally be the answer to increased performance in MSFS.

MfA · Jan 20, 2010

Meh, I'm no journalist ... I wouldn't even know who to ask for official statements on this, and even if I did I don't think "to satisfy my personal curiosity" will convince them to spend time on me. So I just threw the conspiracy theory out there and hoped someone else would find out whether this is just a public document error or if it runs deeper. Someone who does run a site perhaps. It would be fine and dandy too if some of the AMD/IMG lurkers on this board could just say something like "Microsoft told us about this, the public documents are just a mess" (nudge nudge).

CarstenS · Jan 20, 2010

ShaidarHaran said:
Re: triangle setup rate

are we 100% sure Fermi sets up 4x tri/clk (assuming 4 GPCs) in all situations?

I'm curious because this could finally be the answer to increased performance in MSFS.

Official docs say "up to" - as always. My question wrt exactly this hasn't been answered yet. Makes me wonder…

KimB · Jan 20, 2010

CarstenS said:
Official docs say "up to" - as always. My question wrt exactly this hasn't been answered yet. Makes me wonder…

My understanding of this is that they have four parallel units, but those units may at times be stalled waiting for the results of other units. They've put a lot of work into attempting to make sure that these four parallel geometry units are used as optimally as possible, but in reality we can't expect a 4x increase in hardly any aspect of geometry performance.

Ailuros · Jan 20, 2010

Razor1 said:
I'm not sure just a guess.

Of course are claimed values usually peak values, but I can't figure out at the moment (since I'm way too tired) why you couldn't process at least 2 Tris/clock to feed four raster units.

DavidGraham · Jan 20, 2010

ShaidarHaran said:
Ok, so we're still stuck at 1 tri/clk unless tesselating.

Too bad.

I must say that left me severely disappointed too ! that means the chance of GF100 to improve it's overall performance due to enhanced geometric design is low ! in normal cases of course !

Add that to the not so huge texture improvements and possibly low clock speeds , and you get the picture of GF100 making about 80% BEST CASE of GTX285's performance , and hence an even lower advantage over HD5870 (possibly 20%) , However in tessellation it will trounce GTX 285 by a huge margin .

Unless that is wrong , and there is 4 tri/clk in normal rendering .

ShaidarHaran · Jan 20, 2010

Hmm, perhaps I should rephrase my question then.

Can Fermi setup more than 1 non-tesselated tri/clk?

edit: I have a feeling this discussion should be in the other thread.

KimB · Jan 20, 2010

ShaidarHaran said:
Hmm, perhaps I should rephrase my question then.

Can Fermi setup more than 1 non-tesselated tri/clk?

From the architecture layouts, it seems it should be able to do better than 1/clock. Because it has four parallel geometry units as opposed to one monolithic unit, it seems extremely doubtful that there could be any hardwired limitation to one triangle per clock.

That said, other limitations, such as bandwidth or ability to parallelize non-tessellated triangles between the units, may prevent much performance improvement from having the additional geometry units.

CRoland · Jan 20, 2010

MfA said:
Meh, I'm no journalist ... I wouldn't even know who to ask for official statements on this, and even if I did I don't think "to satisfy my personal curiosity" will convince them to spend time on me. So I just threw the conspiracy theory out there and hoped someone else would find out whether this is just a public document error or if it runs deeper. Someone who does run a site perhaps. It would be fine and dandy too if some of the AMD/IMG lurkers on this board could just say something like "Microsoft told us about this, the public documents are just a mess" (nudge nudge).

Am I missing something or does it seem at least as likely that:
a) it was an honest omission and
b) the omission could hinder its use and actually hurt GF100's potential edge?

DavidGraham · Jan 20, 2010

Hey guys , Could someone explain to me whether GF100 could output more than 1 tri/clk in non tessellated situations or not ? and why ?

This question is in the other thread too ..

Rys · Jan 20, 2010

ShaidarHaran said:
are we 100% sure Fermi sets up 4x tri/clk (assuming 4 GPCs) in all situations?

For GF100, yes, but the aggregate rasterisation area is no bigger than prior hardware could rasterise in a clock.

3dcgi · Jan 20, 2010

DemoCoder said:
Well, because if each clock, exactly only (u,v) coordinate is sent for domain shading, then the amplification amounts to bandwidth saving only, and tessellation into many small triangles won't be able to keep 1600 ALUs busy, since they'll all be waiting for rasterization of the next small triangle, which is going to take a few clocks to pop out. What you'd want is for the (u,v) to be sent to 64 different groups of domain shading ALUs, so that you can parallelize the tessellation as much as possible and not have those ALUs sitting idle.

On Fermi, you can be working on 16 different (u,v) values, separately domain shading, setting up the triangles, and rasterizing. So even if the polymorph engine can only tessellate one set of coordinates per clock, it can do 16 of them, as well as setup 4 outputs from domain shaders each clock. It's less bottlenecked.

I agree it's less bottlenecked with more (u,v) generation, but that's really a latency issue. If all the ALUs are needed to balance the DS and setup the one (u,v) per cluster option would take a little longer to reach steady state, but it would still reach it. How much of a performance impact this has will be determined by the amount of work done before switching to a workload that requires a different balance. This is still on the assumption that there's close to one new (u,v) per setup primitive.

Obviously more (u,v)'s per clock is better, but determining how much better is the tricky part.

ShaidarHaran said:
Re: triangle setup rate

are we 100% sure Fermi sets up 4x tri/clk (assuming 4 GPCs) in all situations?

I'm curious because this could finally be the answer to increased performance in MSFS.

It's unclear if GF100 can achieve full rate without tessellation, but it should be easy to test with a custom app. The potential limitation is that there is a single index buffer per draw call. So they need to parallelize processing of the index buffer to make a single draw command run faster than 1x. This is non trivial.

Rys · Jan 20, 2010

It doesn't matter if it's tesselated by hardware or not, since you can draw tiny triangles all by yourself if you so wish.

MfA · Jan 20, 2010

CRoland said:
Am I missing something or does it seem at least as likely that:
a) it was an honest omission and

From the public docs ... sure. If the other IHVs weren't made aware of it, then in combination with the fact it's only at HLSL level (not the assembly level, where nothing can be hidden because it's used for drivers) then it becomes rather hard to believe.

This is what I basically said in the first post about this ... still as valid as then. Stop making me repeat myself ... are you people just tag teaming to make me dig myself in ever deeper or what?

b) the omission could hinder its use and actually hurt GF100's potential edge?

I don't think DX11 engines are far enough in development for it to be an obstacle, especially if NVIDIA volunteers the work/code necessary to integrate it.

DavidGraham · Jan 20, 2010

Rys said:
It doesn't matter if it's tesselated by hardware or not, since you can draw tiny triangles all by yourself if you so wish.

Thanks Mr.Rys , but I have to wonder : you guys said that the reason why no body cared to double the number of Hardware Rasterizers is that you have to figure out what to do when triangles overlap , or share vertices .. how is that different in GF100 situation ? how did Nvidia overcome this seemingly difficult obstacle ?

KimB · Jan 20, 2010

DavidGraham said:
Thanks Mr.Rys , but I have to wonder : you guys said that the reason why no body cared to double the number of Hardware Rasterizers is that you have to figure out what to do when triangles overlap , or share vertices .. how is that different in GF100 situation ? how did Nvidia overcome this seemingly difficult obstacle ?

Basically it's a problem of out-of-order execution. Anand goes into it a little bit here:
http://www.anandtech.com/video/showdoc.aspx?i=3721&p=2

Though I must say that I was mistaken. The GF100 has 16 geometry units, not 4. So I think we can definitely expect faster geometry throughput all around. That said, the triangle setup is in the raster engine, of which there are four, so we should expect, in ideal conditions, that the GF100 can do 4 triangles/clock (I don't think the raster engine has the same out-of-order execution problems as the PolyMorph engine).

ShaidarHaran · Jan 20, 2010

Rys said:
For GF100, yes, but the aggregate rasterisation area is no bigger than prior hardware could rasterise in a clock.

I don't follow you here. To me this sounds like you are saying there is no benefit to this implementation.

Alexko · Jan 20, 2010

Rys said:
For GF100, yes, but the aggregate rasterisation area is no bigger than prior hardware could rasterise in a clock.

Does this mean that the aggregate rasterisation area will be smaller than GT200's on mainstream derivatives?

CarstenS · Jan 20, 2010

It's currently 8 ppc/raster unit. If triangles are larger than 32 pix you don't necessarily benefit but only move the bottleneck to the rasters instead of the tri setup. Mainstream parts will be affected based on Nvidias choice of implementation, i.e. their number of GPCs.

Psycho · Jan 20, 2010

Rys said:
For GF100, yes, but the aggregate rasterisation area is no bigger than prior hardware could rasterise in a clock.

So only half that of Cypress (per clock)?
Strange they can think so different about the balance..

NVIDIA Fermi: Architecture discussion

ShaidarHaran

hardware monkey

MfA

CarstenS

Moderator

KimB

Ailuros

Epsilon plus three

DavidGraham

ShaidarHaran

hardware monkey

KimB

CRoland

DavidGraham

Rys

Graphics @ AMD

3dcgi

Rys

Graphics @ AMD

MfA

DavidGraham

KimB

ShaidarHaran

hardware monkey

Alexko

CarstenS

Moderator

Psycho

Similar threads