NVIDIA Fermi: Architecture discussion

Ninjaprime · Jan 20, 2010

Razor1 said:
why would the less card start choking, its going to be doing lesser work in ever other area too. Unless we have real data on it there is no way to say if it would or it scales with the rest of the program.

What I'm saying is, and trying to confirm, if you disable 1, 2 and 3 GPC's doesn't the "normal" 32 pixel triangle count drop to 525 Mtri/sec, 350 Mtri/sec, 175 Mtri/sec, respectively, using the suggested clocks? 175 Mtri/sec is way back at Geforce 3 levels, isnt it?

3dcgi · Jan 20, 2010

Ninjaprime said:
What I'm saying is, and trying to confirm, if you disable 1, 2 and 3 GPC's doesn't the "normal" 32 pixel triangle count drop to 525 Mtri/sec, 350 Mtri/sec, 175 Mtri/sec, respectively, using the suggested clocks? 175 Mtri/sec is way back at Geforce 3 levels, isnt it?

GF3 couldn't push out 32 pixels/clock so if you're talking 32 pixel triangles it's actual prim rate is probably 1/8th of that.

DemoCoder · Jan 20, 2010

3dcgi said:
Why do you think that will matter much? With reuse the ratio of (u,v) to prims should be pretty close to one.

Well, because if each clock, exactly only (u,v) coordinate is sent for domain shading, then the amplification amounts to bandwidth saving only, and tessellation into many small triangles won't be able to keep 1600 ALUs busy, since they'll all be waiting for rasterization of the next small triangle, which is going to take a few clocks to pop out. What you'd want is for the (u,v) to be sent to 64 different groups of domain shading ALUs, so that you can parallelize the tessellation as much as possible and not have those ALUs sitting idle.

On Fermi, you can be working on 16 different (u,v) values, separately domain shading, setting up the triangles, and rasterizing. So even if the polymorph engine can only tessellate one set of coordinates per clock, it can do 16 of them, as well as setup 4 outputs from domain shaders each clock. It's less bottlenecked.

Bob · Jan 20, 2010

Ninjaprime said:
What I'm saying is, and trying to confirm, if you disable 1, 2 and 3 GPC's doesn't the "normal" 32 pixel triangle count drop to 525 Mtri/sec, 350 Mtri/sec, 175 Mtri/sec, respectively, using the suggested clocks? 175 Mtri/sec is way back at Geforce 3 levels, isnt it?

If you're rendering 32+ pixel triangles, the bottleneck won't be the rasterization or triangle rate.

Kaotik · Jan 20, 2010

Regarding the 'new accelerated jittered sampling' - is there anything really new compared to fetch4/gather4, which was already in dx10.1, in it?

KimB · Jan 20, 2010

Kaotik said:
Regarding the 'new accelerated jittered sampling' - is there anything really new compared to fetch4/gather4, which was already in dx10.1, in it?

In the CSAA modes, the extra samples are now used to for coverage AA, which basically means that you can have much better anti-aliasing of alpha textures (e.g. fences, grass).

Rys · Jan 20, 2010

Kaotik said:
Regarding the 'new accelerated jittered sampling' - is there anything really new compared to fetch4/gather4, which was already in dx10.1, in it?

As MfA and OpenGL guy uncovered in the other thread, there's a four offset overloaded version of Gather in DX11, which wasn't commonly understood before. That's the jitter, with the acceleration looking like a Fermi sampler being able to fetch two samples with discrete offsets per clock, rather than one.

Jawed · Jan 20, 2010

NVidia only did double rate, not quad rate here - so the pain appears to have been enough to have made them compromise.

Sparse sampling is hardly news to AMD, so they'll attack it in their own sweet time. I'm sure NVidia will get the nicest possible looking shadows into games, which is cool.

As to the mechanics of the texture caches, I suppose a patent rummage would suffice. Last time I studied ATI's was for older architectures that only have L1, years ago. In that scheme each filtering unit (i.e. x 4) has a private L1, so texels are duplicated across the L1s in the same quad-TMU. Effectively those L1s have the texels pre-aligned for the filtering units to use, so filtering never does unaligned cache fetches.

Jawed

Jawed · Jan 20, 2010

trinibwoy said:
Well if we were to put aside for a moment the prevailing assumption that Nvidia's engineers are morons I suspect they won't produce Fermi variants that are terribly unbalanced.

Well, you and I agree that scaling down by deleting SMs from GPCs is viable - so there's no concensus of a presumption of moronism in the scaling question

I'm losing track in all the spin though (not from you specifically Jawed). On one hand geometry processing isn't an issue even on high-end cards yet at the same time it would be a travesty if Nvidia scales back geometry throughput on downmarket parts. And both angles are coming from the same people. So which one is it, cause it sure can't be both.

Post-reveal architectural confusion is fun, aint it?

I don't think NVidia will scale back geometry for downmarket parts to the ridiculous extent that Silus is suggesting. Though I wouldn't bet against the barrel-scraping GF108 or whatever the hell the crappiest part is called being super shit in this respect - performance is not an option.

Jawed

Jawed · Jan 20, 2010

DemoCoder said:
Yep, you could do that, but it's yet another knock against Charlie's spin. Supporting different levels of tessellation performance is the least of the developer's problems these days.

An adaptive tessellation algorithm, i.e. screen resolution based, automatically scales anyway.

Jawed

KimB · Jan 20, 2010

ShaidarHaran said:
Count me in the minority because I for one am excited for GF100 to be productized and released. Gotta update my Folding farm

One question re: the higher CSAA mode
does this reduce specular and texture aliasing as well, or only aliasing of alpha textures?

Just alpha textures. Basically the previous techniques would only allow a small number of gradients (typically 4) between total transparency and total opacity, giving a dithered look to the anti-aliasing of alpha textures. By offering up to 32 gradients (supposedly at a very small performance hit), this problem is largely solved.

MfA · Jan 20, 2010

Jawed said:
Sparse sampling is hardly news to AMD, so they'll attack it in their own sweet time.

I'm sure they will, but fuck these pre-emptive excuses ... it wouldn't have been such a "ruckus" without them. The presence of API features which makes it more elegant for the developer to make use of sparse sampling is not irrelevant to how you design your future hardware and to implicitly suggest it is is just plain being disingenuous

Effectively those L1s have the texels pre-aligned for the filtering units to use, so filtering never does unaligned cache fetches.

How do they do that without simply using 4 32 bit ports? (For uncompressed textures.) I guess you could use some complex addressing scheme which natively works in 2D and is specialized for quads, but then you can't really speak of alignment any more.

I just don't see a way to do ATI/NVIDIA's current texture caches without coalescing/multicast stages if they are really shared ... unless it's just 16 ported (in which case you might as well not share the caches at all).

Jawed · Jan 20, 2010

MfA said:
I'm sure they will, but fuck these pre-emptive excuses ... it wouldn't have been such a "ruckus" without them. The presence of API features which makes it more elegant for the developer to make use of sparse sampling is not irrelevant to how you design your future hardware and to implicitly suggest it is is just plain being disingenuous

I hope you're not suggesting I'm being disingenuous :???:

How do they do that without simply using 4 32 bit ports? (For uncompressed textures.) I guess you could use some complex addressing scheme which natively works in 2D and is specialized for quads, but then you can't really speak of alignment any more.

Whoops, sorry, I misremembered the patent I was thinking of

It's actually the 2-level design that I was thinking of:

http://v3.espacenet.com/publication...=A1&FT=D&date=20051013&DB=EPODOC&locale=en_V3

Jawed

Jawed · Jan 20, 2010

Silus said:
And I agree too. I was just speculating on the option of disabling GPCs for other chips. and its effects, given architectural improvements that we know of. That may or may not be NVIDIA's path for this, because they may be able to just leave some bits and pieces of other GPCs enabled in the chip, instead of disabling them completely.

There's a gulf between disabling and deleting. You speculated specifically on a 1 GPC chip to compete with Juniper.

Jawed

Ailuros · Jan 20, 2010

Chalnoth said:
Just alpha textures. Basically the previous techniques would only allow a small number of gradients (typically 4) between total transparency and total opacity, giving a dithered look to the anti-aliasing of alpha textures. By offering up to 32 gradients (supposedly at a very small performance hit), this problem is largely solved.

I'd say that AA settings for GF100 should be a superset of those on current solutions. I don't expect 16x (CSAA) to have vanished and in that case for the majority of cases 4x MSAA + 12x CSAA + 16x TAA should be more than enough and won't use as much memory/bandwidth as 32x CSAA. The latter sounds better suited for rather extreme alpha test aliasing.

MfA · Jan 20, 2010

Jawed said:
I hope you're not suggesting I'm being disingenuous

"The presence of API features which makes it more elegant for the developer to make use of sparse sampling is not irrelevant to how you design your future hardware and to implicitly suggest it is is just plain being disingenuous"

Maybe I'm just misreading you.

Whoops, sorry, I misremembered the patent I was thinking of It's actually the 2-level design that I was thinking of:

http://v3.espacenet.com/publication...=A1&FT=D&date=20051013&DB=EPODOC&locale=en_V3

Gotta digest that.

PS. knowing your competitor is doing something is not relevant to the decision of spending effort in that area yourself??? No relevance whatsoever??? Aargh no sorry, it just doesn't compute for me ... do you really want me to believe you believe that? Ick ... disingenuous or just plain silly, take your pick.

PPS. brief descriptions which don't actually describe anything should not be allowed (oh wait, technically they aren't ... sigh, patent office).

PPPS. I don't think it's too relevant any more, there seems to be completely replication in all the L1s and the port is extremely narrow.

Jawed · Jan 20, 2010

I'm not saying it's irrelevant - I'm saying they'll decide when it's relevant. It's potentially seriously costly, architecturally. It's halving ALU:TEX rate for a specific scenario.

The ATI compiler handles this case already, so they know what the performance is like. If we're gonna talk wanton crimes of omission then 4xZ per clock in R600 is the crime of the decade.

Jawed

MfA · Jan 20, 2010

Jawed said:
I'm not saying it's irrelevant - I'm saying they'll decide when it's relevant.

If they didn't have the knowledge say 6 months ago, but they decide it would have been relevant then they will need a tardis first. Otherwise the decision was made for them.

I'm old and have a failing memory ... what part of the DirectX API was obscured for the 4xZ in the R600?

Jawed · Jan 20, 2010

OK, we're going round in circles now, where is the evidence that anything was obscured?

:???:

If this was obscured, then I'd be similarly bemused/disappointed/disgusted.

Jawed

trinibwoy · Jan 20, 2010

That's the thing, we're operating under the assumption that Microsoft and Nvidia were playing footsie behind AMD's back. Has anyone from AMD even confirmed that this was a surprise to them? Granted, it wouldn't be in their best interest to squash yet another Nvidia conspiracy theory

NVIDIA Fermi: Architecture discussion

Ninjaprime

3dcgi

DemoCoder

Bob

Kaotik

Drunk Member

KimB

Rys

Graphics @ AMD

Jawed

Jawed

Jawed

KimB

MfA

Jawed

Jawed

Ailuros

Epsilon plus three

MfA

Jawed

MfA

Jawed

trinibwoy

Meh

Similar threads