NVIDIA Fermi: Architecture discussion

Mintmaster · Jan 16, 2010

DemoCoder said:
What do you think about running the setup in its own clock domain (e.g. hot clock) rather then using multiple units? That would seem like the simplest way to do it.

Seems like a dead end investment. If you're going to bother with improving setup, might as well use an approach that scales with future architectures rather than just get a factor of 2 improvement.

If you look back to the GF3->GF4 transition, NVidia could have tried to use a hot clock for the vertex engine, but decided that parallelizing made far more sense. With DX11 tesselation, this was the perfect time to upgrade the setup engine. That's why I was dismayed that AMD didn't do it with R8xx.

Mintmaster · Jan 16, 2010

MfA said:
There is complexity, but is there expensive complexity? I still fail to see the big deal.

We all have trouble seeing the big deal, but the fact is that the brilliant design teams at ATI and NVidia have not done more than one tri per clock up to this point. AFAIK, they don't even discard culled triangles faster than that, and they are a huge source of pipeline bubbles on the pixel side. This lesser optimization doesn't even require any changes to the rasterizer.

There must be something that we're missing, because even if the average speed gain is only 5%, remember that R300 does one tri per clock and is only 100M transistors. We're looking at under 1% board cost to get that 5%, and you probably need 15% faster RAM to get that same boost.

I guess it could just be a "if it ain't broke don't fix it" situation. I did a lot of analysis on R200 at ATI, and I found a bug originating in the setup engine that cost 90 clocks for each culled triangle needing dependent texturing.

MfA · Jan 16, 2010

Mintmaster said:
We all have trouble seeing the big deal, but the fact is that the brilliant design teams at ATI and NVidia have not done more than one tri per clock up to this point. AFAIK, they don't even discard culled triangles faster than that, and they are a huge source of pipeline bubbles on the pixel side.

Half or quarter of huge is still huge ... either way you need buffering to take the edge off that effect. The solution is hierarchical culling, not parallel setup.

PS. I think the splitup of this and the old thread is not going as planned.

Mintmaster · Jan 16, 2010

MfA said:
Half or quarter of huge is still huge ... either way you need buffering to take the edge off that effect. The solution is hierarchical culling, not parallel setup.

I'm not really talking about huge, though, I'm talking about significant. And half/quarter of that would definately be far less significant. Remember that visible triangles keep the shaders busy, and speeding up setup at that time is less useful. I'm sure it's the large clumps of culled triangles that occuppy the majority of time that the shaders are idling, and small triangles resulting in 1-2 quads per clock with simple shaders are a secondary problem.

FYI, when I say setup, I mean everything between the vertex shaders (well I guess geometry shaders) and the pixel shaders, so I'm including culling/clipping. When you say heirarchical culling, are you talking about the software side?

PS. I think the splitup of this and the old thread is not going as planned.

My bad. Maybe a mod can move all the setup discussion to the architecture thread.

sethk · Jan 16, 2010

I can't think of any scenario where split frame would have worse latency than AFR, can you?

Mintmaster · Jan 16, 2010

MfA said:
Nah, was talking about backface culling on the GPU ... without hierarchical backface culling you are always going to get huge spans of vertex shading producing bugger all for the pixel shaders. Even if setup wasn't a problem that isn't a nice thing to do.

Do you have any links to the kind of heirarchical algorithm you're envisioning?

Personally, I think parallelized culling is adequate. Even if it was fixed function, the cost should be minimal to test, say, 8 tris per clock with fixed data paths. No need to go beyond that until you boost rasterizer speed. You'll still get spans of producing bugger all for the pixel shaders, but they'll be 8 times smaller and that should be good enough, IMO.

Mintmaster · Jan 16, 2010

trinibwoy said:
One practical optimization would be to cull backfacing HOS primitives to avoid the unnecessary tessellation hit. But obviously you need higher setup anyway if the number of visible polys is going to increase substantially with tessellation.

I don't think that's practical to do, because backfacing HOS primitives can generate frontfacing triangles after tesselation, especially when doing displacement.

trinibwoy · Jan 16, 2010

Good point. Guess not.

DemoCoder · Jan 17, 2010

ChrisRay said:
Specifically I said Fermi's tessellation engine is impressive. I think its the biggest investment Nvidia has put into a new API to accellerate new API features in a very long time. And what I mean by that is Nvidia's tessellation engine certainly not implemented in a half assed way. And I stand by that statement Its not long from now when everyone will have all their information about it.

There's an interesting thing happening in forums with these revelations happening. Months ago, there was much optimism and props given to AMD for their focus on tessellation in DX11, and from that came the assumption that NVidia put no work into it, and if they supported it at all, it would be some late additional, half-assed, bolted-on, or emulated tessellation and would not perform as well as AMD's. I'll note for the record that much the same story was repeated with Geometry Shaders (speculation that NVidia would suck at it, and that the R600 would be the only 'true' DX10 chip) AMD has had some form of tessellation for several generations all the way back to N-patches, so there's some logic to these beliefs. Also, the Fermi announcement mentioned nothing about improvement to graphics (only compute), so there has been a tacit assumption that the rest of the chip is basically a G8x with Fermi CUDA tacted on.

But as more and more leaks seem to indicate that NVidia has invested significant design work into making tessellation run very fast, it seems like some are in disbelief, while others are now starting to downplay the importance of tessellation performance and benchmarks (whereas once it was taken for granted that this was AMD's strong point) If indeed NVidia has significantly boosted triangle setup, culling, and tessellation, this could be like G80 all over again, where the total lack of information caused people to assume the worst, and the final chip coming as a big surprise. I think they deserve much props if they did increase setup rate.

As Mint said, it's been far too long to leave this part of the chips unchanged. Setup seems exactly where it was 10 years ago.

PSU-failure · Jan 17, 2010

Jawed said:
Setup in RV670 is 1 triangle per clock isn't it?

Take a look here : http://www.behardware.com/articles/723-4/product-review-the-nvidia-geforce-gtx-280-260.html

The results of RightMarks' VS tests point to a 0.5 triangle/clock setup rate for RV670 whichs scores ~270 at 775MHz while R600 scores ~600 at 750MHz. RV770 scores ~650, probably due to a partial only setup limitation.

rpg.314 · Jan 17, 2010

MfA said:
It's not really wasteful, just very inelegant ... because one thing you can do in the HS is simply set the tesselation factor very low (and even without specific hardware support or a geometry shader you can just put it behind the clip plane to cull the output before it gets to the PS).

Sure you can do a lot of hacks. But the DS stage still has to run, pointlessly.

Arnold Beckenbauer · Jan 17, 2010

PSU-failure said:
Take a look here : http://www.behardware.com/articles/723-4/product-review-the-nvidia-geforce-gtx-280-260.html

The results of RightMarks' VS tests point to a 0.5 triangle/clock setup rate for RV670 whichs scores ~270 at 775MHz while R600 scores ~600 at 750MHz. RV770 scores ~650, probably due to a partial only setup limitation.

Everything ok here: http://www.ixbt.com/video3/rv770-part2.shtml#p5

Jawed · Jan 17, 2010

MfA said:
You have to rasterize up to the hierarchical Z resolution to cull in the first place ... so you have a part of the pipeline which is behind setup proper which does rasterization ... lets just call it the rasterizer okay? (Hierarchical fragment rejection is nice, even better with a fast path for small triangles, but it doesn't make sense to count it as part of setup.)

The output of setup is coarse rasterisation - screen-space tile resolution rasterisation (plus triangle data, of course).

So as long as the early-Z system is that coarse, then you can easily, conservatively, reject dozens of small triangles, generated by tessellation, that all fit within a single screen-space tile.

GS is the most obvious place to do this kind of pre-setup culling, because it's the first time that a post-tessellation triangle comes into existence.

I'm wondering if it's possible to move all the non-position attribute calculation out of DS into GS (e.g. normal or colour per vertex). GS can decide if it's going to cut the triangle, so that it doesn't reach setup. If the triangle is emitted by GS then GS just makes sure that all the attributes are generated. This is normal stuff for GS. Manipulating the shaders like this is something the driver can do.

Trying to cull stuff pre-tessellation is another ballgame, as that paper you referred to earlier indicates. All I'm suggesting is that even without that kind of technique, there are opportunities for NVidia to improve the performance of setup - either by culling triangles before they get there, or culling them before they're exported.

Regardless I'm hopeful that NVidia's implemented setup as a kernel - making setup scalable. Though I'd still like to see evidence that tessellation is likely to make GPUs setup limited in games (not synthetics). Rasterisation-, fillrate- or shader-limited will still be very much the norm and tessellation only increases pressure on those. It's really a question of whether setup becomes a bottleneck due to tessellation.

Oh, and I suppose it's worth asking: is setup at 1 triangle per clock (in GPUs that work that way) because the early-Z system can't run any faster? Is early-Z the real bottleneck? If so, perhaps that's the heart of NVidia's improvements in Fermi.

Jawed

CarstenS · Jan 17, 2010

Jawed said:
Oh, and I suppose it's worth asking: is setup at 1 triangle per clock (in GPUs that work that way) because the early-Z system can't run any faster? Is early-Z the real bottleneck?

My guess: Needs weren't pressing enough until now. Especially given the fact, that geometry rates could have been much higher already, if it weren't for (artificial) driver constraints limiting it.

Jawed · Jan 17, 2010

rpg.314 said:
All primitives need to update Z, but since triangle setup (or patch setup in dx11) happens before tessellation stages run, so the setup kernel simply doesn't have the Z this patch will generate after HS and DS run.

Setup is basically the first stage of rasterisation.

As for the setup-of-tessellated-tris, there is still a stage of DS which will run before Z can be determined. Not to mention that the Z will be interpolated by the rasterizer before the final Z is available to perform the Z-query.

Z for the vertices of a triangle is known by the end of DS, if DS is the final shader before rasterisation.

If all the vertices of a triangle fall into a coarse early-Z tile that says "reject" then you can throw away the triangle without running it through the rasteriser. You've had to do a coarse rasterisation to find out which tiles the triangle occupies.

You can't throw away the vertices unless all the triangles touched by any of those vertices are culled.

Are there any occlusion instructions available in the HS/DS so that an entire patch can be killed in case it can be determined that this patch will not be rendered, (on the basis of back face culling etc. )?

You can cull patches by setting the amplification to 0.

Waiting till GS to cull what can be culled in HS seems rather wasteful. Tessellation means that invisible pixels now waste HS/TS/DS time as well, besides wasting PS time.

But you can't tell categorically what triangles will be culled until they've been tessellated. Which is the rabbit-hole the paper that MfA linked dives down. It's tricky stuff (makes my head hurt, didn't study it deeply when I looked at it months ago).

Jawed

rpg.314 · Jan 17, 2010

Jawed said:
Setup is basically the first stage of rasterisation.

Z for the vertices of a triangle is known by the end of DS, if DS is the final shader before rasterisation.

If all the vertices of a triangle fall into a coarse early-Z tile that says "reject" then you can throw away the triangle without running it through the rasteriser. You've had to do a coarse rasterisation to find out which tiles the triangle occupies.

You can't throw away the vertices unless all the triangles touched by any of those vertices are culled.

Now I understand what you meant. Since assembled triangles are visible as a whole in GS, it is indeed the right place to do it.

I'm wondering if it's possible to move all the non-position attribute calculation out of DS into GS (e.g. normal or colour per vertex). GS can decide if it's going to cut the triangle, so that it doesn't reach setup. If the triangle is emitted by GS then GS just makes sure that all the attributes are generated. This is normal stuff for GS. Manipulating the shaders like this is something the driver can do.

Doesn't GS run after setup?

You can cull patches by setting the amplification to 0.

Good. But I was under the impression that the tess factors for each edge had to be between 2 and 64.

But you can't tell categorically what triangles will be culled until they've been tessellated. Which is the rabbit-hole the paper that MfA linked dives down. It's tricky stuff (makes my head hurt, didn't study it deeply when I looked at it months ago).

Not generally, but in some cases. For instance in procedurally generated geometry.

MfA · Jan 17, 2010

Jawed said:
The output of setup is coarse rasterisation - screen-space tile resolution rasterisation (plus triangle data, of course).

Meh, semantics.

GS is the most obvious place to do this kind of pre-setup culling, because it's the first time that a post-tessellation triangle comes into existence.

It's also an optional step just in front of the normal setup and rasterization ... if the normal setup+rasterization are optimized to quickly cull small/backfacing/etc triangles the most obvious place to cull them is there.

I'm wondering if it's possible to move all the non-position attribute calculation out of DS into GS (e.g. normal or colour per vertex). GS can decide if it's going to cut the triangle, so that it doesn't reach setup. If the triangle is emitted by GS then GS just makes sure that all the attributes are generated. This is normal stuff for GS. Manipulating the shaders like this is something the driver can do.

Putting a GS in is not free in the first place (latency/storage wise). Intuitively I'd say just interpolate those attributes, it's small fry work.

Jawed · Jan 17, 2010

rpg.314 said:
Doesn't GS run after setup?

All geometry shading (VS/HS/TS/DS/GS) has to be complete before you can rasterise, and setup is just the first stage of rasterisation (in D3D speak the RS stage is fully named "set-up and rasterisation").

Good. But I was under the impression that the tess factors for each edge had to be between 2 and 64.

http://msdn.microsoft.com/en-us/library/ee417841(VS.85).aspx

Values of 0 or NaN result in the patch being culled.

Jawed

Jawed · Jan 17, 2010

MfA said:
Meh, semantics.

It's not semantics if the early-Z system and screen-space tiles align for the coarsest level of culling.

It's also an optional step just in front of the normal setup and rasterization ... if the normal setup+rasterization are optimized to quickly cull small/backfacing/etc triangles the most obvious place to cull them is there.

The boundary between GS and setup is fairly blurry. They both do triangle culling.

Putting a GS in is not free in the first place (latency/storage wise). Intuitively I'd say just interpolate those attributes, it's small fry work.

While GS is optional, the pipeline still has to be able to accommodate its presence. We can argue about whether it should be considered a sub-optimal path and therefore de-optimised. Or whether setup should be deleted and all that work done in GS.

Interpolation of vertex attributes (post-tessellation) is theoretically quite a bit of work if the pipeline is trying to render 2 billion triangles per second (the kind of number it would need to be doing to be usefully faster than ATI's nominally one-triangle per clock pipeline - though I don't know if this rate halves when TS is turned on). This workload is amplified by the fact that the post transform cache isn't infinite - i.e. vertices can be shaded multiple times.

Actually, that's another good question: what kind of optimisations are possible in the D3D11 pipeline to minimise re-shading vertices? (I'm still unclear on the tessellation implications of a vertex being evicted from PTVC before its lifetime expires: do VS-HS-TS-DS-GS all need to be repeated to re-generate that vertex? It seems so.)

Jawed

skinnyq · Jan 17, 2010

Have samples already made their way to some people's hands?
Are we expecting benchmarks to show up tomorrow or just some more info?

NVIDIA Fermi: Architecture discussion

Mintmaster

Mintmaster

MfA

Mintmaster

sethk

Mintmaster

Mintmaster

trinibwoy

Meh

DemoCoder

PSU-failure

rpg.314

Arnold Beckenbauer

Jawed

CarstenS

Moderator

Jawed

rpg.314

MfA

Jawed

Jawed

skinnyq

Similar threads