NVIDIA Fermi: Architecture discussion

Since Evergreen seems to do much worse with tesselation than what people expected (that is, only one third the vertex throughput to non-tesselated case), begs the question if all that distributed geometry processing in fermi is actually really necessary or would a non-distributed setup be sufficient if you could still reach 1 tri/clock?
 
Since Evergreen seems to do much worse with tesselation than what people expected (that is, only one third the vertex throughput to non-tesselated case), begs the question if all that distributed geometry processing in fermi is actually really necessary or would a non-distributed setup be sufficient if you could still reach 1 tri/clock?
Fermi is still benefiting in non-tessellated games where Evergreen is 1 tri/clock. Did you notice how several reviews mentioned that Fermi has a bigger advantage at lower resolutions (unless memory comes into play)? That's because the pixel load goes down but the vertex load stays the same, so proportionally the vertex load increases and becomes a bigger factor in the FPS you get.

Imagine game X at high resolution where Cypress spends 3ms on geometry heavy areas while Fermi spends 1ms on it, and both spend 15 ms on pixel-heavy work. That's 56 fps vs 63 fps, or 13% diff. At low resolution the geometry load stays the same but pixel load halves, making it 95 fps vs 118 fps, or 24% diff.

But yeah, it looks like even 1 tri/clk for tessellation would help ATI a lot.
 
That makes the 1/3 factor for Cypress even more disappointing. Also, remember that each vertex generated by the tessellator creates two triangles, so Cypress is only generating 1 vertex every six clocks.
Normally the triangles are in a mesh, so you're really talking about one extra triangle per extra vertex in general.

It could be due to a data flow bottleneck. Damien mentions that reading multiple vertices per clock slows down non-Fermi GPUs. Not sure if he's talking about domain shaders or vertex shaders, though they're basically the same thing. Does Evergreen still use a separate vertex cache?
I think it's a single cache for texels and vertices now.

I really doubt it. Remember that tessellation factors are floating point numbers, allowing smooth transitions. All vertices are defined by the same formula, so there's no need for iteration.

The factor is floating point, but the number is encoded:
  • tessellation factor is determined from integer part, 1, 3, 5, 7 etc.
  • interpolation factor is decimal part across a range of 0...1.999999
So there are distinct steps in the count of vertices generated by tessellation. The "popping" of these discretely appearing vertices is controlled by the fact that interpolation factor close to "0" places the new vertex very close to the "zero" vertex it's based upon.

I'm thinking it's iterative simply because very high tessellation factors should be rare. This saves parallelism in implementing the tessellator, i.e. less interpolations per clock. It also reduces the bus widths.

The tessellator doesn't even calculate the positions of the vertices. All it does is create room for the vertex in the pipeline and give 16-bit (0..1) barycentric coordinates to the domain shader. I'd be shocked if ATI didn't put in the maybe 10 million transistors needed to do that math quickly.
Well, this argument isn't much different from "hey why not build an uber-rasteriser, with a throughput of 512 fragments per cycle". Rasterisers are cheap by today's standards.

Given that the B3D article on Cypress said that a lot of shader time was spent in the domain shader, it could be data contention for the patch's control points, stalling the domain shader to the point of only allowing one control point to be read by only one thread (vertex) every two cycles. That would suck...
I think AMD took a view on a balance of throughputs here - with the caveat that Evergreen's rasteriser/fragment-to-hardware-thread-allocation seems like it's on its last legs. There really isn't much point making tessellation faster when, apparently, the gains are limited.

(Also: a relevant part of Evergreen might have been lost in the feature-eviction ahead of the 40nm wreck.)

This is pretty simple geometry with a low resolution displacement map (in terms of features wrt resolution), though. You may not be able to be so adaptive in the real world.
Screen adaptive (e.g. z-depth-based) is the first step and should be trivial. Silhouettes, back-facing and all the rest, agreed, that's more troublesome.

Some nice stuff on tessellation here:

http://developer.amd.com/documentation/presentations/Pages/default.aspx

Jawed
 
Normally the triangles are in a mesh, so you're really talking about one extra triangle per extra vertex in general.
Think harder :p

It is possible that ATI didn't bother with caching the verts and just duplicated them, but that would be odd...
I think it's a single cache for texels and vertices now.
Well that sort of kills my theory. But why do you think Damien found that if you need to access multiple vertices in a shader the speed goes down?

The factor is floating point, but the number is encoded:
<snip>
I know, but the point I was making is that an iterative tessellation algorithm doesn't make sense. Even if it was used, it doesn't even explain 1 vert every 6 clocks, because you get lots of verts with each pass.

I'm thinking it's iterative simply because very high tessellation factors should be rare. This saves parallelism in implementing the tessellator, i.e. less interpolations per clock. It also reduces the bus widths.
I don't see the savings, nor do I see any explanation for 1 vert every 3 or 6 clocks

Well, this argument isn't much different from "hey why not build an uber-rasteriser, with a throughput of 512 fragments per cycle". Rasterisers are cheap by today's standards.
It's a very different argument when you consider scale. 512 fragmentss/clk including all the compressed encoding/decoding, Z loading, HiZ, etc. is way, way more expensive than generating a pair of [0..1] numbers every two clocks.

I think AMD took a view on a balance of throughputs here - with the caveat that Evergreen's rasteriser/fragment-to-hardware-thread-allocation seems like it's on its last legs. There really isn't much point making tessellation faster when, apparently, the gains are limited.
I can see how increasing setup/culling speed is a bit of a pain due to the implications throughout the pipeline, but this is just laziness. Tessellation is a self-contained fixed function unit. Factors go in, coordinates come out.

Screen adaptive (e.g. z-depth-based) is the first step and should be trivial. Silhouettes, back-facing and all the rest, agreed, that's more troublesome.
No, that's not what I'm talking about. I'm saying that the number of triangles output by this SDK sample is a lot lower than a game would need. The 35% and 11% numbers aren't very useful.
 
This is a continuation of the discussion about GF100 being more suitable to scale down to lower end variants in an efficient and cost-effective manner compared to GT200:

And I disagree. Splitting up four intricately communicating parts is more difficult than scaling down a part with a common base and varying numbers of independent SIMDs. There's a lot more loose ends to take care of in Fermi.

If that were truly the case, then how come NVIDIA has had so many problems in the past bringing lower end products to market in a timely and cost-effective manner after the high end product has already been released?

This is what Hardware Canucks had to say about GF100 architectural scaling:

Hardware Canucks said:
By now you should all remember that the Graphics Processing Cluster is the heart of the GF100. It encompasses a quartet of Streaming Multiprocessors and a dedicated Raster Engine. Each of the SMs consists of 32 CUDA cores, four texture units, dedicated cache and a PolyMorph Engine for fixed function calculations. This means each GPC houses 128 cores and 16 texture units. According to NVIDIA, they have the option to eliminate these GPCs as needed to create other products but they are also able to do additional fine tuning as we outline below.

Within the GPC are four Streaming Multiprocessors and these too can be eliminated one by one to decrease the die size and create products at lower price points. As you eliminate each SM, 32 cores and 4 texture units are removed as well. It is also worth mentioning that due to the load balancing architecture used in the GF100, it’s possible to eliminate multiple SMs from a single GPC without impacting the Raster Engine’s parallel communication with the other engines. So in theory, one GPC can have one to four SMs while all the other GPCs have their full amount without impacting performance one bit.

Focusing in on the ROP, Memory and Cache array we can see that while placed relatively far apart on the block diagram, they are closely related and as such they must be scaled together. In its fullest form, the GT100 has 48 ROP units grouped into six groups of eight and each of these groups is served by 128KB of L2 cache for a total of 768KB. In addition, every ROP group has a dedicated 64-bit GDDR5 memory controller. This all translates into a pretty straightforward solution: once you eliminate a ROP group, you also have to eliminate a memory controller and 128KB of L2 cache.

Scaling of these three items happens in a linear affair as you can see in the chart above since in the GF100 architecture, you can’t have ROPs without an associated amount of L2 cache or memory interface and vice versa. One way or another, the architecture can scale down all the way down to 8 ROPs with a 64-bit memory interface.



So sure, one can argue that losing functional units will result in lower performance, but so what? For lower end cards, the far most important thing is to keep costs as low possible, and to keep die size as low (and yield as high) as possible, rather than keeping the same amount of functional units compared to higher end cards just to maintain the same level of geometry throughput.
 
Last edited by a moderator:
Looks like GPC to SM ratio won't change as it's 4:1 by default.

re: other thread where fillrates were discussed.
The 128-bit, 16ROP variant with one third the fillrate will be choking in what, 1680*1050? At this point geometry is probably the least of their concerns - texture and pixel fill comes in much earlier.

G92/GT21x probably stands a better chance with the same unitcount/spec when scaled down unless the fillrates are just a temporary driver gunk-up.
 
IIRC, it took NVIDIA quite some time to release lower end derivatives after their high end 8800 card was released. So while performance of G92 proved to be very good for the money, I don't think that time to market was very stellar compared to release of the high end product. Now, regarding GF100 derivatives, it's hard to say how good the performance will be since the cards are not yet available, but the pixel/texture fillrate will be heavily dependent on not just sheer number of functional units, but on clock speeds too.
 
One of the nice things about tessellation is that it's scalable, so that the performance of the high-end cards can be utilized by games, while the low-end cards can still perform adequately at lower settings. Therefore I don't think we should be overly concerned that nVidia's lower-end parts won't have the geometry power of the higher-end ones.
 
Think harder :p
Yeah, the limit for maximum tessellation is 2 triangles per vertex, but lower tessellation levels (sporadic use or adaptive) makes that lower. I suppose it's fair to say that in fact it will tend to be higher, because patches that are to be tessellated are likely to be parts of continuous meshes, such as characters or terrain or buildings, rather than bitty things dotted about the frame.

Well that sort of kills my theory. But why do you think Damien found that if you need to access multiple vertices in a shader the speed goes down?
Depends how the data's packed I suppose. I've never really spent any time looking at VS and my efforts to find representative HS and DS shaders have so far come to naught. I should try the DX SDK :p

I know, but the point I was making is that an iterative tessellation algorithm doesn't make sense. Even if it was used, it doesn't even explain 1 vert every 6 clocks, because you get lots of verts with each pass.
I think TS has a vertex-centric view with the triangles coming out in the wash, because tessellation factors are per edge, not per triangle.

I don't see the savings, nor do I see any explanation for 1 vert every 3 or 6 clocks
I don't really understand the mechanics of the Xenos/R600 tessellator, and why the D3D11 tessellator is "materially" different to the extent that Evergreen has two distinct tessellators, one for each type of work.

Perhaps the Xenos tessellator is a per-triangle tessellator, and the limit case of 2 triangles per vertex results in the halving of throughput (since vertex rate determines triangle setup rate, rather than vice-versa)? But the D3D11 tessellator, which has to support much higher amplification, reverts to a slower algorithm that's edge-centric? :???:

It's a very different argument when you consider scale. 512 fragmentss/clk including all the compressed encoding/decoding, Z loading, HiZ, etc. is way, way more expensive than generating a pair of [0..1] numbers every two clocks.
Exactly. Generating 10 million visible triangles per frame whose average area <1 pixel requires vastly more hardware support in the rasterisation/fragment shading/back-end part of the GPU, so is therefore pointless.

I can see how increasing setup/culling speed is a bit of a pain due to the implications throughout the pipeline, but this is just laziness. Tessellation is a self-contained fixed function unit. Factors go in, coordinates come out.
And most of a GPU's work is due to an order of magnitude of amplification derived from those resulting triangles.

I really think we'll find that extreme tessellation (sub-pixel triangles) on ATI is slow because hardware threads are basically empty and the rasteriser is mostly shooting blanks. I also raise a question on PTVC, and I suspect a limit of 512 hardware threads in flight also plays a significant part.

If you look back at old presentations about ATI tessellation (R600 era) you'll find hundreds of fps being quoted for 1 or more million triangles. Those numbers simply don't gybe with Heaven and I think it's down to the sub-pixel triangles (whose pixel shaders are long-winded), not the absolute count of them - since we know that Heaven's absolute triangle counts are low (i.e < 2 million - though obviously some multiple of that for extreme mode).

No, that's not what I'm talking about. I'm saying that the number of triangles output by this SDK sample is a lot lower than a game would need. The 35% and 11% numbers aren't very useful.
It's not triangle count that matters, it's area per triangle. Any kind of adaptive algorithm is going to stop before it paints multiple triangles per fragment. Otherwise just delete the MSAA hardware and totally re-model the fragment shading part of the fixed-function GPU architecture.

Which, incidentally, makes me wonder about GTX480's rasterisers:

http://forum.beyond3d.com/showpost.php?p=1414371&postcount=5449

If MSAA is killing GTX480's performance (or is it the AF?...), is there some interaction there between Z rasterisation and tessellation?

Though I remain cautious about the hardwareluxx graph until someone reproduces that HD5870 result, as the results/graphs here:

http://www.pcgameshardware.com/aid,...Fermi-performance-benchmarks/Reviews/?page=16

don't tally.

Jawed
 
That's not right, either.

Naked vertices produced by the tessellator only have u,v.

So, still no good idea why it's apparently 3x slowed-down.

Jawed
Artificial driver limitation? Like they did/do with vertex work? Remember the phenomenal throughput R600 achieved with early drivers, when the same shaders were declared as either pixel or vertex shaders in AMDs shader test? That has been removed since and even HD 5870 doesn't reach those perf levels anymore except for INT-stuff (nor does Fermi).

Though I remain cautious about the hardwareluxx graph until someone reproduces that HD5870 result, as the results/graphs here:

http://www.pcgameshardware.com/aid,...Fermi-performance-benchmarks/Reviews/?page=16

don't tally.

Maybe they just mixed up the color-code? Their normal Fps-graphs show a different result here:
http://www.hardwareluxx.de/index.ph...tx-480-im-test-nvidias-comeback.html?start=25
 
Last edited by a moderator:
Artificial driver limitation? Like they did/do with vertex work? Remember the phenomenal throughput R600 achieved with early drivers, when the same shaders were declared as either pixel or vertex shaders in AMDs shader test? That has been removed since and even HD 5870 doesn't reach those perf levels anymore except for INT-stuff (nor does Fermi).
That was a shader load-balancing change, I suppose. Never really got a decent answer on that, did we?

I suppose there's a chance that the same load-balancing mind-set would apply to tessellation - "no point in clogging-up the GPU with yet more triangles, since pixel shading can't keep up".

Dunno, good thinking though...

Maybe they just mixed up the color-code? Their normal Fps-graphs show a different result here:
http://www.hardwareluxx.de/index.ph...tx-480-im-test-nvidias-comeback.html?start=25
I'm going to treat it as suspect for the time being until someone verifies it.

The shape of the lines (particularly the way one line crosses the other for peaks) is completely different from your results.

Jawed
 
That was a shader load-balancing change, I suppose. Never really got a decent answer on that, did we?
Nope, never. Understandably, since the high VS rates were used as a marketing tool at R600 launch.

I suppose there's a chance that the same load-balancing mind-set would apply to tessellation - "no point in clogging-up the GPU with yet more triangles, since pixel shading can't keep up".
Not pixel-shading but tri-setup mainly, I'd think. Damien should really get a 5770 through this test!

I'm going to treat it as suspect for the time being until someone verifies it.

The shape of the lines (particularly the way one line crosses the other for peaks) is completely different from your results.
I'm pretty confident in our results, but never say never. :)
 
I wouldn't say "insufficient" but Nvidia's advantage will decrease.

Sure but that will be compensated by the fact that geometry processing becomes a larger part of the workload at the resolutions those cards will be running.

HD 5450 scores 86 Mtri/s in the same test.

Hmmm, why would that be? Bandwidth limitation?
 
Cedar doesn't have the same prim rates as the rest of the stack. Cypress, Juniper and Redwood are the same as each other.
 
Sure but that will be compensated by the fact that geometry processing becomes a larger part of the workload at the resolutions those cards will be running.
Not really, because the cards process pixels at half the speed of the higher end cards, so the time spent is the same.

Look at my example earlier in the thread:
Imagine game X at high resolution where Cypress spends 3ms on geometry heavy areas while Fermi spends 1ms on it, and both spend 15 ms on pixel-heavy work. That's 56 fps vs 63 fps, or 13% diff. At low resolution the geometry load stays the same but pixel load halves, making it 95 fps vs 118 fps, or 24% diff.
So Juniper would spend 3ms on geometry and 15ms on pixels at the low resolution for the same 56fps that Cypress got at high res, and a half-Fermi would spend 2ms on geometry and 15 ms on pixels for 59fps. The 13% is now reduced to 6%.

There is no half-Fermi, though, and Juniper would be going up against a 128-bit card.
 
Back
Top