NVIDIA Fermi: Architecture discussion

GZ007 · Jan 29, 2010

They could also use slower and bigger sram L3 cache like a tile cache. On 28nm a 6MB sram cache shoudnt be a big problem.

rpg.314 · Jan 29, 2010

Chalnoth said:
Huh what? In what way is tessellation in any way related to deferred rendering?

It is not related per se, but is a push towards deferred rendering. It eliminates shading and lighting work done for fragments that are eventually discarded.

If you are measuring overdraw as a ratio of number of times the pixel shader runs vs no. of pixels actually drawn, then it has an overdraw of 1.

KimB · Jan 29, 2010

rpg.314 said:
It is not related per se, but is a push towards deferred rendering. It eliminates shading and lighting work done for fragments that are eventually discarded.

I'm not sure how. I'm pretty sure the z-checking is still done in the usual way.

rpg.314 said:
If you are measuring overdraw as a ratio of number of times the pixel shader runs vs no. of pixels actually drawn, then it has an overdraw of 1.

How?

rpg.314 · Jan 29, 2010

Chalnoth said:
I'm not sure how. I'm pretty sure the z-checking is still done in the usual way.

How?

Does this link and the references contained in it help?

http://en.wikipedia.org/wiki/Deferred_shading

trinibwoy · Jan 29, 2010

I think the question is how does a tessellated scene benefit more from deferred rendering than a non-tessellated one.

Jawed · Jan 29, 2010

rpg.314 said:
TBH, tesselation seems like another big push (apart from the relatively small increase in bandwidth from rv770->rv870) towards deferred rendering to me. With MSAA now possible with deferred rendering as well.

Deferred is certainly relatively happy with high triangle counts (and the doubled fillrate is the key thing in Evergreen). But Z pre-pass (where the developer pre-populates Z with unshaded triangles) is a relatively simple major step in that direction. Marco's concept of a Z pre-pass before the main G-buffer write pass theoretically also applies

I'm not sure how tessellation affects the relative cost of the Z pre-pass and the shading pass - and there's a question of the degree to which the Z pre-pass is setup bound, or becomes setup bound with tessellation.

And while your rendering has been deferred, why not render in tiles to increase bandwidth efficiency as well. LRB FTW

If only an IHV will take the initiative to include hw acceleration for sorting fragments in tiles and exposing it in OCL.

The D3D pipeline is flexible enough already in theory. Xenos's pipeline supports explicit markup for tile IDs on vertex buffer content (done by the CPU I think). Though there's a difference between tiling across 2 to 4 tiles like you might do with Xenos and tiling with typically hundreds and as many as thousands (2560x1600 with 32x32 tiles is 4000 tiles).

Tessellation amplifies triangle data - if that data is binned in memory then you've kinda lost one of the benefits of tessellation: memory space and bandwidth savings (though these benefits apply jointly to the main RAM/disk side as well as the GPU side, so the host side gains aren't lost). G-buffers can also be fairly expensive in memory consumption.

So then you have the question of whether the cost of binned triangles in memory is offset by the pixel shading coherency gain, bearing in mind that these traditional GPUs are designed around incoherent pixel shading (while they tile fragments during rasterisation, those tiles are only fleeting, so a given tile is visited repeatedly during a single frame).

Fermi architecture theoretically allows these fleeting tiles to live for longer in L2 before being evicted (because L2 is huge in comparison with the caches attached to ROPs in older architectures), increasing the chance that even in forward rendering these tiles will accelerate blending or MSAA a step beyond the older GPUs.

Absent that, I guess they could try slapping a 256MB DRAM module on top of the GPU die using PoP tech.

Some kind of die stacking seems inevitable to me. Larrabee should be here before die-stacking on this scale occurs for desktop GPUs...

Jawed

KimB · Jan 29, 2010

rpg.314 said:
Does this link and the references contained in it help?

http://en.wikipedia.org/wiki/Deferred_shading

I am entirely familiar with deferred shading, but I still don't see how you think tessellation has anything at all to do with deferred shading.

Deferred shading is a way of storing the entire scene (or at least a significant portion of it) so that rendering can be done by ray casting through the triangles in the scene, only rendering a pixel from a triangle if that triangle is visible.

Tessellation is a way of dynamically subdividing and displacing triangles to improve detail.

What do the two have to do with one another?

Andrew Lauritzen · Jan 29, 2010

Chalnoth said:
What do the two have to do with one another?

They affect the trade-offs of using one vs. another. For instance, tesselation often implies smaller triangles which implies that forward rendering shading rates become somewhat less efficient (due to only partially-filled SIMD lanes) while deferred rendering can often recover this cost.

I agree though that tesselation itself isn't really a "push towards deferred rendering"... more that using it starts to provide additional benefits in also using deferred rendering.

KimB · Jan 29, 2010

Andrew Lauritzen said:
They affect the trade-offs of using one vs. another. For instance, tesselation often implies smaller triangles which implies that forward rendering shading rates become somewhat less efficient (due to only partially-filled SIMD lanes) while deferred rendering can often recover this cost.

I don't see how using SIMD lanes is a design decision that is necessarily tied to traditional rasterization as opposed to deferred rendering.

pcchen · Jan 29, 2010

Chalnoth said:
I don't see how using SIMD lanes is a design decision that is necessarily tied to traditional rasterization as opposed to deferred rendering.

I think what he means is, in deferred renderer, it's easier to pack SIMD lanes from different triangles (because the necessary parameters for each pixel are already computed before), compared to a traditional renderer which may not be able to pack multiple small triangles into SIMD lanes efficiently.

nAo · Jan 29, 2010

Yup. Once you go deferred you split your previously forward rendering-based shaders in two passes and you end up paying the inefficiencies related to small primitives only for the first pass.

KimB · Jan 29, 2010

pcchen said:
I think what he means is, in deferred renderer, it's easier to pack SIMD lanes from different triangles (because the necessary parameters for each pixel are already computed before), compared to a traditional renderer which may not be able to pack multiple small triangles into SIMD lanes efficiently.

Couldn't you do the same with a more powerful triangle setup engine and a little bit of caching? Should be reasonably efficient for tessellated triangles, at least, as neighboring triangles should be near one another in screen space.

mczak · Jan 30, 2010

But if you have a deferred renderer with so many (generated) triangles don't your display lists grow to gigantic sizes?

trinibwoy · Jan 30, 2010

Deferred shader, not deferred renderer.

Andrew Lauritzen · Jan 30, 2010

Chalnoth said:
Couldn't you do the same with a more powerful triangle setup engine and a little bit of caching? Should be reasonably efficient for tessellated triangles, at least, as neighboring triangles should be near one another in screen space.

The user can do a way better job of it than the graphics pipeline implementation (which has to be pretty conservative about when it can combine SIMD warps, interpolants or quantize derivatives or LODs to 2x2 quads).

mczak said:
But if you have a deferred renderer with so many (generated) triangles don't your display lists grow to gigantic sizes?

Not sure what you mean... the geometry pass is pretty inexpensive in a deferred renderer. As Marco notes, one big advantage is getting the complicated (and often suboptimal) triangle renderer/rasterization scheduling decoupled from the bulk of the computation, which can often be more efficiently rescheduled in a deferred pass.

If you're talking about a tile-based or "binning" renderer, that's a different story. In that case yes, you get very large triangle bins for each tile unless you can do something clever (bin bounding boxes and tessellate on the fly or something). That said, this is more about a binning/tile-based renderer than deferred renderers... the latter operate similarly under both immediate-mode and tile-based renderers with respect to triangle counts - they just provide some additional benefits to tile-based renderers as well.

Mintmaster · Jan 30, 2010

Jawed said:
Apparently it's about 23% in Unigine with tessellation off on HD5870 according to your follow up post.

No, my post was calculating setup (rather anything that runs at the same speed as on the HD5770 as opposed to 2x).

But other bottlenecks are in play, i.e. HD5870 isn't ALU limited for geometry here - it could be vertex bandwidth limited or fillrate (Z rate) limited.

Those should still be twice as fast on the 5870, so I'm not quite sure what you're getting at.

http://forum.beyond3d.com/showpost.php?p=1383133&postcount=1004

AvP without tessellation is 96fps, but in wireframe is ~500fps and that's without the geometry workload of shadows. 500fps could be setup limited rather than VS limited, or it could be vertex fetch limited.

Also, I don't have a decent idea of how expensive wireframe rendering, itself, is.

Yeah, it's hard to say. Wireframe may not be saving a lot of rendering load when tesselated, because a breakdown of rendering times using that data would give unreasonable numbers.

Does that take account of multi-pass geometry, shadow buffer rendering passes, overdraw, transparency passes and shrinking triangles?

I don't know what you're asking.

I don't understand what you mean by portions of the scene.

GPUs don't have an infinite buffer between the VS and PS. By and large a GPU will be near 100% utilization in either geometry/setup or pixel shading, and rarely pushing both simultaneously beyond 50% (which is the threshold at which my assumption breaks down). This is not due to lack of capability; it's just the nature of the workload.

"Portions of the scene" (maybe I should have said 'frame' instead of 'scene'?) refer to a sequence of polys that are limited one way or another. I then classify these portions as either pixel or geometry and add up their processing time. A small percent will be neither, but that's it.

I'm assuming this is the scenario you're painting: that VS is setup limited, that's 4 triangles per SM clock or 2 vertices per SM clock. There are 1024 ALU instructions per SM clock (hot clock is 2x SM clock), so 512 ALU instructions per vertex per clock is the limit for 100% VS usage of the GPU. I wasn't thinking of 100% usage, I was contemplating VS becoming the dominant shading workload (>50%) after tessellation (4x multiplier of triangles), which is 64 instructions per vertex.

Your math is assuming a setup rate of 16 tesselated triangles per clock, which simply isn't true. Four tris per clock is the maximum usable output of the geometry engine, whether tesselated or not. Take into account that it's closer to 3/clk, and tri:vertex ratio is under 2, it means vertex shading never has to happen faster than 2 per clock.

Mintmaster · Jan 30, 2010

Jawed said:
That was my point. There's a suggestion that tessellation/setup rate limit is what's occurring here, but at worst this is not a continuous bottleneck.

That's not much of a point.

The fact remains that it's a big bottleneck keeping the shader engines idle over one third of the time.

How are you calculating this?

Does it matter? Add up the numbers, invert for framerate, and you'll match the source data perfectly. Again, this is how I broke it down:

Part A - parts of the workload that run twice as fast on the 5870 than on the 5770 (hence the times being twice as long for the latter with this part of the workload)
Part B - parts of the workload that run at the same speed on both the 5870 and the 5770

For the most part, pixel/geometry is pretty apt for A/B. Pixel includes shader, BW, etc because all that is needed for each pixel.

I agree, the resulting framerates are troublingly low. Even without tessellation they're troublingly low. But then, that's what synthetic benchmarks do.

But at least with low amounts of geometry you can lower the resolution (something that I think is overrated) to increase framerate. With high amounts of geometry it doesn't.

Which is why I'm dubious of the pixel/geometry balance you've derived above, as well as other factors.

Why? Did you not notice that when tesselation was enabled I got larger numbers for the pixel load, too? The breakdown is very reasonable, and there's nothing dubious about it.

Additionally, it seems to me that GF100's substantial read/write L1/L2 cache can't help but be a significant factor in performance here. NVidia describes L2 as replacing various FIFOs, and it's certainly a key part of geometry processing.

That's probably jumping to conclusions. If it processes the geometry limited parts 3.5x as fast and the pixel limited parts 15% faster due to 50% more BW, GF100 will get 1.6x the performance (with tesselation, 12.4ms pixel, 3ms geometry).

KimB · Jan 30, 2010

Andrew Lauritzen said:
The user can do a way better job of it than the graphics pipeline implementation (which has to be pretty conservative about when it can combine SIMD warps, interpolants or quantize derivatives or LODs to 2x2 quads).

Well, I suppose I'd have to look more into the difficulties here, but I would tend to expect that with the limitations placed by tessellation in the first place, there's at least a possibility that this could be extended to tessellated triangles.

However, as a small addendum, let me add that tessellation might actually reduce the problem of having too many very small triangles, because it also allows for lower level of detail for far-away objects. You'd only have more of a problem with small triangles if you use a very high level of detail.

rpg.314 · Jan 30, 2010

Chalnoth said:
Couldn't you do the same with a more powerful triangle setup engine and a little bit of caching? Should be reasonably efficient for tessellated triangles, at least, as neighboring triangles should be near one another in screen space.

When primitives are rasterized, PS always runs on screen aligned quads. With a huge number of small primitives aligned in all sorts of odd ways, the PS cost/waste adds up pretty quickly. Not to mention that curved surfaces increase the percentage of occluded fragments.

Deferred rendering, does away with all of it.

KimB · Jan 30, 2010

rpg.314 said:
When primitives are rasterized, PS always runs on screen aligned quads. With a huge number of small primitives aligned in all sorts of odd ways, the PS cost/waste adds up pretty quickly. Not to mention that curved surfaces increase the percentage of occluded fragments.

Deferred rendering, does away with all of it.

Your post doesn't address my comment.

NVIDIA Fermi: Architecture discussion

GZ007

rpg.314

KimB

rpg.314

trinibwoy

Meh

Jawed

KimB

Andrew Lauritzen

Moderator

KimB

pcchen

Moderator

nAo

Nutella Nutellae

KimB

mczak

trinibwoy

Meh

Andrew Lauritzen

Moderator

Mintmaster

Mintmaster

KimB

rpg.314

KimB

Similar threads