Huh what? In what way is tessellation in any way related to deferred rendering?
I'm not sure how. I'm pretty sure the z-checking is still done in the usual way.It is not related per se, but is a push towards deferred rendering. It eliminates shading and lighting work done for fragments that are eventually discarded.
How?If you are measuring overdraw as a ratio of number of times the pixel shader runs vs no. of pixels actually drawn, then it has an overdraw of 1.
I'm not sure how. I'm pretty sure the z-checking is still done in the usual way.
How?
Deferred is certainly relatively happy with high triangle counts (and the doubled fillrate is the key thing in Evergreen). But Z pre-pass (where the developer pre-populates Z with unshaded triangles) is a relatively simple major step in that direction. Marco's concept of a Z pre-pass before the main G-buffer write pass theoretically also appliesTBH, tesselation seems like another big push (apart from the relatively small increase in bandwidth from rv770->rv870) towards deferred rendering to me. With MSAA now possible with deferred rendering as well.
The D3D pipeline is flexible enough already in theory. Xenos's pipeline supports explicit markup for tile IDs on vertex buffer content (done by the CPU I think). Though there's a difference between tiling across 2 to 4 tiles like you might do with Xenos and tiling with typically hundreds and as many as thousands (2560x1600 with 32x32 tiles is 4000 tiles).And while your rendering has been deferred, why not render in tiles to increase bandwidth efficiency as well. LRB FTW
If only an IHV will take the initiative to include hw acceleration for sorting fragments in tiles and exposing it in OCL.
Some kind of die stacking seems inevitable to me. Larrabee should be here before die-stacking on this scale occurs for desktop GPUs...Absent that, I guess they could try slapping a 256MB DRAM module on top of the GPU die using PoP tech.
I am entirely familiar with deferred shading, but I still don't see how you think tessellation has anything at all to do with deferred shading.Does this link and the references contained in it help?
http://en.wikipedia.org/wiki/Deferred_shading
They affect the trade-offs of using one vs. another. For instance, tesselation often implies smaller triangles which implies that forward rendering shading rates become somewhat less efficient (due to only partially-filled SIMD lanes) while deferred rendering can often recover this cost.What do the two have to do with one another?
I don't see how using SIMD lanes is a design decision that is necessarily tied to traditional rasterization as opposed to deferred rendering.They affect the trade-offs of using one vs. another. For instance, tesselation often implies smaller triangles which implies that forward rendering shading rates become somewhat less efficient (due to only partially-filled SIMD lanes) while deferred rendering can often recover this cost.
I don't see how using SIMD lanes is a design decision that is necessarily tied to traditional rasterization as opposed to deferred rendering.
Couldn't you do the same with a more powerful triangle setup engine and a little bit of caching? Should be reasonably efficient for tessellated triangles, at least, as neighboring triangles should be near one another in screen space.I think what he means is, in deferred renderer, it's easier to pack SIMD lanes from different triangles (because the necessary parameters for each pixel are already computed before), compared to a traditional renderer which may not be able to pack multiple small triangles into SIMD lanes efficiently.
The user can do a way better job of it than the graphics pipeline implementation (which has to be pretty conservative about when it can combine SIMD warps, interpolants or quantize derivatives or LODs to 2x2 quads).Couldn't you do the same with a more powerful triangle setup engine and a little bit of caching? Should be reasonably efficient for tessellated triangles, at least, as neighboring triangles should be near one another in screen space.
Not sure what you mean... the geometry pass is pretty inexpensive in a deferred renderer. As Marco notes, one big advantage is getting the complicated (and often suboptimal) triangle renderer/rasterization scheduling decoupled from the bulk of the computation, which can often be more efficiently rescheduled in a deferred pass.But if you have a deferred renderer with so many (generated) triangles don't your display lists grow to gigantic sizes?
No, my post was calculating setup (rather anything that runs at the same speed as on the HD5770 as opposed to 2x).Apparently it's about 23% in Unigine with tessellation off on HD5870 according to your follow up post.
Those should still be twice as fast on the 5870, so I'm not quite sure what you're getting at.But other bottlenecks are in play, i.e. HD5870 isn't ALU limited for geometry here - it could be vertex bandwidth limited or fillrate (Z rate) limited.
Yeah, it's hard to say. Wireframe may not be saving a lot of rendering load when tesselated, because a breakdown of rendering times using that data would give unreasonable numbers.http://forum.beyond3d.com/showpost.php?p=1383133&postcount=1004
AvP without tessellation is 96fps, but in wireframe is ~500fps and that's without the geometry workload of shadows. 500fps could be setup limited rather than VS limited, or it could be vertex fetch limited.
Also, I don't have a decent idea of how expensive wireframe rendering, itself, is.
I don't know what you're asking.Does that take account of multi-pass geometry, shadow buffer rendering passes, overdraw, transparency passes and shrinking triangles?
GPUs don't have an infinite buffer between the VS and PS. By and large a GPU will be near 100% utilization in either geometry/setup or pixel shading, and rarely pushing both simultaneously beyond 50% (which is the threshold at which my assumption breaks down). This is not due to lack of capability; it's just the nature of the workload.I don't understand what you mean by portions of the scene.
Your math is assuming a setup rate of 16 tesselated triangles per clock, which simply isn't true. Four tris per clock is the maximum usable output of the geometry engine, whether tesselated or not. Take into account that it's closer to 3/clk, and tri:vertex ratio is under 2, it means vertex shading never has to happen faster than 2 per clock.I'm assuming this is the scenario you're painting: that VS is setup limited, that's 4 triangles per SM clock or 2 vertices per SM clock. There are 1024 ALU instructions per SM clock (hot clock is 2x SM clock), so 512 ALU instructions per vertex per clock is the limit for 100% VS usage of the GPU. I wasn't thinking of 100% usage, I was contemplating VS becoming the dominant shading workload (>50%) after tessellation (4x multiplier of triangles), which is 64 instructions per vertex.
That's not much of a point. The fact remains that it's a big bottleneck keeping the shader engines idle over one third of the time.That was my point. There's a suggestion that tessellation/setup rate limit is what's occurring here, but at worst this is not a continuous bottleneck.
Does it matter? Add up the numbers, invert for framerate, and you'll match the source data perfectly. Again, this is how I broke it down:How are you calculating this?
But at least with low amounts of geometry you can lower the resolution (something that I think is overrated) to increase framerate. With high amounts of geometry it doesn't.I agree, the resulting framerates are troublingly low. Even without tessellation they're troublingly low. But then, that's what synthetic benchmarks do.
Why? Did you not notice that when tesselation was enabled I got larger numbers for the pixel load, too? The breakdown is very reasonable, and there's nothing dubious about it.Which is why I'm dubious of the pixel/geometry balance you've derived above, as well as other factors.
That's probably jumping to conclusions. If it processes the geometry limited parts 3.5x as fast and the pixel limited parts 15% faster due to 50% more BW, GF100 will get 1.6x the performance (with tesselation, 12.4ms pixel, 3ms geometry).Additionally, it seems to me that GF100's substantial read/write L1/L2 cache can't help but be a significant factor in performance here. NVidia describes L2 as replacing various FIFOs, and it's certainly a key part of geometry processing.
Well, I suppose I'd have to look more into the difficulties here, but I would tend to expect that with the limitations placed by tessellation in the first place, there's at least a possibility that this could be extended to tessellated triangles.The user can do a way better job of it than the graphics pipeline implementation (which has to be pretty conservative about when it can combine SIMD warps, interpolants or quantize derivatives or LODs to 2x2 quads).
Couldn't you do the same with a more powerful triangle setup engine and a little bit of caching? Should be reasonably efficient for tessellated triangles, at least, as neighboring triangles should be near one another in screen space.
Your post doesn't address my comment.When primitives are rasterized, PS always runs on screen aligned quads. With a huge number of small primitives aligned in all sorts of odd ways, the PS cost/waste adds up pretty quickly. Not to mention that curved surfaces increase the percentage of occluded fragments.
Deferred rendering, does away with all of it.