HSR vs Tile-Based rendering?

Killer-Kris · Jun 19, 2004

I'm curious but why are Nvidia and Ati doing optimizations like bri/try linear filtering? How about angle dependant AF? Are those fillrate or bandwidth optimizations?

Chalnoth said:
Sure, it would be valuable, if it were possible to halve or quarter the fillrate requirements. It's not.

Just curious but where on earth are you getting that from?! That's just pure nonsense, if you have a 2x overdraw your performance is halved on an IMR, if it's 4x it's quartered. A TBDR does not suffer from this.

With a modern GPU, it's very possible to not render a single hidden pixel.

Yes, but that requires the developer to perform extra work, either sorting front-back (why not let the GPU do it instead of the CPU?), or using a z-pass first. Either of which are going to cost performance, especially when you have an alternative technique that does it for small cost in transistors for logic and cache.

KimB · Jun 19, 2004

Killer-Kris said:
Chalnoth said:

Sure, it would be valuable, if it were possible to halve or quarter the fillrate requirements. It's not.

Click to expand...

Just curious but where on earth are you getting that from?! That's just pure nonsense, if you have a 2x overdraw your performance is halved on an IMR, if it's 4x it's quartered. A TBDR does not suffer from this.

It's not that simple anymore. For a long time IMR's have had techniques to reduce overdraw, and the overdraw can, in some cases, be identically one in IMR's.

In a general gaming environment, simplistic ordering of rendered objects into front-to-back rendering can dramatically reduce effective overdraw for modern IMR's.

Yes, but that requires the developer to perform extra work, either sorting front-back (why not let the GPU do it instead of the CPU?), or using a z-pass first. Either of which are going to cost performance, especially when you have an alternative technique that does it for small cost in transistors for logic and cache.

Front to back sorting is pretty easy if done coarsely, and even a coarse front to back sort can dramatically reduce overdraw. But this really isn't the source of my statement.

The source of my statement is the z-pass first. Performing an initial z-pass is both cheap and required for some algorithms.

Jerry Cornelius · Jun 19, 2004

Chalnoth said:
And yes, the ratio of pixel fillrate to memory bandwidth is what determines whether the limiting factor is memory bandwidth or pixel fillrate. Rendering hidden pixels doesn't affect this ratio.

You've lost me completely. I'm not sure why I care about the ratio of this to that. Texturing and various z schemes to avoid texturing access memory, rather randomly I beleive. There is a savings on texturing memory bandwidth with a TBR.

Jerry Cornelius · Jun 19, 2004

Chalnoth said:
The source of my statement is the z-pass first. Performing an initial z-pass is both cheap and required for some algorithms.

High speed z checking is the basis of a tiler, these algoythm, like stencil shadows) may be inherently quick due to a TBR's design.

Sorting the scene or doing a z pass (the latter being a good idea, since it's a form of scene capturing that works for the infinite triangle case) requires developer intervention. The same developer could employ a different bag of tricks to get around any buffer overflow limit on a TBR.

I'd still have my doubts that you will ever pump that many triangles across the system bus for it to matter.

KimB · Jun 19, 2004

Jerry Cornelius said:
High speed z checking is the basis of a tiler, these algoythm, like stencil shadows) may be inherently quick due to a TBR's design.

Fine, but does it really matter? Seriously, as shaders approach hundreds of instructions (which should happen in about two years), how much percentage of the rendering time would be taken up by an initial z-pass, even with no optimizations? It'd be pretty minimal.

Sorting the scene or doing a z pass (the latter being a good idea, since it's a form of scene capturing that works for the infinite triangle case) requires developer intervention. The same developer could employ a different bag of tricks to get around any buffer overflow limit on a TBR.

No, the buffer overflow problem is a pretty hardware-specific problem. The hardware's either going to have to have a large portion of its video memory unused and unavailable to prevent overflows, or it's going to overflow. There's just no way around that, and there's no easy way for a developer to know for certain how many triangles are on the screen at once.

I'd still have my doubts that you will ever pump that many triangles across the system bus for it to matter.

This is what vertex buffers are for.

Kristof · Jun 19, 2004

Something to remember is that Early Z even with an intial depth pass is far from ideal... its all zone/area based hence in a lot of cases you still need to go all the way down to the final depth buffer to figure out if a single pixel is actually visible or not. Also relatively small polygons, due to the area based scheme, are poor occluders (its very difficult to update an earlyZ area of 16x16 with a single depth value using polygons that are only 4x4 in size). Also most initial depth passes for stencil are not just depth passes you usually still end up with a fairly complicated shader and remember the increase in polygon throughput this causes... Just out of curiosity how many currently shipping games use an initial depth pass ?

Ailuros · Jun 19, 2004

No, the buffer overflow problem is a pretty hardware-specific problem. The hardware's either going to have to have a large portion of its video memory unused and unavailable to prevent overflows, or it's going to overflow. There's just no way around that, and there's no easy way for a developer to know for certain how many triangles are on the screen at once.

It is a commonly known HW specific headache, but that doesn't also mean that sollutions aren't possible either. In a worst case scenario w/o any specialized algorithms attending that problem, how large would the portion of the video memory actually be? 10%? 20%? How much exactly? If I compare that rate to any IMR's buffer consumption with even mediocre amount of MSAA samples (and that would be just one example) I don't see any serious disadvantage in an average summary. In comparative terms if binning space is going to be "large", then AA buffer consumption could be "huge" on an IMR and no not necessarily with an extravagant amount of AA samples.

Accelerators work as a whole and irrelevant of approach there are always going to be both advantages or disadvantages; what matters to me is what comes out at the other hand. And no there hasn't been any comparable hardware out there for the PC for ages.

Unless of course we go into the PDA/mobile sector; there I'd want to see a chip first that can reach 2.75M Polys/sec@120MHz with the same die size, power consumption, miniscule memory footprint and performance penalty for FSAA and what not.

Hopefully we'll be looking for those supposed framerate nosedives soon enough on PC accelerators.

Dave B(TotalVR) · Jun 19, 2004

Well, if you want to know more specifics of how PowerVR work read my article here:

http://www.powervr.org.uk/newsitepowervr.htm

Or the official one here:
http://www.pvrdev.com/pub/PC/doc/h/PowerVR Tile based rendering.htm

But I'll summarise....

Tile based renderers store the whole scene to be rendered before rendering it. This is called scene capturing. The advantages start here, when you have all the polygons in the scene transformed and ready to rasterise then some extra things can be done to them. For example the translucent objects can be sorted in hardware - this is impossible on an IMR. Also, generalised modifier volumes can manipulate objects in the scene. The most effective use being shadow volumes - the only performance hit incured being the transformation of the moifier volumes geometry. Both of these features were left out of the KYRO due to lack of developer interest (gits), they were in the Neon250 (and hence dreamcast) though.

There is the issue of the memory required to store this scene buffer - I can assure you it is a non issue. As soon as you turn 4xFSAA on in a high resolution (such as 1280x1024) your z-buffer increases to 20MB Also, your upsampled framebuffer is also 20MB. This is without multiple buffering. On the KYRO the scene bufer has a fixed size of 6MB and can almost render enough polys for the 3Dmark 2k1 high polygon test (the scene manager allows it to render properly). It is also not accessed as frequently as an IMR's frame or z-buffer. Also, we live in day and age of 128/256mb graphis cards - even if the scene buffer is 30MB big wow.

Once you get to rasterisation hidden surface removal can take place. This is pixel perfect and no IMR can claim the same. 100% of the pixels PowerVR renders are visible in the final image 100% of the time. It is also a pipelined operation with buffering, so it will never stall the rendering (especially as it calculates 32 pixels hsr per clock). Also as each pixel is shaded (in much the same way an IMR does) values are written to the tile buffer, not the framebuffer. Multi-pass effects that write multiple times to the framebuffer such as lot of transparencies, many texture layers etc..are only written once on PowerVR as the multiple reads/reads are done in on chip cache.

Assuming a scene with 4x overdraw.. (compared to an IMR without saving techniques)

Powervr will reduce the texture bandwidth requirements to 25% of what it was.

PVR will also reduce the number of pixel shader operations down to 25% (on average)

PVR will also reduce framebuffer bandwidth by 25% or more if there is transparency in the game.

Also, one of the not very often touted benifits of PVR tech is that not only is less bandwidth used, but when PVR does use its external memory it can use it much more efficiently. It never writes to the framebuffer in a non-contiguous manner. The only time it reads erraticly is when doing texture access. I have no idea what prefect scheme PVR implimentations use but being as the textures for a particular tile can be known long before it is rendered then I'd say prefetching of textures for each tile into cache would be pretty easy.

The main reason TBR's never took off (PVR in paticular) was compatiability issues (which now no longer exist since the KYRO)and te *cough* time to market of many of their products *COUGH*

If at some point IMGTEC were to release a PowerVR implimentation that was feature complete and actually faster than other cards on the market then they could do really, really well.

Oh yes, did I forget to mention that by using multi-sampling AA'ing (like NV's quincunx but without the ghey filter) can be done with virtually no performance cost on a Tiler. 4x MSAA increases your frame and z-buffer access requirements 4x, but if those buffers are on chip like in PowerVR you total extra bandwidth cost is.....zilch.

nutball · Jun 19, 2004

Amusing how these religious debates persist.

This thread mirrors the one on ray-tracing v. rasterisation.

When it comes down to it, an architecture that you can actually buy is better than one that you can't, regardless of the technical details!

KimB · Jun 19, 2004

Kristof said:
Something to remember is that Early Z even with an intial depth pass is far from ideal... its all zone/area based hence in a lot of cases you still need to go all the way down to the final depth buffer to figure out if a single pixel is actually visible or not.

Which won't really matter in a shader-heavy scenario. At most, an early-z check will waste one clock, while rendering a pixel could take 50 or more in future games (and we must be talking about future games if we're talking about hypothetical TBDR's).

Dave Baumann · Jun 19, 2004

AFAIK pixel level Z rejects generally waste more than a single clock.

KimB · Jun 19, 2004

DaveBaumann said:
AFAIK pixel level Z rejects generally waste more than a single clock.

Given this benchmark, I would generally expect typical Z rejects to waste fewer than one pixel:
http://www.xbitlabs.com/articles/video/display/nv40_18.html
(Look at the bottom graph, and note it's above the theoretical maximum)

Dave Baumann · Jun 19, 2004

Thats only testing the z write speeds, not rejecting. IIRC, rejecting costs more cycles because it has to purge the pipeline of the pixel it was calculating. I believe initial hardware didn't do early Z-reject/testing because of the latencies involved with rejecting the pixel(s) were costlier than the bandwidth of writing the pixel, however as the pixel operations went up it became more profitable to do early rejects (and they have probably gotten much better at flushing the pipeline as well).

Kristof · Jun 19, 2004

Chalnoth said:
Kristof said:

Something to remember is that Early Z even with an intial depth pass is far from ideal... its all zone/area based hence in a lot of cases you still need to go all the way down to the final depth buffer to figure out if a single pixel is actually visible or not.

Click to expand...

Which won't really matter in a shader-heavy scenario. At most, an early-z check will waste one clock, while rendering a pixel could take 50 or more in future games (and we must be talking about future games if we're talking about hypothetical TBDR's).

An early Z check that fails the early part, as with older architectures produces the whole pixel and only does the final full detail ZCheck to decide on the update... this is "considerably" more than a single clock. Since in real games most up front objects are detailed objects with lots of relatively small polygons this results in poor occluding info and lots of pixels which go the whole way and do NOT update the early Z buffer in a sensible way. A benchmark like villagemark on the other hand with few but large polygons is ideal for occluding and updating the early Z buffer. Other benchmarks I have seen also use large, potentially even full screen polygons which do not highlight this issue.

Early Z systems are "area" based and as long as you can not write info that is valid for the whole Z area you can not do anything, hence small polygons cover only part of the Z area and can NOT be used as early removal.

Point is in the real world early z checking is considerably less efficient than you might all think at first sight...

K-

Kristof · Jun 19, 2004

Chalnoth said:
DaveBaumann said:

AFAIK pixel level Z rejects generally waste more than a single clock.

Click to expand...

Given this benchmark, I would generally expect typical Z rejects to waste fewer than one pixel:
http://www.xbitlabs.com/articles/video/display/nv40_18.html
(Look at the bottom graph, and note it's above the theoretical maximum)

Given the comment in the text either the test is flawed or as said uses massively large fullscreen opaque polygons... since its a fillrate test they probabaly used very large polygons to avoid triangle throughput limits. If that happens you can do a polygon quick reject using a high level of your hierarchical ZBuffer. So yes potentially they are rejecting a whole polygon per clock or fairly large chunks of polygons per clock... however this situation will (almost) never occur in a real world case. Large polygons hiding equally large polygons which is ideal for the area based testing and occluding... real world consists mainly of medium to small polygons which will cover only part of the ZTest areas. Only large polygons are background, sky and potentially ground which occlude... well not much...

K-

Dave Baumann · Jun 19, 2004

Early Z systems are "area" based and as long as you can not write info that is valid for the whole Z area you can not do anything, hence small polygons cover only part of the Z area and can NOT be used as early removal.

Are you sure about this? I'm not sure that the region based early systems (NVIDIA's Z-Cull and ATI's Hier-Z) are considered a (pixel level) early z check. I'd say the pixel level early z check would more than likely be done just as the quad is assigned to a render pipeline - before any pixel instructions are exected the dispatch engine would request the ROP for Z information and discard those pixels that would be deemed to be occluded from the full resolution Z-buffer before instructions are executed.

It would still require more than one clock to do the z check, pixel discard and instruction flush, though.

Kristof · Jun 19, 2004

I am talking about updating the EarlyZ info, thus not the trivial accept or reject of quads or areas of polygons. Small polygons can not occlude but can be occluded. Having things to occlude behind is essential, effectively even though the monster is closeby it has such small polygons its difficult to update the EarlyZ (area based) info and hence the monster fails to occlude. For the monster to occlude you'd have to merge polygons until they are big enough to cover an Early Z-Area.

Another thing since we are talking about an initial depth pass. AFAIK one of the only apps that does this is FableMark... so should give you a hint on how architectures deal with it. Most of these initial depth pass techniques depend on the blending in of light effects, this blending is framebuffer blending which on a TBDR occurs on chip, while on an IMR occurs off chip hence bandwidth increase to do the read/modify/write.

K-

Dave Baumann · Jun 19, 2004

But even with a full resolution z buffer check you can alleviate the need for any potentially unnecessary pixel operations dependant on where the z check is done

Kristof · Jun 19, 2004

DaveBaumann said:
Are you sure about this? I'm not sure that the region based early systems (NVIDIA's Z-Cull and ATI's Hier-Z) are considered a (pixel level) early z check. I'd say the pixel level early z check would more than likely be done just as the quad is assigned to a render pipeline - before any pixel instructions are exected the dispatch engine would request the ROP for Z information and discard those pixels that would be deemed to be occluded from the full resolution Z-buffer before instructions are executed.

It would still require more than one clock to do the z check, pixel discard and instruction flush, though.

This is the back-end check your talking about, AFAIK this should work as it has always worked : most likely in parallell with the execution of the pixel. You could try to break the execution of the shader once the depth check has completed but that would be messy and the flushing of the pipeline info would take quite a bit of time. It would be possible but wonder about the complexity of going into the whole batch of pixels (there is probabaly more than a single quad in flight in the pipelines) and picking out those that can be stopped and flushed. Since its a SIMD process you might end up breaking the whole thing ? Suspect you need to wait out and discard at the backend of the pipeline rather than try and selectively create bubbles in the pipeline. Although technically it should not be impossible, a most basic shader has say 1 ops, so you need to be able to ZCheck at that rate, if your Shader is multiple ops your ZCheck should complete (obviously assuming that the PS does not update the depth info) prior to the shader hence you could try to break the operation of the shader if the ZCheck has failed. Would probably be a managing nightmare though trying to kill some of the pixels in flight... but not impossible I'd guess.

K-

Jerry Cornelius · Jun 19, 2004

Chalnoth said:
Fine, but does it really matter? Seriously, as shaders approach hundreds of instructions (which should happen in about two years), how much percentage of the rendering time would be taken up by an initial z-pass, even with no optimizations? It'd be pretty minimal.

hmmm...I'm not sure. This is inventing a scenario to solve a problem. Once rendering get's intense enough, IMR hardware that reduces overdraw to near zero in a fraction of the time it takes to render a screen full of pixels should work well enough for the common case. I think IMR are a long way from this. Certainly current offerings are disappointing in this area.

Also, the z pass may never be as trivial as you think, since the game has to shuffle through it's whole scene twice. Stencil buffering is going to be a biggy in the comng years, I don't think this is ever going to be cheap on a IMR even using these sort of techniques.

The deferred renderer will always have the advantage in whereas z checking and similar requirements are concerned. For simply drawing polygons to the screen with very complex pixels IMR may evolve to minimalize the advatnage of deferred renderers. There's going to be coplexity increases in other areas at the same time, so we'll have to wait and see what the future brings.

Now is the time for these sort of optimizations for a IMR and we aren't seeing them, so they may be harder to implement than they sound, or not as rewarding.

Chalnoth said:
No, the buffer overflow problem is a pretty hardware-specific problem. The hardware's either going to have to have a large portion of its video memory unused and unavailable to prevent overflows, or it's going to overflow. There's just no way around that, and there's no easy way for a developer to know for certain how many triangles are on the screen at once.

We don't know that. If the application feeds the TBDR triangles in a localized order, memory access would be contingious and non repetitive (more or less) in an overflow situation. I don't really know all the ins and outs regarding what happens in this case, but there is definately going to be application optimizations to work around the problem. Simply saying "the card will have to swap, game over" isn't accurate.

Chalnoth said:
This is what vertex buffers are for.

I don't know what these are, but any sort of hardware triangle suppoer can be tailored to suit a tilers needs. As long as it knows the scope of the operation it doesn't have to spin up the triangles untill it needs them. It comes down to if you can store as big a scene as the application can send across the bus. To my knowledge the curent and future answer is yes.

I think I see where you are going with this. In a perfect world where framebuffer access represents 99% of all memory traffic, all you need is enough horsepower in the pipeline of a IMR to keep up regardless of overdraw. In this same perfect world a deferred renderrer would have vast amounts of storage space and all the advantages of scene capturing, which includes really fast balls on AA. In a perfect world non of it really matters, applications are rendering their own scenes on the mainboard.

Scene capturing carries with it the ultimate limit of not being able to render a scene with an infinit number of triangles. This is the trade off made for being able to perform the task at hand much more effeciently. This is optimizations 101. Design around the scope of the task to get it done as effectively as possible.

HSR vs Tile-Based rendering?

Killer-Kris

KimB

Jerry Cornelius

Jerry Cornelius

KimB

Kristof

Ailuros

Epsilon plus three

Dave B(TotalVR)

nutball

KimB

Dave Baumann

Gamerscore Wh...

KimB

Dave Baumann

Gamerscore Wh...

Kristof

Kristof

Dave Baumann

Gamerscore Wh...

Kristof

Dave Baumann

Gamerscore Wh...

Kristof

Jerry Cornelius

Similar threads