HSR vs Tile-Based rendering?

arjan de lumens · Jun 24, 2004

Ingenu said:
Simon F said:

Ingenu said:

Hardware queries are not an option with AGP IMO.

Click to expand...

The problem, AFAICS, is that the triangle FIFO and rendering paths are very long so, by the time the query has been processed, you probably have processed the 1/2 the scene. <shrug>

Click to expand...

I have to confess I've not dive(?) too much into the reasons of the delay, so I assumed it was the AGP bus fault ^^
The sure thing is that currently it's not fast enough.

Why wouldn't AGP be fast enough for occlusion queries?? When you perform such a query, all the data that need to be written back over AGP per query are something like a single 4-byte 'int' giving the number of pixels that were drawn by the queried object.

MfA · Jun 24, 2004

And of course compared to the rendering latency the latency difference between AGP and PCIe transfers is negligible too.

3dcgi · Jun 25, 2004

Dave B(TotalVR) said:
IMR's hate small polygons, they cant render them as quads properly and they can force them to use their memory busses very inefficiently.

I'm not sure that small polygons are particularly more inefficient with IMRs than with a TBDR. Rendering quads makes sense because of texturing and there's nothing that precludes a TBDR from rendering quads as well. Both architectures with likely have similar texture cache performance characteristics in this case. In most cases small polygons will be tessellated surfaces and particles that have screen space locality. Something like gunfire being sprayed haphazardly around the screen might see some benefit from a TBDR, but that shouldn't be a significant portion of the render time.

vrecan · Jun 25, 2004

what about explosions throwing particles everywhich way? This is all to common in most games.

Dave B(TotalVR) · Jun 25, 2004

3dcgi said:
Dave B(TotalVR) said:

IMR's hate small polygons, they cant render them as quads properly and they can force them to use their memory busses very inefficiently.

Click to expand...

I'm not sure that small polygons are particularly more inefficient with IMRs than with a TBDR. Rendering quads makes sense because of texturing and there's nothing that precludes a TBDR from rendering quads as well. Both architectures with likely have similar texture cache performance characteristics in this case. In most cases small polygons will be tessellated surfaces and particles that have screen space locality. Something like gunfire being sprayed haphazardly around the screen might see some benefit from a TBDR, but that shouldn't be a significant portion of the render time.

I dont know if it is *actually* done, but I would have thought it a pretty simple task to pre-fecth all the required texture data for a tile before it is rendered. That will help cache hits massively.

arjan de lumens · Jun 25, 2004

Dave B(TotalVR) said:
I dont know if it is *actually* done, but I would have thought it a pretty simple task to pre-fecth all the required texture data for a tile before it is rendered. That will help cache hits massively.

Not simple at all. For each incoming polygon, you need to determine a bounding polygon in texture space that you prefetch from. This may seem simple, but if the incoming polygon straddles tile edges, you can get a rather messy bounding polygon. There aren't many simplifcations you can do either; any fetch outside the bounding polygon is a waste of bandwidth. Also, you need to correctly determine which mipmap levels you need to fetch from, which can be difficult. Also, prefetching won't work with EMBM/dependent texturing.

And even if you do manage to prefetch the correct texture data, the HSR mechanisms of the tiler may determine that a textured polygon or a section of such a polygon is not drawn because it would get overdrawn - if that happens, you have just wasted a load of bandwidth and polluted the texture cache.

KimB · Jun 25, 2004

Besides, prefetch isn't the problem. IMR's hide the latency of texture accesses just fine. The problem is texture cache hits, meaning that locality must be preserved for good texture memory bandwidth performance. I really don't see how a TBDR would be any better, as the problem of small polygons and texture accesses is essentially the same.

Simon F · Jun 28, 2004

Chalnoth said:
Besides, prefetch isn't the problem. IMR's hide the latency of texture accesses just fine.

Errrrr ... do you know how latency is hidden? You put bloody great FIFOs inside the chip.

KimB · Jun 28, 2004

Yeah, that's obvious. What's your point?

Simon F · Jun 28, 2004

Chalnoth said:
Yeah, that's obvious. What's your point?

It's expensive.

KimB · Jun 28, 2004

Last I checked, TBDR's required a fair bit more on-die caching than IMR's.

Simon F · Jun 28, 2004

Chalnoth said:
Last I checked, TBDR's required a fair bit more on-die caching than IMR's.

Do they now? "Last I checked", eh? Do tell (because I'm damned intrigued), what is your source of information? Do you have a high power microscope and super-amazing reverse engineering skills? (I can just see it now... "This register file forms part of the texture cache and has a output delay of X pico seconds")...

Of course, your statement immediately explains why IMRs always have smaller transistor counts. Let's see, Kyro had about 15M and the geforce of the same generation had, IIRC, ~25.

KimB · Jun 28, 2004

Simon F said:
Do they now? "Last I checked", eh? Do tell (because I'm damned intrigued), what is your source of information? Do you have a high power microscope and super-amazing reverse engineering skills? (I can just see it now... "This register file forms part of the texture cache and has a output delay of X pico seconds")...

Well, let's check. The same amount of texture access latency has to be hidden in a TBDR (texture accesses are no faster), so at least the same number of pixels need to be cached for rendering. Then there's the on-chip frame buffer, depth information (however that's stored), and pointers (at least) to geometry within the tile in the scene buffer.

While the tile cache and the texture read cache may overlap somewhat, I just don't see any possible way for a TBDR to have less cache than an IMR, for optimal performance.

Of course, your statement immediately explains why IMRs always have smaller transistor counts. Let's see, Kyro had about 15M and the geforce of the same generation had, IIRC, ~25.

Sure, with half the pipelines of the IMR, and no geometry processor.

Besides, the amount of cache on a GPU is currently vastly smaller than that on a CPU. Queuing up a hundred or so pixels per pipeline really won't take up all that much data (if you even need that many to hide the latency: 30-40 may be plenty).

Kristof · Jun 28, 2004

Well lets see at least one buffer IMRs have is a huge FIFO buffer between the VS and the PS since else these units would constantly bottleneck each other (ping-pong effect with either a full buffer or a completely empty buffer depending on which unit has the most load - load which depends on what shader is used, how big the polygons are etc etc...). So to decouple these, at least to some level, you need an on chip buffer... for a TBDR this decoupling is in external memory... I guess you could move to this to external memory ...

K-

archie4oz · Jun 28, 2004

Well lets see at least one buffer IMRs have is a huge FIFO buffer between the VS and the PS since else these units would constantly bottleneck each other (ping-pong effect with either a full buffer or a completely empty buffer depending on which unit has the most load - load which depends on what shader is used, how big the polygons are etc etc...).

Toss in vertex caches (if you've get a Geometry engine on board, or vertex shader), and a rather large supply of registers (actually probably banks of registers to deal with multiple shader units/shader threads and to bury shader execution latency)...

KimB · Jun 29, 2004

Kristof said:
Well lets see at least one buffer IMRs have is a huge FIFO buffer between the VS and the PS since else these units would constantly bottleneck each other (ping-pong effect with either a full buffer or a completely empty buffer depending on which unit has the most load - load which depends on what shader is used, how big the polygons are etc etc...). So to decouple these, at least to some level, you need an on chip buffer... for a TBDR this decoupling is in external memory... I guess you could move to this to external memory ...

Well, TBDR's have to cache the entire scene between the vertex shader and the pixel shader, so I'm not sure that's any more efficient.

More efficient would be if there was dynamic allocation of pipelines as pixel or vertex pipelines.

Now, the real question is, how much cache to IMR's use in this place? We really don't know, and it would be implementation-dependent. It may be that GPU manufacturers have decided that triangles in a batch are pretty much always going to be of roughly the same size anyway, and so have not decided to optimize for this scenario, or they could store hundreds to thousands of triangles (or pixels...one could have a large cache that would go between the triangle setup and pixel pipelines, or between the vertex pipelines and triangle setup, with a small cache on the other side...doing it on the vertex side would be more efficient, but would require more processing).

Simon F · Jun 29, 2004

Chalnoth said:
Well, TBDR's have to cache the entire scene between the vertex shader and the pixel shader, so I'm not sure that's any more efficient.

Cache the entire scene? Goodness me, why would you want to do that?

JohnH · Jun 29, 2004

[Edit: Some clatification of various buffering...]

An IMR requires 3 set's of "big" latency buffers for depth, textures and alpha blending. In addition to this IMR's MUST have a buffer between the vertex and pixel pipelines, you are correct in that the size of this is implementation dependent. However it must be deep in order to avoid stalling either vertex or pixel processing pipelines, i.e. large triangles you want to avoid stallign the VS pipe, for small the PS pipe. If this buffer isn't big enough you loose parallelism between VS and PS processing. TBDR on the other hand mainly only requires latency buffering for textures fetches, but also carriers tile buffers for depth and accumulation.
Before you start throwing around the latest buzz words of "dynamicaly allocated shader pipes", this only allows you to push shader performance where its required, you still need pretty much the same buffering to cover all cases.

To be honest this all just means that the area for a given peak throughput tends to be pretty much the same for IMR and TBDR based systems, however the later would, in the majority of cases, have a higher sustained throughput.

John.

KimB · Jun 29, 2004

I don't see why alpha blending need be handled any differently from texture fetches, in terms of latency hiding. That is, I would expect that an architecture could simply use the latency hiding that is used for texture fetches to also hide the latency for alpha blends.

Before you start throwing around the latest buzz words of "dynamicaly allocated shader pipes", this only allows you to push shader performance where its required, you still need pretty much the same buffering to cover all cases.

Not really. If you design the pipelines in an optimal fashion, then there's no need for much of any vertex->pixel cache. The main problem is that you'd want the pipelines to be able to switch quickly between vertex and pixel processing. For example: have a pipeline do all vertex processing for a single triangle, then do all pixel shading for that triangle, then start on the next triangle. If an architecture like this could be designed to handle two states at once efficiently, there would be no problem with load balancing.

There are a couple of issues, of course. The first is branching: you'd have to handle branches efficiently with very long pipelines designed for texture access latency hiding. One way around this might be to have only one "idle" stage that is designed to hide most of the latency, with data that doesn't need texture accesses being promoted by some amount within the "idle" stage (which would largely act like a FIFO buffer). With such an architecture, you'd obviously need tags in the data in the buffer to indicate what to do with that data, as there'd be no realistic way to keep it all straight in other ways. Since we're talking about a dual-state system, that tag may just be one bit.

Anyway, I'm not going to go any more into the problems. I'm sure you can think of a number of other problems with this solution, but the questions remain: Can this become more efficient than dedicated vertex and pixel pipelines? Is the additional transistor worth it? What additional programming possibilities could this add for developers, and would they make the change more worthwhile?

There is, of course, the obvious benefits that such an architecture could attain are absolutely zero stalling between vertex and pixel data with a tiny buffer (you'd essentially only need to store a few incoming pieces of vertex/triangle data, a couple of triangles for the triangle setup engine to work on, and a couple of pixels output from the triangle setup engine...pixels would get priority over vertex data, with execution of vertex data only when the triangle setup engine has no more pixel data to give to the pixel pipelines), and absolute load balancing between vertex and pixel data.

Dave B(TotalVR) · Jun 29, 2004

arjan de lumens said:
And even if you do manage to prefetch the correct texture data, the HSR mechanisms of the tiler may determine that a textured polygon or a section of such a polygon is not drawn because it would get overdrawn - if that happens, you have just wasted a load of bandwidth and polluted the texture cache.

What if you prefetch the texure data *after* the HSR has been performed and the tile is sitting in a FIFO buffer waiting to be rendered?

Chalnoth said:
The main problem is that you'd want the pipelines to be able to switch quickly between vertex and pixel processing. For example: have a pipeline do all vertex processing for a single triangle, then do all pixel shading for that triangle, then start on the next triangle. If an architecture like this could be designed to handle two states at once efficiently, there would be no problem with load balancing.

Ever heard of Metagence?

Dave

HSR vs Tile-Based rendering?

arjan de lumens

MfA

3dcgi

vrecan

Dave B(TotalVR)

arjan de lumens

KimB

Simon F

Tea maker

KimB

Simon F

Tea maker

KimB

Simon F

Tea maker

KimB

Kristof

archie4oz

ea_spouse is H4WT!

KimB

Simon F

Tea maker

JohnH

KimB

Dave B(TotalVR)

Similar threads