HSR vs Tile-Based rendering?

Raqia

Regular
I'd like to know in detail what the difference between the PowerVR architecture's tile rendering scheme and the HSR employed by more popular archetectures today.

Why for instance are tiles used instead of treating the whole screen at once? A good explanation or a link to a detailed faq would be appreciated, I googled to no avail.
 
1. Tile-based rendering is not specific to PowerVR architectures. Architectures frequently break things up into tiles for rendering because it makes memory accesses more efficient.

2. The "HSR" of current architectures is essentially a way of keeping hidden objects from being rendered. Modern architectures have essentially optimized the case when a z-buffer fail happens, keeping z-buffer fails from reducing performance as much as possible (i.e. an old architecture, upon a z-buffer fail, would still render the pixel, it just wouldn't output it to the screen. A new architecture is capable of skipping that rendering altogether, through the use of a number of techniques).

3. The tile-based deferred rendering of PowerVR's architectures is pretty simple: Wait until the software sends all of the commands to draw one frame, while storing every command that comes in. Sort these commands by tile, then render one tile at a time.

By rendering one tile at a time, while at the same time having all information about all commands that are going to be executed at that tile, the hardware is able to completely sort all polygons before rendering, which results in only rendering visible pixels, allowing for maximal efficiency in rendering.

There is also a memory bandwidth benefit, as the entire tile is written to the framebuffer once, and no external z-buffer is needed.

The problem with this deferred rendering approach is the fact that the entire scene must be cached before rendering. This "scene buffer" effectively takes the place of the z-buffer. The objection I have to this approach is that the size of the scene buffer is unknown before rendering, and so there are always going to be possible buffer overflow issues which could drastically impact performance. By contrast, the z-buffer's size is well-known prior to rendering. Immediate-mode renderers (pretty much any GPU available today) also have a number of techniques that improve efficiency, so I don't see a reason to go "all the way" to deferred rendering to solve problems that, in my mind, don't really need solving given today's rendering performance.
 
Chalnoth said:
1. Tile-based rendering is not specific to PowerVR architectures. Architectures frequently break things up into tiles for rendering because it makes memory accesses more efficient.

2. The "HSR" of current architectures is essentially a way of keeping hidden objects from being rendered. Modern architectures have essentially optimized the case when a z-buffer fail happens, keeping z-buffer fails from reducing performance as much as possible (i.e. an old architecture, upon a z-buffer fail, would still render the pixel, it just wouldn't output it to the screen. A new architecture is capable of skipping that rendering altogether, through the use of a number of techniques).

3. The tile-based deferred rendering of PowerVR's architectures is pretty simple: Wait until the software sends all of the commands to draw one frame, while storing every command that comes in. Sort these commands by tile, then render one tile at a time.

By rendering one tile at a time, while at the same time having all information about all commands that are going to be executed at that tile, the hardware is able to completely sort all polygons before rendering, which results in only rendering visible pixels, allowing for maximal efficiency in rendering.

There is also a memory bandwidth benefit, as the entire tile is written to the framebuffer once, and no external z-buffer is needed.

The problem with this deferred rendering approach is the fact that the entire scene must be cached before rendering. This "scene buffer" effectively takes the place of the z-buffer. The objection I have to this approach is that the size of the scene buffer is unknown before rendering, and so there are always going to be possible buffer overflow issues which could drastically impact performance. By contrast, the z-buffer's size is well-known prior to rendering. Immediate-mode renderers (pretty much any GPU available today) also have a number of techniques that improve efficiency, so I don't see a reason to go "all the way" to deferred rendering to solve problems that, in my mind, don't really need solving given today's rendering performance.

So are all the purported bandwidth savings from reducing the number of texture reads or does this caching also provide savings at the triangle setup stage?
 
The bandwidth savings comes from outputting all pixels in a tile at once, and only once.

An immediate-mode renderer may have to overrite some values in the framebuffer, and may not be able to write many pixels at once (for small triangles).

But I still think that TBDR solves a problem that doesn't need solving right now (memory bandwidth), while at the same time creating new problems that we don't have to worry about.
 
Chalnoth said:
The bandwidth savings comes from outputting all pixels in a tile at once, and only once.

An immediate-mode renderer may have to overrite some values in the framebuffer, and may not be able to write many pixels at once (for small triangles).

But I still think that TBDR solves a problem that doesn't need solving right now (memory bandwidth), while at the same time creating new problems that we don't have to worry about.

Not entirely true there are all the texture reads that you would have otherwise had to do for the overdrawn pixels.

And it's not just about memory bandwidth but also execution bandwidth, if you have to do a long shader on every pixel you don't want to have an overdraw of 3-4x. That saves you execution time and lets you spend it else where. It's not all just about memory bandwidth.

As such it makes the most of the two most important resources, limited memory bandwidth and execution time. Something I think all developers would be all for.

As far as creating new problems, do please enlighten me? Sure you have a scene buffer, but you no longer have a Z-buffer so memory usage will be similar and sometimes less. Other than that there aren't too many other problems I'm aware of.

Then in if you want to get into future advancements, a deferred renderer is actually a step closer to being able to do ray tracing than an IMR. Since you do have to collect the whole scene before you can begin rendering.
 
Raqia said:
Why for instance are tiles used instead of treating the whole screen at once? A good explanation or a link to a detailed faq would be appreciated, I googled to no avail.

The only explanation I can think of off the top of my head is that a tiled approach allows you to use onchip caches.
 
Chalnoth said:
But I still think that TBDR solves a problem that doesn't need solving right now (memory bandwidth), while at the same time creating new problems that we don't have to worry about.

ironic that the NV40 and R420 both seem to be pretty-much bandwidth-limited. also it seems funny that so much effort has been put into bandwidth-saving tricks for IMR's and they are still bandwidth-limited, yet almost no effort has been put into solving the problem you mentioned... and it still hasnt caused actual *problems*
 
For old games, sure. But advanced shaders themselves will tend to reduce memory bandwidth limitations. This is probably the #1 reason the R420 pulls ahead of the NV40 in some of the more shader-heavy titles: Both cards have the same memory bandwidth, but the R420 has more fillrate. In shader-heavy titles, it's the extra fillrate that makes a difference.

Personally, I don't think anybody should really be caring much about improving performance in old games. It's plenty high already. It's new games that are important, and for new games memory bandwidth isn't the main limiting factor in current architectures.
 
Chalnoth said:
For old games, sure. But advanced shaders themselves will tend to reduce memory bandwidth limitations. This is probably the #1 reason the R420 pulls ahead of the NV40 in some of the more shader-heavy titles: Both cards have the same memory bandwidth, but the R420 has more fillrate. In shader-heavy titles, it's the extra fillrate that makes a difference.

Personally, I don't think anybody should really be caring much about improving performance in old games. It's plenty high already. It's new games that are important, and for new games memory bandwidth isn't the main limiting factor in current architectures.

you have a point. fill-rate is pretty important, especially in new games. which makes halving-to-quartering your fill-rate requirements for a given scene to be rendered is just that much more valuable ;)
 
Chalnoth said:
For old games, sure. But advanced shaders themselves will tend to reduce memory bandwidth limitations. This is probably the #1 reason the R420 pulls ahead of the NV40 in some of the more shader-heavy titles: Both cards have the same memory bandwidth, but the R420 has more fillrate. In shader-heavy titles, it's the extra fillrate that makes a difference.

Personally, I don't think anybody should really be caring much about improving performance in old games. It's plenty high already. It's new games that are important, and for new games memory bandwidth isn't the main limiting factor in current architectures.

About memory bandwidth not being a major limiting factor anymore... I really doubt developers are going to move away from using lots and lots of textures just because they have shaders now. They're going to use everything that is available to them. Whether that happens to be regular textures, seeds for procedurally generated textures, normal maps, and the different types of rendering targets (especially the FP ones). Memory bandwidth usage is and always will be a major factor, well unless the disparity between processing power and bandwidth becomes to great (think CPUs). At which point developers may think about using strictly procedural content, but at that point bandwidth improvements will be needed more than ever.

Then to address new and future games, developers are continually creating more and more intricate and detailed worlds, that means more overdraw. Which once again puts a large requirement on both the bandwidth and execution. So I hardly see how a TBDR is solving problems that don't need to be solved. Past, Present, and future games all seem to be tailored to very well.

Though an issue I would like to know more about is how much logic TBDR costs, versus an IMR with HSR, and an IMR with that logic devoted to execution.

Edit: By the way I appologize that you're not getting answers to your HSR and TDBR implementation question. I keep hoping that Kristof or Simon will pop in here and give you a good answer but so far no luck on that.
 
Let's not forget full scene AA with almost zero (extra) memory footprint, and, I quote, "stupidly fast stencil buffering". Basically it does everything better using half the energy, generating half the heat.

Unlike an immediate mode card, you can't expect linear response all the way to an infinite number of triangles per scene. If you're doing really open ended stuff (like what I don't know) there are some tasks an immediate mode renderer is better suited for. I doubt that includes any real time rendering where 30 fps or more is expected on a consumer class pc.

In the real world it's heads and tails above the competition in design. The problem is there's no public buy in and the company holding all the workable patents doesn't give a rat's ass about making it in the pc market.
 
Sage said:
you have a point. fill-rate is pretty important, especially in new games. which makes halving-to-quartering your fill-rate requirements for a given scene to be rendered is just that much more valuable ;)
Sure, it would be valuable, if it were possible to halve or quarter the fillrate requirements. It's not.

With a modern GPU, it's very possible to not render a single hidden pixel.
 
Killer-Kris said:
About memory bandwidth not being a major limiting factor anymore... I really doubt developers are going to move away from using lots and lots of textures just because they have shaders now.
1. TBDR's don't really save on texture memory bandwidth. They save on framebuffer bandwidth.
2. Just because lots of textures are used doesn't mean that there aren't also lots of math operations.

Yes, of course we'll need more bandwidth going into the future, but that problem is being handled quite adequately by current architectures, and with the way the 3D industry is headed, TBDR's advantages over IMR become less and less in the memory bandwidth realm.
 
Raqia said:
I'd like to know in detail what the difference between the PowerVR architecture's tile rendering scheme and the HSR employed by more popular archetectures today.

Why for instance are tiles used instead of treating the whole screen at once? A good explanation or a link to a detailed faq would be appreciated, I googled to no avail.

tiles are used so the chip can process (hap hazzard guess) 16 by 16 pixels at once. All the z processing and rendering is done writing to a small embedded chunk of high speed memory, when it's done it flushes the tiule to the frame buffer and moves on to the next. You could probably build something that did a whole scene at once, it would just cost $10,000,000
 
Jerry Cornelius said:
Let's not forget full scene AA with almost zero (extra) memory footprint, and, I quote, "stupidly fast stencil buffering". Basically it does everything better using half the energy, generating half the heat.
Which is great, but is potentially a massive problem if there is ever a scene buffer overflow. When there's a scene buffer overflow, the hardware will suddenly need to use a z-buffer, and make the external framebuffer full-size. That's a massive difference in memory bandwidth usage, and would absolutely slaughter performance.
 
Chalnoth said:
1. TBDR's don't really save on texture memory bandwidth. They save on framebuffer bandwidth.

Ya, actually they do, since you aren't rendering hidden pixels. Even with advanced immediate mode z tricks, there's no gaurantee that the triangles are being processed front to back. And since you are rendering one triangle at a time, there will always be some overdraw unless the calling game/application has gone to the trouble of crappong every overlapping triangle.
 
Jerry Cornelius said:
Ya, actually they do, since you aren't rendering hidden pixels. Even with advanced immediate mode z tricks, there's no gaurantee that the triangles are being processed front to back.
1. Rendering hidden pixels doesn't change the memory bandwidth/fillrate ratios (which is the important factor in whether or not memory bandwidth is the primary limitation).
2. Rendering hidden pixels is completely unecessary with an initial z-pass (something that is useful for shadow rendering anyway).
 
Chalnoth said:
Jerry Cornelius said:
Ya, actually they do, since you aren't rendering hidden pixels. Even with advanced immediate mode z tricks, there's no gaurantee that the triangles are being processed front to back.
1. Rendering hidden pixels doesn't change the memory bandwidth/fillrate ratios.
2. Rendering hidden pixels is completely unecessary with an initial z-pass (something that is useful for shadow rendering anyway).

1. I'm not sure what you're saying here, but effective fillrate is the only kind that matters, and the ratio is sure going to change. Besides, it's not about ratios, it's about bandwidth per scene. Less texture mapping means less bandwidth.
2. You're saying that immediate mode renderers don't render any hidden pixels? A lots changed in a year since I was reading up on this. Even so, an initial z pass is required, and so is the z test, the z test is still relational to raw probably texels regarless of any heirarchal z sceme. So it's not a texture read, it's still memory access to avoid a texture read.
 
Hierarchical-z isn't the only method that is used. Basically, a modern architecture does its z-tests early. If there's a z-fail, it doesn't need to render the pixel.

Additionally, an initial z-pass is very cheap, and is even required for stencil shadowing, a technique that's going to be used in a number of games after DOOM3.

And yes, the ratio of pixel fillrate to memory bandwidth is what determines whether the limiting factor is memory bandwidth or pixel fillrate. Rendering hidden pixels doesn't affect this ratio.
 
Back
Top