HSR vs Tile-Based rendering?

Jerry Cornelius · Jun 20, 2004

Chalnoth said:
No, but nVidia and ATI have such teams, and have not gone for deferred rendering. nVidia even owns some deferred rendering IP. There's a reason for this.

Ya, they don't want to be assosciated with a company that's never produced a high end offering in the PC market, or throw away all their years of investment in IMR, or risk infringing on a zillion patents (which sucks).

Chalnoth said:
If you have an idea of how to get around the massive performance drop that would be incurred from a buffer overflow, please, post it. Otherwise...

We don't know it's massive. Any application that stresses a TBDR this much will likely be stressing any card you can put in your PC. All that's needed is apprpriate performance in the worst case. I don't know what the bandwidth requirements are for scene and list data once it's been transformed, but you could probably get away with another bank of low cost memory in a high end card to really have it covered There's going to be limits to what the system can throw at the card.

Sage · Jun 20, 2004

Chalnoth said:
If you have an idea of how to get around the massive performance drop that would be incurred from a buffer overflow, please, post it. Otherwise...

if you have an idea on how to build a stable quantum computer, please, post it. Otherwise.... I'm going to assume you don't have a PhD in the field. Doesn't mean other people don't. It's pretty vain to assume just because you can't come up with a solution then noone else can.

mboeller · Jun 20, 2004

alexsok said:
I recall reading an Intel published paper sometime ago,

Do you have an Link?

Thanks

KimB · Jun 20, 2004

alexsok said:
the only advantage of quantum computers over conventional hardware (being the features touted by most scientists and specialists these days) is the drastic speed-up that would result in complicated tasks that can be executed and computed more efficiently than on the current-generation hardware.

The idea of quantum computing is that it can, for some algorithms, produce a result in order N time where a silicon-based architcture would take order N^2 time. That's a drastic difference, but it will require a totally different programming paradigm, and thus will be it will be a huge challenge to leverage this possibly massive programming capability.

Imagine, for instance, if if this could be applied to graphics. Resolution scales as order N^2. This means that if you double the X-Y resolution, you quadruple the number of pixels. If one could possibly build a quantum computer that could compute a frame in order N time, then you could, for example, do 4x supersampling FSAA while only halving performance (as opposed to cutting it by 1/4, as happens in a silicon-based architecture). When the next architecture comes along that doubles performance, instead of moving from 800x600 to 1024x768, you'll be jumping all the way to 1600x1200. Granted, this is just won't happen because this isn't a true order N^2 system, but hopefully you get the idea.

As far as I can tell, other processing technologies will all act similarly to silicon-based designs, they'll just be different in implementation, and may thus be smaller, or faster, or produce less heat, etc.

Anyway, I really don't exactly know whether or not quantum computing will take off. I think that there are a lot of bad ideas, though. For example, I'm really not sure that quantum entanglement can really be adequately put to use in a quantum computer. The stuff I've read on it so far in regard to quantum computers just seems plain wrong.

Simon F · Jun 21, 2004

Raqia said:
I'd like to know in detail what the difference between the PowerVR architecture's tile rendering scheme and the HSR employed by more popular archetectures today.

ARGHH! (/me makes note to write this up and put it on my web site for once and for all!!!)

"Hidden Surface Removal" just means "Don't show the objects that are hidden". It probably should be called "Visible Surface Determination" but the name evolved from the days of line graphics and "Hidden Line Removal" algorithms and so the term has stuck.

Z buffering on its own is already an HSR algorithm. So are scanline, Painters, and ray tracing/casting algorithms.

If someone specifically tells you it is something more specialised, then they are ignorant of 3D graphics and should go and read a proper graphics book and not rely on ramblings on the web. (Apart from mine, of course

)

Why for instance are tiles used instead of treating the whole screen at once? A good explanation or a link to a detailed faq would be appreciated, I googled to no avail.

Locality, Locality, Locality. Did you look on the PowerVR website? There's a great explanation (although it doesn't describe some of the newer features) of why it is done.

Chalnoth said:
3. The tile-based deferred rendering of PowerVR's architectures is pretty simple: .

Simple? It might seem that way, but there's complicated things going on in the background.

The problem with this deferred rendering approach is the fact that the entire scene must be cached before rendering.

That was only true of Series 1 and 2. Kyro could handle any sized scene.

Chalnoth said:
But I still think that TBDR solves a problem that doesn't need solving right now (memory bandwidth), while at the same time creating new problems that we don't have to worry about.

[Sarcasm] Of course, and that's why memory busses have been getting smaller and smaller over the years[/sarcasm]

Killer-Kris said:
Edit: By the way I appologize that you're not getting answers to your HSR and TDBR implementation question. I keep hoping that Kristof or Simon will pop in here and give you a good answer but so far no luck on that.

Errr it's the weekend in UK. What do you expect?

Chalnoth said:
1. TBDR's don't really save on texture memory bandwidth. They save on framebuffer bandwidth.

Cough!! Choke! Splutter!
/Me makes note not to have a mouthful of coffee when reading the forums.

Chalnoth said:
Which is great, but is potentially a massive problem if there is ever a scene buffer overflow. When there's a scene buffer overflow, the hardware will suddenly need to use a z-buffer, and make the external framebuffer full-size. That's a massive difference in memory bandwidth usage, and would absolutely slaughter performance.

No, that's a difference in memory usage not bandwidth.

Killer-Kris said:
I'm curious but why are Nvidia and Ati doing optimizations like bri/try linear filtering?

Probably a combination of the fact that (a) a bilinear calc on most HW takes one cycle rather than two, and (b) it sometimes saves accessing another MIP map level... except that someone has told us that texture access is not an issue

Crisidelm · Jun 21, 2004

Simon F said:
....Locality, Locality, Locality. Did you look on the PowerVT website? There's a great explanation (although it doesn't describe some of the newer features) of why it is done....

PowerVT?

Simon F · Jun 21, 2004

Chalnoth said:
DeanoC said:

In practise the scene-capture side of TBR is irrelevant, ALL 3D cards capture the complete scene.

All high performance rendering consists of the drawcommand placing data in a command buffer. When finished the command buffer is flushed and the scene is rendered. If you ever over fill the command buffer an expensive operation must occur (either more memory must be allocated or the command buffer is processed and the drawcommand stalls until the GPU has finished using it).

Click to expand...

No, IMR's render as commands are being sent. Deferred renderers are the only ones that wait, hence the term, "deferred rendering."

Gosh it's funny seeing Chalnoth telling a developer how the graphics cards he uses work

Chalnoth, if you think that commands are operated on "immediately" you are being very naive. As an excercise, have a think what would happen if you sent 1 big wall polygon followed by 100 small polygons to an IMR if it really rendered things immediately.

Simon F · Jun 21, 2004

Crisidelm said:
PowerVT?

Picky... Ok I'll fix the typo.

Crisidelm · Jun 21, 2004

well you know, powervt.com DOES exist

KimB · Jun 21, 2004

Simon F said:
Chalnoth said:

3. The tile-based deferred rendering of PowerVR's architectures is pretty simple: .

Click to expand...

Simple? It might seem that way, but there's complicated things going on in the background.

There's always something complicated going on in the background. That doesn't mean it can't be understood in a relatively simple manner.

The problem with this deferred rendering approach is the fact that the entire scene must be cached before rendering.

Click to expand...

That was only true of Series 1 and 2. Kyro could handle any sized scene.

Um, not without penalties.

Chalnoth said:
Chalnoth said:

But I still think that TBDR solves a problem that doesn't need solving right now (memory bandwidth), while at the same time creating new problems that we don't have to worry about.

Click to expand...

[Sarcasm] Of course, and that's why memory busses have been getting smaller and smaller over the years[/sarcasm]

The point was that current architectures aren't that memory bandwidth bound when you consider more advanced scenarios.

Chalnoth said:
Chalnoth said:

1. TBDR's don't really save on texture memory bandwidth. They save on framebuffer bandwidth.

Click to expand...

Cough!! Choke! Splutter!
/Me makes note not to have a mouthful of coffee when reading the forums.

What was your objection to this one? I know that PowerVR had virtual texturing, which would save on AGP texture bandwidth, but other than that, they're still doing the same basic rendering work.

Chalnoth said:
Chalnoth said:

Which is great, but is potentially a massive problem if there is ever a scene buffer overflow. When there's a scene buffer overflow, the hardware will suddenly need to use a z-buffer, and make the external framebuffer full-size. That's a massive difference in memory bandwidth usage, and would absolutely slaughter performance.

Click to expand...

No, that's a difference in memory usage not bandwidth.

Um, you're going to be outputting and inputting much more data for any frames that require external z and frame buffers (external frame buffer in this case meaning full resolution). Last I checked, that takes bandwidth.

Killer-Kris said:
Killer-Kris said:

I'm curious but why are Nvidia and Ati doing optimizations like bri/try linear filtering?

Click to expand...

Probably a combination of the fact that (a) a bilinear calc on most HW takes one cycle rather than two, and (b) it sometimes saves accessing another MIP map level... except that someone has told us that texture access is not an issue

I'd say it has much more to do with the fillrate hit. If you're going to take an extra clock to do texture filtering, that's an extra clock that you don't have a z-buffer or frame-buffer access.

christoph · Jun 21, 2004

Killer-Kris wrote:

I'm curious but why are Nvidia and Ati doing optimizations like bri/try linear filtering?

take a look at this fillrate graph (edit: y-axis: pixel fillrate, x-axis: texel/pixel ratio)

taken from this article

Simon F · Jun 21, 2004

Chalnoth said:
Simon F said:

Chalnoth said:

3. The tile-based deferred rendering of PowerVR's architectures is pretty simple: .

Click to expand...

Simple? It might seem that way, but there's complicated things going on in the background.

Click to expand...

There's always something complicated going on in the background. That doesn't mean it can't be understood in a relatively simple manner.

Well, that is true but I felt that you simplified a step too far when you implied that the whole scene had to be collected, then tiled, then rendered. All of these steps are going on in parallel.

The problem with this deferred rendering approach is the fact that the entire scene must be cached before rendering.

Click to expand...

That was only true of Series 1 and 2. Kyro could handle any sized scene.

Click to expand...

Um, not without penalties.

Which I'm saying are not that significant. The memory usage does go up by a jump (i.e. the allocation of a Z buffer) but the bandwidth usage does not.

Chalnoth said:
Chalnoth said:

But I still think that TBDR solves a problem that doesn't need solving right now (memory bandwidth), while at the same time creating new problems that we don't have to worry about.

Click to expand...

[Sarcasm] Of course, and that's why memory busses have been getting smaller and smaller over the years[/sarcasm]

Click to expand...

The point was that current architectures aren't that memory bandwidth bound when you consider more advanced scenarios.

What are these "more advanced scenarios"? Is it making your pixel shader so busy on maths (with little texture access) that it becomes the main bottleneck? Could it be, then, that executing these pixels whenever they are obscured might yield even more performance benefits?

Chalnoth said:
Chalnoth said:

1. TBDR's don't really save on texture memory bandwidth. They save on framebuffer bandwidth.

Click to expand...

Cough!! Choke! Splutter!
/Me makes note not to have a mouthful of coffee when reading the forums.

Click to expand...

What was your objection to this one? I know that PowerVR had virtual texturing, which would save on AGP texture bandwidth, but other than that, they're still doing the same basic rendering work.

What's virtual texturing got to do with it? A pixel that is rejected prior to texturing is one that reduces texture memory bandwidth. Not only does it save the initial external memory accesses, it also means that the texture cache is more efficient because things are less likely to be thrown out.

Chalnoth said:
Chalnoth said:

Which is great, but is potentially a massive problem if there is ever a scene buffer overflow. When there's a scene buffer overflow, the hardware will suddenly need to use a z-buffer, and make the external framebuffer full-size. That's a massive difference in memory bandwidth usage, and would absolutely slaughter performance.

Click to expand...

No, that's a difference in memory usage not bandwidth.

Click to expand...

Um, you're going to be outputting and inputting much more data for any frames that require external z and frame buffers (external frame buffer in this case meaning full resolution). Last I checked, that takes bandwidth.

Well I think you should think about it a bit more. As a hint, consider the distribution of polygons across an image.

Killer-Kris said:
Killer-Kris said:

I'm curious but why are Nvidia and Ati doing optimizations like bri/try linear filtering?

Click to expand...

Probably a combination of the fact that (a) a bilinear calc on most HW takes one cycle rather than two, and (b) it sometimes saves accessing another MIP map level... except that someone has told us that texture access is not an issue

Click to expand...

I'd say it has much more to do with the fillrate hit.

Which is what I just said.

If you're going to take an extra clock to do texture filtering, that's an extra clock that you don't have a z-buffer or frame-buffer access.

You're thinking from a 'software renderer' perspective. It's not like that in hardware.

KimB · Jun 21, 2004

Simon F said:
Which I'm saying are not that significant. The memory usage does go up by a jump (i.e. the allocation of a Z buffer) but the bandwidth usage does not.

That's just impossible. Instead of not outputting a z-buffer, you're outputting one. I mean, come on. That requires bandwidth.

What are these "more advanced scenarios"? Is it making your pixel shader so busy on maths (with little texture access) that it becomes the main bottleneck? Could it be, then, that executing these pixels whenever they are obscured might yield even more performance benefits?

Once again, there are ways around rendering hidden pixels even on IMR's. This isn't necessarily specific to deferred rendering.

What's virtual texturing got to do with it? A pixel that is rejected prior to texturing is one that reduces texture memory bandwidth. Not only does it save the initial external memory accesses, it also means that the texture cache is more efficient because things are less likely to be thrown out.

1. Hidden pixels don't necessarily need to be rendered on an IMR.
2. Even if the hidden pixels are rendered, this rendering doesn't change the memory bandwidth/fillrate ratios at all.
3. There's no reason that a deferred renderer is inherently more efficient with its texture cache.

Well I think you should think about it a bit more. As a hint, consider the distribution of polygons across an image.

Nope, still going to take more bandwidth if you want to output/input more data.

Bjorn · Jun 21, 2004

Chalnoth said:
Once again, there are ways around rendering hidden pixels even on IMR's. This isn't necessarily specific to deferred rendering.

Do these methods remove all overdraw ?

Simon F · Jun 21, 2004

Chalnoth said:
Simon F said:

Which I'm saying are not that significant. The memory usage does go up by a jump (i.e. the allocation of a Z buffer) but the bandwidth usage does not.

Click to expand...

That's just impossible. Instead of not outputting a z-buffer, you're outputting one. I mean, come on. That requires bandwidth.

Does not go up by a jump. (I thought that was clear from my text.) Go and think about it.

What are these "more advanced scenarios"? Is it making your pixel shader so busy on maths (with little texture access) that it becomes the main bottleneck? Could it be, then, that executing these pixels whenever they are obscured might yield even more performance benefits?

Click to expand...

Once again, there are ways around rendering hidden pixels even on IMR's. This isn't necessarily specific to deferred rendering.

However, if you had read Kristof's post you will see that TBDR are still many times more efficient at deleting useless pixels.

What's virtual texturing got to do with it? A pixel that is rejected prior to texturing is one that reduces texture memory bandwidth. Not only does it save the initial external memory accesses, it also means that the texture cache is more efficient because things are less likely to be thrown out.

Click to expand...

1. Hidden pixels don't necessarily need to be rendered on an IMR.

But by a much smaller margin...

2. Even if the hidden pixels are rendered, this rendering doesn't change the memory bandwidth/fillrate ratios at all.

Yes it can. Hypothetically, assume there is an N kB cache and exactly N kB of texture data is needed to render each of the visible pixels in the scene.

Well I think you should think about it a bit more. As a hint, consider the distribution of polygons across an image.

Click to expand...

Nope, still going to take more bandwidth if you want to output/input more data.

I didn't say it didn't take more, I said it wasn't significant.

DeanoC · Jun 21, 2004

Chalnoth, I think your misunderstanding fundementally whats expensive at the hardware level.

Computing something is cheap, you can 'just' add more circuits (more pipelining, etc.) what isn't cheap is retreiving/storing things off chip.

Memory is slow AND its doesn't like random access patterns (I misplaced one of my favorite quotes which basically says "RAM is the worst named thing ever"). What TBR do is localise the memory access, by 'doing' a bit of the screen at each moment, they can use very expensive fast RAM.

Now of course they have to hit 'slow' RAM sometimes but they can spend some gates to a)reduce this to a minimum (i.e. deferring the texture sampling to the front fragment only) and b)linearising the access (i.e. outputing an entire z/colour tile in one go, not jumping around like triangle rendering implies).

Even IMR renderers these days, use a vast number of gates to do similar work (caches, hierachical z-buffers, etc.).

What the real discussion should be about (and it was originally with Kristof's very interesting posts with regard z-prepass) is whether the IMR techniques (that are basically faking deferrement) are better/as good as the TBDR renderer approach.

I'd say in the embedded/cheap space, the argument has already been won in favour of tile systems (I won't say deferred because there are quite a few tile architectures that aren't deferred). Sony, MBX, Intel, Nintendo, SEGA, MS are all using tile ideas (in Sony and Nintendo, they don't really support framebuffers bigger than a single tile (big tile mind

) but they can be coerced).

However in PC space conventional IMR's are currently favorite. Thats why we all want to benchmark a Series 5, a modern TBDR renderer vs ATI/NVIDIA style IMR.

Edit: Tidied up a few things (brackets and the odd missing word)

Kristof · Jun 21, 2004

Chalnoth said:
Simon F said:

Which I'm saying are not that significant. The memory usage does go up by a jump (i.e. the allocation of a Z buffer) but the bandwidth usage does not.

Click to expand...

That's just impossible. Instead of not outputting a z-buffer, you're outputting one. I mean, come on. That requires bandwidth.

May I point you to the Macrotiling patent ?

http://l2.espacenet.com/espacenet/viewer?PN=EP1287494&CY=gb&LG=en&DB=EPD

One of the things to understand is that even if a TBDR needs to go and render it still does only a fraction of the external Z Read/Write ops that an IMR does... as SimonF said : think about it.

K-

Xmas · Jun 21, 2004

Chalnoth said:
Simon F said:

Which I'm saying are not that significant. The memory usage does go up by a jump (i.e. the allocation of a Z buffer) but the bandwidth usage does not.

Click to expand...

That's just impossible. Instead of not outputting a z-buffer, you're outputting one. I mean, come on. That requires bandwidth.

If the scene buffer overflows, you can probably either process it completely, or process a certain number of tiles (or you have virtual memory and just spill over to system RAM). In case of complete processing (worst case), you need to store the whole multisample frame-/Z-buffer in video memory, and read it back to on-chip memory in the second pass. So if you wanted to render 1600x1200@60fps with 4xMSAA, you'd need 1600 * 1200 * 4 (32bit) * 4 (samples) * 2 (z+color) * 2 (write+read) * 60 (fps) = ~6,9 GiB/s.

With compression you most likely need less than half of that (let's assume 3GiB/s), and all reads and writes are completely linear and predictable. So how massive a hit is 3GiB/s of additional bandwidth requirements? The resolution/performance target is probably only suitable to a high-end model currently, and even from a TBDR I expect no less than 12GiB/s of real, usable bandwidth in the high-end. So if you've been bandwidth-limited before, the performance will probably dip by 20%. But an intelligent driver will let this happen only a few times and then appropriately enlarge the scene buffer. Or even know the applications with high geometry requirements beforehands.

KimB · Jun 21, 2004

And if the memory isn't allocated for the full-size frame and z buffers beforehand?

And, of course, with a future architecture, 4-sample MSAA should be considered small. I would expect closer to 8-sample MSAA.

Furthermore, if the architecture isn't optimized for the case where you have overflows, can you count on their being framebuffer compression?

Sage · Jun 21, 2004

Chalnoth said:
if the architecture isn't optimized for the case where you have overflows, can you count on their being framebuffer compression?

yeah well *IF* IMR's arent optimized for kind of early z-rejection they're pretty slow too... but I'm pretty sure most peole designing IMR's are familiar with the need for it

HSR vs Tile-Based rendering?

Jerry Cornelius

Sage

13 short of a dozen

mboeller

KimB

Simon F

Tea maker

Crisidelm

Simon F

Tea maker

Simon F

Tea maker

Crisidelm

KimB

christoph

Simon F

Tea maker

KimB

Bjorn

Simon F

Tea maker

DeanoC

Trust me, I'm a renderer person!

Kristof

Xmas

Porous

KimB

Sage

13 short of a dozen

Similar threads