HSR vs Tile-Based rendering?

Chalnoth said:
Additionally, an initial z-pass is very cheap

Cheap, comparatively speaking I guess you mean. It still requires all the Z-buffer traffic that would have taken place during regular rendering, PLUS maybe several read-backs of Z when the actual textured rendering takes place PLUS doing all the geometry transforms twice. So it's really mostly shifting the load onto other parts of the GPU that are hopefully not as strained as the memory subsystem.

I also fail to see how a buffer overflow - which hasn't been demonstrated as being an actual problem with a tbdr - would "slaughter" performance, as all the actual extra memory traffic would be one block write of the internal Z-buffer to memory which could be done as very efficient linear bursts, and then a read-back in the next pass in the same efficient manner. This is nothing compared to what is required in an IM renderer in the way of Z-transactions, and like half an order of magnitude more efficient... ;)

and is even required for stencil shadowing

Why would it be required? Afaik, DOOM3 renders shadows AFTER the scene's been textured already, so the Z-buffer would be filled anyway at that point even had the first initial Z-pass not taken place. It seems to me the initial Z-pass is there merely to cut down on the shading neccessary to render the light-sourced textured pixels, not to facilitate rendering of stencil shadows.

And yes, the ratio of pixel fillrate to memory bandwidth is what determines whether the limiting factor is memory bandwidth or pixel fillrate. Rendering hidden pixels doesn't affect this ratio.

Isn't that being a bit selective you think? Naturally the amount of overdraw varies with the composition of the scene in question, but shouldn't Z traffic be factored into the formula as well? Just forgetting about it doesn't seem entirely honest...
 
Kristof said:
An early Z check that fails the early part, as with older architectures produces the whole pixel and only does the final full detail ZCheck to decide on the update... this is "considerably" more than a single clock.
It doesn't matter how long it takes. It only matters if that latency can or cannot be hidden. I see no reason why it can't.
 
Guden Oden said:
and is even required for stencil shadowing

Why would it be required? Afaik, DOOM3 renders shadows AFTER the scene's been textured already, so the Z-buffer would be filled anyway at that point even had the first initial Z-pass not taken place. It seems to me the initial Z-pass is there merely to cut down on the shading neccessary to render the light-sourced textured pixels, not to facilitate rendering of stencil shadows.
No, the scene is not textured first, and it's more correct to say that the engine renders light, not shadows. Unlike most other games, Doom3 works by adding light to places not in shadow (which is a more correct way of doing it), instead of subtracting light from places in shadow,

1 start with a black screen. fill in z-buffer
2 draw shadow volume to stencil buffer
3 add light to everything not in shadow

repeat 2 and 3 until done.

Step 2 can not be done without step 1 first. In step 3 the actual base texture lookup is done, it's modulated with the light and added to the framebuffer.
 
Guden Oden said:
Why would it be required? Afaik, DOOM3 renders shadows AFTER the scene's been textured already, so the Z-buffer would be filled anyway at that point even had the first initial Z-pass not taken place. It seems to me the initial Z-pass is there merely to cut down on the shading neccessary to render the light-sourced textured pixels, not to facilitate rendering of stencil shadows.
Well, it's required if one wants to use shadows to occlude instead of being "painted on." No, the DOOM3 technique first lays down a z-buffer, then does the shadows, then the actual rendering (Well, final rendering may be interleaved between shadow passes, but that's an optimization issue).
 
Thowllly said:
No, the scene is not textured first, and it's more correct to say that the engine renders light, not shadows. Unlike most other games, Doom3 works by adding light to places not in shadow (which is a more correct way of doing it), instead of subtracting light from places in shadow,

1 start with a black screen. fill in z-buffer

Usually you put down the ambient light in the first pass which is textured. Pure dark shadows are pretty boring and unrealistic... unless you have a situation like in Doom :)

K-
 
Is the limited size scene buffer for TBRs that much of a limitation? For a good performance on current GPUs the scene must stored in the video memory anyway. So with performance in mind, it looks like both architectures need same amount of memory for the scene data.

OT: Are the cache lines for texture caches blocks (2d)? I've been wondering about this, it looks like texture cache has to store 2d chunks of the texture ( 3d for 3d textures) for the cache to improve performance. Also, Can anyone give an estimate of the texture cahce size?
 
Kristof said:
Usually you put down the ambient light in the first pass which is textured. Pure dark shadows are pretty boring and unrealistic... unless you have a situation like in Doom :)

K-
Unless you do all lighting in a final pass, using MRT's to store shadowing information and whatnot. And besides, ambient lighting is the easiest to lay down.
 
krychek said:
Is the limited size scene buffer for TBRs that much of a limitation? For a good performance on current GPUs the scene must stored in the video memory anyway. So with performance in mind, it looks like both architectures need same amount of memory for the scene data.
No. While both IMRs and TBDRs require pre-transform vertex data to be kept onboard for optimal T&L performance, TBDRs need to keep post-transform data for at least 1 full frame as well. Also, these data do not map 1:1 to each other - if you transform a vertex array N times in the same frame, the TBDR will need to store N instances of post-transform data where an IMR needs doesn't need to store any data at all.
OT: Are the cache lines for texture caches blocks (2d)? I've been wondering about this, it looks like texture cache has to store 2d chunks of the texture ( 3d for 3d textures) for the cache to improve performance. Also, Can anyone give an estimate of the texture cahce size?
Swizzled memory layouts for textures, so that a chunk of linear memory maps to a square 2D/3D patch of the texture, is a rather familiar and old trick in the 3D hardware design world. The actual sizes of texture caches are usually well-kept secrets - the numbers I have heard tend to range from 2KB to 16KB. Given their intended use (often just to avoid re-fetching texels from one scanline to the next) and their access patterns (very little texel reuse other that between ajdacent pixels), they tend to exhibit dramatically diminishing returns from increased sizes.
 
arjan de lumens said:
krychek said:
Is the limited size scene buffer for TBRs that much of a limitation? For a good performance on current GPUs the scene must stored in the video memory anyway. So with performance in mind, it looks like both architectures need same amount of memory for the scene data.
No. While both IMRs and TBDRs require pre-transform vertex data to be kept onboard for optimal T&L performance, TBDRs need to keep post-transform data for at least 1 full frame as well. Also, these data do not map 1:1 to each other - if you transform a vertex array N times in the same frame, the TBDR will need to store N instances of post-transform data where an IMR needs doesn't need to store any data at all.

Ah, Silly me :oops:. That makes perfect sense.
 
In practise the scene-capture side of TBR is irrelevant, ALL 3D cards capture the complete scene.

All high performance rendering consists of the drawcommand placing data in a command buffer. When finished the command buffer is flushed and the scene is rendered. If you ever over fill the command buffer an expensive operation must occur (either more memory must be allocated or the command buffer is processed and the drawcommand stalls until the GPU has finished using it).

A least one non-TBR uses this fact to accelerate the z-prepass style rendering even more. The z-prepass can gather 'other' information that when the command buffer is resubmitted speed it up further.
 
arjan de lumens said:
No. While both IMRs and TBDRs require pre-transform vertex data to be kept onboard for optimal T&L performance, TBDRs need to keep post-transform data for at least 1 full frame as well. Also, these data do not map 1:1 to each other - if you transform a vertex array N times in the same frame, the TBDR will need to store N instances of post-transform data where an IMR needs doesn't need to store any data at all.
Maybe some 'tilers' use post-transform buffers in the same way that some IMR use post-transform vertex caches. But its certainly not required, and I know one 'tiler' that doesn't do that. It simple transforms the vertices multiple times.

Its a case of how big a tile is, if the tile means re-transforming 20-30 times than a post-transform buffer will save alot of time BUT if the tile size means you only need to re-transform 3-4 times you can just have more vertex shader ALU to compensate (or even use a combination of both).
 
Thowllly said:
it's more correct to say that the engine renders light, not shadows.

Um, the stencil volumes are definitely shadows... :) I know the game renders textured pixels as light rather than darkness though, but that wasn't what I meant.

Anyway, thanks for the clarification. :)
 
Hmm. Does that command buffer contain actual vertex data or just pointers to vertex arrays? And why would it not be possible for the (IMR or TBDR) GPU to start fetching commands from the command buffer before the app is finished writing to the buffer?

As for re-transforming vertex data (as opposed to storing post-transform data), how well will that work with vertex shaders? If the polygons in a tile in sum use N different shader programs on average, you get something like N vertex shader program loads per tile, which doesn't sound very nice.

And you still get the issue that you need to build, for each tile, a list of polygons touching it, which takes potentially unbounded memory.
 
DeanoC said:
In practise the scene-capture side of TBR is irrelevant, ALL 3D cards capture the complete scene.

All high performance rendering consists of the drawcommand placing data in a command buffer. When finished the command buffer is flushed and the scene is rendered. If you ever over fill the command buffer an expensive operation must occur (either more memory must be allocated or the command buffer is processed and the drawcommand stalls until the GPU has finished using it).
No, IMR's render as commands are being sent. Deferred renderers are the only ones that wait, hence the term, "deferred rendering."

A least one non-TBR uses this fact to accelerate the z-prepass style rendering even more. The z-prepass can gather 'other' information that when the command buffer is resubmitted speed it up further.
That's a special case, and I doubt it leads to a speedup in most scenarios. The case you're talking about, I believe, is the one where a 3D company (I think it was S3?) makes a copy of all data being sent to the graphics card, in order to do a z-only pass first, then sends the data a second time now that the z-buffer is filled. That's a completely different scenario as the copying is done in system memory, where overflows are vastly cheaper.
 
arjan de lumens said:
Hmm. Does that command buffer contain actual vertex data or just pointers to vertex arrays?
Pointers usually (but not for some dynamic data) but some architectures have to copy indexes.

arjan de lumens said:
And why would it not be possible for the (IMR or TBDR) GPU to start fetching commands from the command buffer before the app is finished writing to the buffer?
You can do either, but staying in sync for the GPU is wasteful (something is always waiting, either the CPU so it can write new data or the GPU waiting to read new data), better to build next frame completely while the GPU consumes this frame.

arjan de lumens said:
As for re-transforming vertex data (as opposed to storing post-transform data), how well will that work with vertex shaders? If the polygons in a tile in sum use N different shader programs on average, you get something like N vertex shader program loads per tile, which doesn't sound very nice.
State changes (which programs loads are just one type) are always a problem, most hardware has ways of reducing the actual cost. Changing shader doesn't require a memory reload in most cases (with a bit of luck its already in the chip memory).

arjan de lumens said:
And you still get the issue that you need to build, for each tile, a list of polygons touching it, which takes potentially unbounded memory.

No you don't, one way would be that the first tile calculates in which tile each vertex is in (1 byte per vertex would be enough), other tile passes would fast reject based on this. This also eliminates much of vertex shader cost its then just 2 per vertex (once for the first tile and once for the tile its actually in). Of course there is the cost of the 1 byte per vertex write and the extra 1 byte read per tile.

Its a variation on the z-prepass scheme.
 
Chalnoth said:
No, IMR's render as commands are being sent. Deferred renderers are the only ones that wait, hence the term, "deferred rendering."
No they don't.
Have you actually programmed a modern GPU at the hardware level? Most don't even have a direct mode.

GPU use a ring/command/DMA buffer (NVIDIA/ATI/Sony terminology respectively), A drawcommand consists of a packet of information that is written into this buffer, hopefully the GPU is reading from a part of the buffer far away (else you get stalls).

The only communication you will ever make is through this buffer and if the GPU is ever waiting for you to write data, your in trouble from a performance point of view.

The only real difference in modern designs is whether use a double buffer DMA buffer (CPU writes one, while the GPU reads the other) or a cyclic buffer where the GPU and CPU are chasing each other forever (your still writing upto 1 frame away, its just your reusing the RAM once the GPU has finished).

Chalnoth said:
A least one non-TBR uses this fact to accelerate the z-prepass style rendering even more. The z-prepass can gather 'other' information that when the command buffer is resubmitted speed it up further.
That's a special case, and I doubt it leads to a speedup in most scenarios. The case you're talking about, I believe, is the one where a 3D company (I think it was S3?) makes a copy of all data being sent to the graphics card, in order to do a z-only pass first, then sends the data a second time now that the z-buffer is filled. That's a completely different scenario as the copying is done in system memory, where overflows are vastly cheaper.

As I said all PC cards write into the command buffer, I know nothing about S3's technique (i.e. I wasn't talking about S3), but they probably just don't use a ring buffer (so they don't wipe over themselves) and just resubmit.
 
That's still a driver-level optimization technique. It has nothing to do with the scene buffer, and has completely different performance characteristics due to the fact that it's stored in system memory, and has no overflow problems (overflow in this case will typically just mean that something's processing too quickly, so a stall there won't affect much of anything).

Anyway, I'd be highly surprised if this buffer really was larger than a few dozen draw calls in size.
 
DeanoC said:
arjan de lumens said:
And why would it not be possible for the (IMR or TBDR) GPU to start fetching commands from the command buffer before the app is finished writing to the buffer?
You can do either, but staying in sync for the GPU is wasteful (something is always waiting, either the CPU so it can write new data or the GPU waiting to read new data), better to build next frame completely while the GPU consumes this frame.
The CPU will stall if the buffer is full, the GPU if the buffer is empty - if the buffer is, say, 50% full, there is no reason for either to wait. Is it usually the common case that the buffer is nearly-always-full/empty?

And how big is it - I seem to remember having seen a number around 1-2MBytes?

arjan de lumens said:
And you still get the issue that you need to build, for each tile, a list of polygons touching it, which takes potentially unbounded memory.

No you don't, one way would be that the first tile calculates in which tile each vertex is in (1 byte per vertex would be enough), other tile passes would fast reject based on this. This also eliminates much of vertex shader cost its then just 2 per vertex (once for the first tile and once for the tile its actually in). Of course there is the cost of the 1 byte per vertex write and the extra 1 byte read per tile.

Its a variation on the z-prepass scheme.
I don't see how this would work - a polygon can cover tiles other than the ones that its vertices are in? (think polygons with length/height > 2 tiles)
 
arjan de lumens said:
And how big is it - I seem to remember having seen a number around 1-2MBytes?
Its enough to hold a frame or 2, that of course varies, I've seen Xbox and PS2 ones well past a couple of MBytes.

I don't see how this would work - a polygon can cover tiles other than the ones that its vertices are in? (think polygons with length/height > 2 tiles)

Good point, forgot about them. Its still not that bad, you could just mark them to be always processed, with small polygon shouldn't happen to often. Most architectures that used any kind of tile (including PS2 VRAM pages) don't handle large polygons well.
 
arjan de lumens said:
The CPU will stall if the buffer is full, the GPU if the buffer is empty - if the buffer is, say, 50% full, there is no reason for either to wait. Is it usually the common case that the buffer is nearly-always-full/empty?
Nope your right, you try and balance is so there is always lots of room between GPU and CPU. But thats was my point, thats ALOT of data and theres a huge deferement from submitting a render packet and it actually happening (I'm generalising of course, some times its fairly short. Classic example used to be rendering the sky early so the GPU would be busy while you got a chance to get ahead again, back when cards didn't have a backbuffer flip dma packet).
 
Back
Top