Kristof's comments on render to texture

OpenGL guy said:
fresh said:
Have you ever tried this? It's very, very fast to generate mipmaps in hardware. A complete mipmap chain is 1.333 times as big as the top mip level. so a 512x512 texture only takes an additional ~87k pixels for the mipmaps. With 2 gpix fillrate, it's not big deal.
I don't think Kristof was concerned about the fillrate. You'd have to send down the same geometry multiple times in order to generate the miplevels, and that could affect performance.

Why would you need to send down the geometry multiple times? You can just down-sample the top mip level a few times. Hell, if we can do this on the PS2 I'm sure the R300 can do it :D. Humus already pointed out the relevant OGL extension and DX9 supports mipmap generation via downsampling as well. I'm sure you know all this, I'm just trying to figure out why Kristof thinks it's such a performance hit.
 
Well, remember this RTT stuff was originally because of 3D Mark a DirectX 8 program. DX8 doesn't have any method to automatically generate the lower mip levels. You pretty much must render each level individually.

There are 'other' methods that could be used i.e.
1) using 2 textures and manually downsampling one in to the other.
or
2) if you want to do something that might seriously break on different hardware, set render target to be a lower mip level of the current texture source. This would be a really 'bad' idea :)

-Colourless
 
Lots of little ones make one big... you need to read in the top level, write out a lower level, re-read that level, write out an even lower level, re-read that level, etc... for each render texture in the scene and for each frame... it all adds up to something. Not to mention that render target changes come at a cost as well.

Point mainly is its not "free" and will have some performance impact unless you are not bottlenecking yet.

But at least we all learned something more about render to texture and MipMap generation, which is a good thing (tm) :)

K-
 
Kristof said:
Lots of little ones make one big... you need to read in the top level, write out a lower level, re-read that level, write out an even lower level, re-read that level, etc... for each render texture in the scene and for each frame... it all adds up to something. Not to mention that render target changes come at a cost as well.

Point mainly is its not "free" and will have some performance impact unless you are not bottlenecking yet.

But at least we all learned something more about render to texture and MipMap generation, which is a good thing (tm) :)

K-

Not every texture needs this. Only textures which are rendered to (like dynamic cube maps, shadow maps, etc). Generating mipmaps is just a texturing operations, which modern cards are freakin fast at, especially when you access the textures in linear fasion. Nope, it's not free, but it's very very very fast and I'd be shocked if runtime mipmap generation would show up on any kind of profiler. The actual rendering of the top level is what would take the most time. One Of These Days When I Have Time (tm) I'll whip up a small app to test this :)

Also want to point out that some cards can draw to a swizzled render target.
 
The actual rendering of the top level is what would take the most time. One Of These Days When I Have Time (tm) I'll whip up a small app to test this

We did this in our app aswell, and while it's certainly measurable (time wise), it's certainly not a major performance penalty when compared to rendering the original offscreen image.
We measured the first mipmap generation at around 70-80% of the time to generate the entire mip chain. and in our case generating the mip chain took 10-15% of the time generating the original offscreen image did.

There are a number of issues with rendertarget format selectuion on PC cards, most notably wether the target memory is tiled, or swizzled, or simply linear. Which works out best is usually dependant on the number of tris you rener into the offscreen image vs the number of tris you use the rendered texture on.
 
This is really funny because I was having problems doing this exact thing on my Radeon 8500 (in Direct3D) since about a month ago, and tried to resolve it on the Rage3D 3D Coding forum. It seems I can't render to anything but the top level in a mip-map chain, and someone else there confirmed the problem. I'm just drawing simple, non-textured polygons, too, and they won't draw anything but the top level.

Anyway, generating smaller mip-maps is just a single texture operation on a quad. You can set the LOD bias to -1, strategically align the source texture, and use the normal bilinear texture filter to do the downsampling.

As for performance hit, it will be memory bandwidth limited (4 32-bit texture reads and a 32-bit colour output each pixel), but you have no overdraw at all. 99% of the time it will still draw much faster than if you instead drew the original scene. Renderstate changes are very minimal - all you do is change the render target, and LOD will take care of the texture source.

Since the subsequent mip-maps are smaller in resolution (only 33% more pixels total as fresh pointed out), and drawing them is faster (per output pixel) than the original scene, the performance overhead is probably going to be around 10% instead of 33%, per render-to-texture.

Given that all the render-to-texture's only consume a part of the overall frame time, say 30%, we're talking about a 3% performance hit overall. And this neglects the savings in texture thrashing.

Very minimal performance hit indeed. :D

And Kristof, OpenGL guy is right about the tiling. You render to texture in a way so that it winds up exactly the same as a normal texture, and there's no performance hit associated with that. It's been done like that for some time now (several generations), AFAIK. The only performance limitation is lack of compression, and that's it.
 
And Kristof, OpenGL guy is right about the tiling. You render to texture in a way so that it winds up exactly the same as a normal texture, and there's no performance hit associated with that. It's been done like that for some time now (several generations), AFAIK. The only performance limitation is lack of compression, and that's it

This isn't exactly true at least on NVidia cards.
From fastest to slowest render targets rate as follows

Linear in Tiled memory
Linear
Swizzled
Swizzled+Tiled

As a non compressed source texture Fastest to slowest they rank
Swizzled
Linear
Linear in Tiled memory
Swizzled+Tiled

I'm not sure if you can specify all or any of these rendertarget formats in D3D on the PC, but you can on the XBox version.
The first Item in each of the above lists are significantly faster than the other options. So if you render a lot more pixels to the map then from it your better off using a Linear texture in Tiled memory, in the reverse case you'd be better off with a swizzled rendertarget.
 
ERP said:
I'm not sure if you can specify all or any of these rendertarget formats in D3D on the PC, but you can on the XBox version.
No, the driver determines the memory layout that gives best performance. If the application does a lock on the surface (which should be rare because it's bad for performance) you can give them an aperture that looks like linear memory so the app (or API) doesn't need to know anything about your tiling.

Anyway, your list of fast-to-slow uncompressed texture formats doesn't make sense. Swizzled + Tiled should be the fastest texture format. Memory tiling was initially done for textures so that bilinear filtering would be faster, i.e. the texels are closer together in memory.
 
Anyway, your list of fast-to-slow uncompressed texture formats doesn't make sense. Swizzled + Tiled should be the fastest texture format. Memory tiling was initially done for textures so that bilinear filtering would be faster, i.e. the texels are closer together in memory

While this is intuitively true, it's certainly not true if you benchmark it.
My understanding is that the Tiling and the swizzling work against each other providing worse performance.
 
ERP said:
While this is intuitively true, it's certainly not true if you benchmark it.
My understanding is that the Tiling and the swizzling work against each other providing worse performance.
Maybe what we are calling swizzled + tiled has different meanings.
 
ERP said:
While this is intuitively true, it's certainly not true if you benchmark it.
My understanding is that the Tiling and the swizzling work against each other providing worse performance.

I suppose it depends on how your benchmark works, in order to determine what you are actually benching. Perhaps there are limitations in their renderer. A swizzle is just a reordering of data channels. Perhaps the blocks are getting too small, alignment requirements come in to play and you end up generating a few texels in software. Perhaps there is a padding and/or replication requirement. Perhaps the swizzling isn't a true swizzle and requires a higher bit depth internal storage format, or an additional HW operation to perform the swizzle.

I gather it depends on usage. Lots of render to textures may mean that the slow path in the driver is becoming the bottleneck, and outstripping the benefits of tiled data.

Swizzling data components and tiling are independent of each other.
 
While I understand that tiling and swizzling are independant, i.e. tiling is a function of the memory subsystem and swizzling basically just does a bitwise remapping to provide better coherency.

Tiling already swizzled memory isn't going to necessarilly increase texel locality. Perhaps part of the problem is the definition of tiling, I'd be surprised if ATI and NVidia picked the exact same implementation, since a lot of the performance characteristics are most likely dependant on the characteristics of the memory controller. I would imagine that the texture cache behaviour probably has some impact on how they work together also.

One of the nice things about the Xbox devkit, is that the driver is rarely the limiting factor. Stepping into the API calls it's pretty apparent that the driver does little more than copy the appropriate commands into the pushbuffer.

My tests were inside an app and so the results are going to be biased one way or another based on the apps usage, I benchmarked a number of different configurations. I tested both Rendering to and using as a source texture seperately and also together to get a picture of what was happening. Interestingly once I started testing display and use together most of the available options benchmarked incredibly close to each other, that even included Render to Tiled linear memory, do a copy to swizzled non Tiled memory and use that as the source. The final selection came down to picking the easiest to implement, in our case a Linear Tiled render target.
 
Gunhead said:
"Twiddled"? "Swizzled"? These technical terms kill me :p
"Twiddled" (as in twiddling your thumbs, i.e. moving them around) is the term that was coined at IMG/PowerVR for the arrangement of texels (actually refering to the movement of the 'addressing' bits) used since pre PCX1.

It effectively implements Morton Order storage (but we hadn't heard of that when we first started to use the technique). The idea was inspired by Jim Blinn's report on the NASA "planets flyby" that he rendered, in which he tiled the textures to reduce virtual memory thrashing. Twiddling took this to the next level by recursively tiling the data, primarily to reduce SDRAM page breaks but also for faster texture filtering.

Simon
 
darkblu said:
now that was intriguing. thanks, Simon.
Linking this in with another thread, I remember explaining how we stored textures to RenderMorphics and they were initially, not surprisingly, rather confused by it. :)
 
OpenGL guy said:
ERP said:
While this is intuitively true, it's certainly not true if you benchmark it.
My understanding is that the Tiling and the swizzling work against each other providing worse performance.
Maybe what we are calling swizzled + tiled has different meanings.
nvidia texture swizzling is substantially different to ATI's. I can believe that swizzled + tiled is a negative in their hardware, although I'd be surprised if it was slower than straight-linear.
 
nvidia texture swizzling is substantially different to ATI's. I can believe that swizzled + tiled is a negative in their hardware, although I'd be surprised if it was slower than straight-linear

Interesting, FWIW I know nothing about ATI's hardware.
As I said in the limited tests I did straight linear was indeed faster than swizzled in tiled memory. Having said that I should say that there was only abput a 20-30% performance variance accross all the formats.
 
Back
Top