Why isn't R2VB discussed and used more?

sebbbi

Veteran
Last week I was searching for a good alternative for Radeon cards to use instead of vertex texturing for our terrain engine's DX9 renderer. I knew already about the Radeon R2VB functionality, but I never tried to use it before, so I thought it was just a dirty hack usable only for limited scenarios and limited amount of cards. When I found it was usable on all DX9 Radeons starting from the old R300 (9500+) and I figured out in my own testing that I was able to do many of the same algorithms planned for our DX10 renderer with similar efficiency on DX9 using R2VB, I really wondered why this feature has not been discussed, marketed and used more.

Many of the games sold currently and especially during the next years are ported to DX9 from DX10 or XBox 360. Using a algorithm suitable for R2VB is the most optimal way to implement many new rendering techniques seen on these platforms. This makes R2VB more useful now than it ever was (and this is the reason I also started to use it now).

It's notable that vertex texturing capability is very limited in DX9 class cards. All DX9 Geforces support only R32F and ARGB32F texture formats, and DX9 Radeons do not support vertex texturing at all. Also DX9 cards have filtering and performance issues with vertex texturing. Saving large structures (such as height map) in 32 bit floating point formats consumes huge amount of graphics memory. With R2VB the developer can use any texture format supported by the pixel shader (including compressed formats). In our terrain renderer we can also quickly calculate normal vectors for the heightmap on fly and this allows us to only save the height data (I use fetch4 to get all the data needed for normal vector calculation with just one tex instruction). My results for porting our terrain rendering system from vertex texturing to R2VB has been very positive. It runs faster and uses considerably amount less memory than the vertex texturing based alternative. I now also plan to port our DX10 particle system to DX9 using R2VB (all animation in shaders, particle mesh creation in shaders, etc).

I know at least that Valve's Source engine uses R2VB, but are there any other major engines using it for their DX9 renderer?
 
Nope, R2VB is an ATi-only implementation and was enabled for D3D9 awhile ago in their drivers (Cat 5.9).
 
There is no official support to DX9 R2VB for NVidia cards. However I found some notes about it in the web. Seems that it has been supported in some beta drivers at least.

http://www.gamedev.net/community/forums/topic.asp?topic_id=422354

Quote:

"I've discovered the R2VB FourCC code in DirectDraw caps of my GeForce 6600 AGP under Windows Vista RC2 with nVidia's beta driver for Vista RC1. So I ran few ATI's R2VB demos (from ATI SDK) - and they were working well ! (in particular those one which do not use other ATI specific features like ATI1N texture format).

Demos are:
R2VB animation
R2VB Water
R2VB IK
R2VB N-Patches
R2VB Shadow volumes
R2VB Terrain morph

Othes don't start.
I've also tried to check R2VB under Windows XP - but even on ForceWare beta 96.89 R2VB FourCC was not reported."


This post is dated 11/3/2006, could be that now Windows XP is supported also. Have to check on our Geforce + XP test setups (we have 6000, 7000 and 8000 series test setups with XP) when I get to work. I'll inform about my findings. Geforce support for R2VB would be a great thing to have, allowing us to drop the vertex texturing code completely, and to port many of our DX10 specific shaders to the DX9 renderer.
 
Last edited by a moderator:
I tested DX9 R2VB on various Geforce cards at work, here are the results:
XP + Geforce 6800 = works
XP + Geforce 7800 = works
Vista + Geforce 6800 = works
Vista + Geforce 7800 = works
Vista + Geforce 8800 = does not work (no support for 'R2VB' fourcc format)

This is very good news. Despite any official support, most of the Geforces seem to support R2VB just fine. As R2VB is usable on all DX9 Radeons (9500+) it's a very good choice for many DX10 style rendering techniques.

This feature is called „Vertex Texture Fetch“ by nvidia: http://developer.nvidia.com/object/using_vertex_textures.html

Yes, that's the marketting name. But in reality, it's just standard texture sampling inside vertex shader. You can use the same sampling instructions that are usable in pixel shader. There are limitations, like less available texture formats and the hardware is not capable of calculating mip levels automatically (without polygon slope info that is of course impossible). Also it's a bit slower than texture sampling inside pixel shader, because the chip cannot use texture cache that efficiently (vertices don't form screen space quads).

The biggest negative side in texture sampling inside vertex shader is the very limited texture format support of DX9 class hardware (only 32 bit floating point textures are supported) and you are limited to point sampling also. With R2VB you can use all texture formats (including dxt compressed formats), and all hardware filtering methods (bilinear, trilinear, anisotropic).
 
Does R2VB cause an in device memory copy in ATI or NVidia PC cards (it does on 360 obviously)?

I suppose if the framebuffer is not in a swizzled format internally, then R2VB would not require a copy, but reading from a framebuffer would require a swizzling copy.

IMO R2VB would have been much more useful if the vertex buffer frequency stream feature was also as portable...
 
I really wondered why this feature has not been discussed, marketed and used more.

I'm not sure how many games use it, but I think it has been fairly well marketed, at least to developers. ATI released an SDK back in those days with a large focus on R2VB and IIRC more than half the new samples used R2VB.

I tested DX9 R2VB on various Geforce cards at work, here are the results:
XP + Geforce 6800 = works
XP + Geforce 7800 = works
Vista + Geforce 6800 = works
Vista + Geforce 7800 = works
Vista + Geforce 8800 = does not work (no support for 'R2VB' fourcc format)

Interesting. Didn't know they had implemented it, but it makes sense. It was just a lot faster than VTF. For DX10 hardware though VTF is fast enough so there's no longer any direct need for R2VB. In fact, you'll probably see somewhat better performance with VTF if there's any difference at all.

Sebbi where can i get those demo's ?

Grab the March 2006 Radeon SDK from here:
http://ati.amd.com/developer/SDK/PreviousRadeonSDK.html

Does R2VB cause an in device memory copy in ATI or NVidia PC cards (it does on 360 obviously)?

Don't know about Nvidia, but on ATI there's no copy involved. I would think it's the same for Nvidia. When you create a render target for R2VB you pass a flag identifying it as such so it'll be created as linear in memory.
 
Does R2VB cause an in device memory copy in ATI or NVidia PC cards (it does on 360 obviously)?

I suppose if the framebuffer is not in a swizzled format internally, then R2VB would not require a copy, but reading from a framebuffer would require a swizzling copy.

You use the D3DUSAGE_DMAP flag when you create the rendertarget texture. This makes the hardware (Radeons and 360 at least) create the buffer in linear format instead of swizzled one. I am pretty sure the NVidia cards also support linear buffers in HW, so most likely no internal buffer copy is needed on them either.

I am not sure why the hardware manufacturers are forcing us to use non-swizzled format for R2VB. If I would know how the surface has been swizzled, I could easily create the index buffer to negate the swizzling (and likely cause a minor performance boost as swizzled format is the most optimal). Quess this is because the swizzling is different between different generation cards, and they want to make the API as clean as possible. It's just a funny thing that we console developers notice things like this first when we learn a new API :)

Interesting. Didn't know they had implemented it, but it makes sense. It was just a lot faster than VTF. For DX10 hardware though VTF is fast enough so there's no longer any direct need for R2VB. In fact, you'll probably see somewhat better performance with VTF if there's any difference at all.

Agreed, R2VB is much more flexible and faster than VTF in DX9. But R2VB is not dead in DX10, it still remains more flexible than VTF (most important thing being that you can reuse data generated with R2VB), however the margin has been greatly narrowed as the filtering and texture format problems have been solved.
 
Last edited by a moderator:
Humus i meant are they available as standalone exe's
i havnt looked but im guessing they are as source in the sdk and i would need a compiler + directx sdk ect
 
I am not sure why the hardware manufacturers are forcing us to use non-swizzled format for R2VB. If I would know how the surface has been swizzled, I could easily create the index buffer to negate the swizzling (and likely cause a minor performance boost as swizzled format is the most optimal). Quess this is because the swizzling is different between different generation cards, and they want to make the API as clean as possible. It's just a funny thing that we console developers notice things like this first when we learn a new API :)
The "swizzling" can change depending on memory configuration as well... not something you'd have to handle on a console. Would you really want to have different versions for every different card and memory configuration? Imagine the amount of testing you would have to do.
 
Humus i meant are they available as standalone exe's
i havnt looked but im guessing they are as source in the sdk and i would need a compiler + directx sdk ect

Yes the demos are compiled, and contain the exe files in their directories. You can run them without compiling just fine. Some of the R2VB demos do not work on Geforce cards, as they use ATI specific texture formats (ATI1 and ATI2).

The "swizzling" can change depending on memory configuration as well... not something you'd have to handle on a console. Would you really want to have different versions for every different card and memory configuration? Imagine the amount of testing you would have to do.

Sadly true. Everything is so much easier in the console development. One rendering code and one test platform. I doubt the performance difference would be that large either way, as most R2VB algorithms aren't very cache friendly.
 
Agreed, R2VB is much more flexible and faster than VTF in DX9. But R2VB is not dead in DX10, it still remains more flexible than VTF (most important thing being that you can reuse data generated with R2VB)
How so? Data written to a texture is obviously reusable, too.
 
How so? Data written to a texture is obviously reusable, too.

Yes of course, but if you use VTF you have to sample every texture you use every frame in your vertex shader, and sampling in vertex shader is slower than in pixel shader. Also many textures sampled in vertex shader are in floating point format, and for any advanced algorithm, just sampling one 4 channel texture is not enough . When you use R2VB, you do the sampling once and input the result to a static vertex buffer. You do not have to sample anything anymore when you render the vertices. R2VB is faster if you update the data only periodically.
 
Last edited by a moderator:
Why should it be slower? That just sounds like a particular implementation that has biased the view.

Pixel shader renders in quads, and the texture cache is more optimally used. Of course if we are talking about theoretical architecture with no texture cache, and no other specific texturing optimizations (possible because of the much more controlled texture accessing patters in pixel shaders), then the both vertex and pixel texturing should behave similarly performance wise. However this is not the case with current DX9 and DX10 chips. The scenario here is pretty much comparable to random memory reads compared to similar sized memory block transfers.
 
Last edited by a moderator:
Another point we should probably bring up here is that with DX10/GL2.1(+NVidia's extensions) you can stream out. However from what I've seen in GL, stream out only works with floats. Even in this case R2VB still has a possible advantage in that you could write compressed output (ie FP16 or RGBA INT8 for colors) which could be a considerable bandwidth savings overall.

However, with stream out you can output I believe up to 16 floats per vert, of which you could manually pack/compress data into. And stream out does have the advantage of doing data expansion for free, resulting in the ability to render indexed triangles with one interleaved vertex buffer, all vertex attribute fetches could be of perfect memory alignment and granularity so that you have no wasted bandwidth.

In the case of VTF,

Also just guessing from CUDA docs, I believe the texture cache line on GeForce 8+9 Series is about 32 bytes (to match the memory granularity), which would be 8 texels (guess 4x2 rect) for R FP32, 4 texels (2x2 quad) for RGBA FP16 or RG FP32, and just 2 texels (2x1 rect) for RGBA FP32. So if you do have incoherent texture fetches from the vertex pipe, might be best to fetch RGBA FP32. Obviously you will also want to double check performance with software swizzled texture fetch addresses vs linear.
 
Back
Top