Why isn't R2VB discussed and used more?

Jawed · Apr 29, 2008

TimothyFarrar said:
However, with stream out you can output I believe up to 16 floats per vert, of which you could manually pack/compress data into.

1024 32bit elements per input vertex.

Jawed

sebbbi · Apr 29, 2008

Simon F said:
Do you really expect that accesses to texture data from the vertices are going to be incoherent?

Texture access pattern from vertex shader (VTF) is much more random than from pixel shader, if we just think about naive porting of existing algorithms to VTF and do not reorder our vertices to compensate. However vertex reordering for VTF is not that straigtforward, as you'd usually want to order your vertices to get the best vertex cache performance. Reaching both goals at once is not trivial, and perfect reordering is not possible for majority of algorithms and data sets.

Of course R2VB suffers from similar problems if your vertexbuffer generation accesses textures very randomly. However this affects only the vertex buffer creation. If you are using the generated buffer hundreds of times in a frame and/or updating the buffer infrequently there can be a significant performance gap. I am not trying to prove here that VTF is not a useful feature. It's very useful in many ways, but there are cases where R2VB is just more efficient (even on DX10 hardware).

Let me use our terrain renderer as an example: Our terrain is stored to the graphics card memory as a huge one component 2d heightmap. For rendering we have a 2d vertex grid that is 1x1 units in close range, 2x2 a bit further away (and 4x4, 8x8 and 16x16 units further away). We round the player position to the nearest 16x16 position in the grid. This way there is no grid wobbling present (as would be if we used a view space grid). If the player moves at full speed the grid position changes around once a second. This is when we use the R2VB to update the terrain vertex buffer. We are using this generated terrain vertex buffer 2 times every frame (one for water reflections). The terrain mesh has around 500 000 vertices. For our VTF path the terrain rendering requires one million texture fetches per frame, and for our R2VB path only 8333 (assuming 60 fps). R2VB generated vertex buffer is slightly larger compared to VTF counterpart as it includes ground height (f16) and normal vector (i8) data.

Xmas · Apr 29, 2008

sebbbi said:
Yes of course, but if you use VTF you have to sample every texture you use every frame in your vertex shader, and sampling in vertex shader is slower than in pixel shader. Also many textures sampled in vertex shader are in floating point format, and for any advanced algorithm, just sampling one 4 channel texture is not enough . When you use R2VB, you do the sampling once and input the result to a static vertex buffer. You do not have to sample anything anymore when you render the vertices. R2VB is faster if you update the data only periodically.

Well, you can obviously render to a texture then sample it in a vertex shader. At that point the only difference that remains between VTF and R2VB is the way data is fetched by the vertex shader: either automatically from vertex buffers based on the vertex index, or explicitly from a texture using texture coordinates. In that regard VTF is more flexible than R2VB.

sebbbi · Apr 29, 2008

Xmas said:
Well, you can obviously render to a texture then sample it in a vertex shader. At that point the only difference that remains between VTF and R2VB is the way data is fetched by the vertex shader: either automatically from vertex buffers based on the vertex index, or explicitly from a texture using texture coordinates. In that regard VTF is more flexible than R2VB.

Flexible yes, but not that fast. Seems that we have just started to argue about semantics

On DX10 hardware both VTF and R2VB allow you to implement the same rendering techniques. Both are as flexible. R2VB offers you better performance in some cases and VTF offers you better in some cases. And you can even mix up the techniques if the situation demands so (R2VB infrequently changing data in your vertex buffer and VTF the remaining highly dynamic data).

On DX9 hardware R2VB offers both superior performance and flexibility (VTF supports only 32 bit floating point textures and cards have very limited amount of specific VTF texturing units).

Humus · Apr 29, 2008

sebbbi said:
Everything is so much easier in the console development.

I disagree.

sebbbi said:
Pixel shader renders in quads, and the texture cache is more optimally used. Of course if we are talking about theoretical architecture with no texture cache, and no other specific texturing optimizations (possible because of the much more controlled texture accessing patters in pixel shaders), then the both vertex and pixel texturing should behave similarly performance wise. However this is not the case with current DX9 and DX10 chips.

When I last did any tests on VTF performance ATI cards showed performance comparable to pixel shader. Oddly though the R600 outperformed the G80 despite less texturing power.

In my InfiniteTerrainII demo I compared VTF to R2VB on R600 and VTF was about 10-20% faster than R2VB IIRC. The reason is because when you use R2VB all data is fetched with vertex fetch, whereas with VTF some data is fetched with vertex fetch and other with the texture units. Theorethically you could double the fetch rate by fetching half the data as textures, assuming you're not limited by bandwidth or texturing in the pixel shader.

TimothyFarrar said:
Another point we should probably bring up here is that with DX10/GL2.1(+NVidia's extensions) you can stream out. However from what I've seen in GL, stream out only works with floats. Even in this case R2VB still has a possible advantage in that you could write compressed output (ie FP16 or RGBA INT8 for colors) which could be a considerable bandwidth savings overall.

I've never been too excited about StreamOut to be honest. I'd solve most problems with VTF/R2VB instead. It has some nifty bits, but also limitations. For instance you lose the ability to index the output. Instead you get complete primitives streamed out. And of course the float limitation. Although you can of course do your own packing and return integers casted with asfloat() in HLSL and cast back with asuint().

Speaking of compressed output, one interesting idea in DX10.1 is to compress the output to BC4, like for dynamic displacement maps, and use VTF to fetch the data. Doesn't get more compact than that.

Davros · Apr 30, 2008

BC4 ?

sebbbi · Apr 30, 2008

Humus said:
In my InfiniteTerrainII demo I compared VTF to R2VB on R600 and VTF was about 10-20% faster than R2VB IIRC. The reason is because when you use R2VB all data is fetched with vertex fetch, whereas with VTF some data is fetched with vertex fetch and other with the texture units. Theorethically you could double the fetch rate by fetching half the data as textures, assuming you're not limited by bandwidth or texturing in the pixel shader.

My testing has been conducted using a deferred renderer. We are already bottlenecked by texture sampling in many cases (4 gbuffers and our per pixel material system using four dxt5 textures and parallax mapping on every surface). I didn't even think about hitting vertex buffer read bottlenecks, as we have never hit them. Balance is of course the key to avoid bottlenecks. If you have spare texture sampling power to use, the combination of static buffer + VTF is going to give the best performance (in DX10).

Davros said:
BC4 ?

It's very much like ATI1 fourcc format in DX9. ATI1 is basically a separate DXT5 alpha channel (one channel format). BC5 is similar to the ATI2 format (marketed as 3dc normal map compression). Both formats offer two 8 bit color values per channel (min and max color) and 8 value (3 bit) interpolation between the values in a 4x4 block. BC4 offers one channel and BC5 offers two. I have always been a a bit disappointed that there is no BC6 (or ATI4) offering four channels compressed like this. It would be perfect fit for our material system.

http://msdn.microsoft.com/en-us/library/bb694531(VS.85).aspx

Davros · Apr 30, 2008

tnx..

Simon F · May 1, 2008

sebbbi said:
I have always been a a bit disappointed that there is no BC6 (or ATI4) offering four channels compressed like this. It would be perfect fit for our material system.

In your data, is there no correlation between the four channels then?

AlNom · May 1, 2008

Simon F said:
In your data, is there no correlation between the four channels then?

More info: http://forum.beyond3d.com/showthread.php?t=45771

Simon F · May 1, 2008

AlStrong said:
More info: http://forum.beyond3d.com/showthread.php?t=45771

Oh, yes, I remember that thread now.

sebbbi · May 2, 2008

Simon F said:
Oh, yes, I remember that thread now.

Yes, our current material system is a evolved version of the one I talked about half year ago. For most materials 3 x DXT5 textures is enough to store all the channels we need. We have done some additional channel reordering to get better quality and to better tie correlated channels together (to the DXT5 rgb color parts). For most complex materials 4 x DXT5 is needed (or 3 x DXT5 + ATI2). This is 3-4 bytes per pixel for the whole material. I am pleased about the result, as the materials look fantastic.

sebbbi · May 12, 2008

I have been porting R2VB and VTF for all our supported cards lately. Everything works now and the performance is very much acceptable on each card. However the driver support for these features in DX9 cards is a bit shaky to say at least.

Geforces:

Geforces seem to have lots of small problems in DX9 R2VB. It's quite understandable as it's originally a ATI DX9 API hack, and has been only recently added to Geforce drivers.

Geforce 8000 series:
No support for R2VB at all. Understandable as 8000 series has very fast VTF units (same texture units usable on both pixel and vertex shaders). However for some reason all texture formats are not supported in DX9 VTF API. In DX10 API the 8000 series supports much wider selection of texture formats in vertex shaders. This seems to be a driver issue, as sampling some "not supported" texture formats in the vertex shader works just fine (even if the driver doesn't list these formats as supported ones).

Geforce 6000 and 7000 series:
These cards support USAGE_DMAP only for very limited amount of texture formats (the driver certifies the call parameters and returns null texture). As this is the only way to instruct the card to store the texture scanlines linearly, there is no correct way to use R2VB on other texture formats. You can swizzle index buffer vertex indices manually, but it's very risky as there is no documentation how the swizzling is done on each Geforce model/memory configuration. A small driver "fix" would solve this issue easily: the API could return a valid texture pointer like all ATI cards do even if the USAGE_DMAP is not supported for the created texture format.

This issue limits DX9 R2VB formats on Geforces 6000 and 7000 series to:
A16B16G16R16
G16R16F
A16B16G16R16F
R32F
G32R32F
A32B32G32R32F

This is still considerably better than 6000 and 7000 series VTF support (only A32B32G32R32F and R32F). R2VB is much faster also on these cards, as they only have very limited amount of dedicated texture samplers for VTF.

Radeons:

Unsurprisingly all Radeons support DX9 R2VB better. All texture formats that have equivalent vertex format are usable. Only HD 2000 and 3000 series support VTF, but both support very generous amount of different formats in DX9 (basically all the same formats can be sampled in vertex shaders that can be in pixel shaders).

But there are minor problems in the Radeon drivers as well. All Radeons seems to have a similar driver bug in both of their VFT and R2VB implementation.

VTF: If you change the VFT texture (SetTexture), but do not change the stream source (SetStreamSource), the Radeon HD drivers do not notice the texture change at all, and the card continues to sample the old texture in the vertex shader. However if you use the same texture also in the pixel shader, the change is notified correctly, and the correct updated texture pointer is also used in the vertex shader.

R2VB: Similar bug also affects R2VB. If you change the D3DDMAPSAMPLER texture, the driver does not notice this change until you change the stream source associated with the same stream id as the R2VB_VSMP_OVR_DMAP.

This bug affects only scenarios where you render the same static vertex buffer several times in a row, and only change the vertex texture (or R2VB stream). Easiest way around this issue is to alternate between 2 identical vertex buffers. However it wastes some memory. Geforce drivers do not have this issue in their VTF or R2VB implementation.

I used the most recent WHQL certified Geforce and Radeon drivers for my testing.

TimothyFarrar · May 12, 2008

Sebbbi, thanks for the info!!

If R2VB doesn't work at all on NV 8/9 series cards, sounds like R2VB isn't going to be a universal solution (which is too bad).

One thing to consider on future (and some current) hardware is that ROP/OM might be going to a special output path and would require a copy to memory to be used again as a VB or a texture. While stream out might simply go directly back to linear memory without requiring an extra copy.

sebbbi · May 13, 2008

TimothyFarrar said:
If R2VB doesn't work at all on NV 8/9 series cards, sounds like R2VB isn't going to be a universal solution (which is too bad).

Yes, I was quite surprised about this issue as well, and even more surprised about the lack of supported DX9 VTF formats in Geforce 8800 (compared to ATI DX10 and DX 10.1 cards). These problems will most likely be fixed by new driver releases, as DX9 VTF and DX9 R2VB are used more all the time (more and more games being ported from XBox 360 and Vista/DX10 to DX9/WinXP and no longer the other way around like it used to be).

Currently we are using VTF path only on Geforce 8000/9000 series. We are using the same "R2VB vertex buffer textures" for the VTF path also, and 1:1 sampling one pixel per one vertex (using point sampling). We do this because the creation of these buffers is pretty expensive (we calculate shadows and normal vectors, etc in the buffer creation), and the buffers are only updated around once per second (one buffer per frame). Calculating normal vectors and shadows every frame would slow down the terrain vertex shader, and the random accesses to a larger terrain heightmap would be considerable less texture cache friendly than sampling a small texture scanline by scanline. One limitation in DX9 is that the API doesn't offer a primitive ID to the vertex shader like DX10 offers, so I can't use it to calculate the texture coordinate for the vertex. In our terrain renderer this was not a big issue, as we could just modify the static vertex buffer XZ position to become direct texture coordinates, and compensate this change in the object world matrix.

sebbbi · May 21, 2008

Update:
We tested with the new NVidia WHQL drivers in Vista 64 bit, and the 8000 series cards are still lacking the DX9 R2VB support and support for integer texture format VTF in DX9 (including standard 8888 texture format). However the integer texture formats are working just fine, the support caps are only missing.

satchmo · Nov 23, 2009

Sorry for dragging this back up, but I was wondering if anything has changed with newer nVidia drivers? I want to use R2VB for an effect on PS3 where VTF isn't an option (latency/texture format restrictions), but need to support D3D as well and would prefer not to have to support and maintain both methods.

nbohr1more · Dec 20, 2011

(Necro...)

Questions:

1) How is R2VB doing on current Nvidia drivers?

2) Is PBO or FBO the OpenGL version of this?

3) Does anyone have know where I can get a code sample of R2VB style Shadow Volume extrusion for OpenGL as in the PDF below?

http://www.csee.umbc.edu/~olano/s2006c03/scheuermann.pdf

fellix · Dec 20, 2011

nbohr1more said:
1) How is R2VB doing on current Nvidia drivers?

None to be found.

Why isn't R2VB discussed and used more?

Jawed

sebbbi

Xmas

Porous

sebbbi

Humus

Crazy coder

Davros

sebbbi

Davros

Simon F

Tea maker

AlNom

Moderator

Simon F

Tea maker

sebbbi

sebbbi

TimothyFarrar

sebbbi

sebbbi

satchmo

nbohr1more

fellix

Similar threads