Optimizing vertex inputs

sebbbi

Veteran
Lately I have been optimizing our deferred rendering geometry pass vertex and texture inputs. In geometry buffer rendering, the chip bandwidth is usually a limiting factor while the chip ALUs are almost idle. Because of this, additional ALU usage has almost no performance impact, and the ALU should be used to decompress the information as much as possible.

This subject has not been talked that much in publications yet, so I though it would be best to ask for some educated advice. I have around 10 year experience in 3d-graphics programming, but this area is kind of new to me.

This is my current geometry vertex format in our DX9 render path:

Position:
- I started optimizing from 32 bit float 3d vector (12 bytes).
- Now: One 16 bit float 4d vector (8 bytes).
- Compression ratio: 1.5x
- Notes: Something needs to be stored in the w-channel!
- Quality: All our object vertices are centered around the object center point, and all objects are of manageable size (object based culling also needs this to work at best performance). Because of these factors, the 16 bit float precision has not shown any visible artifacts in our testing.

Texturecoordinates:
- I started optimizing from 32 bit float 2d vector (8 bytes).
- Now: One 16 bit float 2d vector (4 bytes).
- Compression ratio: 2.0x
- Quality: No quality degration can be seen with our content.

Normal, binormal, tangent:
- I started optimizing from 3x 32 bit float 3d-vector (36 bytes).
- Now: One signed 16 bit normalized 4d vector (SHORT4N) (8 bytes). I store normal x,y in x,y channels and tangent x,y in z,w channels. I reserve one bit of each component for cross product sign.
- I calculate normal.z and tangent.z with cross products. And multiply them by the sign bits (stored in one bit of the x and z channel).
- I calculate binormal with cross product from normal and tangent and multiply it by the sign bit (stored in one bit of the w channel).
- Compression ratio: 4.5x
- Quality: No quality degration can be seen. After decompression you basically have all 3 normal vectors reconstructed in 15-15-15 bit precision. The precision is actually too good...

Total:
Original vertex with single texturecoordinates/tangents: 56 bytes
Compressed vertex with single texturecoordinates/tangents: 22 bytes
Current compression ratio: 2.54x

Extra instructions for decompression:
- No extra calculations for position and texturecoordinates.
- For normal/binormal/tangent I have to separate the sign bits, calculate normal and tangent z-components (sqrt(1 - x^2 - y^2) * sign), and calculate binormal (cross(normal, tangent) * sign).

Currently I am pretty happy with the normal/tangent/binormal compression, but the texturecoordinate and especially the position compression still needs some work. What kind of vertex compression have you used in your projects or have uncompressed floats served your project just fine?
 
In all my projects I'm using uncompressed floats and the bottleneck is always in the pixel shader fillrate. However, the geometry compression is a very interesting subject.

I've never had enough time to make tests but I've got a lot of questions.

1. You say that the quality degradation is good. What scale/metric are you using?

2. In regards to performance, did you notice a big improvement?

3. Are you using a separate geometry vertex format for dynamic objects (decals, particles)?

4. Did you notice zbuffer problems using far distances (800 meters)?

Thanks in advance for your time,

Sergi
 
In all my projects I'm using uncompressed floats and the bottleneck is always in the pixel shader fillrate. However, the geometry compression is a very interesting subject.

I've never had enough time to make tests but I've got a lot of questions.

1. You say that the quality degradation is good. What scale/metric are you using?

2. In regards to performance, did you notice a big improvement?

3. Are you using a separate geometry vertex format for dynamic objects (decals, particles)?

4. Did you notice zbuffer problems using far distances (800 meters)?

Thanks in advance for your time,

Sergi

1. We are mainly comparing the output images (various closeups from difficult cases). But of course I am also comparing how many bits we lose, and what the expected results are for each optimization. Also I know the limits of our g-buffer targets (our g-buffers are also very tightly packed). It's no need to calculate anything in larger precision than the g-buffers can store.

2. With our current non-finalized content, the performance boost is minimal. Our graphics artists have been careful in vertex counts, because our previous technology didn't handle large vertex counts that well. And now we also have parallax mapping and per pixel displacement mapping to increase the surface look without need for extra polygons, so it's hard to convince them to add more polygon detail :). Currently the main gain from compression is that the vertex buffers are now much smaller saving us lots of valuable graphics memory. After all the console projects I have done, I have learned that saving memory for no performance degration = very good. This will be noticeable at the end of the development when graphics artists start to finalize the levels. And then we also see more performance improvements from the compression.

3. We have a flexible material definition language. You can customize the vertex format for every material. The format I was referring was the most used format in our materials. For UI sprites and particles we mainly use uncompressed floats, as the vertex count in these is smaller and the compression would require float32->float16 or float32->int cpu conversion every frame for each vertex. In all rendered geometry, we use vertex shaders to deform objects. We do not use dynamic vertex buffers as editing vertex buffers with cpu on the fly really kills the performance.

4. There is nothing that degrades z-buffering resolution in the compression I used. Vertex shader calculates everything in the same 32 bit float precision. When the input vertex is transformed from object space it's already in full 32 bit float precision. All our matrices (and other constant inputs) are naturally in full 32 bit float precision.
 
It obviously depends on how much precision your texcoords and positions need. With higher tessellation you need more precision for both. In general fixed point is better than float here as you typically want a uniform distribution across a limited range. The values should be normalized to the range required for each dimension. To give you a sense of scale, if you had a model that spans 10 m on one axis, a normalized 16-bit fixed point value gives you a precision of 153 µm, while with 10 bits you get a step size of 9.78 mm. If you have a model with uneven extends (e.g. tall but narrow) you can use less bits for axes that use a smaller range.

As the texture wraps around the model it may cover a longer distance, thus the texcoords should probably be more precise than position. But the required precision for texture coordinates also depends on the texture content and size.

It also depends on the format availability on your target hardware, as well as if it's a unified shader architecture (which gives you many more ALU cycles to spend in the VS).

As you want to encode 5 values, you could try combining one of the SHORT2 formats and DEC3. 3x DEC3 for all your 9 values is probably not precise enough. But using 2x SHORT4N and grabbing a few bits off some values to construct the 9th should be enough in your case.
 
It obviously depends on how much precision your texcoords and positions need. With higher tessellation you need more precision for both. In general fixed point is better than float here as you typically want a uniform distribution across a limited range. The values should be normalized to the range required for each dimension. To give you a sense of scale, if you had a model that spans 10 m on one axis, a normalized 16-bit fixed point value gives you a precision of 153 µm, while with 10 bits you get a step size of 9.78 mm. If you have a model with uneven extends (e.g. tall but narrow) you can use less bits for axes that use a smaller range.

Agreed on these points completely. Currently the only reason the position and texture coordinates are 16 bit floats instead of 16 bit integers is just because I wanted to use the same shaders on all targets if possible (all floating point formats are interchangeable and need no extra shader decoding). 16 bit float is not supported on earlier ATI DX9 cards and earlier nvidia cards have different shortcomings. However SHORT4 and SHORT2 formats are supported by all our targets, so we do not have to fallback to FLOAT3 in any cases, and the need for duplicate shader versions is nullified.

All dimensions (x, y, z) are naturally compressed to the full range separately. We already have implemented this for our older Nintendo DS renderer. The object world matrix is just multiplied by one extra xyz scale matrix before it's sent for the shader. However we then need to send one extra 3x3 matrix to rotate the normals.

As you want to encode 5 values, you could try combining one of the SHORT2 formats and DEC3. 3x DEC3 for all your 9 values is probably not precise enough. But using 2x SHORT4N and grabbing a few bits off some values to construct the 9th should be enough in your case.

First I compressed normal/tangent/binormal with DEC3N. Normal was stored in one DEC3N, and tangent was stored in other DEC3N. In the tangent vector one bit of x was used to store the binormal cross product sign (as for some odd reason the DEC3N format does not have the 2 alpha bits available). This way the normal/tangent/binormal vectors were also compressed to total of 8 bytes, and I actually needed less operations to decode them than I currently do with SHORT4N. However the precision was only 10-10-10 for the normal and 9-10-10 for the tangent and binormal and I needed 2 vertex inputs instead of one (2xDEC3N vs 1xSHORT4N).

The reason I started to develop a replacement for the DEC3N normal/binormal/tangent compression was the lack of support for it (the precision was good enough and never a problem in testing). Geforce 6000 and 7000 series do not support UDEC3/DEC3N. 8000 series is the first nvidia card family with DEC3N/UDEC3 support. ATI supports it in all DX9 cards (starting from the old Radeon 9500/9700). I still haven't run performance tests on the older (non unified shading) ATI cards to check if the DEC3N normal/binormal/tangent compression runs faster on them compared to the new one. All the unified architecture cards use the new SHORT4N compression (as during the geometry pass we have so much extra free ALU power to use).

Because only half of our target cards support it, I am not using DEC3 formats anymore as our main optimization target (for DX9 renderer). But as DEC3N and SHORT4N are interchangeable in xyz (from the shader's point of view), I most likely make some clever optimizations for the position vectors for ATI cards. When packing the vertex position streams it's easy to calculate maximum errors and average errors and decide the format by that information. Our architecture is already flexible enough to support competely different formats for each buffer... as long as the shader gets the data it needs.


Update:

SHORT4 + 2 x DEC3N would require 16 bytes, and the precision is good enough to store everything I need.

- Position xyz stored to the SHORT4 xyz.
- Normal xy stored to the first DEC3 xy
- Tangent xy stored to the second DEC3 xy
- Texture coordinate x stored to SHORT4 w
- Texture coordinate y 10 most significant bits stored to first DEC3 z and 6 least significant to last DEC3 z.
- 4 bits of last DEC3 z used for signs. (actually only 3 bit used, one is unused and could be used to increase the texture coordinate precision by one bit)

This way I get:
- Separate 16 bit fixed point precision to all xyz position channels.
- 16 bit fixed point precision to x texture coordinate
- 16 bit fixed point precision to y texture coordinate (10+6 bits)
- 10-10-10 bit precision to normals

This is the best I can come up for all ATI DX9 cards and Geforce 8000-series. For Geforce 6000 and 7000 series I have to use something else.
 
Last edited by a moderator:
Back
Top