Drawing many moving objects with OpenGL

If you have time could you sort back to front and tell me your performance differential if any?
front to back
back to front
and random
 
In that example it seems quad-overshading could be a concern, and sadly nothing you can really do about. You could measure it precisely though.

It looks like LOD would help a bit, I'll take a closer look at this later, thank you.

If you have time could you sort back to front and tell me your performance differential if any?
front to back
back to front
and random

I tried both orders (front to back and back to front) just to make sure, and neither made a measurable difference compared to the initial, random order.
 
Would you ever share this code? Storting to get interested in playing around with it.
 
If you have time could you sort back to front and tell me your performance differential if any?
front to back
back to front
and random
Also you can sort by screen space position. This improves ROP cache utilization.

However it seems that your case is fully geometry bound. Adding LOD / tessellation is the best way to proceed.

If your sphere is symmetrical (for example based on icosahadron), you can threat each of the 20 icosahedron triangles as an instance and backface cull these instances (each sphere consists of 20 instances). I recommend using a compute shader pass for the backside culling (output visible subobjects to an append buffer) and perform a single DrawIndexedInstancedIndirect to draw all the spheres at once. This should roughly double your performance, as it halves the geometry processing cost (by removing the backfacing parts of the icosahedron). It is trivial to perform viewport culling in the same compute shader.

If you choose to do tessellation, I recommend using the 20 triangle icosahedron as the base mesh. When the sphere comes closer to camera, you increase the tessellation factors. With continuous tessellation tou get seamless LOD. Easiest way to get the added vertices on the sphere surface is to linearly interpolate with the barycentrics and then renormalize the vector (from sphere centre to the surface) and multiply by sphere radius.

You can also do backface culling optimization directly in the hull shader when you use tessellation. A patch is culled away when you output a negative tessellation factor. Calculate a (biased) dot product between the camera front vector and the icosahedron face normal. Output a negative tessellation factor when the dot product is positive. Add a small constant bias to the dot product test to ensure the silhouette is perfect at all distances.

With ExecuteIndirect you can so some additional tricks, since you can change the start index and primitive count per instance (standard geometry instancing uses the same start index and primitive count for every instance). This allows you to extend subobject culling to any geometry (not just spheres).
 
Last edited:
@sebbbi He's OpenGL, not DirectX :) It might drive him a little crazy to achieve what you suggest. Additive note: tesselation factor of 0 is also good to cull.
 
More than that, my software will have to run on machines limited to OpenGL 3.3, so no tessellation for me. But I can still implement a cruder version of LOD. And maybe those machines will be upgraded in time to try to implement all this.

Once again, thank you all for your advice, I think that with a bit of LOD thrown in I'll have acceptable performance, at least for the time being.
 
@sebbbi He's OpenGL, not DirectX :) It might drive him a little crazy to achieve what you suggest. Additive note: tesselation factor of 0 is also good to cull.
OpenGL 4.4 has tessellation, indirect draw, compute, shaders and even MultiDrawIndirect (to replace ExecuteIndirect). If he needs to stick with OpenGL 3.3, then these improvements are not possible.

In this case I would recommend a basic LOD system. Do one instanced draw per LOD level. Draw near spheres (LOD 0) first (with a single instanced draw), then draw spheres that are a little bit further away (LOD 1) with another instanced draw call, etc, until you have rendered all the LODs. This minimizes your draw call count and roughly sorts the objects front to back (reducing pixel shader cost).

LODs will give you a big peformance gain.
 
funny you mention backface culling, I remember doing that on CPU in the early days of 3D accelerators ^^
 
You could render viewer facing hemispheres to avoid processing (most) backfaces, if the mesh is meant to be dense enough to get smooth silhouettes then you probably won't notice the rotation, either.
 
You could render viewer facing hemispheres to avoid processing (most) backfaces, if the mesh is meant to be dense enough to get smooth silhouettes then you probably won't notice the rotation, either.
This is a good idea if you only need spheres. Simple to implement and saves half of the work.

You could also render screen space bounded quads and calculate sphere intersection analytically. Use discard (clip) when the pixel misses the sphere. Normal vector of a sphere's surface pixel = (pixel.position - sphere.position). If you want spheres to intersect properly with other scene geometry, you should output depth from the pixel shader. This method should be fully pixel shader (or fill) bound. You don't need to implement LODs either.

Distance field raytracing also solves the case of spheres (and other primitives) nicely. If you want to do this differently :)
 
You could also render screen space bounded quads and calculate sphere intersection analytically. Use discard (clip) when the pixel misses the sphere. Normal vector of a sphere's surface pixel = (pixel.position - sphere.position). If you want spheres to intersect properly with other scene geometry, you should output depth from the pixel shader. This method should be fully pixel shader (or fill) bound. You don't need to implement LODs either.
Oh I did that in 2008 with a fragment shader, that was fun ^^
Sorry I forgot about it and didn't mention it !
 
After a somewhat crude implementation of LOD I can handle about 4300 spheres at 60 FPS, instead of about 3500 spheres without LOD. That seems like a smaller gain than I would have expected, since over 80% of my spheres are displayed with lower details (80 triangles instead of 320). I'm not sure why. Perhaps making two instanced draw calls per frame instead of one is costly. That said, in fullscreen mode, I can go up to 5000 spheres on my crappy GPU, so I think once deployed on the machines that will run the final software, I should be just fine.

Once again, thank you all for your help.
 
Last edited:
If you want me to test on a GTX 970 to see how many spheres that can handle just let me know?
 
I just remembered one last thing you can try. Make sure your vertex size won't cross a cache line boundary by making it 16 or 32 or 64 bytes per vertex per stream. Add padding to reach the goal. It can increase memory consumption but thats not much of an issue for your current use case. Also are you using 16 bit indexes'?
 
If you want me to test on a GTX 970 to see how many spheres that can handle just let me know?

I appreciate the offer, but it's tied to a whole bunch of stuff and kind of a pain to compile and even execute, which is why I haven't tried it on faster hardware—yet. But I'll be deploying it on machines that I believe are GT200-based fairly soon, and I also have a Radeon HD 6950 at my disposal.

I just remembered one last thing you can try. Make sure your vertex size won't cross a cache line boundary by making it 16 or 32 or 64 bytes per vertex per stream. Add padding to reach the goal. It can increase memory consumption but thats not much of an issue for your current use case. Also are you using 16 bit indexes'?

Thanks, I'll look into that. Yes, I'm using 16-bit indexes.
 
Last edited:
Definitely optimize your vertex data size if you have more than 32 bytes per vertex. This can give you surprisingly big gains.

I would recommend storing the vertex position as 16 bit normalized values (instead of 32 bit floats). This halves the position data size, with no noticeable quality loss. Find the min/max xyz bounds of the vertices and scale the vertex coordinates to [0, 1] range (as an offline processing step). When rendering, adjust the object scale matrix accordingly to compensate (multiply by mesh scale). This way you do not need any extra shader math compared to the 32 bit float version.

Vertex normal, tangent and bitangent can be stored efficiently by various means. Simplest way is to reduce the data from nine floats to two 10-10-10-2 normalized integers (4.5x data reduction). Store normal.xyz and tangent.xyz to the 10 bit (xyz) channels. Calculate bitangent by a cross product from normal and tangent and multiply the bitangent by the sign stored in one of those excess 2 bit texture channels (w). Alternative you can use normalized quaternions, but that needs more math and some advanced trickery.

How do you store the instance data (object matrix)? Optimizing this data is also important.
 
You should listen to sebbbi's size advice before mine... mine was unconfirmed from about 15 years ago so I don't know if it's still applicable. Oh and do me a favor and check the front to back rendering again on the new machines I curious if it's the card in question. (if you have time)
 
You should listen to sebbbi's size advice before mine... mine was unconfirmed from about 15 years ago so I don't know if it's still applicable. Oh and do me a favor and check the front to back rendering again on the new machines I curious if it's the card in question. (if you have time)
Your advice is correct. 16 bytes and 32 bytes are perfect targets for vertex size optimizations (and even 64, if you really need more data and your vertex shader is complex enough to hide the memory latency). Cache line aligning your vertex data is always a good idea.

A good example about vertex size cost on GCN. I wrote a prototype terrain renderer with 32 bit float inputs. Position.xyz, normal.xyz, tangent.xyz, uv.xy, height (12 floats = 48 bytes). It was slow. I optimized the vertex format heavily and almost tripled the peformance.

Terrain is a grid, so can contruct it by using the (system value) vertex id. Divide it by the grid width and you get uv.y, modulo gets you uv.x. Now read a constant buffer value that stores terrain min.xy and scale.xy. Multiply-add the uv.xy with this and you get the position.xy. Now use the uv to sample a heightmap texture (16 bits per pixel normalized). To calculate the normal and tangent and bitangent sample one heightmap texel neighbor from x side and one from y side. These samples are almost guaranteed to come from the L1 cache, since the neighbor vertex threads sample the same data. Result is that the bandwidth drops to 2 bytes per terrain vertex (from 48 bytes), a 24x reduction in bandwidth usage. This is important as the terrain has million+ vertices. The extra ALU cost is not a problem since GCN has high ALU:BW ratio. Same is true for your NVIDIA card.

You can do the same tricks with sphere rendering. System value vertex id is available even in DirectX 9. You don't need the tessellator (or the geometry shader) to generate geometry on GPU.

Mobile cards tend to be more bandwidth starved. It's important to notice that you can be BW starved even if you don't use all your BW at all times. Vertex shader execution tends to be bursty, and you can see a BW bottleneck during this burst, even if the memory controller is almost idling when the pixel shader outputs 32 bpp (untextured) data to the render target. Add some pixel shader load to utilize your BW better (and to provide the VS some latency hiding). Your test case seems to show a bottleneck that is not common in real games.
 
Back
Top