Drawing many moving objects with OpenGL

You just need to mulitply the vertex position with the scale factor and add the sphere's world position. Hardly an expensive operation, 3 MADs.


Do you have the correct buffer (not texture) bound to GL_TEXTURE_BUFFER at the time?

I didn't! Nice catch, thank you. So here's the updated version:

Code:
glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_BUFFER, TBO_TEX_ID);
glBindBuffer(GL_TEXTURE_BUFFER, TBO_ID);
glBufferData(GL_TEXTURE_BUFFER, sphere_array_size, spheres, GL_DYNAMIC_DRAW);
// spheres is an array of floats, containing the positions of the spheres, laid out as such:
// sphere0x, sphere0y, sphere0z, sphere1x, sphere1y, sphere1z, sphere2x, sphere2y, sphere2z, etc.
glTexBuffer(GL_TEXTURE_BUFFER, GL_RGB32F, TBO_ID);
glUniform1i(u_TBO_ID, 0);

The error is gone. However, my TBO still appears to contain only zeros, as read in the shader.

Edit: never mind, problem fixed, it was a stupid error related to the size of my array of data.
 
Last edited:
Here's an update:

I've figured out how to use a Texture Buffer Object instead of a uniform float array to send the sphere positions to the shader, which had no effect on performance but allowed me to display a lot more objects.

Then I moved the matrix creations to the CPU and I now use the TBO to send the matrices instead of the positions. This means I'm no longer computing one M and MVP matrix per vertex, just per sphere. That's a lot more sensible but it has had no effect on performance. This presumably means I'm not bound by shaders.

I fixed the mesh to use indexing properly, which reduced my number of vertices by a factor of about 5.92, and I did get a 2-fold performance boost from this. That is, instead of being able to display 1000 spheres at 60 FPS, I can now handle 2000. After that, performaces steadily decreases. My intuition would be that I'm now bound by triangles. Does that sound right to you?

Should vertex cache optimization be the next step?

Thanks!
 
Should vertex cache optimization be the next step?
Yes, there's some code from Tom Forsyth floating around I think, or derivative work with source code available that's rather good.
You might check AMD & NV websites too as they may have tools for that. (Although I remember at one point in time they weren't as good as TomF code or one of its derivative work.)
 
Yes, there's some code from Tom Forsyth floating around I think, or derivative work with source code available that's rather good.
You might check AMD & NV websites too as they may have tools for that. (Although I remember at one point in time they weren't as good as TomF code or one of its derivative work.)

Thank you. I presume you're talking about this: https://home.comcast.net/~tom_forsyth/papers/fast_vert_cache_opt.html

There's a fairly complete description of the technique, plus some links to implementations, so I should be OK. Provided I get a decent performance boost from this, I ought to have something fast enough, at least for now.

Still, I have to say I'm a bit surprised that just drawing a bunch of spheres is that complicated, although Homerdog has a point about having the world's slowest GPU.
 
Then I moved the matrix creations to the CPU and I now use the TBO to send the matrices instead of the positions. This means I'm no longer computing one M and MVP matrix per vertex, just per sphere. That's a lot more sensible but it has had no effect on performance. This presumably means I'm not bound by shaders.

I fixed the mesh to use indexing properly, which reduced my number of vertices by a factor of about 5.92, and I did get a 2-fold performance boost from this. That is, instead of being able to display 1000 spheres at 60 FPS, I can now handle 2000. After that, performaces steadily decreases. My intuition would be that I'm now bound by triangles. Does that sound right to you?
Indexed primitives enables the post-transform vertex cache thus saving you shader invocations so you were shader bound. I am rather curious as to why you didn't get a performance boost from the first fix.

I think that you should quickly do a front to back distance sort on your spheres next, while working on better vertex cache utilization after. Eliminating overdraw will reduce the bandwidth requirement of drawing alot which should help especially on that card. After that LOD.
 
Indexed primitives enables the post-transform vertex cache thus saving you shader invocations so you were shader bound. I am rather curious as to why you didn't get a performance boost from the first fix.

I think that you should quickly do a front to back distance sort on your spheres next, while working on better vertex cache utilization after. Eliminating overdraw will reduce the bandwidth requirement of drawing alot which should help especially on that card. After that LOD.

That makes sense.

I'm not exactly sure why I didn't get a boost from removing unnecessary matrix computations from the vertex shader, but the latter still does some stuff. Perhaps the number of times it's called is more important than precisely how much work each instance does, due to some sort of overhead. In the end, all I've really removed from the shader is creating the model matrix (which was just a mat4 initialization) and computing the MVP.

I really don't know what's going on. I'd never done any serious graphics programming before (if you can call what I'm doing serious) and I have to say it feels a bit like an occult science.


I guess I'll start with the sort, after all, it's pretty easy to do. Thanks again.
 
You might be memory bound in your vertex shader if your ALU optimizations do not help at all. How many bytes are you reading per vertex?
 
Yes, there's some code from Tom Forsyth floating around I think, or derivative work with source code available that's rather good.
You might check AMD & NV websites too as they may have tools for that. (Although I remember at one point in time they weren't as good as TomF code or one of its derivative work.)

The code is only popular because it's public domain. It's very basic and it's asumptions don't hold that well with new architectures. If you want something really advanced, then use Tootle 2. It's on the AMD dev pages.
 
http://gfx.cs.princeton.edu/pubs/Sander_2007_>TR/tipsy.pdf

This algorithm is fast enough to run at mesh load time (linear time), so you can optimize speficically for the current GPUs vertex cache size. It achieves close to perfect vertex cache utilization and at the same time orders the mesh triangles in a way that reduces the average overdraw. For spheres this obviously is not helpful (since sphere is convex and cannot cause overdraw with itself). In general I prefer to use an overdraw sensitive algorithm instead of a pure vertex cache optimizer.
 
It's the algorithm used in Tootle.
That's nice. We have used a custom tipsify implementation for two released games already. I recommend it. If I remember corrently, it improved our g-buffer rendering performance by almost 10% compared to the best vertex cache optimizer without overdraw reduction.

However now we use something completely different, because our engine does coarse runtime sort of object triangles (as a side benefit). This provides similar depth rejection performance as depth prepass (but without the need to render the geometry twice).
 
Good news, everyone!

Using an implementation of Tom Forsyth's mesh optimization technique, I can display 3000 spheres at 60 FPS (with occasional dips to 54, for some reason) as opposed to 2000 without it.

However, I've noticed something strange: if I crank it up to 4000 spheres, my frame rate drops to 30. I find it odd that adding 33% more spheres should cut performance in half. It gets weirder:

Code:
3000   spheres: 60 FPS
3500   spheres: 40 FPS
4000   spheres: 30 FPS
6000   spheres: 30 FPS (?!)
12000  spheres: 18 FPS

This looks like some sort of threshold effect, the sort of thing that might happen when you run out of space in a cache. According to this, my GPU only has 128kB of L2 cache. Could this be it?

I guess I'll try sorting my spheres now.
 
I did. Without VSync, I get about 55 FPS with 4000 spheres instead of 30! I didn't think VSync would be this costly. I'm controlling VSync with glfwSwapInterval, by the way.

Without VSync:

Code:
2000 spheres:   95 FPS
3000 spheres:   70 FPS
3500 spheres:   60 FPS
4000 spheres:   55 FPS
6000 spheres:   40 FPS
12000 spheres:  22 FPS

I guess I'd better leave it off!
 
Well, oddly enough, sorting the spheres doesn't have a measurable impact on performance, even with a very large number of them and a great deal of occlusion.

Does this mean that I'm not fragment shader-bound at all?


I guess I should try LOD. I'm not sure if there's a textbook way to do this, I've searched around and haven't been able to find much. Should I just sort my spheres by distance to the camera, render the close ones with a fine mesh and the distant ones with a coarser one? Is there a more elegant way to do this? I mean I know that this is the basic principle, but what I'm asking is whether there are OpenGL commands to do this or whether it has to be done all manually.
 
Last edited:
use a long fragment shader, or constant colour output one, and see how it changes.
Have any idea how many pixels/fragments you cover per triangle on average ?
 
use a long fragment shader, or constant colour output one, and see how it changes.
Have any idea how many pixels/fragments you cover per triangle on average ?

I added a stupidly long loop to the fragment shader, and it has no perceivable effect on performance, even without sorting:

Code:
float foo = 0.1f;
for(int i=0; i<16384; ++i) {
  foo *= 1.01f;
}

I can't precisely quantify how many fragments I have per triangle, but I have 320 triangles per sphere (of which I suppose at most 160 are visible) and fewer pixels that than on average:



The above is an example with 12,000 spheres.
 
I added a stupidly long loop to the fragment shader, and it has no perceivable effect on performance, even without sorting:

Code:
float foo = 0.1f;
for(int i=0; i<16384; ++i) {
  foo *= 1.01f;
}

If the compiler can figure out foo is dead it'll optimize it away thus yielding no impact. Try to write foo out, or use it in a computation that yields observable results, to ensure this does not happen.
 
If the compiler can figure out foo is dead it'll optimize it away thus yielding no impact. Try to write foo out, or use it in a computation that yields observable results, to ensure this does not happen.

Yes, I figured as much and used it to compute the final color. Of course the compiler could also figure out that foo is in fact 0.1×1.01^16384, but I kind of doubt that.

Edit: Oh, and if I use a very coarse sphere mesh (icosphere with just 2 subdivisions) I get a bit over 60 FPS with 12,000 spheres instead of about 22 FPS. So I guess that I'm mostly primitive or vertex shader-bound and I might get +50~100% from LOD.

Second edit: Another way to look at it is that I can display 12,000 spheres at 60 FPS instead of 3,500, which is about 3.42× as many spheres. And that is from using a sphere with 80 triangles instead of 320, or a reduction by a factor of 4. That seems like a pretty clear case of primitive/vertex-shader bottleneck to me.
 
Last edited:
I can't precisely quantify how many fragments I have per triangle, but I have 320 triangles per sphere (of which I suppose at most 160 are visible) and fewer pixels that than on average:



The above is an example with 12,000 spheres.

In that example it seems quad-overshading could be a concern, and sadly nothing you can really do about. You could measure it precisely though.
 
Back
Top