Drawing many moving objects with OpenGL

Alexko

Veteran
Supporter
Hi,

I'm writing a little piece of software for which I need to draw a lot of moving spheres (at least hundreds). They move independently of one another and may have different colors as well. I know very little about OpenGL, so I'm really just winging it here. I thought of three different possible ways to do this:

Option 1, Giant unified mesh, CPU-managed: The idea here would be to have a single Vertex Buffer Object containing all the vertices of all the spheres, and to move each vertex individually in the C++ code. Then I would just have to render a single object. This seemed too CPU-heavy and not very much in the spirit of OpenGL, so I didn't actually do this. Should I have?

Option 2, Naive rendering of each sphere separately: Here I just upload the mesh of my sphere, compute the Model-View-Projection matrix + color vector of the first sphere in C++, upload them to the GPU, and draw. I upload the MVP+color of the second sphere, draw, upload the MVP+color of the third sphere, and so on. This works, and I get 60 FPS with up to 500 spheres, but after that performance starts to decline pretty sharply.

Using two quad-core 2.4GHz Haswell Xeons and a crappy NVIDIA NVS 310 (GF119, so Fermi with 48 Shaders) I get about 30 FPS with 1 000 spheres and a 4 FPS slideshow with 12 000 spheres. I guess that's probably normal given the number of draw calls. I'm rendering in a 1000×1000 window with AA 4× (by simply using glfwWindowHint(GLFW_SAMPLES, 4);).

Option 3, Instanced rendering: OK, so then I heard about instanced rendering, so I figured I ought to do that. I now do almost nothing on the CPU, I just upload the mesh once, plus an array with all the positions and an array with all the colors of each sphere. Then, in the shader, I get the correct position based on gl_InstanceID, from which I can build the MVP matrix in the shader itself; I get the color in a similar fashion and I can render the sphere. Thus, all the heavy-lifting is done on the GPU.

Problems: while this works, I seem to be limited to about 1000 (probably closer to 1024) elements in my array of sphere positions. If I try to go higher, I get an error telling me OpenGL couldn't locate a suitable resource to bind my variable. Worse yet, this limit appears to be global to all uniform variables, because when I added an array of colors, I had to go down to 500 spheres.

And, on top of this, there appears to be no measurable performance benefit over option 2, which really surprised me. As far as I can tell I only have one draw call per frame now, although I can't really be sure of what the driver is actually doing.

Apparently, I could get over the size limitation by using a one-dimensional texture, so perhaps I ought to try that. But still, shouldn't this be faster?


Thoughts? Did I overlook something to cause such lackluster performance? Is instanced rendering even the right option in this case? Have I even done anything that wasn't completely stupid?

Thanks!
 
Not opengl specific but since you aren't texturing the spheres and have a uniform color, you will get a perf boost from sorting spheres front to back does eliminating overdraw and taking advantage of early/hi z features of video card. If the spheres move sort per frame. LOD to reduce tri's per mesh based on distance. That's about all I can think of.

Oh and in this thread a link was posted to this document which might interest you.

edit- and I do think instancing is the right way to go.
 
Perhaps Option 3 is slow because you have the world's slowest GPU. Even a super old Quadro 2000 is way faster.
 
Last edited:
Then, in the shader, I get the correct position based on gl_InstanceID, from which I can build the MVP matrix
If I'm remembering my instancing stuff correctly, if you're doing that per-vertex, you're doing it wrong and might have performance implications more so on such a slow card.

Oh and since you're one solid color If you can figure out how not to rotate the sphere you could do away with the back-faces of the model. If not I had once thought about (hadn't figured out the math... so not sure if possible) taking groups of triangles to create a "range" (can't be to big) of normals and then try to back-face cull them at once. Then use multidraw/multidraw-indirect to select only batches that contain one or more front facing for a draw call. Don't know if it will help your case, just something you might want to look into.
 
Thanks you all for your help!

Not opengl specific but since you aren't texturing the spheres and have a uniform color, you will get a perf boost from sorting spheres front to back does eliminating overdraw and taking advantage of early/hi z features of video card. If the spheres move sort per frame. LOD to reduce tri's per mesh based on distance. That's about all I can think of.

Oh and in this thread a link was posted to this document which might interest you.

edit- and I do think instancing is the right way to go.

What's the connection between the fact that I'm not texturing the spheres and early/hi-z rejection?

I hadn't considered sorting the spheres, and it seems costly (especially with >10 000 of them) but then again, it may well be less costly than rendering things I can't see. And I suppose it's the only way to do distance-based LOD anyway, isn't it?

How many triangles per sphere? If you have too many, you might be primitive bound.

I had 720 triangles per sphere. I figured I should be fine on the primitive side, because with 1 000 spheres, that's only 720 000 triangles per frame, or about 43 million per second at 60 FPS. I thought modern GPUs could comfortably handle much more than that. My GF119 should be able to do 1 triangle per cycle (maybe 0.5 if it's really gimped) and I don't know exactly how fast it's running, but even at 500MHz it should be able to do 250 to 500 million triangles per second.

And yet, per your suggestion I tried it with a coarser mesh (320 triangles per sphere) and I can now handle 1 000 spheres at 60 FPS, vs. 30 FPS before. So I guess I must be running into some sort of primitive-related bottleneck at some point. I can't really afford to go lower for the spheres that are close to the camera, but it certainly is overkill for the very distant ones.

Perhaps Option 3 is slow because you have the world's slowest GPU. Even a super old Quadro 2000 is way faster.

That sure isn't helping. Thankfully it's just a development platform, and the actual application will run on somewhat faster hardware.

If I'm remembering my instancing stuff correctly, if you're doing that per-vertex, you're doing it wrong and might have performance implications more so on such a slow card.

Oh and since you're one solid color If you can figure out how not to rotate the sphere you could do away with the back-faces of the model. If not I had once thought about (hadn't figured out the math... so not sure if possible) taking groups of triangles to create a "range" (can't be to big) of normals and then try to back-face cull them at once. Then use multidraw/multidraw-indirect to select only batches that contain one or more front facing for a draw call. Don't know if it will help your case, just something you might want to look into.

Come to think of it, yeah, what I'm doing may not be very clever, since I compute the same MVP matrices for each vertex of a given sphere. I should probably compute the matrices on the CPU and upload them to the GPU. The downside would be sending more data to the GPU, and being even more limited by the maximum size of uniform arrays.

Which means I should do it through a texture, I guess. Or should I just call glDrawElementsInstanced on batches of a few hundred spheres at a time and update the uniforms for each batch? What makes the most sense?

As for back faces, strictly speaking I'm not rotating the spheres right now so I could indeed remove the corresponding vertices, but at some point I will be doing more complex things with cameras, so things might be tricky. But I'll certainly make a note of it and think about making it work if I still need more performance when the time comes.

I'm not sure I really understand the rest of your post, but I'll make a note of it too.

I'm still pretty sure that I must be doing something pretty stupid that I should fix before doing anything this fancy.

Thanks again!
 
I hadn't considered sorting the spheres, and it seems costly (especially with >10 000 of them) but then again, it may well be less costly than rendering things I can't see.

Sorting a list of 100k points by distance to the camera (squared) should be pretty fast on any computer made in the last decade. Specially since you can just run a single bubble sort iteration per frame and get acceptable results. We are talking few MBytes/sec and a few Megaflops per frame.
 
What's the connection between the fact that I'm not texturing the spheres and early/hi-z rejection?
If you were texturing the common advice you'd be given by some is to sort by shader, then texture, then distance... proper draw order/technique is scene/view dependent and can be hard to get right. Also instancing requires the same state per call so texturing breaks the batching. (if you use more than one texture)

I hadn't considered sorting the spheres, and it seems costly (especially with >10 000 of them) but then again, it may well be less costly than rendering things I can't see. And I suppose it's the only way to do distance-based LOD anyway, isn't it?
Sorting is already pretty fast especially since 10,000 distances and ID's will fit entirely in the cache, but if you still have performance concerns since things aren't teleporting (i'm assuming) use spatial subdivision to create buckets sort smaller buckets (sorting 4 2500 element lists will be faster then 1 10000 element list) then either mergesort or merge on sending from your draw loop. And no you don't need to sort to do LOD. Sorting becomes more complicated when you go past spheres. (think long walls)

Come to think of it, yeah, what I'm doing may not be very clever, since I compute the same MVP matrices for each vertex of a given sphere. I should probably compute the matrices on the CPU and upload them to the GPU. The downside would be sending more data to the GPU, and being even more limited by the maximum size of uniform arrays.
Uploading matrix's and setting up the MVP matrix's on the CPU really shouldn't be a problem but you can do that on the GPU as well if necessary. (I don't know about opengl but directx has stream out) If you look at the pdf I linked you they point out uniform buffer objects... what version of opengl are you targeting.

and I don't know exactly how fast it's running, but even at 500MHz it should be able to do 250 to 500 million triangles per second.
You're not taking into account fillrate and fill bandwidth.

Come to think of it, yeah, what I'm doing may not be very clever, since I compute the same MVP matrices for each vertex of a given sphere. I should probably compute the matrices on the CPU and upload them to the GPU. The downside would be sending more data to the GPU, and being even more limited by the maximum size of uniform arrays.
Yeah per vertex MVP matrix is slowing you down, and again doing it on the CPU really shouldn't be a problem. But like you said draw calls might be slowing you down at 10,000 calls per frame.

Which means I should do it through a texture, I guess. Or should I just call glDrawElementsInstanced on batches of a few hundred spheres at a time and update the uniforms for each batch? What makes the most sense?
I don't really know opengl... hopefully someone who does can chime in... and I only really started getting back into DX recently.

I'm still pretty sure that I must be doing something pretty stupid that I should fix before doing anything this fancy.
Definitely fix the MVP per vertex. After that LOD. After that sort by distance. All the while find out the best way to reduce draw calls.
 
Thanks. I'll try all that. This has somewhat suddenly been taken down a notch in my list of priorities, but I'll get back to it in short order and keep this thread updated.
 
I had 720 triangles per sphere. I figured I should be fine on the primitive side, because with 1 000 spheres, that's only 720 000 triangles per frame, or about 43 million per second at 60 FPS. I thought modern GPUs could comfortably handle much more than that. My GF119 should be able to do 1 triangle per cycle (maybe 0.5 if it's really gimped) and I don't know exactly how fast it's running, but even at 500MHz it should be able to do 250 to 500 million triangles per second.

And yet, per your suggestion I tried it with a coarser mesh (320 triangles per sphere) and I can now handle 1 000 spheres at 60 FPS, vs. 30 FPS before. So I guess I must be running into some sort of primitive-related bottleneck at some point. I can't really afford to go lower for the spheres that are close to the camera, but it certainly is overkill for the very distant ones.
What kind of primitive topology you are using? Are you using index buffers? Do you optimize your mesh for vertex cache? Index buffers give big boost compared to storing 3 unique vertices per triangle. Vertex cache optimization also gives a big boost. With these two you should see around 2x-3x perf boost (if you are purely primitive/vertex bound).

The next optimization is to use tessellation and/or multiple LODs. This way you can have perfect geometry density no matter for far away from the camera the sphere is. 10x+ perf boost is easy to achieve if you are purely primitive/vertex bound.

Depth sorting gives a big boost for pixel shader bound cases. A fast (CPU) radix sorter takes less than 1 ms to sort 10k objects.

And if you have a matrix per vertex, then that's a big no-no. One matrix per sphere is enough. Fixing this should be your highest priority. Other optimizations (except for tessellation / LOD) will not likely matter if you have super fat vertices with matrices on each.
 
You could also think of other approaches that don't need triangles.
1) Ray tracing, intersection ray spehere is easy and fast, some binning or hierarchical bounding volume may help reduce the nr of spheres to test per ray. Should be plenty fast.
2) Using a precomputed color + depth texture for one sphere. After that just simple bit blitting basically. Also very fast.

Both should be faster as using triangle geometry.
 
Thanks. I'll try all that. This has somewhat suddenly been taken down a notch in my list of priorities, but I'll get back to it in short order and keep this thread updated.
Good Luck, I'll check back in on this thread.

What kind of primitive topology you are using? Are you using index buffers? Do you optimize your mesh for vertex cache?
I knew I was forgetting something... vertex caching.
 
Uploading matrix's and setting up the MVP matrix's on the CPU really shouldn't be a problem but you can do that on the GPU as well if necessary.
Yeah I'm buggin out... I don't know what I was thinking when I wrote that. You could also use a per instance "world" matrix and one "view projection" matrix per instanced call, and transform the vertex by world then the VP matrix's in the shader. Sorry about that.
 
Given they're spheres you only really need world position (and maybe a uniform scale factor), not a full matrix per instance.
 
What kind of primitive topology you are using? Are you using index buffers? Do you optimize your mesh for vertex cache? Index buffers give big boost compared to storing 3 unique vertices per triangle. Vertex cache optimization also gives a big boost. With these two you should see around 2x-3x perf boost (if you are purely primitive/vertex bound).

The next optimization is to use tessellation and/or multiple LODs. This way you can have perfect geometry density no matter for far away from the camera the sphere is. 10x+ perf boost is easy to achieve if you are purely primitive/vertex bound.

Depth sorting gives a big boost for pixel shader bound cases. A fast (CPU) radix sorter takes less than 1 ms to sort 10k objects.

And if you have a matrix per vertex, then that's a big no-no. One matrix per sphere is enough. Fixing this should be your highest priority. Other optimizations (except for tessellation / LOD) will not likely matter if you have super fat vertices with matrices on each.

What do you mean by primitive topology? I'm using triangles, if that's your question. And yes, I have an index buffer, but I've just noticed that the .obj loader I use, which I've borrowed from an OpenGL tutorial, duplicates all the vertices in order to have three vertices per triangles. I end up with over 5 times as many vertices as necessary, so I'll fix that ASAP.

So as you may have guessed I don't optimize my mesh for vertex cache, but then again I have no idea how to do that. I'll look it up.

I don't do a great deal with the matrices, I basically just create the Model matrix from the sphere's position, but I do compute one MVP per vertex, and that is indeed quite silly. That's the first thing I intend to fix.

You could also think of other approaches that don't need triangles.
1) Ray tracing, intersection ray spehere is easy and fast, some binning or hierarchical bounding volume may help reduce the nr of spheres to test per ray. Should be plenty fast.
2) Using a precomputed color + depth texture for one sphere. After that just simple bit blitting basically. Also very fast.

Both should be faster as using triangle geometry.

Yes, I could do all that, but ultimately I may have to render something other than spheres, including objects of arbitrary shapes that would likely require meshes. Plus, I look at this as a learning experience for "traditional" rendering with meshes.

Forgot to ask but you are frustum culling right?

All my spheres are always in the frustum, so I don't have to.

Given they're spheres you only really need world position (and maybe a uniform scale factor), not a full matrix per instance.

I currently create the model matrix in the shader:

Code:
// sFactor is the scaling factor
// trans is the translation vector, i.e. the position of the sphere
mat4 newM = mat4(
     sFactor,   0,          0,         0,
     0,         sFactor,   0,          0,
     0,          0,        sFactor,   0,
     trans.x,   trans.y,    trans.z,   1
);

I should be doing this in the C++ application, not in the shader, but I don't see how I could do any less than that.



Edit: never mind all that Texture Buffer Object nonsense, I've fixed my problem.
 
Last edited:
I should be doing this in the C++ application, not in the shader, but I don't see how I could do any less than that.
Why? Constructing the matrix like that in a shader is a no-op on scalar architectures.

However, I have a couple of issues. First, the official OpenGL documentation states that GL_MAX_TEXTURE_BUFFER_SIZE must be at least 65536, yet cout << GL_MAX_TEXTURE_BUFFER_SIZE; yields 35883. It's not the end of the world, but what's up with that?
You have to query the value at runtime using glGetIntegerv(GL_MAX_TEXTURE_BUFFER_SIZE, ...)
GL_MAX_TEXTURE_BUFFER_SIZE just identifies the value to get, it isn't the value itself.
 
Why? Constructing the matrix like that in a shader is a no-op on scalar architectures.

Yes but then I use it to compute the MVP, which I therefore do for each vertex. That's probably not wise.

You have to query the value at runtime using glGetIntegerv(GL_MAX_TEXTURE_BUFFER_SIZE, ...)
GL_MAX_TEXTURE_BUFFER_SIZE just identifies the value to get, it isn't the value itself.

Ah, silly me, thank you!
Edit: indeed, I get over 100 million with glGetIntegerv, which ought to be plenty! If only I could get it to work, that is.
 
Last edited:
OK, the problem seems related to the following line:

Code:
glBufferData(GL_TEXTURE_BUFFER, sizeof(spheres), spheres, GL_DYNAMIC_DRAW);

It generates an INVALID_OPERATION (1282) error, but I fail to see why. It's pretty much identical to the example here: https://gist.github.com/roxlu/5090067
 
Yes but then I use it to compute the MVP, which I therefore do for each vertex. That's probably not wise.
You just need to mulitply the vertex position with the scale factor and add the sphere's world position. Hardly an expensive operation, 3 MADs.

OK, the problem seems related to the following line:
Code:
glBufferData(GL_TEXTURE_BUFFER, sizeof(spheres), spheres, GL_DYNAMIC_DRAW);
It generates an INVALID_OPERATION (1282) error, but I fail to see why. It's pretty much identical to the example here: https://gist.github.com/roxlu/5090067
Do you have the correct buffer (not texture) bound to GL_TEXTURE_BUFFER at the time?
 
Back
Top