ati's drivers... and vertex buffer objects (opengl)

AlNom

Moderator
Moderator
Legend
Hey, there's this wicked engine mod for quake 2 -> quake 2 evolved

Anyhoo...one of the recent engine additions the coder there has done is r_arb_vertex_buffer_object. Thing is.... with the extension enabled, everyone who has ati cards (on the forums, including myself, at least) has experienced decreased frame rates.

Anybody know what I'm talking about?


edit: whoops... wrong forum.... could a mod please move it to the 3D graphics and drivers please? :)
 
I've taken a look at his code, and the way he's using VBO's most of the time is very strange. He's using them exactly as though they were standard vertex arrays. They're not designed for this.

The assumption we make with VBO is that the number of calls to glBindBuffer is significantly greater than the number of calls to glBufferData as that's the whole raison d'etre of VBO's. That's not the case with this mod and that's why it's slower. 13 calls to BindBuffer, of which 12 of them are followed by BufferData, is unlikely to be a win.

(The logic behind this is that for BufferData, etc. we manage memory assuming the data will be retained for a long period of time, while for standard vertex arrays we have other ways of doing it).
 
:oops:

okie... I'll just um.. link Berserk (the coder) to this thread...

thanks for the reply :)


edit: personally, I'm wondering why this seems to have affected the ATI users more profoundly than the nVidia users.

:?:
 
Holy crap! :oops:

Just took a look at the code, too. He seems to be buffering the data each and every frame! That totally defeats the entire point of using VBOs. You're supposed to load it into individual buffers at the scene's initialization and store the index of each buffer with the corresponding mesh. Each frame you just bind the appropriate buffer and render it. The way he's doing things results in the game being entirely AGP bandwidth limited. Once he fixes this I expect a huge performance boost for all cards using VBO. . .
 
but it's weird that the ati cards are crazily affected.

one of the biggest problems though is that he's working on a geforce2mx and it's difficult to get something better at the moment...

one sec, I'll post his comments.

edit: by "his", I mean yours Ostsol on their forum ;)

err... actually I'll just notify him see more on this thread when there's new stuff.
 
I registered and posted some stuff. It's odd. . . My test program doesn't show any of the issues he has. Using VBOs in the way he is turns out to be much faster than standard vertex arrays. Unfortunately, I can't test his project since I don't have Quake 2. :? I really should finally buy the whole set, some time. . . It ought to be really cheap, by now.
 
ATI and NV do things differently. For NV, they take the hit when you do the glVertexPointer etc calls. That mod would be more limited by the CPU than the AGP bus I think. While he's still using DMA transfers, it's a lot of silly calls.

(Haven't looked at the code, so dont know what he's doing)
 
AndrewM said:
ATI and NV do things differently. For NV, they take the hit when you do the glVertexPointer etc calls. That mod would be more limited by the CPU than the AGP bus I think. While he's still using DMA transfers, it's a lot of silly calls.

(Haven't looked at the code, so dont know what he's doing)
Here's a bit that should describe things quite well:

Code:
static void RB_RenderShaderARB (void){

	shaderStage_t	*stage;
	stageBundle_t	*bundle;
	int				i, j;

	if (r_logFile->integer)
		QGL_LogPrintf("--- RB_RenderShaderARB( %s ) ---\n", rb_shader->name);

	qglBindBufferARB(GL_ELEMENT_ARRAY_BUFFER_ARB, rb_vbo.indexBuffer);
	qglBufferDataARB(GL_ELEMENT_ARRAY_BUFFER_ARB, numIndex * sizeof(unsigned), indexArray, GL_STREAM_DRAW_ARB);

	RB_SetShaderState();

	RB_DeformVertexes();

	qglBindBufferARB(GL_ARRAY_BUFFER_ARB, rb_vbo.vertexBuffer);
	qglBufferDataARB(GL_ARRAY_BUFFER_ARB, numVertex * sizeof(vec3_t), vertexArray, GL_STREAM_DRAW_ARB);
	qglEnableClientState(GL_VERTEX_ARRAY);
	qglVertexPointer(3, GL_FLOAT, 0, VBO_OFFSET(0));

	qglBindBufferARB(GL_ARRAY_BUFFER_ARB, rb_vbo.normalBuffer);
	qglBufferDataARB(GL_ARRAY_BUFFER_ARB, numVertex * sizeof(vec3_t), normalArray, GL_STREAM_DRAW_ARB);
	qglEnableClientState(GL_NORMAL_ARRAY);
	qglNormalPointer(GL_FLOAT, 0, VBO_OFFSET(0));

	for (i = 0; i < rb_shader->numStages; i++){
		stage = rb_shader->stages[i];

		RB_SetShaderStageState(stage);

		RB_CalcVertexColors(stage);

		qglBindBufferARB(GL_ARRAY_BUFFER_ARB, rb_vbo.colorBuffer);
		qglBufferDataARB(GL_ARRAY_BUFFER_ARB, numVertex * sizeof(color_t), colorArray, GL_STREAM_DRAW_ARB);
		qglEnableClientState(GL_COLOR_ARRAY);
		qglColorPointer(4, GL_UNSIGNED_BYTE, 0, VBO_OFFSET(0));

		for (j = 0; j < stage->numBundles; j++){
			bundle = stage->bundles[j];

			RB_SetupTextureUnit(bundle, j);

			RB_CalcTextureCoords(bundle, j);

			qglBindBufferARB(GL_ARRAY_BUFFER_ARB, rb_vbo.texCoordBuffer[j]);
			qglBufferDataARB(GL_ARRAY_BUFFER_ARB, numVertex * sizeof(vec3_t), texCoordArray[j], GL_STREAM_DRAW_ARB);
			qglEnableClientState(GL_TEXTURE_COORD_ARRAY);
			qglTexCoordPointer(3, GL_FLOAT, 0, VBO_OFFSET(0));
		}

		if (glConfig.drawRangeElements)
			qglDrawRangeElementsEXT(GL_TRIANGLES, 0, numVertex, numIndex, GL_UNSIGNED_INT, VBO_OFFSET(0));
		else
			qglDrawElements(GL_TRIANGLES, numIndex, GL_UNSIGNED_INT, VBO_OFFSET(0));

		for (j = stage->numBundles - 1; j >= 0; j--){
			bundle = stage->bundles[j];

			RB_CleanupTextureUnit(bundle, j);

			qglDisableClientState(GL_TEXTURE_COORD_ARRAY);
		}
	}

	if (r_logFile->integer)
		QGL_LogPrintf("-----------------------------\n");
}
 
hmm, yes. How does that rb_vbo structure get filled? That looks like another big issue, which causes this one.. sorta using a rotating buffer for static geometry. tsk tsk.

:)
 
Oh yes, ours and nvidia's implementations of vertex buffer objects have very different performance characteristics. Unfortunately, I know that devlopers are getting slightly different advice about how to use them because of it.

Our logic is simple: VBO's are for static objects and we will do everything in our power to make that VBO then render as quickly as possible. A consequence is that they aren't quite as fast as standard vertex arrays for doing the job of the standard vertex array - but that's what standard vertex arrays are for.

We advise updating buffers (using BufferSubData) rather than reallocating them (using BufferData with a different size to how it was originally).
 
Dio said:
Our logic is simple: VBO's are for static objects ...

You must be kidding what about the <usage> tag ?
When I use these, I don't expect you to suppose I'm willing STATIC stuff...
STREAM_DRAW_ARB, STREAM_READ_ARB, STREAM_COPY_ARB
DYNAMIC_DRAW_ARB, DYNAMIC_READ_ARB, DYNAMIC_COPY_ARB

however when I use these I do :
STATIC_DRAW_ARB, STATIC_READ_ARB, STATIC_READ_ARB

Or maybe you're only commenting "STATIC_*" <usage> behavior ?

Everyone should read the docs carefully anyway, code snippet shows a complete lack of understanding of the VBO nature, or a quick conversion from VA w/o code redesign...
 
Dio said:
Oh yes, ours and nvidia's implementations of vertex buffer objects have very different performance characteristics. Unfortunately, I know that devlopers are getting slightly different advice about how to use them because of it.

Our logic is simple: VBO's are for static objects and we will do everything in our power to make that VBO then render as quickly as possible. A consequence is that they aren't quite as fast as standard vertex arrays for doing the job of the standard vertex array - but that's what standard vertex arrays are for.

We advise updating buffers (using BufferSubData) rather than reallocating them (using BufferData with a different size to how it was originally).
If that is your stance and current performance profile, wouldn't it be wise to put STREAM_* VBOs in system memory, so that they, effectively, become regular vertex arrays?
 
update:

Originally posted by Berserk
I don't really mind if streaming VBOs are faster than VAs as long as they are not slower (and they have the potential to be somewhat faster, but we'll have to wait for even more mature/optimized drivers).
I just don't get what's causing the slowdowns on ATI hardware. I was told that Catalyst 4.7 give a large boost compared to 4.6, but they are still slower than VAs. I don't have an ATI card myself to try other approaches, but I can assure you that I tried every possible combination of VBO parameters/functions with dynamic vertex data and this one turned out to be the fastest by a considerable margin (again, for my specific system at least).
VBOs in NVidia drivers prior to 56.72 were buggy and slow too, and that's why I'm almost sure this is ATI's fault.
 
I don't know what the driver does, but logically VBOs should never be slower than regular vertex arrays, as long as the API is used properly. If you're misusing it though you can expect serious slowdowns.

Updating a VBO then drawing should be faster than using a regular vertex array to draw if you're not reusing your VBO for different vertex arrays (which can cause all sorts of performance problems since the buffers may have to be resized and/or moved), not sure if that what this app does.

Using the same VBO for the same dynamic data with matching sizes should however potentially be faster since you can avoid resizing buffers. Passing it as a regular vertex array may have the driver resizing a buffer it manages itself for the data.
 
Originally posted by Berserk

I just don't get what's causing the slowdowns on ATI hardware.

I don't have an ATI card myself to try other approaches

Hmmmm...think these 2 quotes has anything to do with his problems?........ :rolleyes:
 
Ingenu said:
Dio said:
Our logic is simple: VBO's are for static objects ...

You must be kidding what about the <usage> tag ?
When I use these, I don't expect you to suppose I'm willing STATIC stuff...
STREAM_DRAW_ARB, STREAM_READ_ARB, STREAM_COPY_ARB
DYNAMIC_DRAW_ARB, DYNAMIC_READ_ARB, DYNAMIC_COPY_ARB

however when I use these I do :
STATIC_DRAW_ARB, STATIC_READ_ARB, STATIC_READ_ARB

Or maybe you're only commenting "STATIC_*" <usage> behavior ?
Good point. I wasn't clear originally. I should have said "VBO's are for data that will be reused".

I would say that if something is specified DYNAMIC, but each upload is only used once, then there's no point in it being in a VBO in the first place.
 
zeckensack said:
If that is your stance and current performance profile, wouldn't it be wise to put STREAM_* VBOs in system memory, so that they, effectively, become regular vertex arrays?
That's not the case. Using BufferData in this manner would introduce an extra copy operation, because BufferData instantiates the data pointed to immediately, implying some kind of copy, unlike VertexPointer et.al. that instantiate only at an array reference such as DrawElements.
 
Dio said:
Using BufferData in this manner would introduce an extra copy operation, because BufferData instantiates the data pointed to immediately, implying some kind of copy, unlike VertexPointer et.al. that instantiate only at an array reference such as DrawElements.

The way the API is designed a copy is indeed required, yet it doesn't mean it's not what one might want, the STREAM_* flag could lead to an area of memory in AGP space, which would give better result than straight RAM and a reason why this flag exists.
 
Of course. AGP, however, has few advantages over VRAM - basically, all it does is avoid CPU writes contending for memory resources. Both have similar access characteristics (assuming the VPU memory bus is mostly idle, but CPU access is a high priority client anyway).

There's no such thing as too much semantic information about buffers, though (as long as it's accurate).

I believe PCI Express memory may be better than AGP (it's snooped and so can be cached) but I don't know that for sure.
 
Originally posted by Berserk
Because of the way the engine works it would be too much hassle to completely redesign the backend for this. So for the moment everything is treated as dynamic geometry, but I do have plans for putting certain static surfaces in GL_STATIC_DRAW_ARB buffers. I will not be rewritting the whole thing to put, say colors in a static buffer and vertices in a streaming buffer, mainly because I doubt it will get a considerable speed boost so it's not really worth the effort. Like I said, there are not much of these "special cases". I think entirely static meshes have more potential, and they will also be easier to implement. Again, I haven't worked on this just yet, but looking at the performance of streaming buffers is, well... not very encouraging to be honest.

If ATI's stance about dynamic buffers is like Dio said, then I'm sorry, but they aren't really doing things like they should (I repeat, VBOs are not only designed for static data, but also dynamic... or least that's what I understand by reading the specs).
Of course my opinion isn't even taken into account by vendors because this engine doesn't help selling their products, I understand that, but it's still not my fault if they don't do things like they should.

I would like to hear the reasons why VBOs can't and/or shouldn't be used for dynamic geometry.
If not, then I'd like to know what path would be the fastest for Radeon hardware to directly compare it with my current approach that (one more time) is on par with CVAs performance wise on NVidia hardware.

In any case, Q2E users can disable VBOs if they want (they even have an option in the menu).
 
Back
Top