OpenGL Geometry shaders: poor performance on NV GF9800 GT

const

Newcomer
I writing simple application which render a torus using geometry shader. Exactly, I wrote three shaders:
  • vertex shader does just gl_Position = gl_Vertex
  • geometry shader (POINTS in, TRIANGLE_STRIP out) takes a vertex and generates tesselated torus (center is given vertex, radii are in uniform variables)
  • fragment shader performs per-pixel lighting

Program works correctly on Intel GPU (X3000 I believe) under MacOS X 10.5. Very slow, but it's Intel ;)

On NVIDIA GeForce 9800 GT under Linux it works even more SLOW and eats CPU (Core 2 Duo E8500 3.16 GHz) very much.
More, GL_MAX_GEOMETRY_OUTPUT_VERTICES_EXT says that GS may generate 1024 vertices, but when I try to do that, only first 128 vertices are rendered without reporting any error.
Similar problem with GeForce 8700M GT.

I tried to move color calculation from fragment shader to geometry shader and completely disable fragment shader. Program works significantly better but still very slow. Although it renders just 1024 (or 128 :?: ) triangles.

Other applications and games work very well, reaching 60 FPS even with 'maximum quality' settings. But they are not using GS, I suppose.

Is it a driver or GPU issue? Can I get better performance generating geometry on GPU instead of on CPU anyway with my videocard?
 
Maybe you are exceeding the GS max number of varyings. Also, Nvidia cards are not good with outputting large numbers of vertices from the GS. Some Nvidia employees recommend using as few GS ouputs as possible (under 20 vertices).
 
Maybe you are exceeding the GS max number of varyings.
5 vectors are passed from GS to FS. When FS is disabled, there are no varyings.
Also, Nvidia cards are not good with outputting large numbers of vertices from the GS. Some Nvidia employees recommend using as few GS ouputs as possible (under 20 vertices).
16 vertices 'torus' reaches 1 FPS in 500x500 window :LOL:
AFAIK, Nvidia proposed GS, key feature of which is geometry generation. What do Nvidia employees recommend to use instead of GS? :???:
 
The geometry shader is generally slow in all hardware out there, but in different ways on ATI and Nvidia hardware. In any case, if you can use something else like instancing to achieve the same that's generally preferable for performance.
 
AFAIK, Nvidia proposed GS, key feature of which is geometry generation. What do Nvidia employees recommend to use instead of GS? :???:

The geometry shader was not designed to do large-scale geometry amplification efficiently; if you wish to use it for that purpose, you will need to split the geometry amplification into multiple passes. For your torus example, one way to split geometry amplfication would be as follows:
  • 1st pass: use the geometry shader to expand the vertex into a line strip that loops back on itself; use transform feedback to dump this line strip to a vertex array.
  • 2nd pass: expand each line produced in the first pass into a loop of triangles, then pass to fragment shader.
Depending on the degree of tessellation you want, this may or may not be adequate; if it is not, you will have to find ways to split the work into even more passes (or use alternative approaches, such as the instancing suggested by Humus).
 
The Nvidia G80 GS apparently wasn't designed for high amplification. AMD hardware was designed for it.

const, I'm not familiar with the OpenGL extension, but DX10 requires the GS to support 1024 dwords of output. So if you have 5 dwords/vertex you can only have 204 emits.
 
The Nvidia G80 GS apparently wasn't designed for high amplification. AMD hardware was designed for it.

const, I'm not familiar with the OpenGL extension, but DX10 requires the GS to support 1024 dwords of output. So if you have 5 dwords/vertex you can only have 204 emits.

Do you qualify scene with single point expanded to 16 triangles as 'high amplification'??
Of course, generating whole large scene from single point is bad idea, just because single invocation of GS cannot be parallelized among GPU cores.

Extension specification speaks about limit of number of emitted vertices (this must be at least 256!). On Intel GPU this limit is 1024 and properly honored. If I try to emit 1200 vertices, then last 76 vertices are silently discarded.

So whats the gs good for ?

Very good question. Vertex shaders may be used to transform vertices, and GS may be used to transform whole single primitive, I guess. With small hack we can transform primitives in vertex shader, but what we cannot do without GS is generating new vertices. So question is still open.
 
So whats the gs good for ?

The use cases that were promoted when the geometry shader was added to directx10 in the first place was stuff like shadow volume generation, single-pass cube-map rendering, point-sprite expansion; all of which can be done with very limited geometry amplification (~1-10 output polygons per input polygon).

While it looks kinda obvious that you should be able to perform general-purpose tessellation on the geometry shader, it's not what it's there for, and as const has experienced, it indeed does not perform that task well at all. The programming model of the geometry shader dictates that its outputs are produced serially, which basically makes any kind of parallellism essentially impossible unless the gpu arbitrarily hard-limits the amount of output each invocation can produce.
 
Do you qualify scene with single point expanded to 16 triangles as 'high amplification'??
Of course, generating whole large scene from single point is bad idea, just because single invocation of GS cannot be parallelized among GPU cores.
I consider that moderate amplification. 16x initially seems high, but 16 verts is far from the max and thus not high. GS can be parallelized it just has serialization points before and after so your peak rate is still 1 vert/clock on current hardware. How much it can be parallelized depends on how much on chip memory you have or if you have the ability to spill to off chip memory. If you spill off chip more bandwidth is used, but in many cases this isn't a big deal.
 
So whats the gs good for ?

Small scale amplification. Doing per-primitive work. Distributing rendering to many render targets (single pass render to cubemap etc).

Do you qualify scene with single point expanded to 16 triangles as 'high amplification'??

Last time I tried the GS on Nvidia, which was about a year ago and on an 8800 GTX, performance dropped off roughly by a factor 3-4 for every double of the amplification. So amplifying by 16x would reduce performance to something like 1/100 the performance of no amplification.

Does OpenGL have an equivalent of Render to Vertex Buffer?

Yes. You can use PBOs.
 
Yes. You can use PBOs.
Render to VBO is essentially used in order to transform or to generate primitives onto the GPU that are then used in following rendering passes. Thus, you can workaround the rasterization and the shading stage by using the transform feedback buffers rather than PBO. In general, the "render to VBO" functionality is only a "hack" that addressed the lack of transform feedback buffers.
 
Render to VBO is essentially used in order to transform or to generate primitives onto the GPU that are then used in following rendering passes. Thus, you can workaround the rasterization and the shading stage by using the transform feedback buffers rather than PBO. In general, the "render to VBO" functionality is only a "hack" that addressed the lack of transform feedback buffers.

I'd profile both/all before you decide which is a hack. I think someone on the OpenGL boards had done this and tested all the ways one could feedback data (cuda on nvidia cards, render to vertex buffer, transform feedback, vertex texture fetch, and geometry shader). I believe that render to vertex buffer (R2VB) was the fastest with the exception that one had to use vertex texture fetch on newer NVidia cards (8 series and up) because R2VB wasn't supported. Of course, my memory could be wrong, I'd suggest doing some research for yourself...

There are some advantages of rendering to a texture/VB as well, in that the ROP units can do free format conversion which can reduce bandwidth required for the entire operation from output to feedback in another rendering pass.
 
TimothyFarrar said:
I'd profile both/all before you decide which is a hack.
In fact, I talked about the mechanism. Of course, you can tweak the algorithm and test all the techniques before judging which is better in term of efficiency. Nevertheless, when you want to re-used GPU-transformed/generated primitives, conceptually, it seems natural to bypass the rasterization, the shading and the ROP stages. Using the shading stage to generate/transform primitives may be more efficient. But it was not its goal. Of course, if you have to read back the shading data, using a PBO is the best way (always in term of concept). But in the same way, maybe it is more efficient to use a TFB if the amount of output shading data is not fixed.
 
They are separate. The GS comes after the tessellation (VS->HS->Tessellator->DS->GS->PS).
 
It seems that the GS is an Nvidia oriented approach to triangle amplification... is this why DX10 and DX11 Tessellation are so different...? Was Nvidia planning on supplanting AMD's tessellation approach? Did Nvidia "Win" to some degree?
 
Back
Top