What's the architectural difference between a vertex and geometry shader?

I've read through this thread;
http://www.beyond3d.com/forum/showthread.php?t=25760
which discribes variation from a programming perspective, but I'm curious from an architectural viewpoint (not that I really grasp one more than the other if at all...)

In simple terms, a complete triangle is being manipulated rather than a single vertex, yes?

The fact we don't have any hardware available makes it a little difficult, but would it be fair to assume a geometry shader would be optimally built with several SIMD ALU's as opposed to a vertex shader with a single vector and scalar ALU, or would the same result be acheived by dynamically stringing multiple vertex shaders together (or looping through the same VS)? How would this translate into unified shader architectures?

(Or do I simply have no clue...)
 
The GS is still working on adjacent per-vertex data (it comes after the VS in the pipe), but collected in the form of triangle lists/strips, line lists, etc. It can then stream back out to memory so output data is fed back into the VS for another trip and more processing, and it can generate new vertices based on what it's fed.

That streamout then is configurable (turn it on or off, rasteriser gets different data patterns, or stream out to multiple buffers at once), and the GS can also texture, just like the VS hardware in SM3.0.

So the difference is largely in its streamout caps, compared to the VS hardware, and the fact it can generate brand new geometry which the VS can't.

As for the ALU arrangement of a GS, I'd expect it to be just the same as the VS hardware. Infact you can probably start to build a GS using suitable VS silicon as a basis, I'd imagine.
 
I just read through that thread as well and was kind of curious about a few things. I'm not fully up to date on how xenos works etc but have a rough understanding IMHO of how shades work.

Are they potentially adding a "triangle" datatype to the mix? From what I can figure the GS works on triangles and tesselates them etc. So how would you determine just how many triangles you plan on outputting? Ideally you could output as many as needed.

So for instance would you:
1) Take a triangle, divide it into x triangles and then output those
2) Use a 'while/for' loop and add in a function like "streamout(triangle out)" that each time called passed on a triangle to the pixel shaders.

This also brings up the question if it's streaming does it go back into memory and wait for a pixel shader to start up on a batch or does it go directly out to a pixel shader to start processing.


Speculation:
Also what is a vertex shader used for if a geometry shader is present? Would it not make sense to only use a GS, which can already see all 3 verts, perform the desired vertex shader based operations on each vert. Then proceed with any tessellation if desired, while the shader could repeat the same actions on any additional verts/triangles it wished to complete.

I'm guessing with ATI going unified they're taking the pixel shaders(most complete pipelines for unified hardware) and using a bunch of those for unified hardware shaders. Secondly would their choice for 3 ALUs per pipeline possibly be tied to geometry shaders which would work on 3 verts at once? It's the only situation I can think of off the top of my head where you have 3 parallel execution paths, plus each would be partially dependent on eachother. Considering they have a programmable memory controller would it be possible to program it to take just the pixel shaders on a 1800/1900 and have them run DX10 based code straight up? Without the additonal format capabilities etc or course. It seems like with the 1900 it got all the nice improvements that all have obvious benefits towards a unified/DX10 style system. I'm just wondering if R580 likeing multiples of 3 isn't a coincidence here. Because it seems like they would be really efficient at processing 3 verticies in parallel on their pixel shading units.

On top of this you'd be real close to being able to have one massive shader that took in 3 verts and ended up drawing pixels to the screen by the time it was done. Tesselation could be a matter of dynamic branching if this were the case. Also could this possibly cut down an having to queue up tasks for following shaders, if you had a streamout() type of function?

Also assuming you had vert data related to b-splines could you start looping in the GS part of the shader until you had triangles that fit into individual pixels? So theoretically if you supplied verts with the correct data you could turn 4 coplanar points into a perfectly formed sphere? This also would open up the possibility for some really whacked up shaders. Because who says that you can't tesselate a point sprite or turn a single vertex into a perfectly rounded sphere? Or extrude a single face in both directions? Or if you really want to kill performance render a localized particle effect with what would technically be completely unrelated geometry that started as a single 3D point.

Of course i'm not an expert so it's always possible I don't have the slighest clue what i'm talking about.
 
"Anarchist4000", the words "tesselation" and "geometry shader" are not the same. I'd recommend researching what "tesselation" means (or what it has become to be known), rather than what "geometry shader" means. And perhaps what a "vertex shader" means, too.

Also what is a vertex shader used for if a geometry shader is present?
One affects the vertices (which "Rys" explained quite well IMO in the context of this thread) while the other affects geometry. You start with vertices (and not "geometry") when you want to use pixel shaders for certain effects. Once you grasp this, then the hardware's VS unit is not the same as its GS unit; I assume you meant this in a hardware sense (a hardware's VS unit vs that same hardware's GS unit) instead of a software sense (a "vertex shader" compared to a "geometry shader").

I assume your next question would be an example of a geometry shader in use. "Humus" will tell you to wait :)
 
ATI will most likely have a unified shader core, hence it will consist of a high number of multi-purpose ALUs. Xenos' ALUs are already capable of both pixel and vertex shading; there adding another stage for geometry shading shouldn't be too complicated.
 
I've always understood tessellation to mean breaking geometry up into additional, more detailed geometry. Whether or not this is done before or after the VS doesn't seem like it would make a difference. Although if the GS has all the capabilities and then some of a VS why exactly could it not be a replacement for it? And having both tesselation before the VS and a GS after the VS seems like it would be redundant. From what I've seen the GS gets all the same data a VS would get x3 with adjaceny and other data and still has the capabilities to transform/light those verts as well as create additional verts and do the same with them. And adjaceny i'd imagine would be determined by the draw call from the application so the VS would have no impact on this.

I've always worked on the assumption the process started with triangles(DrawTriStrip), with each vert being handled independently by the VS, then the triangle textured by the PS, then blended with whatever is in the rendertarget.

So using streaming you could take two coplanar triangles(simple square), send them into the GS, and then extrude just about any shape you want out of it creating all the added triangles/geometry you need to make it look good. You wouldn't necessarily have to keep the original verts that were used. You could use the same concept to turn a simple point sprite into a sphere.

When a geometry shader is active, it is invoked once for every primitive passed down or generated earlier in the pipeline. Each invocation of the geometry shader sees as input the data for the invoking primitive, whether that is a single point, a single line, or a single triangle. A triangle strip from earlier in the pipeline would result in an invocation of the geometry shader for each individual triangle in the strip (as if the strip were expanded out into a triangle list). All the input data for each vertex in the individual primitive is available (i.e. 3 vertices for triangle), plus adjacent vertex data if applicable/available.

One of the main features of GS being that they break triangles up into additional geometry, tessellation seems like it would still apply here. So you'd feed it a triangle(group of 3 verts) and it'd make more triangles. What would tesselation be used for in the previous sense if it didn't involve adding/creating additional geometry? I've never played around with tessellation in previous DX versions so I could be overlooking something here.
 
Ghost of D3D is right. And you can use the GS as a tesselator if you like, sure. But it's capable of (much) more than that.
 
Doing vertex shading operations in the geometry shading stage is in general possible, but forces every vertex to be vertex-shaded once for each polygon it belongs to - in an ordinary triangle mesh, each vertex is shared between about 6 polygons, so that you effectively end up with 6 times the vertex shader load. This 6x effciency gap is the main reason why you would still want to have a separate vertex shader (at least in the programming model) even if the geometry shader is present.

As for using the geometry shader for higher-order-surface tessellation, it CAN be done, but it doesn't seem to have been designed to do that efficiently. For a fast tessellator, you would probably want some sort of systolic array rather than a general-purpose programmable execution unit - the systolic array would be able to reach much higher tessellation performance for a given transistor or power budget, while losing some flexibility or efficiency for other geometry-shader-style tasks. Also, the geometry shader has a programming model that is excessively serial for plain tessellation purposes: if you wish to output 10 triangles from one input triangle, the programming model forces you to emit the 10 triangles serially instead of running them in parallel.
 
Anarchist4000 said:
So for instance would you:
1) Take a triangle, divide it into x triangles and then output those
2) Use a 'while/for' loop and add in a function like "streamout(triangle out)" that each time called passed on a triangle to the pixel shaders.
There are 3 datatypes for output streams: PointStream, LineStream and TriangleStream. They have Append() and RestartStrip() methods. So it's 2).

This also brings up the question if it's streaming does it go back into memory and wait for a pixel shader to start up on a batch or does it go directly out to a pixel shader to start processing.
That depends on the architecture. One of the difficulties with GS is that you don't know the lenght of the output beforehand, while you have to keep the triangle order. This makes parallel execution difficult, therefore the output stream per GS pass is limited to 1024 32-bit values.
So calculating the GS for multiple triangles in parallel could be done by having each pass write to its own 4 KiB area in memory, plus storing the number of data elements in each somewhere.


Speculation:
Also what is a vertex shader used for if a geometry shader is present? Would it not make sense to only use a GS, which can already see all 3 verts, perform the desired vertex shader based operations on each vert. Then proceed with any tessellation if desired, while the shader could repeat the same actions on any additional verts/triangles it wished to complete.
Why do vertex calculations multiple times if you can do them once and store them in the post-transform vertex cache?
A vertex is usually part of multiple triangles, so doing vertex transformations (and other per-vertex stuff) in the GS would mean lots of redundant work.


I'm guessing with ATI going unified they're taking the pixel shaders(most complete pipelines for unified hardware) and using a bunch of those for unified hardware shaders. Secondly would their choice for 3 ALUs per pipeline possibly be tied to geometry shaders which would work on 3 verts at once? It's the only situation I can think of off the top of my head where you have 3 parallel execution paths, plus each would be partially dependent on eachother.
Geometry shaders do not work on "3 verts at once". They work on one primitive at once. They are not 3 programs running in parallel on one of the vertices each (that would be equivalent to vertex shading), they are a single sequence of operations working on a single primitive.

And I think you should read Dave's article on Xenos.

Considering they have a programmable memory controller would it be possible to program it to take just the pixel shaders on a 1800/1900 and have them run DX10 based code straight up? Without the additonal format capabilities etc or course. It seems like with the 1900 it got all the nice improvements that all have obvious benefits towards a unified/DX10 style system. I'm just wondering if R580 likeing multiples of 3 isn't a coincidence here. Because it seems like they would be really efficient at processing 3 verticies in parallel on their pixel shading units.
Xenos is much closer to D3D10 than R580, and it's still not there. R580 pixel shaders do not process 3 elements (whether that be pixels or vertices) in parallel, it's thread size is 48 elements/12 quads.

On top of this you'd be real close to being able to have one massive shader that took in 3 verts and ended up drawing pixels to the screen by the time it was done. Tesselation could be a matter of dynamic branching if this were the case. Also could this possibly cut down an having to queue up tasks for following shaders, if you had a streamout() type of function?

Also assuming you had vert data related to b-splines could you start looping in the GS part of the shader until you had triangles that fit into individual pixels? So theoretically if you supplied verts with the correct data you could turn 4 coplanar points into a perfectly formed sphere? This also would open up the possibility for some really whacked up shaders. Because who says that you can't tesselate a point sprite or turn a single vertex into a perfectly rounded sphere? Or extrude a single face in both directions? Or if you really want to kill performance render a localized particle effect with what would technically be completely unrelated geometry that started as a single 3D point.
You could do something like that, but you're limited to 1024 32-bit output values per GS pass.
And single pixel triangles are horribly inefficient. For that you would really need a new architecture that is made with these requirements in mind.
 
Geometry shaders do not work on "3 verts at once". They work on one primitive at once. They are not 3 programs running in parallel on one of the vertices each (that would be equivalent to vertex shading), they are a single sequence of operations working on a single primitive.

I was basing that on a primitive typically being a triangle in most cicumstances. And since R580 had 3 ALUs per pipeline(assuming the ALU works on an entire vector) it could technically run identical instructions on each of the 3 points on the primitive at the same time. You couldn't necessarily code the shader to run certain parts in parallel but if you had ALUs for 3 seperate vector calculations that could all run at the same time and you simply wanted to create a new vertex halfway through each side of a triangle could they not all execute at the same time since they wouldn't be dependent on eachother? It just seems like it would make sense from an efficiency standpoint.

As for using the GS to transform the data instead of the VS, I was thinking more about a displacement map sort of scenario or spots where there is dynamically driven geometry with little adjacency outside of a few triangles(typically pairs). Which would be broken up into smaller pieces where the detail is needed. Particle effect like stuff to a degree. Things where at certain points you would want to increase the geometry they use but only rarely would it occur. Or if you were doing particles that were basic point sprites where you fed it one point that was the center of the system and it created more points based on time, velocities etc. For straight up models etc like what's used a lot of the time this probably wouldn't be a very good approach.

For the datatypes how would you define a triangle exactly. I've gone through the DX10 SDK and all it said was they were a templated datatype. I'm guessing that would be whatever the vertex format used on the triangles was in an array of 3?

I know the GS would be capable of much more than just transforming verts but in some cases would it not make sense to kill 2 birds with one stone so to speak? Or use it to create your geometry and then hit the vertex shader on a second pass to put it on the screen.
 
Anarchist4000 said:
I was basing that on a primitive typically being a triangle in most cicumstances. And since R580 had 3 ALUs per pipeline(assuming the ALU works on an entire vector) it could technically run identical instructions on each of the 3 points on the primitive at the same time. You couldn't necessarily code the shader to run certain parts in parallel but if you had ALUs for 3 seperate vector calculations that could all run at the same time and you simply wanted to create a new vertex halfway through each side of a triangle could they not all execute at the same time since they wouldn't be dependent on eachother? It just seems like it would make sense from an efficiency standpoint.
How does that help efficiency?

And it's not the number of ALUs that's important for this, but the processing granularity, the number of elements per thread. R580 performs one "operation" (which can consist of multiple sub-operations) on 48 pixels, per "pipeline", over 4 clocks.

As for using the GS to transform the data instead of the VS, I was thinking more about a displacement map sort of scenario or spots where there is dynamically driven geometry with little adjacency outside of a few triangles(typically pairs). Which would be broken up into smaller pieces where the detail is needed. Particle effect like stuff to a degree. Things where at certain points you would want to increase the geometry they use but only rarely would it occur. Or if you were doing particles that were basic point sprites where you fed it one point that was the center of the system and it created more points based on time, velocities etc. For straight up models etc like what's used a lot of the time this probably wouldn't be a very good approach.
What's to be gained by doing everything in the GS instead of using VS for per (input) vertex calculations and GS for per primitive calculations? It's a programming model. Hardware implementations will most likely share most resources for both.

For the datatypes how would you define a triangle exactly. I've gone through the DX10 SDK and all it said was they were a templated datatype. I'm guessing that would be whatever the vertex format used on the triangles was in an array of 3?
The vertex format of the output stream is independent of the vertex format of the input vertices.

Here's the GS code from an SDK example:
Code:
struct GS_OUTPUT_CUBEMAP
{
    float4 Pos : SV_POSITION;     // Projection coord
    float2 Tex : TEXCOORD0;       // Texture coord
    uint RTIndex : SV_RenderTargetArrayIndex;
};

[maxvertexcount(18)]
void GS_CubeMap( triangle VS_OUTPUT_CUBEMAP In[3], inout TriangleStream<GS_OUTPUT_CUBEMAP> CubeMapStream )
{
    for( int f = 0; f < 6; ++f )
    {
        // Compute screen coordinates
        GS_OUTPUT_CUBEMAP Out;
        Out.RTIndex = f;
        for( int v = 0; v < 3; v++ )
        {
            Out.Pos = mul( In[v].Pos, g_mViewCM[f] );
            Out.Pos = mul( Out.Pos, mProj );
            Out.Tex = In[v].Tex;
            CubeMapStream.Append( Out );
        }
        CubeMapStream.RestartStrip();
    }
}
 
BTW, I suspect this sample is a good example of how *not* to use a GS. Simply because it does transform each vertex multiple times. Doing six passes, one for each cube face, will likely be more efficient.
 
ET said:
BTW, I suspect this sample is a good example of how *not* to use a GS. Simply because it does transform each vertex multiple times. Doing six passes, one for each cube face, will likely be more efficient.
When rendering cube maps with multiple passes you're still transforming each vertex multiple times. The SDK method just performs all of the transforms in a single pass. Theoretically it saves PCI-E bandwidth and CPU overhead.
 
Anarchist4000 said:
I know the GS would be capable of much more than just transforming verts but in some cases would it not make sense to kill 2 birds with one stone so to speak? Or use it to create your geometry and then hit the vertex shader on a second pass to put it on the screen.
Your second sentence actually answers your (misunderstood) first, although this doesn't necessarily have to do with "passes". "Passes" cost performance, for one -- hardware with dedicated/individual GS and VS units will allow programmers to forget one potential performance-related problem. Read "arjan de lumens" post.

I think what's relevant to enlightening you on this topic is this : What graphical effect are you trying to do/approximate? If you can provide an example, it will probably be much easier to explain the differences between having dedicated GS and VS units versus having "just" a GS unit and forsaking a VS unit (singularity of units just to simplify things -- we (will) almost certainly have more than one for each). Which is what I assume you're basically asking, right? Or are you asking what a hardware's VS unit(s) can't do compared to its GS unit(s) and vice versa?
 
I've not had the time to read all the replies so far, so apologies if this has already been said!

If you're talking about replacing a VS with a GS, and effectively asking whats the point of having a VS anymore. It should still be a useful optimization.

I'm pretty sure that the vertex cache should still be in play - if two triangles share 1 or more vertices (quite likely with an indexed list) then the vertex shader can be executed twice and the results duplicated for both triangles passed onto the GS. If you just threw the raw data at the GS then you'd have to transform 4 vertices instead of 2.

You'd also end up with lots of confusion and complexity around the 1-ring adjacencies if the vertices weren't transformed before they hit the GS.

From the general noises I've been picking up, as well as my own interpretation of the specs and samples, the GS isn't going to be the transformation monster that the VS is. At least not initially. Rather, the GS is a good way of duplicating and redirecting to multiple render targets, to add compute extra data and expose more information for the PS...

hth
Jack
 
Back
Top