Anarchist4000 said:
So for instance would you:
1) Take a triangle, divide it into x triangles and then output those
2) Use a 'while/for' loop and add in a function like "streamout(triangle out)" that each time called passed on a triangle to the pixel shaders.
There are 3 datatypes for output streams: PointStream, LineStream and TriangleStream. They have Append() and RestartStrip() methods. So it's 2).
This also brings up the question if it's streaming does it go back into memory and wait for a pixel shader to start up on a batch or does it go directly out to a pixel shader to start processing.
That depends on the architecture. One of the difficulties with GS is that you don't know the lenght of the output beforehand, while you have to keep the triangle order. This makes parallel execution difficult, therefore the output stream per GS pass is limited to 1024 32-bit values.
So calculating the GS for multiple triangles in parallel could be done by having each pass write to its own 4 KiB area in memory, plus storing the number of data elements in each somewhere.
Speculation:
Also what is a vertex shader used for if a geometry shader is present? Would it not make sense to only use a GS, which can already see all 3 verts, perform the desired vertex shader based operations on each vert. Then proceed with any tessellation if desired, while the shader could repeat the same actions on any additional verts/triangles it wished to complete.
Why do vertex calculations multiple times if you can do them once and store them in the post-transform vertex cache?
A vertex is usually part of multiple triangles, so doing vertex transformations (and other per-vertex stuff) in the GS would mean lots of redundant work.
I'm guessing with ATI going unified they're taking the pixel shaders(most complete pipelines for unified hardware) and using a bunch of those for unified hardware shaders. Secondly would their choice for 3 ALUs per pipeline possibly be tied to geometry shaders which would work on 3 verts at once? It's the only situation I can think of off the top of my head where you have 3 parallel execution paths, plus each would be partially dependent on eachother.
Geometry shaders do not work on "3 verts at once". They work on one primitive at once. They are not 3 programs running in parallel on one of the vertices each (that would be equivalent to vertex shading), they are a single sequence of operations working on a single primitive.
And I think you should read Dave's article on
Xenos.
Considering they have a programmable memory controller would it be possible to program it to take just the pixel shaders on a 1800/1900 and have them run DX10 based code straight up? Without the additonal format capabilities etc or course. It seems like with the 1900 it got all the nice improvements that all have obvious benefits towards a unified/DX10 style system. I'm just wondering if R580 likeing multiples of 3 isn't a coincidence here. Because it seems like they would be really efficient at processing 3 verticies in parallel on their pixel shading units.
Xenos is much closer to D3D10 than R580, and it's still not there. R580 pixel shaders do not process 3 elements (whether that be pixels or vertices) in parallel, it's thread size is 48 elements/12 quads.
On top of this you'd be real close to being able to have one massive shader that took in 3 verts and ended up drawing pixels to the screen by the time it was done. Tesselation could be a matter of dynamic branching if this were the case. Also could this possibly cut down an having to queue up tasks for following shaders, if you had a streamout() type of function?
Also assuming you had vert data related to b-splines could you start looping in the GS part of the shader until you had triangles that fit into individual pixels? So theoretically if you supplied verts with the correct data you could turn 4 coplanar points into a perfectly formed sphere? This also would open up the possibility for some really whacked up shaders. Because who says that you can't tesselate a point sprite or turn a single vertex into a perfectly rounded sphere? Or extrude a single face in both directions? Or if you really want to kill performance render a localized particle effect with what would technically be completely unrelated geometry that started as a single 3D point.
You could do something like that, but you're limited to 1024 32-bit output values per GS pass.
And single pixel triangles are horribly inefficient. For that you would really need a new architecture that is made with these requirements in mind.