Tridam said:Ok, thanks.
It will also see adjacent vertices (-> up to 6 vertices), won't it ?
Yes when you have selected an primitive topology that includes adjacencies.
Tridam said:Ok, thanks.
It will also see adjacent vertices (-> up to 6 vertices), won't it ?
MDolenc said:...And is able to output up to 1024 32bit values.
P.S.: Say your GS outputs just float4 position, then you'll be able to generate up to 256 vertices. If you add say a 2D texture coordinate then you can only generate 170 vertices.
Streamout's bandwidth is effectively half GS's output bandwidth, and it's half of the Input Assembler's bandwidth too. Ouch. Sorta makes sense if you think of stream out and vertex fetch running concurrently (different buffers for each).Demirug said:The samples in the SDK let me believe that if you want to generate a higher number of vertices per GS run you should stream them out and render them with a second pass.
Demirug said:The samples in the SDK let me believe that if you want to generate a higher number of vertices per GS run you should stream them out and render them with a second pass.
32x4x32b is probably referring to 32 parameters per vertex with each parameter consisting of 4 components.Jawed said:each primitive can output 1024 32b values (e.g. 256 Vec4 FP32s), which is 4KB - the diagram doesn't agree with the text indicating only 128 32b values (32x4x32b) for some reason
The Shader Compiler has an idea how much data is going to be output ahead of time because the shader writer tells it. Along with each GS program you're required to specify a maxvertexcount. The GS is then able to output any amount up to that. So by looking at the maxvertexcount and the number of parameters per vertex the shader compiler knows the worst case memory requirement per GS primitive.Jawed said:You could deliberately size the PGSC so that it can only absorb, say, 128 32b values (instead of 1024) per primitive. The command processor would then issue "smaller" batches than normal if the PGSC can't hold all the data produced by the GS program (e.g. when it wants to output the full 1024 32b values). I dare say the command processor doesn't know in advance how much data that's going to be, so it might be forced to simply junk the GS-output of some pipes and re-submit those primitives in a following batch. So the overall throughput of primitives is cut (as though some pipes are running empty) as the amount of per-primitive data produced by the GS increases.
You can still have a lot of parallelism even with a small amount of output memory, as long as the cost for switching between the GS and PS threads is small.3dcgi said:Vertices can be stored anywhere and even go straight to setup as you suggest. There's a tradeoff between having a lot of memory and processing a lot of primitives in parallel vs. having a small amount of output memory and possibly serializing GS execution at points.
As I woke up this morning I thought: "how is GS-amplified data serialised?"3dcgi said:Vertices can be stored anywhere and even go straight to setup as you suggest. There's a tradeoff between having a lot of memory and processing a lot of primitives in parallel vs. having a small amount of output memory and possibly serializing GS execution at points.
Yes I think you're right.3dcgi said:32x4x32b is probably referring to 32 parameters per vertex with each parameter consisting of 4 components.
JawedInput Assembler (IA) gathers 1D vertex data from up to 8 input streams attached to vertex buffers and converts data items to a canonical format (e.g., float32). Each stream specifies an inde-pendent vertex structure containing up to 16 fields (called ele-ments). An element is a homogenous tuple of 1 to 4 data items (e.g., float32s).
Well, I think that if you allow all pixels to finish for one GS before moving onto the next, it simply becomes a matter of enforcing triangle ordering at the triangle setup level, which would seem to be something that already has to be done.Jawed said:As I woke up this morning I thought: "how is GS-amplified data serialised?"
What's the correct ordering of vertices/primitives, when you have an algorithm that depends upon the ordering of vertices/primitives? How can GS, and/or the post-GS stages before rasterisation, sort this out?
Jawed said:As I woke up this morning I thought: "how is GS-amplified data serialised?"
The GS is invoked once per input primitive, which are ordered. Each GS invocation may output a stream of primitives which is ordered as well. The order of primitives coming from the GS stage is well-defined.Jawed said:As I woke up this morning I thought: "how is GS-amplified data serialised?"
What's the correct ordering of vertices/primitives, when you have an algorithm that depends upon the ordering of vertices/primitives? How can GS, and/or the post-GS stages before rasterisation, sort this out?
Is amplification of vertices mutually exclusive to algorithms that require ordering?
Jawed
Xmas said:Unfortunately, there seems to be no way the application can tell the driver that triangle order does not matter.
There are actually only a few cases where it does matter: framebuffer blending, fragments at exactly the same depth, disabled depth test, and stencil test and writes at the same time (and maybe I forgot one or two). Most of the time there is no difference (or in the case of fragments at the same depth it is an arbitrary decision which one comes out on top).
Yes, only the "fragments at same depth" case might cause slight problems because you cannot detect it from state objects. But it should only make a minor difference at polygon intersection edges and in cases where you should use a depth bias anyway.Demirug said:A smart driver could detected this and use it for a higher overall GS output. With the reduced number of state objects it should be very easy to do this in advanced. Another reason why I would prefer a D3D10 version with caps for older hardware (and Windows XP)
Xmas said:Unfortunately, there seems to be no way the application can tell the driver that triangle order does not matter.
There are actually only a few cases where it does matter: framebuffer blending, fragments at exactly the same depth, disabled depth test, and stencil test and writes at the same time (and maybe I forgot one or two). Most of the time there is no difference (or in the case of fragments at the same depth it is an arbitrary decision which one comes out on top).