The People Behind DirectX 10: Part 3 - NVIDIA's Tony Tamasi

MDolenc said:
...And is able to output up to 1024 32bit values.

P.S.: Say your GS outputs just float4 position, then you'll be able to generate up to 256 vertices. If you add say a 2D texture coordinate then you can only generate 170 vertices.

The samples in the SDK let me believe that if you want to generate a higher number of vertices per GS run you should stream them out and render them with a second pass.
 
  • Like
Reactions: Geo
Demirug said:
The samples in the SDK let me believe that if you want to generate a higher number of vertices per GS run you should stream them out and render them with a second pass.
Streamout's bandwidth is effectively half GS's output bandwidth, and it's half of the Input Assembler's bandwidth too. Ouch. Sorta makes sense if you think of stream out and vertex fetch running concurrently (different buffers for each).

Clearly streamout amplification is going to gobble GDDR bandwidth. Presumably a stream buffer has to be completely written before it can be used as input by the Input Assembler. But, presumably, a stream buffer can be as small as all the data produced with respect to a single input primitive by a single invocation of a shader program.

Thanks for clearing up the 1024x32b thing guys.

Jawed
 
Demirug said:
The samples in the SDK let me believe that if you want to generate a higher number of vertices per GS run you should stream them out and render them with a second pass.

That doesn't sound very promising for the first D3D10 GPUs :rolleyes:
 
Jawed said:
each primitive can output 1024 32b values (e.g. 256 Vec4 FP32s), which is 4KB - the diagram doesn't agree with the text :???: indicating only 128 32b values (32x4x32b) for some reason :???:
32x4x32b is probably referring to 32 parameters per vertex with each parameter consisting of 4 components.

Jawed said:
You could deliberately size the PGSC so that it can only absorb, say, 128 32b values (instead of 1024) per primitive. The command processor would then issue "smaller" batches than normal if the PGSC can't hold all the data produced by the GS program (e.g. when it wants to output the full 1024 32b values). I dare say the command processor doesn't know in advance how much data that's going to be, so it might be forced to simply junk the GS-output of some pipes and re-submit those primitives in a following batch. So the overall throughput of primitives is cut (as though some pipes are running empty) as the amount of per-primitive data produced by the GS increases.
The Shader Compiler has an idea how much data is going to be output ahead of time because the shader writer tells it. Along with each GS program you're required to specify a maxvertexcount. The GS is then able to output any amount up to that. So by looking at the maxvertexcount and the number of parameters per vertex the shader compiler knows the worst case memory requirement per GS primitive.
 
Do all vertices need to be stored in local memory, though? Can't the geometry shader just output the triangles one at a time to the triangle setup unit? The GS thread would obviously have to remain idle while each triangle is processed, but this way you'd only have to store the local GS thread info, which would presumably be less data than storing all of the individual triangles.
 
Vertices can be stored anywhere and even go straight to setup as you suggest. There's a tradeoff between having a lot of memory and processing a lot of primitives in parallel vs. having a small amount of output memory and possibly serializing GS execution at points.
 
3dcgi said:
Vertices can be stored anywhere and even go straight to setup as you suggest. There's a tradeoff between having a lot of memory and processing a lot of primitives in parallel vs. having a small amount of output memory and possibly serializing GS execution at points.
You can still have a lot of parallelism even with a small amount of output memory, as long as the cost for switching between the GS and PS threads is small.
 
3dcgi said:
Vertices can be stored anywhere and even go straight to setup as you suggest. There's a tradeoff between having a lot of memory and processing a lot of primitives in parallel vs. having a small amount of output memory and possibly serializing GS execution at points.
As I woke up this morning I thought: "how is GS-amplified data serialised?"

What's the correct ordering of vertices/primitives, when you have an algorithm that depends upon the ordering of vertices/primitives? How can GS, and/or the post-GS stages before rasterisation, sort this out?

Is amplification of vertices mutually exclusive to algorithms that require ordering?

Jawed
 
3dcgi said:
32x4x32b is probably referring to 32 parameters per vertex with each parameter consisting of 4 components.
Yes I think you're right.

The reason being the symmetry of streamout (which has half the capacity of GS-out) and IA-vertex-fetch bandwidths (per vertex) indicated on the diagram. They are both 16x4x32b.

Also,

Input Assembler (IA) gathers 1D vertex data from up to 8 input streams attached to vertex buffers and converts data items to a canonical format (e.g., float32). Each stream specifies an inde-pendent vertex structure containing up to 16 fields (called ele-ments). An element is a homogenous tuple of 1 to 4 data items (e.g., float32s).
Jawed
 
Last edited by a moderator:
Jawed said:
As I woke up this morning I thought: "how is GS-amplified data serialised?"

What's the correct ordering of vertices/primitives, when you have an algorithm that depends upon the ordering of vertices/primitives? How can GS, and/or the post-GS stages before rasterisation, sort this out?
Well, I think that if you allow all pixels to finish for one GS before moving onto the next, it simply becomes a matter of enforcing triangle ordering at the triangle setup level, which would seem to be something that already has to be done.
 
Jawed said:
As I woke up this morning I thought: "how is GS-amplified data serialised?"

If we ever start a thread somewhere with the topic "Things you'll only read at B3D", I'm nominating this one! :LOL:
 
Jawed said:
As I woke up this morning I thought: "how is GS-amplified data serialised?"

What's the correct ordering of vertices/primitives, when you have an algorithm that depends upon the ordering of vertices/primitives? How can GS, and/or the post-GS stages before rasterisation, sort this out?

Is amplification of vertices mutually exclusive to algorithms that require ordering?

Jawed
The GS is invoked once per input primitive, which are ordered. Each GS invocation may output a stream of primitives which is ordered as well. The order of primitives coming from the GS stage is well-defined.
Maintaining this order with multiple GS instances running in parallel might be done by having an on-chip buffer where each GS instance writes to a certain area large enough to take the maximum output values. Linked lists are also an option.

Unfortunately, there seems to be no way the application can tell the driver that triangle order does not matter.
There are actually only a few cases where it does matter: framebuffer blending, fragments at exactly the same depth, disabled depth test, and stencil test and writes at the same time (and maybe I forgot one or two). Most of the time there is no difference (or in the case of fragments at the same depth it is an arbitrary decision which one comes out on top).
 
Xmas said:
Unfortunately, there seems to be no way the application can tell the driver that triangle order does not matter.
There are actually only a few cases where it does matter: framebuffer blending, fragments at exactly the same depth, disabled depth test, and stencil test and writes at the same time (and maybe I forgot one or two). Most of the time there is no difference (or in the case of fragments at the same depth it is an arbitrary decision which one comes out on top).

A smart driver could detected this and use it for a higher overall GS output. With the reduced number of state objects it should be very easy to do this in advanced. Another reason why I would prefer a D3D10 version with caps for older hardware (and Windows XP)
 
Demirug said:
A smart driver could detected this and use it for a higher overall GS output. With the reduced number of state objects it should be very easy to do this in advanced. Another reason why I would prefer a D3D10 version with caps for older hardware (and Windows XP)
Yes, only the "fragments at same depth" case might cause slight problems because you cannot detect it from state objects. But it should only make a minor difference at polygon intersection edges and in cases where you should use a depth bias anyway.
 
Xmas said:
Unfortunately, there seems to be no way the application can tell the driver that triangle order does not matter.
There are actually only a few cases where it does matter: framebuffer blending, fragments at exactly the same depth, disabled depth test, and stencil test and writes at the same time (and maybe I forgot one or two). Most of the time there is no difference (or in the case of fragments at the same depth it is an arbitrary decision which one comes out on top).

It's by design. Determinism (including across different pipeline configurations) may be considered essential for building verifiable hardware and software. After all you could make the same argument for rasterization and fragment processing, why not a renderstate to not preserve order there too.
 
Back
Top