The People Behind DirectX 10: Part 3 - NVIDIA's Tony Tamasi

Demirug · Jul 13, 2006

Tridam said:
Ok, thanks.

It will also see adjacent vertices (-> up to 6 vertices), won't it ?

Yes when you have selected an primitive topology that includes adjacencies.

Demirug · Jul 13, 2006

MDolenc said:
...And is able to output up to 1024 32bit values.

P.S.: Say your GS outputs just float4 position, then you'll be able to generate up to 256 vertices. If you add say a 2D texture coordinate then you can only generate 170 vertices.

The samples in the SDK let me believe that if you want to generate a higher number of vertices per GS run you should stream them out and render them with a second pass.

Jawed · Jul 13, 2006

Demirug said:
The samples in the SDK let me believe that if you want to generate a higher number of vertices per GS run you should stream them out and render them with a second pass.

Streamout's bandwidth is effectively half GS's output bandwidth, and it's half of the Input Assembler's bandwidth too. Ouch. Sorta makes sense if you think of stream out and vertex fetch running concurrently (different buffers for each).

Clearly streamout amplification is going to gobble GDDR bandwidth. Presumably a stream buffer has to be completely written before it can be used as input by the Input Assembler. But, presumably, a stream buffer can be as small as all the data produced with respect to a single input primitive by a single invocation of a shader program.

Thanks for clearing up the 1024x32b thing guys.

Jawed

Ailuros · Jul 13, 2006

Demirug said:
The samples in the SDK let me believe that if you want to generate a higher number of vertices per GS run you should stream them out and render them with a second pass.

That doesn't sound very promising for the first D3D10 GPUs

3dcgi · Jul 14, 2006

Jawed said:
each primitive can output 1024 32b values (e.g. 256 Vec4 FP32s), which is 4KB - the diagram doesn't agree with the text indicating only 128 32b values (32x4x32b) for some reason

32x4x32b is probably referring to 32 parameters per vertex with each parameter consisting of 4 components.

Jawed said:
You could deliberately size the PGSC so that it can only absorb, say, 128 32b values (instead of 1024) per primitive. The command processor would then issue "smaller" batches than normal if the PGSC can't hold all the data produced by the GS program (e.g. when it wants to output the full 1024 32b values). I dare say the command processor doesn't know in advance how much data that's going to be, so it might be forced to simply junk the GS-output of some pipes and re-submit those primitives in a following batch. So the overall throughput of primitives is cut (as though some pipes are running empty) as the amount of per-primitive data produced by the GS increases.

The Shader Compiler has an idea how much data is going to be output ahead of time because the shader writer tells it. Along with each GS program you're required to specify a maxvertexcount. The GS is then able to output any amount up to that. So by looking at the maxvertexcount and the number of parameters per vertex the shader compiler knows the worst case memory requirement per GS primitive.

KimB · Jul 14, 2006

Do all vertices need to be stored in local memory, though? Can't the geometry shader just output the triangles one at a time to the triangle setup unit? The GS thread would obviously have to remain idle while each triangle is processed, but this way you'd only have to store the local GS thread info, which would presumably be less data than storing all of the individual triangles.

3dcgi · Jul 14, 2006

Vertices can be stored anywhere and even go straight to setup as you suggest. There's a tradeoff between having a lot of memory and processing a lot of primitives in parallel vs. having a small amount of output memory and possibly serializing GS execution at points.

KimB · Jul 14, 2006

3dcgi said:
Vertices can be stored anywhere and even go straight to setup as you suggest. There's a tradeoff between having a lot of memory and processing a lot of primitives in parallel vs. having a small amount of output memory and possibly serializing GS execution at points.

You can still have a lot of parallelism even with a small amount of output memory, as long as the cost for switching between the GS and PS threads is small.

Jawed · Jul 14, 2006

3dcgi said:
Vertices can be stored anywhere and even go straight to setup as you suggest. There's a tradeoff between having a lot of memory and processing a lot of primitives in parallel vs. having a small amount of output memory and possibly serializing GS execution at points.

As I woke up this morning I thought: "how is GS-amplified data serialised?"

What's the correct ordering of vertices/primitives, when you have an algorithm that depends upon the ordering of vertices/primitives? How can GS, and/or the post-GS stages before rasterisation, sort this out?

Is amplification of vertices mutually exclusive to algorithms that require ordering?

Jawed

Jawed · Jul 14, 2006

3dcgi said:
32x4x32b is probably referring to 32 parameters per vertex with each parameter consisting of 4 components.

Yes I think you're right.

The reason being the symmetry of streamout (which has half the capacity of GS-out) and IA-vertex-fetch bandwidths (per vertex) indicated on the diagram. They are both 16x4x32b.

Also,

Input Assembler (IA) gathers 1D vertex data from up to 8 input streams attached to vertex buffers and converts data items to a canonical format (e.g., float32). Each stream specifies an inde-pendent vertex structure containing up to 16 fields (called ele-ments). An element is a homogenous tuple of 1 to 4 data items (e.g., float32s).

Jawed

KimB · Jul 14, 2006

Jawed said:
As I woke up this morning I thought: "how is GS-amplified data serialised?"

What's the correct ordering of vertices/primitives, when you have an algorithm that depends upon the ordering of vertices/primitives? How can GS, and/or the post-GS stages before rasterisation, sort this out?

Well, I think that if you allow all pixels to finish for one GS before moving onto the next, it simply becomes a matter of enforcing triangle ordering at the triangle setup level, which would seem to be something that already has to be done.

Geo · Jul 14, 2006

Jawed said:
As I woke up this morning I thought: "how is GS-amplified data serialised?"

If we ever start a thread somewhere with the topic "Things you'll only read at B3D", I'm nominating this one!

Xmas · Jul 14, 2006

Jawed said:
As I woke up this morning I thought: "how is GS-amplified data serialised?"

What's the correct ordering of vertices/primitives, when you have an algorithm that depends upon the ordering of vertices/primitives? How can GS, and/or the post-GS stages before rasterisation, sort this out?

Is amplification of vertices mutually exclusive to algorithms that require ordering?

Jawed

The GS is invoked once per input primitive, which are ordered. Each GS invocation may output a stream of primitives which is ordered as well. The order of primitives coming from the GS stage is well-defined.
Maintaining this order with multiple GS instances running in parallel might be done by having an on-chip buffer where each GS instance writes to a certain area large enough to take the maximum output values. Linked lists are also an option.

Unfortunately, there seems to be no way the application can tell the driver that triangle order does not matter.
There are actually only a few cases where it does matter: framebuffer blending, fragments at exactly the same depth, disabled depth test, and stencil test and writes at the same time (and maybe I forgot one or two). Most of the time there is no difference (or in the case of fragments at the same depth it is an arbitrary decision which one comes out on top).

Demirug · Jul 14, 2006

Xmas said:
Unfortunately, there seems to be no way the application can tell the driver that triangle order does not matter.
There are actually only a few cases where it does matter: framebuffer blending, fragments at exactly the same depth, disabled depth test, and stencil test and writes at the same time (and maybe I forgot one or two). Most of the time there is no difference (or in the case of fragments at the same depth it is an arbitrary decision which one comes out on top).

A smart driver could detected this and use it for a higher overall GS output. With the reduced number of state objects it should be very easy to do this in advanced. Another reason why I would prefer a D3D10 version with caps for older hardware (and Windows XP)

Xmas · Jul 14, 2006

Demirug said:
A smart driver could detected this and use it for a higher overall GS output. With the reduced number of state objects it should be very easy to do this in advanced. Another reason why I would prefer a D3D10 version with caps for older hardware (and Windows XP)

Yes, only the "fragments at same depth" case might cause slight problems because you cannot detect it from state objects. But it should only make a minor difference at polygon intersection edges and in cases where you should use a depth bias anyway.

db · Jul 15, 2006

Xmas said:
Unfortunately, there seems to be no way the application can tell the driver that triangle order does not matter.
There are actually only a few cases where it does matter: framebuffer blending, fragments at exactly the same depth, disabled depth test, and stencil test and writes at the same time (and maybe I forgot one or two). Most of the time there is no difference (or in the case of fragments at the same depth it is an arbitrary decision which one comes out on top).

It's by design. Determinism (including across different pipeline configurations) may be considered essential for building verifiable hardware and software. After all you could make the same argument for rasterization and fragment processing, why not a renderstate to not preserve order there too.

The People Behind DirectX 10: Part 3 - NVIDIA's Tony Tamasi

Demirug

Demirug

Jawed

Ailuros

Epsilon plus three

3dcgi

KimB

3dcgi

KimB

Jawed

Jawed

KimB

Geo

Mostly Harmless

Xmas

Porous

Demirug

Xmas

Porous

db

Similar threads