DX10 GS shader

nobond · Jul 21, 2007

The shader output will be put into a buffer, which will fed back to vertex shader.

I assume there must a limit in the size of the gs buffer? Any real number for the current hardware?

Humus · Jul 21, 2007

I don't know actual hardware limit, but the biggest buffers you can create that's guaranteed to work is 128MB. If you attempts to create larger buffers than that you get this surprisingly verbose response from the runtime:

D3D10: INFO: ID3D10Device::CreateBuffer: Note that the resource allocation (135266304 bytes plus overhead) would use more than 128 MB of application usable memory. This is fine; D3D10 allows attempts to make allocations above 128 MB in the event that they may happen to succeed, however this usage is subject to hardware specific failure. D3D10 only guarantees that allocations within 128 MB are supported by all D3D10 hardware. Here, failure may only happen if the system runs out of resources. Allocations above 128 MB may fail for a couple of reasons, not only because the system is overextended, but also if the particular hardware being used does not support it. There is intentionally no supported way to report individual hardware limits on allocation sizes above 128 MB. [ STATE_CREATION INFO #73: CREATEBUFFER_LARGEALLOCATION ]

JHoxley · Jul 26, 2007

If you're referring to use Stream Out ('SO') then I'm pretty sure the limit is 4096 bytes - that is, 1024 float4 elements. It's also limited to 16 float4 components per-vertex.

I had a look in the specs, but couldn't find it on a quick search - so I could be wrong!

hth
Jack

Humus · Jul 26, 2007

I think you're talking about the GS, which can output at most 1024 floats per shader invokation, but the StreamOut buffer can most certainly be larger than that. StreamOut would be useless if you couldn't output more than 4KB data.

Zengar · Jul 27, 2007

The API limits are set to 1024 floats for one shader invokation, as JHoxley and Humus mentioned. On Nvidia hardware, one should keep the GS output as small as possible for optimal performance. I am not sure about ATI but I think their hardware can handle larger GS outputs better.

JHoxley · Jul 27, 2007

Humus said:
I think you're talking about the GS, which can output at most 1024 floats per shader invokation, but the StreamOut buffer can most certainly be larger than that. StreamOut would be useless if you couldn't output more than 4KB data.

Yes, I definitely agree that they're useless with such limits... but I know for definite that I saw some information that scrapped one of my applications from the outset. I wanted to do some GPU-accelerated mesh pre-processing, but something about dividing it up to fit into the output buffer sizes meant it just wasn't worth the effort. For the life of me I can't remember where I saw these details - I was sure it was in the 10.0 spec but I can't find it anymore

It also does feed into a fact I've heard SamG repeat several times - the GS is more an "informational" unit than a tesselator. It's supposed to give the programmable pipeline topological level information with a bit of amplification/culling capability as a secondary feature.

Jack

3dcgi · Jul 28, 2007

JHoxley said:
It also does feed into a fact I've heard SamG repeat several times - the GS is more an "informational" unit than a tesselator. It's supposed to give the programmable pipeline topological level information with a bit of amplification/culling capability as a secondary feature.

If Microsoft only limited the number of verts capable of being emitted and allowed each to have the full number of parameters it'd probably be a fine tessellator.

Humus · Jul 28, 2007

Zengar said:
The API limits are set to 1024 floats for one shader invokation, as JHoxley and Humus mentioned. On Nvidia hardware, one should keep the GS output as small as possible for optimal performance. I am not sure about ATI but I think their hardware can handle larger GS outputs better.

Yes, we don't have any exponential performance falloff. It scales linearly. Double the output essentially means double the processing time. If you can use amplification to reduce the number of passes then it's recommended that you do so. It's better with for instance 4 passes amplifying to 16x than do 16 passes amplifying to 4x.

JHoxley said:
Yes, I definitely agree that they're useless with such limits... but I know for definite that I saw some information that scrapped one of my applications from the outset. I wanted to do some GPU-accelerated mesh pre-processing, but something about dividing it up to fit into the output buffer sizes meant it just wasn't worth the effort. For the life of me I can't remember where I saw these details - I was sure it was in the 10.0 spec but I can't find it anymore

Well, each shader invokation can only output 1024 floats. This is a limitation that I've bumped into myself too. But you can of course render loads of vertices where each outputs 1024 floats, so the total output can be huge.

JHoxley said:
It also does feed into a fact I've heard SamG repeat several times - the GS is more an "informational" unit than a tesselator. It's supposed to give the programmable pipeline topological level information with a bit of amplification/culling capability as a secondary feature.

I'd say amplification/culling is the primary feature, and the topological information secondary. But that's of course subjective. There are certainly good uses of both.

DX10 GS shader

nobond

Humus

Crazy coder

JHoxley

Humus

Crazy coder

Zengar

JHoxley

3dcgi

Humus

Crazy coder

Similar threads