Geometry Shader - what's the difference from VS ?

JHoxley · Nov 19, 2005

Doesn't Xenos reuse its vertex/pixel shader ALUs for HOS, tesselation and such? If so, it seems geometry shading, or at least a subset, is possible given a flexible enough architecture with a pool of generalized ALUs.

Firstly, I'm not on the XB360 program (i wish!) but it's worth considering that the hardware and the software must have been finalised quite some time ago (standards wise, even if the actual hardware wasn't there). Possibly even years. Given everything I've heard about the 360, it's just a turbo-charged DX9/SM3 part - some fancy stuff, but fundamentally still D3D9.

I interpret the GS in the SM4 pipeline as a multi-pass operation

I don't see why it *must* be a multi-pass operation... Sure, some awesomely cool stuff can be done by a form of feedback via stream output, but it's not required. Given how GPU's are essentially massively parallel stream processors it'll be interesting to see if stream-output criples first generation DX10 hardware.

there are other diagrams around which show GS before VS

They are incorrect. Simple as that. The official/public information shows the basic pipeline overview... and the VS sets up vertices before the GS.

when I first saw the geometry shader mentioned I got the impression that this is more-or-less a fixed function tesselator.

At a simple level, the GS could just be a tesselator for HOS; but in practice that's not necessarily likely... I'd imagine more complex and dynamic animations are much more likely. We've seen what happens with HOS when it was an "amendment" to the FFP and how succesful it was - we could write shaders to handle it's limitations, but that's just the beginning.

Also, I don't understand how occlusion queries fit in here - I think they are linked to the GS (working as virtual viewports, I think) but I dunno.

You need to look into predicated rendering. I believe thats mentioned in the 'Direct3D part 2' PDC slides.

I dare say geometry instancing is the closest SM3 came to having a GS.

What is (was?) geometry instancing has been consumed entirely by the IA stage of the pipeline. Look at the PDC slides for more details and think about it. The GS gets fed in whole triangles with the 1-ring vertices as a form of adjacency. That implies that the triangle(s) have been built *before* the GS stage.

Is memexport equivalent to streamout? Scatter writes are not stream writes. memexport doesn't appear in any of the publically talked about WGF2.0 features.

I'm curious as to what this "memexport" is.

hth
Jack

JHoxley · Nov 19, 2005

Jawed said:
memexport can only be used during vertex shading - though I can't remember where I read this.

The base address per vertex is, effectively, generated by a "malloc()" type operation in the first instruction in that sample (erm, that's what it looks like to me - some random address). Quite how that's communicated forwards into another pass, for reading back in, I dunno. Lots of missing detail...

Jawed

I would be very interested if you could dig up any references on this. I'm confused!

Also, as a general comment... Why are you dragging XB360 into this? Historically consoles have taken a "base" level and then customized it. Take the original XBox - it's primarily SM1 with a few extra trimmings. Related to the comment in my previous post, D3D10 is relatively new as far as the XB360 timeline goes... such that there might be parallels, but just because D3D10 is close and XB360 is released doesn't necessitate a connection

Cheers,
Jack

Jawed · Nov 19, 2005

The thing that bugs me is why is Stream Output a component of the DX10 pipeline if not to support multi-pass vertex/primitive shading?

When (where) else is data that's streamed out going to be used?

I know that ATI talks about Xenos being capable of helping do physics, and stream out being a key component of that. But I suppose I interpret that as merely overloading of the basic stream output concept.

Jawed

Jawed · Nov 19, 2005

JHoxley said:
I would be very interested if you could dig up any references on this. I'm confused!

Erm, it's on my computer, an XFest document - just can't remember which one

Also, as a general comment... Why are you dragging XB360 into this?

See my other post (just now) about the meaning of "stream output". Dave has indicated (rather vaguely) that memexport isn't stream output. I only bring it up because they seem awfully similar to each other, even if they're not the same.

Historically consoles have taken a "base" level and then customized it. Take the original XBox - it's primarily SM1 with a few extra trimmings. Related to the comment in my previous post, D3D10 is relatively new as far as the XB360 timeline goes... such that there might be parallels, but just because D3D10 is close and XB360 is released doesn't necessitate a connection

Generally I'd agree, but I suppose the rumours of R400 (unified architecture) dating back years and being the spiritual father of Xenos makes me think ATI has been thinking about geometry shading (implemented on top of the unified shader architecture) for quite a long time.

Me just jumping to conclusions :neutral: sometimes I find it helps to generate some frisson in the people who really know the answers, to egg them into actually revealing what they understand...

Jawed

EDIT2:"stream out" not "stream output", but I'm too lazy to change them

JHoxley · Nov 19, 2005

Jawed said:
The thing that bugs me is why is Stream Output a component of the DX10 pipeline if not to support multi-pass vertex/primitive shading?

Yes, you would be correct. Stream output is there fore exactly the purpose you describe. What I picked up on was the "exclusive" multi-pass nature. Might just be a misunderstanding though...

The GS can do a feedback or it can pass on data.

I lurk in these forums more than I post, so I know you guys are pretty smart... Thus I think you'll appreciate that a stream processor (what I imagine the majority, if not all, current GPU's are) might find it hard to do a feedback loop. Thus requiring a revolution rather than evolution in current hardware design.

hth
Jack

JHoxley · Nov 19, 2005

Jawed said:
I suppose the rumours of R400 (unified architecture) dating back years and being the spiritual father of Xenos makes me think ATI has been thinking about geometry shading (implemented on top of the unified shader architecture) for quite a long time.

Yup, you wouldn't surprise me there... I'd be amazed if either NV or ATI 's R&D departments aren't several years ahead of what we're discussing here. Simple as that!

Jawed said:
Me just jumping to conclusions :neutral: sometimes I find it helps to generate some frisson in the people who really know the answers, to egg them into actually revealing what they understand...

Haha, nice try ;-) but in all fairness we're not going to break NDA for a public forum. Might happen every now and then... but for the most part I don't think I (or others) fancy being hunted down by the Microsoft legal department

Jack

Jawed · Nov 19, 2005

Meanwhile I'm swimming in XFest docs that I don't understand very much of

Just discovering that with Sequencer programming, devs can control the way Xenos switches batches (sigh, I should get into the habit of calling them threads) to hide texture or vertex fetch latencies. In other words it seems that devs can configure their own pre-fetching to fine-tune the performance of a shader

But that's OT

Jawed

Jawed · Nov 19, 2005

JHoxley said:
Thus I think you'll appreciate that a stream processor (what I imagine the majority, if not all, current GPU's are) might find it hard to do a feedback loop. Thus requiring a revolution rather than evolution in current hardware design.

That's why I like Xenos so much.

Predicated tiling is the perfect example of a feedback loop:

http://www.beyond3d.com/forum/showpost.php?p=627728&postcount=202

So I think the revolution is well and truly under way.

I dare say the extents-data export, generated by the initial pass for predicated tiling (to a buffer in main memory) is much like memexport.

Jawed

DemoCoder · Nov 19, 2005

Making the most of streaming would require alot more than Xenos. Ideally, before you are finished writing the stream, you want the next consumer of the stream to start dequeuing in a pipelined fashion.

The "evolutionary" path is to write out the entire stream to memory. Then, start reading from the beginning of the stream when you loop back after the last data item is written in the previous pass. But this is inefficient, since the next stage in your renderer (say, PS) can't do anything until all of the recursive passes on the stream is done. The more recursion, the longer the later stages are stalled.

The "revolutionary" path is that as data is put into the stream, it is available for use by other shaders that are waiting for it. But this would probably require a much more complex design than the current WGF2.0 parts.

Jawed · Nov 19, 2005

DemoCoder said:
The "evolutionary" path is to write out the entire stream to memory. Then, start reading from the beginning of the stream when you loop back after the last data item is written in the previous pass. But this is inefficient, since the next stage in your renderer (say, PS) can't do anything until all of the recursive passes on the stream is done. The more recursion, the longer the later stages are stalled.

True - except that while the PS stage is "stalled", the majority of the GPU resources that the PS stages would be using (i.e. the shader pipes) are being used for vertex shading, anyway. But yes the texture pipes are potentially idle and the PS-specific functional blocks such as the rasteriser are definitely going to be idle.

But it's not uncommon, right now, for some PS hardware to lie idle while a GPU does stencil shadowing, for example.

The "revolutionary" path is that as data is put into the stream, it is available for use by other shaders that are waiting for it. But this would probably require a much more complex design than the current WGF2.0 parts.

Logically this is a question of load-balancing - how many threads can the GPU maintain at any one time - and the general principle of thread FIFO in the Xenos sequencer.

If the, for example, tessellated data set is small then logically Xenos should indeed be able to start eating the buffer before it's finished writing it. But Xenos can only maintain 31 vertex threads (each of 64 vertices) so I guess that's unlikely or vanishingly small.

Having said all that, it seems it'll be normal for some of the multiple rendering passes in 3D engines to leave the PS-specific/focussed hardware idle. So calling-out a technicality of load-balanced streaming on a limitation that would exist even without a feedback loop isn't particularly useful.

In the end a GPU isn't a monotonic streaming processor, but a group of individually specialised streaming processors that mostly stream forwards. The fact there's buffers/caches/registers between many streaming processors in the GPU pipeline indicates the stream is anything but continuous or smooth. Indeed a unified shader architecture relies upon that fact.

Jawed

DemoCoder · Nov 19, 2005

Jawed said:
True - except that while the PS stage is "stalled", the majority of the GPU resources that the PS stages would be using (i.e. the shader pipes) are being used for vertex shading, anyway. But yes the texture pipes are potentially idle and the PS-specific functional blocks such as the rasteriser are definitely going to be idle.

The stall isn't just during the processing, but the bubble gap left between the time you write the last byte to the stream, and the time the next stage can start reading. This could be several hundred cycles if the stream is stored in video memory.

Logically this is a question of load-balancing - how many threads can the GPU maintain at any one time - and the general principle of thread FIFO in the Xenos sequencer.

It's also a question of what's stalling you. You want extra threads to hide blocked/waiting I/O or other processing. The problem with a streamout approach that writes the entire stream to memory before commencing the processing of that stream is not only could some of your other units be starved and idle, but that as your schedule switchs from production to consumption, there is a huge delay. This is especially true of today's multipass techniques where you must go back to the CPU before the next pass.

psurge · Nov 19, 2005

Jawed, are these docs you're talking about available on-line?

Jawed · Nov 19, 2005

DemoCoder said:
The problem with a streamout approach that writes the entire stream to memory before commencing the processing of that stream is not only could some of your other units be starved and idle, but that as your schedule switchs from production to consumption, there is a huge delay. This is especially true of today's multipass techniques where you must go back to the CPU before the next pass.

I think batching in the vertex/primitive stream makes this argument redundant.

Also, there's no need to have the GPU halt between loops. If the output is a stream, then there's nothing logically preventing the GPU feeding off the head while the tail's being written.

Why should the entire stream have to be finished before it can be read? What am I missing?

Jawed

DemoCoder · Nov 19, 2005

I didn't say there is anything logically to prevent it, but I doubt first gen WGF2.0 parts will support this. That was my "revolutionary vs evolutionary" point.

Streamout is not an onchip FIFO Jawed, it is writing to external memory. That means the GPU, for each stream, must keep track how many bytes have been fetched so far, and how many have been written. If it tries to read past the tail, it needs to block. This information already exists in some form, but using it and coupled that with an advanced memory controller that needs to prefetch data, and with possibly reordering memory writes and combining them, it is more complex than you think.

That's why I think the first implementations of streamout will be just like multipass today (finish writing stream before consuming), except that rather than the CPU having to be involved, it will be driven by the GS.

Jawed · Nov 19, 2005

DemoCoder said:
I didn't say there is anything logically to prevent it, but I doubt first gen WGF2.0 parts will support this. That was my "revolutionary vs evolutionary" point.

OK

Streamout is not an onchip FIFO Jawed, it is writing to external memory.

I'm not suggesting it is

That means the GPU, for each stream, must keep track how many bytes have been fetched so far, and how many have been written. If it tries to read past the tail, it needs to block. This information already exists in some form, but using it and coupled that with an advanced memory controller that needs to prefetch data, and with possibly reordering memory writes and combining them, it is more complex than you think.

Having just scratched the surface of the Sequencer in Xenos, I kinda suspect all this stuff is in scope. Anyway, it's very early days in the documentation I've got. I've found one slide on tessellation so far:

Code:

 [font='Convection Medium']float4 interpolate(float4 a0, float4 a1, float4 a2, float3 b) [/font]
[font='Convection Medium']{ [/font]
[font='Convection Medium']    return a0  * b.z + a1 * b.y + a2 * b.x; [/font]
[font='Convection Medium']} [/font]
 
[font='Convection Medium']float4 main(int3 index : INDEX, float3 b : BARYCENTRIC) : POSITION [/font]
[font='Convection Medium']{ [/font]
[font='Convection Medium']    float pos0, pos1, pos2; [/font]
[font='Convection Medium']    asm { [/font]
[font='Convection Medium']        vfetch pos0, index.x, position [/font]
[font='Convection Medium']        vfetch pos1, index.y, position [/font]
[font='Convection Medium']        vfetch pos2, index.z, position [/font]
[font='Convection Medium']    }; [/font]
[font='Convection Medium']    return interpolate(pos0, pos1, pos2, b); [/font]
[font='Convection Medium']} [/font]

Also, the GPU reads the CPU generated vertex stream via a fenced ring buffer (i.e. with head and tail "pointers" for each batch in the stream) - so it doesn't seem like a stretch to expect that stream out would use the same mechanism, with the ring-buffer limiting the ultimate size of stream batches, or the total amount of streamed-out data before the GPU starts consuming it again.

But, obviously, I'm far out on my own here.

Jawed

Megadrive1988 · Nov 21, 2005

Yup, you wouldn't surprise me there... I'd be amazed if either NV or ATI 's R&D departments aren't several years ahead of what we're discussing here. Simple as that!

oh I very much agree. there is no doubt that ATI is R&Ding the R800 generation and beyond, including the GPU for the Xbox360 successor

likewise, Nvidia is R&Ding at least all the way out to NV70, as well as working with Sony on R&Ding the PS4 GPU.

I hope both companies are looking into providing a solution for something that will do what raytracing does, but at far less computational cost. as well as global illumination, and all of those presently impossible to do rendering techniques.

aaronspink · Nov 21, 2005

DemoCoder said:
Making the most of streaming would require alot more than Xenos. Ideally, before you are finished writing the stream, you want the next consumer of the stream to start dequeuing in a pipelined fashion.

The "evolutionary" path is to write out the entire stream to memory. Then, start reading from the beginning of the stream when you loop back after the last data item is written in the previous pass. But this is inefficient, since the next stage in your renderer (say, PS) can't do anything until all of the recursive passes on the stream is done. The more recursion, the longer the later stages are stalled.

The "revolutionary" path is that as data is put into the stream, it is available for use by other shaders that are waiting for it. But this would probably require a much more complex design than the current WGF2.0 parts.

Depends on the exact programming model. If the requirement is there that the Stream structure must be pre-allocated, it is perfectly possible to pre-allocate with a null value. when the seconday stream program reads the null value, it can temporarily suspend until the value is written. This can be done on evolutionary hardware without much work. In general the fault on the read of the null value looks pretty much like a long texture read. The micro-thread running the secondary stream would simply suspend for a while then try again.

Aaron Spink
speaking for myself inc.

nAo · Nov 21, 2005

Nice..but it wastes a lot of bandwith

Geometry Shader - what's the difference from VS ?

JHoxley

JHoxley

Jawed

Jawed

JHoxley

JHoxley

Jawed

Jawed

DemoCoder

Jawed

DemoCoder

psurge

Jawed

DemoCoder

Jawed

Megadrive1988

aaronspink

nAo

Nutella Nutellae

Similar threads