Dany - Shader 3.0 and Branching

With PS3.0 vertex shaders already need to cope with texture accesses and as shader lengths go up texture accesses are reduced in comparison to other ALU operations - the reduced importance of pure texturing is already being shown in architectural design.

And, again, we also have the case where you are likely to be able to dedicate more die to a branching unit in a unified shader than having two descrete ones.
 
DaveBaumann said:
With PS3.0 vertex shaders already need to cope with texture accesses and as shader lengths go up texture accesses are reduced in comparison to other ALU operations - the reduced importance of pure texturing is already being shown in architectural design.

Vertex shader lengths were already long, even in PS2.0 (256 instructions in PS2.0). The introduction of texturing into vertex shaders is the *opposite* architectural effect. Replacing long chains of procedural ALU ops with texture lookups. Texture lookups still have much higher throughput than procedurally generated data.

And, again, we also have the case where you are likely to be able to dedicate more die to a branching unit in a unified shader than having two descrete ones.

But now you have the much more complex problem of allocating shared units between an incoming vertex stream and an outgoing fragment stream. Let's say you have a pool of 32 unified units. How many do you allocate to the vertices, for how long? What if there are branches and texture fetches happening in the vertex and fragment programs? Due to the vastly different frequencies of I/O in the two pipelines, you have a huge problem to efficiently allocate those pool of units to the current pipeline state to maximize ALU throughput and bandwidth utilization.

Given the fact that highly parallel CPU machines which have OSs and compilers that dedicate lots of runtime to efficiently schedule identical units to handle subprograms of a given datastream have not solved this probably sufficiently (their average thoroughput is still way below the theoretical throughput) I do not have confidence that a silicon scheduler, bound by gate limits, and real-time limits, is going to do better.

In SM4, the unified shader units look to become generalized stream processors. They take input, can do ALU ops and texture fetches, and write output to a stream, which can then be read as a stream by the next stage/loop back. I simply do not see an easy way that the HW can be reconfigured to handle all these generalized cases efficiently. Especially since a stall in an earlier unified shader can block the rest of the pipeline which is waiting for it's output. Unified shaders add more pathological cases that can hurt performance, not removes them.

Not that I am against unified shaders, but more restrictive programming models offer more opportunities for optimization. The more general purpose the model, the less can be determined statically by IHV designers, by compilers, by the APIs, and the drivers.
 
DemoCoder said:
DaveBaumann said:
With PS3.0 vertex shaders already need to cope with texture accesses and as shader lengths go up texture accesses are reduced in comparison to other ALU operations - the reduced importance of pure texturing is already being shown in architectural design.

Vertex shader lengths were already long, even in PS2.0 (256 instructions in PS2.0). The introduction of texturing into vertex shaders is the *opposite* architectural effect. Replacing long chains of procedural ALU ops with texture lookups. Texture lookups still have much higher throughput than procedurally generated data.

Note, that should have read: "With VS3.0 vertex shaders already need to cope with texture accesses; and as pixel shader lengths go up texture accesses are reduced in comparison to other ALU operations". So, what you are essentially saying is what I’m saying – PS and VS are coming closer together in terms of usage.

How many do you allocate to the vertices, for how long? What if there are branches and texture fetches happening in the vertex and fragment programs? Due to the vastly different frequencies of I/O in the two pipelines, you have a huge problem to efficiently allocate those pool of units to the current pipeline state to maximize ALU throughput and bandwidth utilization.

And these are the same issues you are faced with in a discrete pipeline except that either one end or the other can be entirely stalled by it – Pixel or vertex shading may be stalled in a unified system because the resources are all being used, but at least they are still all being used. However, we’ll have to wait and see how these scenarios will be handled and what is put in place to ensure they don’t occur – I’d very much doubt we’ll see a case where all the ALU’s are being dedicated to either VS or PS entirely (well, I probably can see cases where PS is taking the majority of the ALU’s.

They take input, can do ALU ops and texture fetches, and write output to a stream, which can then be read as a stream by the next stage/loop back.

Why wouldn’t a branch that is determined to be calculated just be set up as another stream and be scheduled again?
 
I agree Dave, you still have stalling issues with today's pipelines, but the access patterns are still more regular because of how they are locked together. The main variable is how many pixels get generated by a vertex. e.g. lots of small polys vs a few big polys. But designers can work with legitimate expected value ranges. Moreover, atleast prior to dependent texture fetches, pipelines could reasonably do some prefetching of textures.

What I see happening with SM4.0 is that each SM stage creates a new "output stream". Think of the piping on DOS/UNIX command line shells. Each stage operates on one "kernel" of varying data at a time as input (e.g. similar to the V0-Vn inputs to the vertex shaders today) plus constants, texture coordinate iterators, and bound textures.

It then writes a new set of constant registers, texture coordinates, and generates N different output datums to a new stream for the next stage to process. In the current VS/PS model, one vertex can generate N invocations of a PS shader depending on how many visible pixels get rasterized.

In the new model, I don't expect the VS to do rasterization in the shader. I expect these pixel quads will still be generated by the HW. The difference is, I expect that the VS can instance multiple quad streams (e.g. generate one triangle, plus generate all of it's shadow volumes). It can send a stream of data to the PS which is the combined inputs for all of the quads it generated (e.g. registers 0-4 are for model triangle, registers 8-16 contain parameters for auxillary tris that goes with it, etc)

And of course, the output of the "PS" can then be fed back into a "VS" (I use quotes, because of an speaking of a unified shader unit, just running a different shader) which takes that input creates more quads.

As I said, my only concern is how all of these streams are buffered up properly and processed so that stalls are minimized. Could you imagine a situation like TESSELATOR | VS | PS | VS | PS, with all the differing stream frequencies between each, and how to allocate those poor unified shaders which are doing multiple duties between them?
 
DaveBaumann said:
With PS3.0 vertex shaders already need to cope with texture accesses
True, of course, but if you read nVidia's documents, for example, they suggest that you, "Try to cover texture fetch latency with other nondependent
instructions." This says to me that whereas in the pixel shader the pipelines are deep enough and made in such a way that a texture can be read and used in the very next instruction, this is not true with the vertex shader, where it is better to try not to use any vertex texture data until a few instructions down the pipeline.

I claim that this directly allows the vertex shader pipelines to be much shorter than pixel shader pipelines, and thus results in better branching performance. If the pipelines are unified, you will either shorten the pixel pipelines, meaning that texture fetch latency will no longer be automatically hidden, or you will lengthen the vertex shader pipelines, meaning that branches will no longer be so cheap.
 
DemoCoder said:
As I said, my only concern is how all of these streams are buffered up properly and processed so that stalls are minimized. Could you imagine a situation like TESSELATOR | VS | PS | VS | PS, with all the differing stream frequencies between each, and how to allocate those poor unified shaders which are doing multiple duties between them?

I suspect that the pipeline will still need a various traditional separators – i.e. between executing VS and PS you are still going to need to setup and fast cull non visible pixels (occlusion query / Heir Z) so the buffering that we have in place now is still likely to exist.

Chalnoth said:
True, of course, but if you read nVidia's documents, for example, they suggest that you, "Try to cover texture fetch latency with other nondependent instructions." This says to me that whereas in the pixel shader the pipelines are deep enough and made in such a way that a texture can be read and used in the very next instruction, this is not true with the vertex shader, where it is better to try not to use any vertex texture data until a few instructions down the pipeline.

You are talking about a particular limitation in a particular design (one that is probably not expected to do much texture lookups in the VS in its lifetime). These types of issues will likely be removed the more it gets used in any architecture.

Chalnoth said:
I claim that this directly allows the vertex shader pipelines to be much shorter than pixel shader pipelines, and thus results in better branching performance.

I would wait until you see the pipeline first. We are also don’t know how the “pipelinesâ€￾ would be organised in a unified structure – we only have the term “poolâ€￾.
 
Well, it is a fact that shorter pipelines make for better branching performance, while longer pipelines allow hiding of latency (i.e. for texture fetches). IHV's may attempt to circumvent these problems in innovative ways, but all I'm saying is that unification won't solve them. The only thing it will do is force IHV's to focus on solving both problems at the same time, which seems to be a nontrivial problem.

Particularly in the first generation of unified GPU's, I would expect that IHV's would be less inclined to spent too much time, effort, and transistors in simultaneously solving these problems, and hope that the added efficiency of the unified architecture covers these problems adequately.
 
sireric said:
The main 1st use of dynamic branching will be to reduce the shader count complexity (this is what the GDC papers were pushing with "dynamic lighting"). It will not be a performance advantage at all; quite the opposite. Later, when shaders are almost pure ALU, then there will be a performance advantage (you need 10's of ALU per texture or perhaps 100's, depending on the architecture).

Hmmm, given the difference in enthusiasm between ISV (Dany Lepage) and IHV (Eric) I'm wondering whether dynamic branching's main advantage is easier (less complex) shader code to write for the software/programmers? Kinda like: performance advantage = faster development? ;)

To be honest I'm a little surprised that Dany and Eric would look at it so differently even given that ATI apparently wont support in PS while nVidia will. What am I missing here?
 
The "dynamic lighting" demo only uses static branching IMHO. They set the loop counter through constant register.
 
sireric said:
The primary right now is for shader combinatorial reduction; this will not bring a performance improvement, just a code reduction.

Ack, this was the quote that actually contained my point. :oops:

Code reduction: For the benefit of whom is the question at hand!
 
Wouldn't combinatorial shader reduction allow one to reduce state changes? Seems to me that could be a performance win.
 
Bjorn said:
LeStoffer said:
Code reduction: For the benefit of whom is the question at hand!

The developer ?

Exactly! :arrow: So the next question is: do we implement it because they are lazy folks by nature that hate spending time on optimizing code for a less than flexible pipeline or because they just dream about writing code as close to the known ways and syntaxes of doing it, e.g. like with C++?

Probably a little of this and that, but no matter what it will hopefully - at least in the future! - release some creative powers from the developers.
 
DemoCoder said:
Wouldn't combinatorial shader reduction allow one to reduce state changes? Seems to me that could be a performance win.
One would hope so. But if changing a constant causes a state change, it could be as bad or worse than having multiple shaders. I guess it just depends on how well the hardware is designed to avoid state changes with static branching.
 
Does changing the number of loops executed imply a state change?

Or more specifically, what kind of change in the shader makes a state change? Does the shader have to keep unchanged after unrolling any nesting loop before flush the pipeline?
 
991060 said:
Does changing the number of loops executed imply a state change?
I'd say that would depend on the hardware.

Or more specifically, what kind of change in the shader makes a state change? Does the shader have to keep unchanged after unrolling any nesting loop before flush the pipeline?
I think the point of supporting branching is so that you don't have to do any unrolling. Changing the shader is what static branching would attempt to avoid.

Anyway, to answer these questions what we need is a benchmark that compares multiple shaders vs. a combined shader with branching (ex. have a benchmark that rapidly changes the number of lights rotating around a teapot, similar to nVidia's multiple lights demo, and compares performance between branching and no branching. To single out the performance impact of using static branching, the benchmark should attempt to do as many shader changes as possible).
 
To me, it boils down to whether SetPixelShader() is slower than SetPixelShaderConstant*(). I'm inclinded to believe that unless the driver does "on the fly" non-cached inlining/branch erasure, the former will be slower. Even if we posit that there is an on-chip memory that can contain N words of compiled shader code and that theoretically SetPixelShader() would involve nothing more than moving a program counter register on the HW, it seems to me that the SetPixelShaderConstant*() method would be more cache friendly by allowing more programs to reside on chip, lowering the overhead of subsequent SetPixelShader() calls. Seems like a win-win.

Of course, the big question mark is what impact a static loop or static branch will have, and whether an extra cycle or two gobbled up by the branch will be worse than having a large unrolled loop or switching between two big chunks of shaders.

We'll have to wait a couple of months for benchmarks to arrive and for drivers to mature before a definitive answer can be announced.
 
Back
Top