Nvidia's unified compiler technology

Humus said:
<snip>
[/code]

You instead do this:

Code:
const float da;
const float ds;

void main(){
   float diffuse = bla bla ..
   float specular = bla bla ..
   return da * diffuse * base + ds * specular ...
}

No quality loss, less resources, quicker compile time and likely some speed gained.

Thats not 'high' quality shaders, I mean the artists having total control over this level of shader (And were using models more complex than this already).

Code:
Pseudo code (else this would be several hundred line long)

Colour = <0,0,0>;
if( diffusemap == 0 )
   Colour += Diffuse( diffusecol, dcoefficent, bias, power )
else
   Colour += DiffuseMap( diffusemap, dcoefficent, bias, power );

if( HalfSpec == true )
{
   if( specularmap == 0 )
         Colour += HSpecular( specularcol, scoefficient, power, metalic)
   else 
         Colour += HSpecularMap( speculamap, scoefficient, power, metalic)
} else
{
   if( specularmap == 0 )
        Colour += RSpecular( specularcol, scoefficient, power, metalic)
   else 
        Colour += RSpecularMap( specularmap, scoefficient, power, metalic)
}

Colour += EnviromentMap( ecoefficient, fresnel, emap)
// and on and on

That consists of 7 fragments and loads of parameters. In reality any paramater could be replaced by a texture map (I've only shown a couple with constant or texture map paramaters).

We get all that data from the artists and have to implement efficient shaders. Its not a simple matter of the artist picking 'shiny' and twiddling 2 parameters until its shiny enough.

Games are not tech demos but large artistic works, that means we try and give as much power to the artists (they have a habit of doing amazing things). Sometimes that means we've give ourself problems, but thats our jobs to solve.

Which is why I can get a little annoyed (and I apologise for that) when you simple dismiss a perfectly valid way of doing things just because its doesn't fit your view of what a shader is.
 
DeanoC said:
We get all that data from the artists and have to implement efficient shaders. Its not a simple matter of the artist picking 'shiny' and twiddling 2 parameters until its shiny enough.

Games are not tech demos but large artistic works, that means we try and give as much power to the artists (they have a habit of doing amazing things). Sometimes that means we've give ourself problems, but thats our jobs to solve.
But branching shaders don't give less power to the artists than hundreds of small shaders. The artists pick parameters, whether they're simple or complex, they don't write code.
You just trade one performance issue (branching, less optimization opportunity because of that) for another performance issue (shader upload and creation, and less optimization opportunity because of the requirement of small compile times). And I think that ultimately, branching will "win". Not only because of performance, but also because it saves work.

DeanoC said:
Which is why I can get a little annoyed (and I apologise for that) when you simple dismiss a perfectly valid way of doing things just because its doesn't fit your view of what a shader is.
It's only perfectly valid if the performance level reachable is suitable for your game. And I think the API should support the IMO more common case of a few dozens or hundreds of shaders in an optimized fashion, i.e. driver side compilation that may take any time to produce the best possible shader code. If the API has an interface to pass it object code so you can use even thousands of shaders without worrying about compile time, just the better. But I don't think it's rigth to use this as an argument against the driver taking its time to optimize.
 
Xmas said:
<snip>
And I think that ultimately, branching will "win". Not only because of performance, but also because it saves work.
Long term I agree, but for the next 4-5 years IMO large complex branching shaders will have a high execution cost, that we will seek to reduce. If the only thing stopping us from grabbing some of that speed back is compile times, shouldn't we at least be concerned with them?

Xmas said:
<snip>
And I think the API should support the IMO more common case of a few dozens or hundreds of shaders in an optimized fashion, i.e. driver side compilation that may take any time to produce the best possible shader code. If the API has an interface to pass it object code so you can use even thousands of shaders without worrying about compile time, just the better. But I don't think it's rigth to use this as an argument against the driver taking its time to optimize.
Common in whose view? Lots of devs are using 1 branching shader as fallback with lots of small shader for speed (PS2 VU). The lots of short shader approach has shown to be a valid and good speed up on a branching shader architecture. The only difference between PC branching and PS2 branching is that PS2 code is pre-compiled off-line.

I don't want the API/Driver to penalise your arcitecture, I'm just asking for the API/Driver not to penalise my choice. Especially given the short shader system will/is being used on consoles and surely we want to reduce the development costs (the true limit in what we can do).

I'd be happy with a flag to indiacate unlimited compile times or not. Then both are approaches are valid and we can fight it out in who gets the best visuals.
 
DeanoC said:
Xmas said:
<snip>
And I think that ultimately, branching will "win". Not only because of performance, but also because it saves work.
Long term I agree, but for the next 4-5 years IMO large complex branching shaders will have a high execution cost, that we will seek to reduce. If the only thing stopping us from grabbing some of that speed back is compile times, shouldn't we at least be concerned with them?

It's not just compile times, it's state change overhead and AGP bus/VRAM resource usage too. I think branches will be a performance win, not a performance loss, but we'll see.

I think you're looking at a strawman. Let's take an example: Toy Story using Renderman. The total scene content for this movie was in the hundreds of gigabytes. The total number of shaders written for the entire production was 1300. Total number of textures: 2000

Keep in mind, Toy Story shaders do way more than just vertex/pixel shaders. We're talking displacement shaders and particle shaders as well, to handle animation. When your game can best Toy Story, we'll talk about how many shaders Finding Nemo uses.

With the exception of shaders designed to compute animation, particles, and displacements, which usually are unique to objects, material/lighting shaders do not need 10,000+ shaders
 
DemoCoder said:
<snip>
It's not just compile times, it's state change overhead and AGP bus/VRAM resource usage too. I think branches will be a performance win, not a performance loss, but we'll see.
Almost no extra state change overhead from branching/nonbranching cases assuming in both cases when you can keep the programs resident. Usually stream processors can double buffer program uploads, so program upload isn't that bad if done correctly. PS2 VU can upload code thousands of time per frame, indeed putting a small VU program on the front of every object is perfectly reasonable (if a little overkill).

DemoCoder said:
I think you're looking at a strawman. Let's take an example: Toy Story using Renderman. The total scene content for this movie was in the hundreds of gigabytes. The total number of shaders written for the entire production was 1300. Total number of textures: 2000

Keep in mind, Toy Story shaders do way more than just vertex/pixel shaders. We're talking displacement shaders and particle shaders as well, to handle animation. When your game can best Toy Story, we'll talk about how many shaders Finding Nemo uses.

With the exception of shaders designed to compute animation, particles, and displacements, which usually are unique to objects, material/lighting shaders do not need 10,000+ shaders

Toy Story wasn't a game. Renderman shading is very different from realtime game rendering. Check the Siggraph papers on Final Fantasy for how different CGI is from Games (50 shadows maps for 1 character in 1 scene!, fake objects, etc). I wish I could control the player in the same way a film can control the camera. And we have 20ms to render the scene not 2 hours...

Maybe you right, but I've seen several different projects have relatively simple light models (basic opengl light model for example) with dynamic branching hardware and that was significantly slower than breaking into smaller straight path shadesr.

We do have lots of shaders permutations whether we use branching or program changes is a choice every platform and engine will be making for the foreseeable future. At the moment my experience shows lots of short shaders being a win (on consoles at least), but of course I may be wrong (its would certainly make life less complicated)

Either way things sure will be pretty :)
 
Xmas said:
And I think that ultimately, branching will "win". Not only because of performance, but also because it saves work.
I don't think you can make sweeping generalisations like that. For example, if a particular shader (or combination of 'branching' constants + shader) gets used on, say, >5% of the vertices or pixels in a scene, then the overhead of management of an optimised 'branch-reduced' version probably becomes insignificant while the extra execution cycles may be costly.
 
Simon F said:
I don't think you can make sweeping generalisations like that. For example, if a particular shader (or combination of 'branching' constants + shader) gets used on, say, >5% of the vertices or pixels in a scene, then the overhead of management of an optimised 'branch-reduced' version probably becomes insignificant while the extra execution cycles may be costly.
Whether or not static branching is faster will be up to the hardware. And in the end, static branching will be all about performance.

After all, it's more than possible for a driver to unroll a static-branched shader, and just send what will be executed for that round, if that is preferable for performance. Alternatively, a driver could also pack multiple shaders into one big static-branched shader (albeit not with the same efficiency in terms of memory usage as if it's done by the programmer), and always work with just one shader, if that is preferable for performance.

It will just depend on what the hardware works best with. Which is better in the end? Well, with the static-branched shader, video memory can be saved, making that potentially preferable for hardware optimized for static branching.

Given this, developers should probably write their shaders using static branching when it is convenient, but not bother when shaders do not share much code.
 
Simon F said:
Xmas said:
And I think that ultimately, branching will "win". Not only because of performance, but also because it saves work.
I don't think you can make sweeping generalisations like that. For example, if a particular shader (or combination of 'branching' constants + shader) gets used on, say, >5% of the vertices or pixels in a scene, then the overhead of management of an optimised 'branch-reduced' version probably becomes insignificant while the extra execution cycles may be costly.


I doubt it. Static branching can be implemented by the driver via preprocessing the shader, in which case, it's virtually equivalent to a pre-processed version being uploaded.

But even in HW, it seems the cost should be minimal. Since the "branch" is static, it can be decoded and replaced as a JUMP, so ultimately, the cost comes down to the cost of moving the PC register, which is roughly the cost of the pre-fetched instructions after the jump that you have to flush from the pipeline.

Seems to me that switching shaders might be more expensive than evaluating a branch in HW.
 
Both of you are missing the point of my original comment. I was stating that you cannot make an all-encompassing statement that an "omni-shader" will always be faster than a set of tuned routines.

Let's take the argument to the absurd extreme; assume you do have such a shader that can execute all possible materials in your application but, for a particular frame, 100% of the vertices/pixels have exactly the same material. In this case it must be faster to use an optimised piece of code**.

Simon

** If you can make all branches completely 100% free then I suggest you get down to your local patent office now - you could make a fortune flogging it to CPU manufacturers.
 
I think however that it's pretty much only in very extreme cases like that there might be an advantage to not use static branching, and the advantage would still be small. Branches can be made essentially for free. Even CPUs can do zero cycle branching after a number of iterations as the branch prediction mechanism figures the pattern out.
 
But you can never remove the penalty for less optimized code caused by branches....

Code:
Branch code
R = V0
if( C0 == X )
   R = R + 1
R= R * R

Cost : 4 Cycles

Non Branch code
[C0 == Y version]
R = V0 * V0

Cost : 1 Cycle

[C0 == X version]
R = V0 + 1
R = R * R

Cost : 2 Cycles

Even takes the same amount of memory in this case (but of course in general the non branching one will use memory). Its the classic optimization speed vs memory.

Branching is hard on optimisers and cache predictors. Nothing will completely remove that cost, the question is "Is the cost acceptable". Long term, ease of development will win but short term...
 
DeanoC said:
But you can never remove the penalty for less optimized code caused by branches....

Code:
Branch code
R = V0
if( C0 == X )
   R = R + 1
R= R * R

Cost : 4 Cycles
With optimized branch hardware, this should only take 2-3 cycles. But, in this case, it would be better to simply do this:
Code:
if( C0 == X )
  R = V0 + 1
  R = R * R
else
  R = V0 * V0
Anyway, remember that my argument was based upon static branching. Static branching is completely optimizable, as the program flow is completely known prior to actually running the program. So, it is possible for hardware to be just as fast with static branching as without. It is a bit more challenging, sure, but with smart compilers and a smart instruction set, it should not be terribly difficult to pull off.

It is dynamic branching that will always cause problems. But again, dynamic branching should only be used for performance reasons. One can always build an equivalent shader that will run with static branching. If dynamic branching is available, it is only useful to use it if the static version is terribly inefficient.

And finally, yes, there will always be a tradeoff between code length and performance. Since final performance is all that matters, I think it should be ultimately up to an IHV's shader compiler to make these decisions when they become important in the next few years.
 
DeanoC said:
But you can never remove the penalty for less optimized code caused by branches....

Code:
Branch code
R = V0
if( C0 == X )
   R = R + 1
R= R * R

[/quote]

Wrong. The driver will remove the static branch at runtime to:

R = V0
R = R*R

and then perform copy propagation

R = V0 * V0

That's the benefit of having the driver do JIT compilation. This is so simple, it can be done via a peephole optimizer right at the moment shader constants are altered. I suspect that DX9's static branch was designed so that it could be removed by the driver. 

[quote]Branching is hard on optimisers and cache predictors. Nothing will completely remove that cost, the question is "Is the cost acceptable". Long term, ease of development will win but short term...[/quote]

Dynamic branching is hard on optimizers, static branching isn't, because static branches get erased during compilation. It is no different than saying #ifdef interferes with the optimizer. I explained in another thread how to use speculative compilation to even remove dynamic branches at runtime.

We're not talking about uber-omni-shaders here. We're talking about parameterizing materials and lighting equations. And for the record Deano, movies like Toy Story have way way more shaders than games, because they use shaders to do everything, even create geometry.  So if Toy Story only needs 2,000 shaders to describe every possible surface material in their whole movie, I can't imagine a game needing more.  Yes, movies have "control" of their scenes, but these movie scenes are also set in environments that have way more objects in the scene with way more detail.   You think Final Fantasy Spirits Within has less unique surfaces than Final Fantasy X on the PS2?
 
DemoCoder said:
Dynamic branching is hard on optimizers, static branching isn't, because static branches get erased during compilation.
I think the point is that static branching doesn't have to be erased during compilation. Static branching could be used by the hardware as an optimization technique, either to remove the program switching penalty, or to reduce memory requirements (which may be unimportant now, but will be important in the future).
 
DemoCoder said:
Yeah it could, but a static branch is just a JMP instruction once it has been evaluated once.
Yes, and then the only possible problem is a cache miss if the next instruction is not in the instruction cache. But, given the deterministic nature, this can be avoided.
 
Humus said:
Even CPUs can do zero cycle branching after a number of iterations as the branch prediction mechanism figures the pattern out.
What about space?
 
Simon F said:
Humus said:
Even CPUs can do zero cycle branching after a number of iterations as the branch prediction mechanism figures the pattern out.
What about space?
Static branching would only be useful when it saves space. If not, then instructions should be unrolled by the compiler.
 
If you are taking the "ultra-shader" code, removing all the "constant Boolean based" branching, and re-optimising the code, then how is this in any way different to just storing each of the optimised shaders in the first place? <shrug>

Simon
 
Back
Top