Nvidia's unified compiler technology

DemoCoder said:
I still don't see where these oodles of shaders come from. There are a finiite number of material types and lighting equations, I simply don't see the combinatorial explosion.

Are there any games today that use 10,000 textures?

The ideal is perfectly optimised unique shading, each surface executes the shortest program that can generate the correct fragments.
We don't do that (for resource generation issues, more than any other reason). We do have a long list of visual fragments, i.e. diffuse lighting, specular lighting, bump map, fog, env map, etc. Each fragment has a list of input paramters used, the artists fiddle, decide what looks 'right' and then we compute the required shader code. With say 10 fragments + 5 parameters each you get 50 fragments, probably 2 shaders per fragment (1 vertex and 1 pixel shader) + per-object data (bones, tweened, vertex compression, different combination of lights etc). So 50 shaders with 20 different objects, gives you 1000 'short' shaders. Or you have a shader with all fragments using branches (or 0's to remove that component) with 20 different objects, gives you 20 'long' shaders. As shaders get more powerfull, the number of fragment will increase quickly, even with exclusivity between certain peices, its still going to be loads of shaders.

Now of course, not all are in use at one time BUT potentially over a big game, lots could be used.

Even today in CGI, nobody applies a single Renderman shader to each surface, they layer multiple shaders together (one layer metal, next layer rust, add a relectivity layer (due to a water etc), then several shadow map layers). Were doing basically the same thing but with performance being our number 1 priority.

The visual affect is the same, the small fragment method may reduce run-time but it may also increase run-time due to less batching. But then we don't batch anywhere near as much as IHV's would like due to using real art and not benchmarks, so as long as the small shader method doesn't reduce batching below the level with the 'long' shaders, small shaders should be a win (ignoring driver issues).

We have over 30,000 resources at the moment (a resource being a texture, model, sound etc). We recently removed using filenames to access them as that was taking several megabytes on there own (now a 4 byte ID to refer to each resource).
 
Dio said:
This is a stickier part of the discussion. I can see some legs of the trousers of time in which the ability to see very high level code inside the driver is not just a win, but a massive win, because some hardware magician has come up with something incredibly smart that is very much out of left field.

There are plenty of hardware magicians inside ATI, I can't rule out wanting high-level code to compile at some future point.

At the same time I completely understand the performance issue. We'll make sure we find an acceptable compromise.
I can see having high level code can help... guess your be writing lexical analysers in ASM then :devilish:

Its a tricky tightrope, we need a standard format on PC but want the speeds low-level optimised code gives us without the run-time expensive of a decent optimiser. Catch 22
 
Dio said:
I would expect so. Anyone can unpack the q3 texture set and see there are about 2000 textures in there.

1086 to be exact an the real number is lower the Q3 textures are split up to keep them lowres.

I'm certain some games designers are well over 5x this now - game asset complexity is matching or outstripping Moore's Law, which would therefore suggest at least 20k textures.

I am pretty sure you wrong, the individual texture has increased in size and complexity since Q3 but I don't think the number of textures has increased very much if at all.

Whether we have yet hit 10k textures per level I don't know. Probably not, but not far off.

Very far off 10K textures at 1Kx1K resolution would use 40GB of space (uncompressed). Even with low resolution textures and compression the size of 10K textures would be a problem.

But the reasons we haven't are nothing to do with technology and everything to do with time and money.

Yeah right. We don't have the technology to use 10K in any reasonable resolution.
 
DeanoC said:
I can see having high level code can help... guess your be writing lexical analysers in ASM then
Porting lex and yacc to generate asm instead probably wouldn't be too hard :)
 
Tim said:
Very far off 10K textures at 1Kx1K resolution would use 40GB of space (uncompressed). Even with low resolution textures and compression the size of 10K textures would be a problem.
Note that I am considering 'per level' not 'per frame'. As Deano points out, levels are increasingly becoming based on streaming architectures that support seamless register and unregister.

Anyone who's using uncompressed 1Kx1K textures without having specifically analysed it to check the image quality cost of compression needs their head examining. DXTC is suitable for 90% of textures, and more so for high resolution textures, because as the texture size goes up the importance of any single pixel is generally somewhat reduced.

For a single scene, 1k 256x256 - still a very common standard texture size for many applications - DXTC textures is easily achievable right now even on low-end cards (48M of texture, including mipmaps): and 10x the scene complexity for a whole level seems reasonable to me and perfectly achievable on top-end hardware (1GB system RAM, 256M video RAM).
 
DeanoC said:
The ideal is perfectly optimised unique shading, each surface executes the shortest program that can generate the correct fragments.
That's what flow control is for. Granted, it's not really there yet (PS), but driver-side compile or the long shaders that would take so long to compile aren't either.

You can have static flow control at zero performance cost in hardware, i.e. SFC instructions don't have to take vital ALU cycles. It will cost some die area, but OTOH SFC saves you from wasting performance on shader changes, greatly improves shader caching efficiency and you don't need to write code that generates thousands of shaders from small pieces.

Hyp-X said:
Let's take a realistic example - you have to support legacy lighting capabilities trough VS1.1. Do that optimized, avoiding dead code.

Let's support point, directional and spot lights. FF supports 8 of them, but let's support only 4. (I think thats 35x combinations).
You can have diffuse texture or not (2x), specular texture or not (2x), envirnment map or not (2x), self-illumination texture or not (2x), bump map or not (2x).
You can have no-deformation, 2 source tweening, 3 source tweening, 1 bone skinning, 2 bone skinning at least (5x).
You can have fog enabled or not (2x).

Thats 35*2*2*2*2*2*5*2 = 11200 shaders.
Yes, legacy support is a problem, but it's far less of a problem than you make out of it. You shouldn't multiply the number of vertex shaders with the number of pixel shaders, because you can, for the most part, freely combine them. Some PS require input from VS, but that still doesn't result in multiplying the numbers. In many cases you can do the extra work in the VS without noticeably sacrificing performance.
As for several light sources, PS1.x hardware will hardly be able to handle more than one per pass, while PS2.0 hardware also means VS2.0, i.e. SFC in the vertex pipeline, so the situation is entirely different again.
 
Xmas said:
You can have static flow control at zero performance cost in hardware, i.e. SFC instructions don't have to take vital ALU cycles. It will cost some die area, but OTOH SFC saves you from wasting performance on shader changes, greatly improves shader caching efficiency and you don't need to write code that generates thousands of shaders from small pieces.

Its not free, static flow of control affects the ability of the optimisor. Any flow of control limits the extent to which you can move instruction, reuse registers etc. So if a earlier step (i.e. HLSL literal) can remove branches for which you know are constant, you should. You will get better performance.
 
Dio said:
DeanoC said:
I can see having high level code can help... guess your be writing lexical analysers in ASM then
Porting lex and yacc to generate asm instead probably wouldn't be too hard :)
I thought he was implying writing the parser in asm! Yuck!
 
DeanoC said:
But long branching shader are slow.

I thought I would have been pretty clear by now, but apparently not. I'm not suggesting that write shaders like this:

Code:
void main(){

  if (material == rubber){
     // do rubber stuff
     ...
  } else if (material == iron){
     // do iron stuff
     ...
  } else if (material == plastic){
     // do plastic stuff
     ...
  } else ....
  ...
}

I'm promoting that you instead of doing this:
Code:
void main(){
   float diffuse = bla bla ..
   float specular = bla bla ..
   return 0.3 * diffuse * base + 0.5 * specular ...
}

plus

void main(){
   float diffuse = bla bla ..
   float specular = bla bla ..
   return 0.2 * diffuse * base + 0.7 * specular ...
}

You instead do this:

Code:
const float da;
const float ds;

void main(){
   float diffuse = bla bla ..
   float specular = bla bla ..
   return da * diffuse * base + ds * specular ...
}

No quality loss, less resources, quicker compile time and likely some speed gained.
 
zeckensack said:
Humus,
I understand precisely what you mean and you're right, but I still can't help wondering what you're arguing against. Is it about JIT or about 10k shaders per frame? The former doesn't automatically imply the latter.

I'm for driver side compilers and against the idea that developers design engines around non-existing guarantees, and as it becomes problematic demands that the driver should change instead of the app, potentially causing problems for well-written application. If someone uses the 10,000 shader example as an argument to place compilers outside the driver, then other well-written applications will have to suffer because they aren't achieving the full performance they could because some other developers artificially forced the driver into guaranteeing short compile time.

As I have said, I'm not against the idea of passing compiler hints and such.
 
DeanoC said:
Xmas said:
You can have static flow control at zero performance cost in hardware, i.e. SFC instructions don't have to take vital ALU cycles. It will cost some die area, but OTOH SFC saves you from wasting performance on shader changes, greatly improves shader caching efficiency and you don't need to write code that generates thousands of shaders from small pieces.

Its not free, static flow of control affects the ability of the optimisor. Any flow of control limits the extent to which you can move instruction, reuse registers etc. So if a earlier step (i.e. HLSL literal) can remove branches for which you know are constant, you should. You will get better performance.

What will affect the optimizer far more is breaking up larger HLSL programs with branches into hundreds of smaller shaders.

Part of what we're talking about are constant parameters being passed, and part of what we're talking about are constant branches, either way, I am bound to believe that the number of cycles wasted doing a static branching is going to be far lower than the number of cycles wasted changing shaders and uploading shaders.

No one is suggesting one big "uber shader" that has hundreds of branches and contains all your small shaders. There is a middle ground, where parameterization using constants makes sense.

Anyway, let's say the average shader is 20 lines of HLSL source, and you are suggesting 10,000 shaders, this means 200,000 lines of HLSL code, preferably debugged and tested, which is very large.

Humus, with DX9 you can't inline constants anyway. They have to be declared as constant registers and passed by the CPU, so if someone is generating HLSL shaders with different inlined constants, they are effectively ending up with identical duplicate shaders!
 
DeanoC said:
We'll have to agree to disagree whether 10,000 shaders sounds alot for the future. We will also have to agree to disagree about driver side compilers.

Even if you need a fast path, that's still not a good argument against driver side compilers. I'd like to view it like how texture compression is handled in OpenGL. You can feed it with a normal RGBA8 texture if you like and tell the driver to compress it for you. There are no guarantees that this will be fast. If you want it fast, you call glHint() with the right parameter. There are still no guarantees, but the driver knows your intention and may adjust accordingly. If you don't need max quality, then the driver can do a quick one that's "good enough". In case this still isn't satifying you have the choice to pass a precompressed DXTC texture directly to the API. The precompressed DXTC texture can be fed back directly from the API in a previous call, stored and reused later on.

This is how I want it things to be handled with shaders too, and it should satisfy everyone. If you don't need short compile times or anything such, then just pass the high level shader to the API and you're done. The compiler can now spend any time it wants to optimize it for the hardware, and maximum performance can be had. If this is a little too slow and you don't need maximum runtime performance, then you can opt for passing something like a GL_minimize_compile_time flag to the API as you compile. And finally, if a fast path really is needed, then you can simply get a compiled binary version back, write it to disk and reuse it from then on.
 
DemoCoder said:
Humus, with DX9 you can't inline constants anyway. They have to be declared as constant registers and passed by the CPU, so if someone is generating HLSL shaders with different inlined constants, they are effectively ending up with identical duplicate shaders!

Ehm, not sure if you mean something else :? ... but how about for instance:

def c0, 1.0, 0.0, 0.0, 0.0
 
Humus said:
I thought I would have been pretty clear by now, but apparently not. I'm not suggesting that write shaders like this:

...

But that's an extremely limited, and far from right way of doing lighting.

In case people didn't get it then, what I was suggesting in the other thread (which, btw, had every point made in this thread and some more) was:
Code:
float4 BRDF( /*...*/ ) { //parameter list is up to debate
     // ...
     //return the BRDF for each channel via texlookup(s)
}

float4 main() {
     //...
     return BRDF( lightVector, eyeVector, /*...*/ )*dot(normal,lightVector)*lightIntensity;
}

All the lighting is in the BRDF, completely seperate from the shader. So you can use 1 shader for all your lighting and just change the BRDF. And the BRDF would likely be automatically generated either from microfacet geometry, a gonioreflectometer, or a set of tweakable parameters in the editor. More flexability, far greater graphical realism, faster, AND easier (on both artists and programmers).

The problem with how everyone does lighting now (evaluating a lighting equation for every pixel, every frame) is that they're constantly recalculating and programming stuff that is essentially just data. Worse, data that doesn't change. Calculate it once, store the results, and just look them up. Storage is a problem, but this stuff is generally highly compressible.
 
Ilfirin said:
But that's an extremely limited, and far from right way of doing lighting.

It was just an example, and whether diffuse + specular is wrong or limited is quite irrelevant.
 
Humus said:
It was just an example, and whether diffuse + specular is wrong or limited is quite irrelevant.

Not completely irrelevant. The 3 solutions reflect 3 different design philosophies, all aimed at the same problem. There's hard coding everything, which leads to massive amounts of code and is generally inflexible, inefficient, and counter productive. Then there's parameterization, where you take one method and make it as configurable and tweakable as possible. And then there's total data-driven design, where everything's in this abstract "data" that the shaders process.
 
Ilfirin,

that is a nice way of doing lighting, but... I think the problem currently is that storing a 4D BRDF takes up giant amounts of video
card RAM, which effectively limits you to just having just a few
materials anyway. It was my impression that by hard-coding a few different BRDF functions (and tweaking constants) you can get more scene-variation...

OT: Something I don't really get is how you filter a BRDF - for example, lets say your surface is pretty far away... that means that one pixel might contain a fair amount of variation in the normal (i.e. a normal mapped surface, or a complicated object very far away). This translates to potentially large variations in the light and view vectors in the coordinate system in which the BRDF lookup textures are defined. How does one take this into account?
 
Humus said:
DemoCoder said:
Humus, with DX9 you can't inline constants anyway. They have to be declared as constant registers and passed by the CPU, so if someone is generating HLSL shaders with different inlined constants, they are effectively ending up with identical duplicate shaders!

Ehm, not sure if you mean something else :? ... but how about for instance:

def c0, 1.0, 0.0, 0.0, 0.0

Maybe I'm mistaken, but I thought "def" was just syntactic sugar, and all this does it cause the driver to preload the registers. What I meant to say was, DX9 LLSL doesn't contain "load immediate" instructions like typical assembly instruction sets, ala

move r0, #1234

Anytime you deal with constants, you must always use a register, so the shader is parameterized by register, since no matter what, you get code like

mul r0, r1, c0

instead of

mul r0, r1, #1234

Even if you have DEFed some registers, can't they be overwritten with an API call, or does "DEF" imply "const"ness which can't be affected by the DX9 API calls?

Two HLSL code snippes

x = y * 10

and

x = Y * 20

will compile to the identical DX9 assembly, with only the DEF headers being different.
 
psurge said:
Ilfirin,

that is a nice way of doing lighting, but... I think the problem currently is that storing a 4D BRDF takes up giant amounts of video
card RAM, which effectively limits you to just having just a few
materials anyway. It was my impression that by hard-coding a few different BRDF functions (and tweaking constants) you can get more scene-variation...

The amount of memory consumed by the BRDF is largely dependent on how you implement it. Like I said in the other thread, you can usually find a factorization/decomposition for a given BRDF (with a given parameterization) that is close enough to the real thing that no one would ever notice that it was just an approximation unless there were side-by-side comparisons between the real-time and the ray-traced versions (and even then it's often difficult to discern the differences). But even with the exact 4D BRDF, there's generally lots of redundant information and/or blank space that can be losslessly compressed and packed away, and then you can generally get away with a good deal of lossy compression too. Of course, all this starts complicating the run-time process..


OT: Something I don't really get is how you filter a BRDF - for example, lets say your surface is pretty far away... that means that one pixel might contain a fair amount of variation in the normal (i.e. a normal mapped surface, or a complicated object very far away). This translates to potentially large variations in the light and view vectors in the coordinate system in which the BRDF lookup textures are defined. How does one take this into account?

Not sure I understand the question, but if you mean how do you handle the cases when the incident light vector and/or eye vector are between samples in the BRDF then, again, it depends on how you did your BRDF implementation. For the 2D array of 2D spherical functions (as mentioned in the other thread), one way you could do it is to turn on bilinear filtering (for the reflection vector samples) on the BRDF texture, and do 4 different BRDF lookups (for the 4 surrounding light vector samples) and do manual bilinear filtering between them in the pixel shader. Though some form of non-linear interpolation might lend itself to better results..
 
Ilfirin, OK I need to look into this factorization of BRDFs before commenting further (any links?)

As to my other question - i can see how interpolation between samples might be done... sorry if my terminology was bad. What I mean is: how does one handle the case where the light and/or view vectors cannot accurately be represented as a single vector? (Extreme minifcation of a normal mapped surface say).

Regards,
Serge
 
Back
Top