HLSL 'Compiler Hints' - Fragmenting DX9?

Bjorn said:
The closest thing to ICD programming offered would be the course in reactive programming, where we wrote a driver for a simple AD/DA converter to control some odd pendulum device.

Been there, done that :)

Not sure if I have asked this before, but I have a faint memory that I might have done so but don't remember anything ..
Are you studying at Luth now, or are you done? What did you study / are studying?
 
Yeah, it sounds like the shaders aren't being "parameterized" That's what constant registers and static branching is for.

The artist should be using shaders written by real programmers bound into his art package so that the artist can manipulate only the parameters the developer wanted, not generate custom shaders with hardcoded constants.

Compiling shaders for each and every HW profile, and for each and every #ifdef could yield a combinatorial explosion of generated code which I think is a bad idea.
 
demalion said:
The fact that new profiles have to be consistently created...

No, the fact is that a new profile had to be created for the NV3x.

You have two scenarios. Either new profiles constantly needs to be provided for new hardware, or IHVs just suck it up and design hardware around assembly shader specs instead of around their visions. Neither way is desirable.

demalion said:
Have you added anything to the discussion besides "well, maybe glslang won't do as advertised, I think, perhaps, but have no information to offer"
Really.

Why are so many of your questions rhetorical and insulting? What does this "add to the discussion", then?

Frankly, I share DC's view on your posts. No offense intended, but it tend to me more quantity than quality. Lots of word, little actual information and lot of speculation.

demalion said:
Writing a compiler from scratch is easier than writing an ICD. Adding a compiler into an ICD isn't that complex, since, as I stated, the compiler doesn't need to "interact" with anything.

So there is no concern with the any other state updates influencing shader output at all on the GPU's end? Perhaps in relation to things like fog, AA, and texture handling?

Nope.
 
Humus said:
Are you going to complain too if a C++ compiler takes half an hour to compile a million lines of autogenerated code? Or that it would generate an executable on tens or hundreds of megabytes? Should you conclude that the C++ specification flawed?

This is a flawed analogy. You're not recompiling your application everytime you execute it. Besides, half an hour compile times are the norm for large scale software construction, and in many cases that's considered quite fast. We're talking about JIT compiling, for GPUs, and how with increasing shader lengths, and increasing numbers of shaders, it will become increasingly less practical. It's been impractical for CPUs for a while now.

And as shader complexity increases the cost will only become higher
Yes, but that cost will increase significantly slower than GPU and CPU speeds, making that cost more and more insignificant. Besides, were not talking about switching between a thousand shaders while rendering the scene. We're talking about switching between a small handful, for just the visible objects.


Having many shaders doesn't equate better quality, if there's any such line of thought in your post. Just because the artist have the full ability to change the appearance of objects as he like doesn't mean it needs to generate different shaders for each.

If your aim is realistic lighting (big if), and you haven't adopted a BRDF model yet, then yes the # of shaders can equate to better quality. Plastic requires a different lighting equation than steel, requires a different lighting equation than alluminum, requires a different lighting equation than iron, requires a different lighting equation than glass, requires a different lighting equation than velvet, etc etc.

If you're not using a BRDF lighting system (and there's still an extremely alarming number of game developers out there that have no idea what a BRDF is), all those different lighting equations require their own shader, and that means virtually every object will have its own shader.

Even with the BRDF system (leaving the lighting part virtually solved), you're still left to combat the inreasing need for more unique objects, and the increasing problem of development times and costs. As a result, many are turning to procedural shaders to add extra 'uniqueness' to objects - add some dust to the objects in this room, some mildew to the objects in that room, rust to the metals, mold to the soda in a glass, etc. All that results in more and more shaders.

Granted you can work around that a little bit and do a single abstract, "procedurally blend a texture" that will blend in a texture based upon parameters. Then you could just select a 'dust texture' (for example) and tweak the parameters to your likings to get a decent approximation of what you would be doing in the paragraph above. Obviously, this can't always be done. And as time goes on there's going to be more and more reasons to add new shaders to your project, and less and less ways of getting around it.

It's only a matter of time before the complexity, length, and number of shaders outgrows what is practical for JIT compilation.
 
Ilfirin said:
This is a flawed analogy. You're not recompiling your application everytime you execute it. Besides, half an hour compile times are the norm for large scale software construction, and in many cases that's considered quite fast. We're talking about JIT compiling, for GPUs, and how with increasing shader lengths, and increasing numbers of shaders, it will become increasingly less practical. It's been impractical for CPUs for a while now.
And if you define the API interface carefully, you don't have to compile whole shader libraries every time you start a game. That is, if JIT compile time will ever become an issue with shaders (which I don't believe).

And as shader complexity increases the cost will only become higher
Yes, but that cost will increase significantly slower than GPU and CPU speeds, making that cost more and more insignificant. Besides, were not talking about switching between a thousand shaders while rendering the scene. We're talking about switching between a small handful, for just the visible objects.
A small handful? If only a small handful of different shaders is used for a frame, how big/detailed does that world have to be to really make use of several thousand shaders? Don't you think an engine could be able to compile only those shaders that are used in the surroundings of the player, reloading/compiling them if necessary?
The number of different visible objects is increasing, too, so the cost of shader switches will remain relevant.


If your aim is realistic lighting (big if), and you haven't adopted a BRDF model yet, then yes the # of shaders can equate to better quality. Plastic requires a different lighting equation than steel, requires a different lighting equation than alluminum, requires a different lighting equation than iron, requires a different lighting equation than glass, requires a different lighting equation than velvet, etc etc.
You can realize this with static flow control. No need to create an endless number of shaders for every possible permutation of surface characteristics.

Even with the BRDF system (leaving the lighting part virtually solved), you're still left to combat the inreasing need for more unique objects, and the increasing problem of development times and costs. As a result, many are turning to procedural shaders to add extra 'uniqueness' to objects - add some dust to the objects in this room, some mildew to the objects in that room, rust to the metals, mold to the soda in a glass, etc. All that results in more and more shaders.

Granted you can work around that a little bit and do a single abstract, "procedurally blend a texture" that will blend in a texture based upon parameters. Then you could just select a 'dust texture' (for example) and tweak the parameters to your likings to get a decent approximation of what you would be doing in the paragraph above. Obviously, this can't always be done. And as time goes on there's going to be more and more reasons to add new shaders to your project, and less and less ways of getting around it.
If you want to add several rare features to certain objects, you don't have to create the full range of all possible permutations. The oft-used elements could be paths selected through SFC.

That's not in any way different to how you write applications for CPUs. You'd never create several dozen executables if they share a lot of code but differ in some aspects. You'd write a configurable one.

It's only a matter of time before the complexity, length, and number of shaders outgrows what is practical for JIT compilation.
I don't think so. What you mentioned before, regarding increasing CPU speeds, is also true for JIT compilation.
You also seem to suggest building shaders from different parts (no one is going to actually write several thousand different shaders). You could achieve this by compiling the parts and adding a linker step that puts them together, which is much faster than compiling all combinations of code elements.

I agree with DemoCoder that artists should not write shaders, but configure them.
 
Regarding multiple BRDFs...

Is it still impractical to approximate your BRDFs using
spherical harmonics or wavelets? That would leave you with a single shader for BRDFs, taking input from a per material BRDF "texture".

Regards,
Serge
 
Xmas said:
And if you define the API interface carefully, you don't have to compile whole shader libraries every time you start a game. That is, if JIT compile time will ever become an issue with shaders (which I don't believe).

You have to compile them eventually, if they're even going to be used (you'd obviously filter out every redundant or unused shader). Even you spread them out and only compile the shaders for one level when that level loads. If you're compile times are 30 minutes, and you have 10 levels, that's an extra 3-minute loading time for each level. Any loading times more than a couple of seconds and I won't play a game.. and even those load times piss me off.

A small handful? If only a small handful of different shaders is used for a frame, how big/detailed does that world have to be to really make use of several thousand shaders? Don't you think an engine could be able to compile only those shaders that are used in the surroundings of the player, reloading/compiling them if necessary?
The number of different visible objects is increasing, too, so the cost of shader switches will remain relevant.

A handful of additional shaders not counting the ones that are already applied to the whole scene (such as tone mapping, digital grading, HDR blooms, etc) is still orders of magnitude more than you see in today's games. A given indoor FPS game scene generally only has a dozen or so objects in it, then there are maybe 50 or so scenes for a level, and 10-20 levels. 5*50*10=2500. *note: I'm not advocating the creation of 2500 shaders here.. just saying that you can easily have thousands of different shaders throughout a game without bogging down performance from all the shader changes (5 changes per frame is nothing).

You can realize this with static flow control. No need to create an endless number of shaders for every possible permutation of surface characteristics.

A) No one's talking about permutations of surface characteristics. We're talking about lighting equations (you know, Blinn, Phong, Torrance-Sparrow, Minnaert, Ward, etc). That is, how much light is reflected towards the eye for a given incident light vector, normal, and eye vector. This varies drastically from one surface (or angle, for that matter) to the next, hence the creation of BRDFs and BSSRDFs (which can be seen as 'data-driven' lighting models of sorts, since entire lighting models can be turned into data that can be simply looked up in the shader). You can hack this up with texture, specular, gloss, and environment maps using the phong model. But that doesn't even begin to compare to the real, physically correct thing.

B) How would packing in a bunch of shaders into 1 long switch statement solve anything? You'd just have one really long shader instead of a bunch of small ones, which is inefficient for the programmer, compiler, and the computer that has to execute it.

That's not in any way different to how you write applications for CPUs.

I don't know about you, but my CPU apps never turn into one long switch statement..

You'd never create several dozen executables if they share a lot of code but differ in some aspects. You'd write a configurable one.

Very bad analogy. We're not talking about different configurations of the same routine, we're talking about a bunch of different routines (as different as a bloom shader is from a phong lighting shader). Doing what you're suggesting is analogous to taking all your RayTriangleIntersect(), SphereSphereIntersect(), etc routines and dropping them all in one Intersect( <function id> ) routine, that's just one big switch statement.

You also seem to suggest building shaders from different parts

No, I said that's what a lot of people are doing/did in their engine design, and thus now have a massive amount of shaders (and thus too much work to go back and change their minds now) that take a very long amount of time to compile. One of the companies who I've heard complain (through private correspondences) about shader compilation times in their project just recently delayed their highly anticipated sequel (hmm), though I'm not sure if their shaders are artist created or not.

(no one is going to actually write several thousand different shaders)

They will if it's required. That's like saying no one would ever write several thousand different routines for CPUs, when, in fact, everyone does now. And shaders are generally a lot easier to write (and shorter) than CPU routines, IMO. We have a while to go before games are anywhere near photorealistic.. we’re not going to get there without a lot of work.

I agree with DemoCoder that artists should not write shaders, but configure them.

As do I, which is why I've stated that in every post I've made in this thread. The last post assumed a programmer, not an artist, was writing the shaders (would you really want an artist devising an analytical solution to light transport and reflection through, say, a tree leaf, and then expect them to code it up efficiently?).

Sorry for the long post.. I hate them as much as everyone else.
 
Ilfirin said:
You have to compile them eventually, if they're even going to be used (you'd obviously filter out every redundant or unused shader). Even you spread them out and only compile the shaders for one level when that level loads. If you're compile times are 30 minutes, and you have 10 levels, that's an extra 3-minute loading time for each level. Any loading times more than a couple of seconds and I won't play a game.. and even those load times piss me off.
I seriously doubt compile times for shaders will ever get that long. Shaders are very short.
 
There's nothing new here. This problem already exists in every other aspect of the rendering process. You can have VBO broken on nVidia, occlusion query broken on ATI, separate blend broken on XGI, stencil test broken on Trident, volumetric texturing broken on 3dlabs etc. .... nothing new under the sun.

I have yet to hear a good argument why high level compilers would be so inherently more prone to being buggy than other parts of the rendering pipeline. And it's not like such bugs can't be fixed or anything. In shaders you at least got #defines with which you probably can work around the problem in most cases for the particular driver that would be problematic.

Even if the compiler presents absolutely no more difficulty to maintain with little bugs it's still added complexity ONTOP of all of the other parts of the rendering path. I don't know enough about complier programming to give any informed opinion on them, but they don't seem to be the most simple thing in the world either, especially when they're supposed to be optimizing code as well. That was the reason to put the compiler into the driver right? To optimize shader code at a high level? That has to be harder than just compiling it so it'll run right? Also, am I wrong in assuming that an error in the compiler code would have much more reaching effects then a broken function in the driver?

In no way am I saying that as a solution having the compiler in the driver is unworkable or in any way inferior, if not superior, to the HLSL method of doing things, merely that there are definate difficulties in proceeding that way as well. Whether they can be surpassed or not is utterly beyond my ken, I leave that to the ARB and the IHVs.
 
psurge said:
Regarding multiple BRDFs...

Is it still impractical to approximate your BRDFs using
spherical harmonics or wavelets? That would leave you with a single shader for BRDFs, taking input from a per material BRDF "texture".

Regards,
Serge

Wavelets are impractical in general (though I haven't yet invested enough time in the wavelet BRDF solutions to say yes/no). You can do it with SH with a lot of work and sacrafices, but really you don't need either. In most cases you can find a seperable approximation that's so close to the real thing that no one will notice a difference.

You don't even need that, either. You could also take a BRDF and turn it into a 2D array of 2D spherical functions (might look something like this for the phong BRDF). You then look up the 'reflectance spherical function' cooresponding to your incident light vector by converting the light vector to spherical coordinates (though you'd probably want a modified set of spherical coords in the range of [0,pi) for theta and phi so you have a square texture), then use your eye vector as a lookup into this smaller, subtexture to retrieve the BRDF. But then you're consuming around 4MB for a BRDF with only 256 samples to it, so it's better to go with the other techniques anyway.

"That would leave you with a single shader for BRDFs, taking input from a per material BRDF "texture"."

Yes, and this is what I've been saying for 2 years now, only to get a lot of "what's a BRDF?" type responses :)
 
Ilfirin said:
If your aim is realistic lighting (big if), and you haven't adopted a BRDF model yet, then yes the # of shaders can equate to better quality. Plastic requires a different lighting equation than steel, requires a different lighting equation than alluminum, requires a different lighting equation than iron, requires a different lighting equation than glass, requires a different lighting equation than velvet, etc etc.

This means one shader for each class of materials, not one for every object. You typically don't have one iron object, one velvet, one plastic, one wood etc. You usually have many of each of them. The same iron shader should be reusable for each iron object. The small individual differences between objects of the same class can be (and should be) handled through parameters to the shader.

Ilfirin said:
It's only a matter of time before the complexity, length, and number of shaders outgrows what is practical for JIT compilation.

I don't think so at all. Now imagine that indeed you would be right. JIT compilation would turn problematic for some large projects. What would happend? The ARB would just proceed with issue 24 from the spec:

24) Do we need a way to get object code back, just like the model of C
on host processors?

DISCUSSION: Lots in email on the arb-gl2 mailing list. This is about
lowest-level, machine specific code that may not even be portable within
a family of cards from the same vendor. One main goal is to save
compilation time. There seems to be general consensus that this has
merit.

RESOLUTION: This is an interesting capability to have, but will be
deferred to the next release or could be added as a separate extension.
http://oss.sgi.com/projects/ogl-sample/registry/ARB/shader_objects.txt
 
Ilfirin said:
How would packing in a bunch of shaders into 1 long switch statement solve anything? You'd just have one really long shader instead of a bunch of small ones, which is inefficient for the programmer, compiler, and the computer that has to execute it.

I don't say that this is the case, though I wouldn't be surprised if a static switch statement would be more or less for free, or at least cheaper than switching the shader.
 
And overriding principles (shortest register count, shortest instruction count) are too simplistic to capture everything that needs to be done, because they are competing goals.
No, they're not competing goals if you don't have a significant performance penalty simply from increasing temporary register utilization beyond a very low number. Perhaps you mean from a hardware design standpoint? I agree, but then I'm not saying glslang doesn't have a theoretical advantage, I'm saying it is obviously in an IHVs interest to avoid certain mistakes as high priorities with a given goal, if possible.

They are competing goals on some architectures. Eliminating registers bloats code by forcing recalculations to occur. Let's say you are using 4 registers, and using a 5th register drops your performance by 25%. To avoid this, you eliminate the register by redoing some calculations, but in doing so, you added 25% more code. For example, maybe you normalized a register (3 instructions) and saved the result for later reuse in two other expressions. But you now have to eliminate this extra register, so you just do the normalize twice, instead of eliminating subexpressions.


The most optimal code is not neccessarily at the extremes (shortest actual shader, or shader with fewest registers used), but is someone in between, and possibility finding the global minimal is extremely hard.


Outside of the NV3x, what type of performance yield are you proposing from this compared to what can be done with the LLSL?

I'm just telling you that the issue is alot more complex than just "shortest shader or shortest register count". Not all instructions have single cycle throughput, and they certainly have differing latencies, so instruction selection is different for each HW. For example, LERP is way more expensive than MIN/MAX on NVidia hardware. RSQ is expensive on NVidia, so sometimes using a Newton Raphson approximation is better. Also, symbolic high level manipulation can yield improvements, here's an example

X = normalize(L) dot normalize(N)

Today, this gets compiled into something like

a = L dot L
b = rsq a
c = a * b

d = N dot N
e = rsq d
f = d * e

X = c dot f

But algebraically, this can be manipulated into

X = (N . L) / ( |N| |L| )
x = (N . L) /(sqrt(N.N * L.L)

which is

a = N . L
b = N . N
c = L . L
d = b * c
e = rsq d
X = a . e

you've shaved all one rsq and one instruction, but traded for a dot product. Depending on the HW, this may or may not be a win, since the RSQs might execute in a different unit, and might be able to be ran in parallel, whereas the extra dot product has to run serially. Who knows.

Microsoft's profiles have to know about more than just general goals like "shortest X or Y", they must also know about the individual timings and latencies of the VLIW instructions that will be used to implement the LLSL operations.

Ditto for predicate vs branch vs LERP vs CMP vs MIN/MAX

In fact, FXC isn't able to eliminate the extra RSQ was I showed above, therefore, NVidia takes a hit, because their RSQ is expensive. In fact, FXC doesn't even generate DX9 normalize macros in the shader, which makes it even harder for the driver to rewrite the expression.

There are loads of other DX9 HLSL library functions which might be directly accelerated on future hardware like faceforward, smoothstep, and transpose. With LLSL, the semantics of these operations are lost because they are replaced with a code expansion. It becomes very difficult for the driver to recognize what is happening and substitute alternatives using algebraic identities after that.


Finally, with regards to JIT compilation and dynamic optimization, this doies not incur significant overhead, and has been used for years on some systems (Smalltalk, Java)

The way it would work is this: The driver keeps a small table of statistics for the "most active" shaders used, it can do this in "debug mode" or in retail mode, it doesn't matter. For the most active shaders, the driver further records which of the runtime constants passed to the constant registers don't change very much.

After the driver has collected this profile information, the compiler can then use it to generate "speculative" compiles of hot shaders. A speculative compile is one where you ASSUME you know the values of those constants which you found not to change very often.

This can lead to constant propagation, algebraic and strength reduction opportunities, along with removing branches, min/max/lerps, etc.

You also compile a version of the shader that is based on not knowing the value of runtime constants.

Now the driver, armed with these two shader versions uploaded to the card, can choose which one really gets bound (when asked for) by looking at what constants were fed via the API. If the constants fed match up with the profile statistics, it chooses the "known constant" shader, if not, fallback to the "unknown" one.

This technique is used in C and C++ compilers to overcome performance related to dynamic dispatch and polymorphism. Over the years, and programmers use languages that offer more dynamic method invocation (pointers to functions, et al), compilers have had a toughter time figuring out how to do global analysis and method inlining.

With speculative compilation, the compiler can use profiling data collected from real application runs to generate code that looks like this

function foo(Object * b)
{
if(b = X)
{ inline version of function B.BAR() }
else
b->BAR();
}

It does this, because perhaps B.BAR() is called 90% of the time, but there is a rare chance that the 'b' pointer points to a different object.

With shaders, the compiler could speculatively propagate constants, and determine if the result yields an improvement based on some heuristic (e.g. don't do it, if it only shaves off n cycles and those particular constant values only appear 70%of the time, since some cycles are lost because of state changes)

There are in fact, a boatload of compilation techniques that are available to GPU compiler authors, and OpenGL gives us a platform to explore this, DirectX9 does not.
 
Chalnoth said:
I seriously doubt compile times for shaders will ever get that long. Shaders are very short.

I think a lot of the stuff I've been hearing from people saying their shader libraries take that long to compile are exaggurated a bit, but even if they're not I'm betting 90% of that time is spent loading and saving the shaders and their compiled results to disk. Split up the compilations enough that everything can be reasonably held in ram at once and I'd expect that number to drop dramatically. But I am still a bit hesitant about doing more stuff at load time.. I really do hate load times so much, and do everything humanely possible to remove them completely. Add in shader compilation and it's just one more thing you have to try and balance, a little more complexity to a problem that is already incredibly complex.
 
There's also the fact that FXC can only compile 1 shader at a time. If you have 2500 shaders, you must exec FXC.EXE 2500 times.

That's 2500 processes created, 2500 loads and inits of a windows app, etc. Way more expensive than a memory resident compiler, or something like FXC *.hlsl (all 2500 get compiled at once) which FXC doesn't support today.

Imagine if C compilers could only compile one .c files per invocation.
 
Eolirin said:
Even if the compiler presents absolutely no more difficulty to maintain with little bugs it's still added complexity ONTOP of all of the other parts of the rendering path.

Still nothing new under the sun. Every time a new generation hardware enters the market a whole bunch of new extensions and capabilities are exposed. This add lots of complexity on top of whatever was before it. OpenGL 1.4 is additional complexity on top of OpenGL 1.3. A shader compiler just happends to be a larger thing to do than an average extension. But it's not like it can be compared to the efforts on the rest of the ICD. It's not 50% compiler + 50% rest of ICD. Rather something like 10% compiler, which is still a large chunk, but nowhere near as dramatic change as have been implied in the debate.

Eolirin said:
I don't know enough about complier programming to give any informed opinion on them, but they don't seem to be the most simple thing in the world either, especially when they're supposed to be optimizing code as well. That was the reason to put the compiler into the driver right? To optimize shader code at a high level? That has to be harder than just compiling it so it'll run right?

Sure it's harder. And sure we'll see bugs in the beginning, just like there are plenty of bugs in the DX9 HLSL compiler right now. But compilers will mature, just like any other compiler. Can't even recall I have ever had any problem with any compilers for CPUs such that it would generate incorrect code. On a few cases I have experienced that MSVC would just fail to compile and report internal errors, often just making a minor change to the code and it works again. GPU compilers will probably reach that level eventually. Not instantly, but I'm sure it will become more than good enough in months.

Eolirin said:
Also, am I wrong in assuming that an error in the compiler code would have much more reaching effects then a broken function in the driver?

Yes, you're wrong on that. An error in the compiler would typically just fail to compile and return an error, or possibly just generate incorrect code, which would simply result in incorrect rendering. But I doubt the compiler would ever cause instant reboot or anything of that sort.
 
It is less likely that optimizer bugs will show up on the GPU. On the CPU, you can get obscure C compiler optimization bugs to do multithreading, over-eager dead code elimination (timing loops), stack unwinding, and memory alignment. Most of these won't be an issue for the GPU.

The GPU's stateless (no ability to modify shared data between pixels or vertices) makes alot of bugs that have to do with stale register values go away, not to mention the lack of memory and stack as well.

On future "more CPU-like" architectures, it could be an issue. Something like the PS2's "Scratchpad ram" could be dangerous to use correctly in the context of shaders and optimizers due to concurrency and ordering issues.
 
Humus said:
The same iron shader should be reusable for each iron object.

Yes, but really, how many different iron objects do you expect from a game not set on the titanic? A few dozen throughout the game, maybe. That's the amount of "uniqueness" I meant. That and the procedural thing.

I don't think so at all. Now imagine that indeed you would be right. JIT compilation would turn problematic for some large projects. What would happend? The ARB would just proceed with issue 24 from the spec:

That would put most my concerns to rest. It's not that I'm dead set against JIT shader compilation (you can find posts by me in the past supporting it), it's that I'm against being forced to do JIT. You don't always want your shader compiled on the fly (a lot of people don't do it with HLSL, even though they have the option), and even when you do, you'd always want a compiled version of the shader that runs well on everyone's hardware and WORKS as a backup for when the inevitable driver release comes that cuts your game's framerate in half, or leaves it so it just doesn't work anymore. Putting what's basically your entire company's future* in the hands of driver writers is a big commitment.

*No one is going to buy or play your game if it's constantly broken with new driver releases, or suddenly starts running suboptimally regardless of whether it's an IHV's fault or not. And these days you don't have the option of a failed release if you plan on doing another.
 
Back
Top