The Experts Speak.. "Automatic Shader Optomizer"

http://www20.tomshardware.com/graphic/20030912/half-life-01.html
In addition to the developer efforts, our driver team has developed a next-generation automatic shader optimizer that vastly improves GeForce FX pixel shader performance across the board
Developers like Valve and others are already writing Specific shaders for Nvidia with Partial Perscision as Well as Specially ordered instructions to better Suit their hardware.

My Question.

What could they be doing that is "Automatically Optimizing Shaders"? Beyond what the Developers are Already doing?

Just looking for some Completely Open Comments from knowlegable people.
 
perhaps it's just a new cool name for shader replacing
their drivers automatically replaces the sahders when they detect the .exe or something ;)
 
Selectively reducing precisions, perhaps? Or perhaps not. . . This has been suggested by others on this forum before and a point that was brought up was that it is impossible for the driver to always be able to predict (without any application-specific coaching) if the shader needs the high dynamic range available through floating point precisions or if only the mantissa is necessary (making fixed point precision viable).

The only thing that might be able to be made 100% reliable is instruction reordering to utilize any instruction co-issuing capabilities a video card might have. Or it's possible that arithmetic normalization could be detected and replaced with a normalization cube-map lookup. The latter could get problematic, though, if the vector was generated in the fragment shader since it would make the lookup a dependant texture lookup.
 
PS/VS 2.0 shaders (or their LLSL OpenGL equivalents) are referred to as being written in an "assembly" language, but that's really not true. Instead a better analogy is Java (or CIS for .NET) bytecode, which is then compiled just-in-time to machine code to be run on a target machine.

In the case of graphics shaders, the JIT compiler is part of the driver runtime. There are numerous indications that neither Nvidia nor ATI's driver compilers is anywhere near optimal in terms of doing the many perfectly legitimate optimizations that compilers do. More indications w.r.t. Nvidia than ATI, but then again this could be because the compilers that compile from HLSL to PS 2.0 LLSL ("assembly", except again that's a misleading term) do so in a way that better matches the ATI's underlying hardware than Nvidia's. Of course a lot of this has to do with the very unusual limitations that Nvidia's hardware presents, particularly it would seem in the form of severe performance penalties for using more than 2 FP32 or 4 FP16 temp registers. Most modern architectures don't have anywhere near such severe register pressure, and so the well-studied optimizations in compilers haven't tended to focus on solving that problem. (In fact the opposite--they very often increase performance at the expense of using extra registers.)

The point is that there is likely a lot of room for Nvidia to improve in terms of perfectly legitimate optimizations in the shader compiler. But that there may be very significant difficulties--even insurmountable ones--in realizing those optimizations. Plus, Nvidia's recent history makes one suspect that at least most of the "optimizations" being discussed here are in fact illegitimate, like special-casing, dropping precision (without automatic determination that there will be no effect on final output--something very difficult to do even in principle), changing the actual shader output, etc.
 
What i want to know is what would happen if they hand wrote the shaders for the 9800pro. I wonder if it would increase performance.
 
The only thing that might be able to be made 100% reliable is instruction reordering to utilize any instruction co-issuing capabilities a video card might have
But in this case would they not need to Prescan an entire Block of instructions, or entire Shader routine? Then somehow Cash it on the Fly and Reorder the instructions somehow without changing the final product? Also Doing all this somehow in a timely Manner?

Let alone the Driver Somehow understanding what the Shaders Goal is to begin with.
 
Shaders have to be registered before use. The recommendation is that this occurs at the start of the program / level, because compilation of the shader can take arbitrary time. As a result, there is plenty of time to perform even relatively complex analysis.
 
I have been wondering for awhile why they dont try something like this. If it was correct at predicting even 50% of the time that fp16 could be used instead of fp32, that would offer significant performance increases. And sheesh 50% its like flipping a coin. Of course I suppose that if developers make us of fp24 completely then fp16 will never be enough. Who knows.

btw hellbinder I thought you lived in Canada, but you said you lived in Coeur d'Alene now. I was just wondering cause my bro is going to school in Moscow at UI. (and I lived there (pullman) a long time ago too not that it is relevant).
 
I don't think the performance gains from using FP16 will every be that "significant" the the performance increase comes purely from the lower register overhead.
 
Here's an example I thought up. I put it in words so non programmers can follow it. It's the simplest example I could think of that would show differences in how the shader compilers in nvidia and ati drivers would work.

store 0.8,0,0 in constant0
1) Texture lookup for unit0 and put results in constant1
2) Texture lookup for unit1 and put results in constant2
3) multiply constant0 by constant1 and store in constant0
4) add constant0 to constant2 and store in constant0
5) output constant0 as pixel color

nvidia would probably want something like this on their hardware:
store 0.8,0,0 in constant0
1) Texture lookup for unit0 and put results in constant1
2) multiply constant0 by constant1 and store in constant0
3) Texture lookup for unit1 and put results in constant1
4) add constant0 to constant1 and store in constant0
5) output constant0 as pixel color

ATI would probably want something like this on their hardware:
store 0.8,0,0 in constant0
1) Texture lookup for unit0 and put results in constant1
2) Texture lookup for unit1 and put results in constant2
3) multiply constant0 and constant1 then add constant2 and store in constant0 (mad instruction)
4) output constant0 as pixel color

If this was part of a longer shader on ati's hardware we know that scaler (ie alpha) instruction has it's own hardware. Since the constant in 0.8,0,0 we know that only the red component is important so that instruction might get swizzled over to the alpha channel if there are more rgb instuction than alpha instructions. Even this is probably a simple compiler optimization for ATI. I'm guessing the trick optimizations deal with more hardware specific tricks that the general public will never know. A different variation for nvidia's driver would be if the developer coded with a mad. Then nvidia's compiler would have to break that mad up into a seperate multiply and add so they could reduce constants.

For those wondering why I didn't include the storing of a constant in the instruction count, it's because the hardware should only need to load the constant when the shader is first loaded because it doesn't change.

I truely hope nvidia doesn't try to dynamically lower precission with their compiler because such an optimization will hurt quality.
 
If it was correct at predicting even 50% of the time that fp16 could be used instead of fp32, that would offer significant performance increases
the _PP Hint for FP16 is already being tacked on to nearly every single line of Shader Code currently being Written. FP16 use at this point is a Forgone Conclusion. There is simply nothing to be gained as far as I can see from Dynamically assigning FP16. Its already done.

What Is of more interest to me is this. From Nvidias Official response to the HL2 numbers.
Part of this is understanding that in many cases promoting PS 1.4 (DirectX 8) to PS 2.0 (DirectX 9) provides no image quality benefit. Sometimes this involves converting 32-bit floating point precision shader operations into 16-bit floating point precision shaders in order to obtain the performance benefit of this mode with no image quality degradation. Our goal is to provide our consumers the best experience possible, and that means games must both look and run great.
It would seem that what they are really doing with this "Automatic Shader Optomizer" Is Converting as much PS 2.0 as they can to DX8 or PS 1.4

Is this a reasonable assumption?
 
Many shaders will contain a taylor series expanion or something similar. If I were in the business of sacrificing image quality for speed, one of the things I'd do is shorten up these expansions.
 
In Shadermark 2.0 (beta?) partial precision give's only an increase of around 36% overall. The new HLSL path PS2.0_a which was specially made for the NV3x cards give an overall performance increase of 0.5% (!!)

See the posting from tb, the producer of shadermark, and the posting of the "gast" which gives the overall performance increase with pp and PS2.0_a :
3DCenter.de Forum

According to Demirug, Nvidia says that the new HLSL-path PS2.0_a will not give better results without the new drivers 51.75; so the results could be better with the new drivers; but I'm still wondering why the drivers should have such an big impact when you already use the NV3x optimised rendering path. ???
 
Because there could be some valid optimizations nVidia could implement based on the LLSL code structure being expressed in a form more easily useful for their LLSL->assembly compiler/scheduler. It is just that all the other garbage nVidia is doing because the R3xx still outperforms their top card serves to obscure this possibility (fairly completely, at the moment, I think). "Big impact" is open to interpretation.

This doesn't exactly seem unfair when it makes so much sense considering their behavior, but I am more interested in some useful investigation of the shader behavior, including contrasts of the "first and special" Det 5x drivers with the ones that end up being fully exposed to public scrutiny to more clearly separate optimizations from "optimizations".
 
jvd said:
What i want to know is what would happen if they hand wrote the shaders for the 9800pro. I wonder if it would increase performance.

There was a thread on here somewhere where hand written and HLSL shaders were compared and I seem to remember the HLSL shaders were actually faster on ATI.
 
Enbar said:
Many shaders will contain a taylor series expanion or something similar. If I were in the business of sacrificing image quality for speed, one of the things I'd do is shorten up these expansions.

Interesting is that this is what ATI does for sin and cos instructions, while nVidia actually has native hardware to perform these calculation at the same rate as other instructions. So if there's room for performance improvement in this aspect, then that's on the ATI side rather than on nVidia's.
 
Dio said:
Shaders have to be registered before use. The recommendation is that this occurs at the start of the program / level, because compilation of the shader can take arbitrary time. As a result, there is plenty of time to perform even relatively complex analysis.

I would imagine they could take an MD5 checksum of the shader and then the driver could replace the shader whenever the checksum was detected. You could store precompiled results in the driver or compile on the fly.
 
Humus said:
Interesting is that this is what ATI does for sin and cos instructions, while nVidia actually has native hardware to perform these calculation at the same rate as other instructions. So if there's room for performance improvement in this aspect, then that's on the ATI side rather than on nVidia's.

Yes, I read somewhere that's how ATI does trig func. I didn't know nvidia was claiming native support for these and I highly doubt they would do these at the same rate as a mad. The best I can imagine they would have is to have implemented these as macros in the hardware. In this case they could probably short cut the macro with a less precise version. All this is guessing though. However if nvidia can do a cos as fast as they can do a mad I'll eat my socks. It just wouldn't make any sense to use all the necessary transistors for that.
 
Back
Top