And overriding principles (shortest register count, shortest instruction count) are too simplistic to capture everything that needs to be done, because they are competing goals.
No, they're not competing goals if you don't have a significant performance penalty simply from increasing temporary register utilization beyond a very low number. Perhaps you mean from a hardware design standpoint? I agree, but then I'm not saying glslang doesn't have a theoretical advantage, I'm saying it is obviously in an IHVs interest to avoid certain mistakes as high priorities with a given goal, if possible.
They are competing goals on some architectures. Eliminating registers bloats code by forcing recalculations to occur. Let's say you are using 4 registers, and using a 5th register drops your performance by 25%. To avoid this, you eliminate the register by redoing some calculations, but in doing so, you added 25% more code. For example, maybe you normalized a register (3 instructions) and saved the result for later reuse in two other expressions. But you now have to eliminate this extra register, so you just do the normalize twice, instead of eliminating subexpressions.
The most optimal code is not neccessarily at the extremes (shortest actual shader, or shader with fewest registers used), but is someone in between, and possibility finding the global minimal is extremely hard.
Outside of the NV3x, what type of performance yield are you proposing from this compared to what can be done with the LLSL?
I'm just telling you that the issue is alot more complex than just "shortest shader or shortest register count". Not all instructions have single cycle throughput, and they certainly have differing latencies, so instruction selection is different for each HW. For example, LERP is way more expensive than MIN/MAX on NVidia hardware. RSQ is expensive on NVidia, so sometimes using a Newton Raphson approximation is better. Also, symbolic high level manipulation can yield improvements, here's an example
X = normalize(L) dot normalize(N)
Today, this gets compiled into something like
a = L dot L
b = rsq a
c = a * b
d = N dot N
e = rsq d
f = d * e
X = c dot f
But algebraically, this can be manipulated into
X = (N . L) / ( |N| |L| )
x = (N . L) /(sqrt(N.N * L.L)
which is
a = N . L
b = N . N
c = L . L
d = b * c
e = rsq d
X = a . e
you've shaved all one rsq and one instruction, but traded for a dot product. Depending on the HW, this may or may not be a win, since the RSQs might execute in a different unit, and might be able to be ran in parallel, whereas the extra dot product has to run serially. Who knows.
Microsoft's profiles have to know about more than just general goals like "shortest X or Y", they must also know about the individual timings and latencies of the VLIW instructions that will be used to implement the LLSL operations.
Ditto for predicate vs branch vs LERP vs CMP vs MIN/MAX
In fact, FXC isn't able to eliminate the extra RSQ was I showed above, therefore, NVidia takes a hit, because their RSQ is expensive. In fact, FXC doesn't even generate DX9 normalize macros in the shader, which makes it even harder for the driver to rewrite the expression.
There are loads of other DX9 HLSL library functions which might be directly accelerated on future hardware like faceforward, smoothstep, and transpose. With LLSL, the semantics of these operations are lost because they are replaced with a code expansion. It becomes very difficult for the driver to recognize what is happening and substitute alternatives using algebraic identities after that.
Finally, with regards to JIT compilation and dynamic optimization, this doies not incur significant overhead, and has been used for years on some systems (Smalltalk, Java)
The way it would work is this: The driver keeps a small table of statistics for the "most active" shaders used, it can do this in "debug mode" or in retail mode, it doesn't matter. For the most active shaders, the driver further records which of the runtime constants passed to the constant registers don't change very much.
After the driver has collected this profile information, the compiler can then use it to generate "speculative" compiles of hot shaders. A speculative compile is one where you ASSUME you know the values of those constants which you found not to change very often.
This can lead to constant propagation, algebraic and strength reduction opportunities, along with removing branches, min/max/lerps, etc.
You also compile a version of the shader that is based on not knowing the value of runtime constants.
Now the driver, armed with these two shader versions uploaded to the card, can choose which one really gets bound (when asked for) by looking at what constants were fed via the API. If the constants fed match up with the profile statistics, it chooses the "known constant" shader, if not, fallback to the "unknown" one.
This technique is used in C and C++ compilers to overcome performance related to dynamic dispatch and polymorphism. Over the years, and programmers use languages that offer more dynamic method invocation (pointers to functions, et al), compilers have had a toughter time figuring out how to do global analysis and method inlining.
With speculative compilation, the compiler can use profiling data collected from real application runs to generate code that looks like this
function foo(Object * b)
{
if(b = X)
{ inline version of function B.BAR() }
else
b->BAR();
}
It does this, because perhaps B.BAR() is called 90% of the time, but there is a rare chance that the 'b' pointer points to a different object.
With shaders, the compiler could speculatively propagate constants, and determine if the result yields an improvement based on some heuristic (e.g. don't do it, if it only shaves off n cycles and those particular constant values only appear 70%of the time, since some cycles are lost because of state changes)
There are in fact, a boatload of compilation techniques that are available to GPU compiler authors, and OpenGL gives us a platform to explore this, DirectX9 does not.