DemoCoder said:
And overriding principles (shortest register count, shortest instruction count) are too simplistic to capture everything that needs to be done, because they are competing goals.
No, they're not competing goals if you don't have a significant performance penalty simply from increasing temporary register utilization beyond a very low number. Perhaps you mean from a hardware design standpoint? I agree, but then I'm not saying glslang doesn't have a theoretical advantage, I'm saying it is obviously in an IHVs interest to avoid certain mistakes as high priorities with a given goal, if possible.
They are competing goals on some architectures. Eliminating registers bloats code by forcing recalculations to occur. Let's say you are using 4 registers, and using a 5th register drops your performance by 25%. To avoid this, you eliminate the register by redoing some calculations, but in doing so, you added 25% more code. For example, maybe you normalized a register (3 instructions) and saved the result for later reuse in two other expressions. But you now have to eliminate this extra register, so you just do the normalize twice, instead of eliminating subexpressions.
That seems a waste of a calculation unit for that clock cycle, purely due to a design limitation, unless this limitation saved you a lot of transistors or allowed you to do something significantly unique and useful. If it didn't, this design would seem to be a mistake, and a mistake that IHVs would be trying to avoid.
Now, does DX HLSL allow you to work around this for hardware that faces this problem? Yes, by a few methods currently: a new profile (as done for NV3x, the architecture that has required this so far), the "LLSL"->GPU opcode compiler for an existing profile (if it allows it to be addressed effectively). Now, have I said there aren't issues with these, or have I said that the significance of issues depend on the IHVs failing to avoid the problem in the first place, and then comes down to how cumbersome the solutions are for all involved?
As far as disagreeing with you, it seems to me we depart fundamentally on a few key points: where you say users will have to download patches for every application every month; my questions about the apparent errors in your N*M discussion, at least AFAICS; your insisting on replacing this discussion with commentaries on MCD/ICD and Pentium 4/SSE that don't speak to what we are disagreeing on.
Now, does glslang allow you to work around this? Yes, by an IHV writing a compiler, and taking on the additonal challenges with that. These challenges ares new work an IHV has to do.
Insurmountable challenges? I certainly hope not, nor see why they should be...for the purposes of my disagreement with what you stated, I'm proposing the challenges are more than "10%" to the "90%" already done for DX HLSL. For the purposes of proposing my initial commentary, I'm also proposing that when these challenges are overcome, when they are necessary to be overcome, and whether an IHV is getting more out of them for their hardware, are all relevant to which delivers on their advantages.
...
Please feel free to discuss what you believe is my error in any and all of these, if you think I am in error (like this post seems to do). Please don't feel free to insist on discussing analogies that obscure these issues instead, while throwing insults at me, no matter what you think. I don't think such expectations are unreasonable at all.
The most optimal code is not neccessarily at the extremes (shortest actual shader, or shader with fewest registers used), but is someone in between, and possibility finding the global minimal is extremely hard.
Outside of the NV3x, what type of performance yield are you proposing from this compared to what can be done with the LLSL?
I'm just telling you that the issue is alot more complex than just "shortest shader or shortest register count". Not all instructions have single cycle throughput, and they certainly have differing latencies, so instruction selection is different for each HW. For example, LERP is way more expensive than MIN/MAX on NVidia hardware. RSQ is expensive on NVidia, so sometimes using a Newton Raphson approximation is better.
My premise was simply that: 1) expressing the functionality in the most compact form is an approach that seems widely applicable...the LLSL->opcode compiler does have the ability to address the type of issue you specify, 2) specific issues like the NV3x register problem are required to prevent that solution from being applicable (i.e., require a new profile). Your characterizations (that I was addressing then) seem to depend on there being an abundance of such issues, and that they will be addressed one at a time each month, only by patching applications/the HLSL compiler, and I was pointing out that there seem a significant set of issues of that type that can be addressed otherwise, if they appear.
Also, symbolic high level manipulation can yield improvements, here's an example
Well, the LLSL has a "nrm" macro, but the HLSL compiler does have issues with properly implementing some macros. I presume this is one of them, then?
This is an issue with MS's DX HLSL implementation, not the overall approach (except as they are failing to meet their challenges...strengths and weaknesses, as I mentioned). If they don't address this in a certain window of application patching for affected applications, it will manifest as a direct advantage opportunity for glslang. This is not because I'm making excuses for MS, it is because of how that actually compares to how glslang is addressing it at the moment (not at all).
you've shaved all one rsq and one instruction, but traded for a dot product. Depending on the HW, this may or may not be a win, since the RSQs might execute in a different unit, and might be able to be ran in parallel, whereas the extra dot product has to run serially. Who knows.
Microsoft's profiles have to know about more than just general goals like "shortest X or Y", they must also know about the individual timings and latencies of the VLIW instructions that will be used to implement the LLSL operations.
If they use the macros, they won't. If they don't use the macros, they will, and glslang will have to overcome less to deliver on more of its advantage. Our disagreement here consists of me pointing out that the LLSL has that macro...I understant the premise you've mentioned.
Ditto for predicate vs branch vs LERP vs CMP vs MIN/MAX
Well, I've covered what the LLSL specification allows...it is a matter of what MS and IHVs achieve in adapting each approach, and when.
In fact, FXC isn't able to eliminate the extra RSQ was I showed above, therefore, NVidia takes a hit, because their RSQ is expensive. In fact, FXC doesn't even generate DX9 normalize macros in the shader, which makes it even harder for the driver to rewrite the expression.
These are the same issue and problem, introduced in a confusing way here. Please clarify: if "FXC" generated the normalize macro (and others), then it would solve all of these problems you list, because the IHV would then be in direct control of these decisions, right? You seem to say that the not expressing using the normalize macro is a new problem in addition to the discussion before it...?
There are loads of other DX9 HLSL library functions which might be directly accelerated on future hardware like faceforward, smoothstep, and transpose.
No relation between the face register and faceforward function, or just some limitation that cannot be resolved by the drivers?
In any case, implementing these efficiently on hardware that has that could benefit would require an update and patch. Was your N*M discussion meant for hardware upgrades over time?
With LLSL, the semantics of these operations are lost because they are replaced with a code expansion.
It becomes very difficult for the driver to recognize what is happening and substitute alternatives using algebraic identities after that.
With the current LLSL, yes. If MS isn't looking forward at least a year or a bit more, this would be a pretty significant issue, because the LLSL needs changes beyond just the compiler and application needing (a minimum of) one patch to better target the LLSL specifications already made. The current hardware issues you brought up seem to be addressed in the LLSL, though.
Moving on to something new in the discussion, though:
Finally, with regards to JIT compilation and dynamic optimization, this doies not incur significant overhead, and has been used for years on some systems (Smalltalk, Java)
The way it would work is this: The driver keeps a small table of statistics for the "most active" shaders used, it can do this in "debug mode" or in retail mode, it doesn't matter. For the most active shaders, the driver further records which of the runtime constants passed to the constant registers don't change very much.
After the driver has collected this profile information, the compiler can then use it to generate "speculative" compiles of hot shaders. A speculative compile is one where you ASSUME you know the values of those constants which you found not to change very often.
This can lead to constant propagation, algebraic and strength reduction opportunities, along with removing branches, min/max/lerps, etc.
Yes, I see how that is significant, just don't see how it is something that reduces the burden of implementing glslang. However, in this case, I can see how a shared compiler that provided such tools could achieve that for all IHVs in a "solve once" fashion...much less of a hurdle than the bandwidth balancing proposition. Where is some information on what the glslang compiler baseline will do? This seems like a believable opportunity for capitlazing significantly on its advantages, maybe even at launch.
A few things, relating to the current discussion: How much CPU overhead will this add? How does that scale to repetition and constant usage for larger amounts short shader execution dispatches? It seems this will be tested for every instruction using constants, and cascade (to some limited level) more work on top of what else is going on...how will that percentage of performance loss equate to performance gained from this given the CPU workload of a game? It seems likely that this won't be an issue for the "high end" at the time, but it will be for whatever manner of CPU usage occurs around the "baseline" the developers targetted for their game.
Also, as far as I understand, the HLSL->LLSL compiler simply can't do this, but the LLSL->GPU opcode can (a clear miss of higher level optimization opportunity)...confirm that if you would.
Also, will there be any type of impact on "shader caching" on the GPU for having the host have to manage this? With higher bandwidth interfacing, this might be less of a problem though, and this seems a problem that might be solved for other reasons. Or is there never any type of "shader caching" (crude description perhaps) for GPUs?
You also compile a version of the shader that is based on not knowing the value of runtime constants.
Now the driver, armed with these two shader versions uploaded to the card, can choose which one really gets bound (when asked for) by looking at what constants were fed via the API. If the constants fed match up with the profile statistics, it chooses the "known constant" shader, if not, fallback to the "unknown" one.
Hmm...elegant, though I'm not clear of the amount of benefit and frequency of it within the limitations of execution for real time shaders in the next few years. However, the longer the shader, the more worthwhile this can be, and the less CPU overhead matters...though the more CPU overhead there is because of more constants used and more analysis to manage the tracking. Any thoughts on how this will pan out? Some factor I missed?
This technique is used in C and C++ compilers to overcome performance related to dynamic dispatch and polymorphism. Over the years, and programmers use languages that offer more dynamic method invocation (pointers to functions, et al), compilers have had a toughter time figuring out how to do global analysis and method inlining.
With speculative compilation, the compiler can use profiling data collected from real application runs to generate code that looks like this
function foo(Object * b)
{
if(b = X)
{ inline version of function B.BAR() }
else
b->BAR();
}
It does this, because perhaps B.BAR() is called 90% of the time, but there is a rare chance that the 'b' pointer points to a different object.
With shaders, the compiler could speculatively propagate constants, and determine if the result yields an improvement based on some heuristic (e.g. don't do it, if it only shaves off n cycles and those particular constant values only appear 70%of the time, since some cycles are lost because of state changes)
There are in fact, a boatload of compilation techniques that are available to GPU compiler authors, and OpenGL gives us a platform to explore this, DirectX9 does not.
Hmm...so you mean the LLSL->GPU opcode scheduler can't do this at all? I'm not sure why, though this discussion seems familiar...was this specified somewhere else before? AFAICS, it isn't "something or nothing", it is "higher level of abstraction or something not as abstracted". However, I also think that if this is done in the baseline glslang compiler, it would actually be a further advantage to glslang from that standpoint, as even if my understanding is correct, it seems this is something DX would absolutely have to require each IHV to do individually.