DemoCoder said:
demalion said:
I take it you mean the failure of one single static compiler that doesn't evolve when a VLIW architecture does? The problem with your example is that you are proposing that what HP and Intel is doing for the Itanium is a proof with regard to GPUs, with a widely divergent workload, and where you haven't bothered to establish that the VLIW evolution will be the same for GPUs.
Demalion, your escape clause in all these arguments is the constant refrain "but it's a GPU".
Actually, my "escape clause" is that GPU workload characteristics and CPU workload characteristics are quite different, and referring to the CPU without correlating the applicability is not valid. It's not an escape clause, because I'm only asking to discuss the correlation instead.
Compiler theory was developed against the background of abstract computing models on automata and idealized computing models. If you pick up most compiler books, it won't even mention any actual CPUs in the real world. The problems specific to VLIW scalability with static compilation are inherent and not architecture specific.
Yes, the problem is not architecture specific, but is it universal to every implementation of VLIW? The demands of the workload being addressed seem to indicate a focus on synchronized and parallel execution and scaling with that in mind, and designing for the ability for facilitating functional unit utilization for direct mathematical calculation.
It seems to me that there is more similarity with the main part of GPU workloads and SSE than with the general computing something like the EPIC VLIW approach targets, and my current understanding is that the Athlon excels at leveraging ALU utilization for both SIMD in things like SSE and their general ALU usage. Why won't GPU designers be trying to do the same thing?
That's a pretty direct and open avenue of discussion for an "escape clause". I don't know how I can more clearly state that I'm asking yout to simply do other than circumvent this discussion, though I think I'm just repeating the same request again.
If you create an FXC compiler profile specific attuned to say, a pipeline with 3 functional units, it will produce code that is not optimal for a pipeline with 5 functional units.
And how is the LLSL->GPU compiler completely unable to handle this? The LLSL characteristics are determine by profile.
Everytime new architecture is modified, the static compiler will have to be updated and all games will have to be recompiled. In fact, GPUs exacerbate the highlighted problems but being even more aggressively parallel in their scaling than convention CPUs.
Doesn't SIMD processing have something to do with how the architecture is modified? I'd think a focus on this and the discrete nature of each component would dictate the evolution as well as there being a VLIW approach. Also, the aggressive parallelism for GPUs is for discrete pixel outputs...isn't the parallelsim for the Itanium an issue of branch prediction discards?
Yes, I said this before too. If it is wrong in some particular, nothing is stopping you from simply tackling the task of pointing out why.
That's why the the driver has to do instruction selection. The only issue is whether or not there is enough information in the DX9 assembly to make such decisions.
Yes, I've tried to discuss this.
...Compilers are ultimately nothing more than pattern matchers, and refactoring the pattern can alter what you "recognize". Even in mathematics, you might be able to prove that two number systems are isomorphic and identical, but the types of things you can recognize and do in each space are radically different (e.g. time vs frequency domain)
I recognize the relavance of this, but I do not recognize that you've established when LLSL is insufficient to the scope of shader execution. You're repeating hypotheticals by maintaining the distinction between CPUs and GPUs do not matter. I'm simply asking you, again, to engage in a discussion of what you are maintaining instead of just repeating the statements that it is so. Really, do you not see this as being the case without me quoting myself to demonstrate it?
DX9 assembly is not neccessarily the representation most conducive to allowing the driver to do the best optimizations.
Right, but it is not necessarily a representation that prevents it. You're proposing that it is a fact that it is, right now, and I'm pointing out that there are indications to the contrary of that assertion. Why is your response simply to require me to say the same thing again?
The result will be that fewer optimizations are found, and more difficult work has to be done by the driver developers to find them.
You went from the possibility that DX 9 LLSL is not sufficient, and the verifiable observation that it seems to be succeed within its requirement of floating point processing fairly well, to stating the certainty that it is insufficient.
You seem to ignore that failing to compile is part of the original expression, and that this goes hand in hand with instruction count limitations that do not parallel CPU compilation.
Why would the compiler fail to compile deliberately if it could produce a transformation which fits within resource limits?
Well, when that transformation results in something that cannot fit into the resource limits (which includes real time execution). This directly influences what shaders will be implemented and when. If we're talking about other than real time situations, I'm thinking the driver having the compiler becomes less of an issue.
Seems absurd. If I have a function call nesting > 4 levels deep on ps3.0 or 1 level deep on ps_2 extended, the compiler will have to inline any further calls.
Well,
ps 2 extended is listed as supporting up to 4 on the msdn website. Is that inaccurate?
f(g(h(i(j(x))))) should compile fine on ps3.0 and even ps2.0 even though DX9 can't support it in the assembly via "call" instructions.
Well, why wouldn't that output change depending on the result of the caps query? Is the compiler incapable of doing this because it targetting the LLSL?
Only if the compiler is incapable of expressing them in the LLSL in a form useful to the LLSL->assembly compiler (not a given, details above).
Why do you think I prefer the term "LLSL" instead of "assembly", anyways? I mention this to illustrate why I'm excising your explanation of this.
I just showed you an example. And please, it is not a low level shading language and it is not an intermediate representation.
Well, it neatly avoids the need to demonstrate its deficiencies as an intermediate representation if it isn't one.
Also, your quoting structure is a rather MAJOR distortion of my statements, such that you're not really addressing me at all...the first sentence was a response to your saying "and therefore, if your underlying hardware has superior than DX9 capabilities in some of the pipeline, it will be unable to realize some optimizations." You'd have to support that statement to address it, but instead you attack the second as if it was part of the first statement.
In actuality, the second was a nested quote of you quoting me on a completely separate line of discussion, where I was pointing out you weren't making sense by attacking it when it was intended to answer your question of whether I "understood" that the DX LLSL was compiled by the drivers already. Removing the nesting and representing it as a continuous statement does not help matters, it just looks like you found it more convenient to misrepresent me to make your assertion because your statement depends on that misrepresentation to be a coherent address to me.
No, I don't enjoy having to tackle nonsensical obstacles like this in every reply. It really would be okay to stop placing them there.
It is a assembly language based off of DX8 legacy and before HLSL was invented.
Well, that answers everything I've said, doesn't it?
Actually, the removal of extensive modifiers and literal coissue expression seems to argue for movement away from exactly what you maintain.
No, I'm suggesting that MS could have the compiler change behavior based on predicate and branch support reported, and that the approaches that the profiles represent need not be unique to every new hardware.
So you expect Microsoft will have to maintain dozens if not more compiler switches for every conceivable combination of architectural support themselves and that they will have to frequently update this uber compiler with patches and send it to developers so that they can RECOMPILE their GAMES and distribute patches to all the gamers out there?
Well, the compiler switches are already there, and as far as what you've brought up with your unrolling and branching decisions this seems pretty simple. I feel I've covered the issue of your practice of gross misrepresentation of disagreement, so I won't do it again for now. I will point out that there are less than "dozens" of issues established at the moment.
my example said:
if(a < x) { b = 10; }
else { b =20; }
Well, for the predicate, a setp_lt,if_pred,mov,else,mov,endif (sticking to one component, and using branching, atleast in LLSL expression) or setp_lt,mov,(p)mov (allowing per component execution of your idea, if necessary) seem possible.
For branching (without predicate), if_lt,mov,else,mov,endif works, correct?
For lrp, how would you best set the interpolation register to each end of the range (0,1 I'm assuming)?
Why not count up the instruction slots used by your solution, compared to an slt/lrp or sub/cmp/lrp.
I considered a sub/cmp type approach, but why wouldn't output targetted at ps 2.0 extended capabilities (or a vs 2.0 or higher profile, as I realize now that you are including) use its specified branching instructions, and let the LLSL->GPU compiler implement the specifics as sub/cmp if necessary?
As for slt, I simply didn't consider vertex shaders. Looks a bit like my thoughts on "if only a predicate register could be used for lrp interpolation control and bolean true was defined as 1". I still don't see how the LLSL->GPU compiler couldn't handle converting an if to slt/lrp, or why the compiler would have to fit your description to be able to provide slt/lrp optimizations in one profile and not another (assuming it does express this literally for the vs profile at current?).
As the author of an optimizer, which are you going to choose, and how are you going to code it
And this is nothing like saying the compiler can only behave in one way? I'd choose based on the capabilities of the target. The target, in this case, is a LLSL that a driver is going to compile again. Ignoring capability queries, one profile that targets based on instruction slot conservation and another that targets based on literal expression with the LLSL instructions would go a long way towards being a solution for multiple GPUs and their LLSL compilers.
Now, AFAICS, you're saying the compiler would pick one approach (that may resemble or completely fail to resemble my outlines), and that this one, in some particular case (more complex than your example), would be beyond the scope of the driver compiler for the LLSL to implement efficiently in the architecture.
I'm saying that if the compiler chooses slt/lrp, it will be more difficult for the driver to reconstruct all the basic blocks that existed in the original source code. Likewise for predicates.
If this other than what I proposed, I fail to see it at the moment.
The only solution is to force the compiler to always turn any conditional into a branch to preserve the original control flow graph, but again, because of resource limits, it is not always able to do this.
Why can't it do this in different profiles? Now, I think we're on the same page here, and I'm happy that this branch of the discussion is resulting in that if I'm right. Can you just directly answer the question? You know, the technical part of your reply could consist with just tackling this answer, and I would not "complain" as long as we continue making some progress and avoid repetition atleast by some measure (which this branch of discussion is doing at the moment)..
What application is this, and why is patching it unacceptable? This case is not something that would happen to every single application, even if you wouldn't be patching them with new hardware.
Let's say there are 100 games out there using HLSL.
Let's say there are 5 publishers for those 100 games. Or, maybe, there are 5 unique game engines.
And every month, like clockwork, ATI, NVidia, 3DLabs, IMGTEC, and other companies find new ways to squeeze performance optimizations out of their designs, or release new HW.
Well, ATI is the only one who promised monthly updates, though I think nVidia should be as capable from a resources standpoint. But I'll grant it for now.
Just like driver updates, there will be a steady stream of compiler updates (as there is with GCC). Zealous gamers who always like to have maximum performance will eagerly download what?
Updates to a shader compilation application written like an auto-update tool, from each publisher, capable of handling all their games. Or maybe a tool for each engine. Actually, this could be a growth direction for the inter-publisher software vendors or the engine designers. Now, if you find a flaw in this proposal, just finally say why.
Continuously updated patches for every game in the their library, assuming that all game publishers will rigorously remain in sync with IHVs and publish patches everytime an update comes out?
No, though your description's dependence on monthly compiler updates seems a rather ridiculous exaggeration in the first place.
Tell me demalion, would you argue for a "grand unified driver model" in which Microsoft ships all drivers for every card in one big uber-library file that developers statically link-in at compile time?
You did this before. I pointed out why it was silly. It would be helpful if you gave those reasons any thought at all.
A statically linked driver would in fact deliver higher performance, but at a massive cost to maintainability.
What if car bodies always wore out within a year of regular use? A welded and bolted car body would in fact deliver good structural strength and low rattle noise while in motion, but at a massive cost to maintainability.
That is the construction of your argument here. Hopefully the full scope of the parallel is evident without me specifying every particular.
So sure, it is possible to continuously ship updates for a single static compiler that supports N profiles, but it is a bad idea for two reasons
#1 lots of overhead in distributing the improvements to users and developers
The problem is that you are depending on hyperbole to support this as being established.
#2 "profiles" do not neccessarily represent the best medium for device drivers to work off of.
Say, how many times to I have to ask for you to say why? We're making progress, as atleast you recognized their existence, though.