DemoCoder said:
Demalion, your entire response is predicated on the assumption that I am talking about and supporting Cg, which is incorrect.
Well, the problem here is that you were addressing this quote:
Doomtrooper said:
Hardly, HLSL is outperforming CG on a 5600, so how can that above statement be correct. A developer has a much better compiler to use that has already been optimized for all hardware, vs. one.
With commentary that seems clearly related to comparing Cg to DX 9 HLSL ("one compiler" is bad, "multiple backend optimizers" are needed and not present in the case Doom was advocating, the DX HLSL, etc.), exactly how did my response along those lines go wrong?
The wording of my response is indeed predicated (with reason, it still seems to me) on believing that this is what you are addressing, but the particulars of support I use are independent of that. If you meant your words as criticisms of Cg as well, I'm confused as to why HLSL was evident in your criticism and Cg was not when the quote you were addressing discussed both. Does this make this a matter of my misunderstanding instead of your miscommunication?
Thus, many of your responses refer to things that are not even in my messages (I have never said Cg solves any of the problems. To critique DX9 HLSL is not to endorse Cg)
This is a problem of obfuscating semantics on your part that does not make a good start in sticking to relevant issues. What you said was that multiple back ends were necessary, and the DX HLSL did not have this. You also made several assertions about what the compiler
could not do, proposed as flaws in the HLSL using a LLSL, which you failed to support in this reply, and which are things that were under discussion that Cg seemed to try to do. For example: "All Microsoft's compiler can do is perform generic optimizations like common subexpression elimination, copy propagation, and dead code elimination. It cannot schedule instructions (reorder them to take advantage of each card's architecture) nor alter it's register allocation strategy based on card architecture." My discussion of 2_a was pointing out that it actually did the first for the NV3x, and my discussion of Cg was based on the idea that the second is what it tries to do. When you state the HLSL cannot do this that is an endorsement of Cg (AFAIK, one based on a false premise, which is what I tried to discuss in much of my text).
To point out that you didn't say the word "Cg" is semantically correct, but does not change that this assertion is present, nor that my address of the assertion is not invalidated (atleast, as far you've demonstrated) with regard to the issues I presented. The obfuscation is mentioning that assertion about my commentary being irrelevant, in whole, instead of addressing the discussion that still remains relevant.
demalion said:
This statement (though not necessarily disagreement with the isolated statement it quotes, IMO) seems slightly facetious, and largely based on a body of assumptions that you did not establish as valid...you can have "one" compiler behave differently, so there being only "one" MS compiler for DX HLSL does not validate your statement.
You cannot ship a single compiler that will statically compile optimized code for all graphic cards both future and present. It would have to be a HUGE monolithic compiler, and everytime a new vendor shipped a graphics card, or someone updated their part of the compiler, Microsoft would have to reship it, and all developers would have to reship their games with the recompiled shaders.
This response indicates that you ignored a large part of provided rationale, namely that you have not established support for the idea that each shader hardware architecture will require compiler re-invention, though you contine to propose it is factual to such a degree that every single architecture will be completely unique and ill-served by the DX HLSL. You continue to simply make the assumption that the NV3x problems inherently prove this by simply ignoring that it could simply represent a uniquely deficient architecture that other IHVs would have reason to avoid if they can. However, I acknowledged that it is possible that they would fail in doing so, it is just that your commentary depends on ignoring that they might succeed.
Some of the text of mine you excised already addresses this.
If you want to play semantics, yes, you could theoretically, have one large uber compiler (like GCC), but how no one will get any benefit in their games if new optimizations are discovered later, nor will people with future graphics cards benefit on older games.
Two things: 1) again, you fail to validate the assertion thatr compilers need the level of complexity of gcc, which maps across significantly divergent hardware architectures with more complex execution stream demands, 2) gcc and its complexity and "uber"ness is not what I was proposing, that is your exaggeration of my viewpoint in lieu of addressing what I actually stated.
If you want to discuss the idea of when the LLSL paradigm is likely to be made unsuitable outside of the context of Cg and HLSL, I recommend a different thread where the examples of Cg and HLSL are not the specific topic such that their irrelevance needs to be established for discussion if that is the presumption, or simply clarifying which relevant examples under discussion your reply refers to, in full.
The optimizations have to happen at dynamically at runtime if you want them to be optimal per card and up to date.
Please note that I support this idea, I just don't think the SLissues discussed in this thread support your proposal that the idea is a significant factor right now with regard to the issues you mention, or that there is any other demonstration of which I'm aware that establish that it is. If you know of one, why didn't you just mention it?
Aside from that, and focusing only on the "ps_2_0" proifle, the idea of "shorter shader is better" is an idea with independent merit for shaders, and with the simplicity of expression with regard to graphics shading (very specific math and data fetching operations being the majority of workload), it is not the given you propose that "you don't want a single compiler for all video cards"
Your criticism falls flat on its face the minute you vary the number of executable units in the GPU, since then the instruction stream would have to be reordered to take advantage of maximal parallelism.
This commentary is a bit puzzling, as my text began "aside from that" and continued with "focusing only on the ps_2_0" profile. Also, part of what you snipped includedtext like, for example,
"the "single compiler" hypothetical (i.e., as a label that does not fit what it appears HLSL will be) is indeed undesirable
". Taken altogether with my referencing the ps_2_a profile (which, btw, has at least the characteristics of reording texture and "arithmetic" ops in "tex/aop/aop" order that you discussed above) seems to make it appear that your criticizing something other than what I proposed.
As it is, my criticism (of your painting HLSL as deficient in areas where you have not provided support for the assertion) at this point is standing fine, because what you propose was discussed elsewhere than here when I'm talking about the ps_2_0 profile and what it represents, and where I'm trying to point out that other IHVs would probably have reason to design the hardware suitable for it and not do things that require a "monolothic" compiler.
Add in difference pipeline depths, branches,...
I tend to think this will indeed be an issue if branching implementations vary wildly between IHVs, but am also proposing: 1) this doesn't seem to be the case yet, 2) it is not a given that divergent implementations will require unique compilers, 3) if it is, the profile system in HLSL and the LLSL specifications seems like it will be able to expand to handle it, before it occurs (assuming MS isn't shutting out IHVs...are they?), and seems to have every reason to do so.
What you're arguing against is any standardized LLSL usage at all, since that is specifically what causes the issue, and the DX HLSL is a means of solving the development and management problems of shaders above that layer.
...and texture cache sizes,
I'm seperating this because this seems to have come out of left field. How would the compiler be handling this? I can see some limited opportunities for cache handling behavior changing when dependent reads are/are not occuring, but that's all I see on first blush. If this wasn't a mistake, and you have more in mind, please explain further.
and there is no way you are going to be optimal unless it is done in the driver.
Why simply repeat this assertion while leaving commentary that might indicate otherwise unaddressed?
This is precisely why Intel and HP are researching runtime recompilation of Itanium, taking a JIT-like approach to native code, because varying the number of functional units makes a huge difference in how you schedule instructions, and static compilation will ensure poor performance on future chipsets.
But GPUs are not the Itanium...obviously. Let me restate: if they "
are the Itanium", could you simply even begin to illustrate how? They seem like dramatically different beasts to me, with the Itanium taking dependence on branching/scheduling/cache utilization to extremes beyond common desktop CPUs, and where common desktop CPUs already take it to levels quite divergent from GPUs. For GPUs, parallelism for unique replication is still the focus (as opposed to parallelism for single stream execution, multiple threading of single stream execution, or branch prediction, where each parellel processing instance is not an independent workload). I do see us heading for "parallel CPU-like scheduling" in individual GPUs "pipeline" concepts, inevitably (atleast, "in effect"), but your discussion seems to depend on fallacies to propose we are there already.
Also the assertion that "shorter is better" is not always neccessarily true.
Nor did I claim that it was, by either inference or statement. I proposed that paradigm as the ps_2_0 model, and further proposed that other IHVs seem to have every reason to work towards making their hardware execute that model well. Since the NV3x specifcally not exhibiting this characteristic was a prominent factor in my discussion and reasoning, I'm not sure why asserting that this is not something I was aware of became the basis of your reply to me.
The classic example is loop unrolling. Given a loop (say, iterational count 10) for a vertex shader, which do you predict will be faster: unrolling the loop into 10 copies of the body (longer), or the smaller code with the branch. The answer depends on how well the GPU handles branches, whether the unrolled code will fit into the vertex shader cache or not, and a host of other issues that will vary from GPU to GPU.
Umm...yes. I do realize that. But I was talking about ps_2_0 output in a specific context. In any case, concerning your new example, I actually thought this vertex processing problem was already handled by the LLSL "optimizer" (which would mean you're arguing against your criticism of LLSL, slightly) where necessary, but if it is not that would indeed be indication that this issue is present right now (so if you have such information, you will have moved your case forward in a way that I would not tend to criticize, in case the nature of my disagreement is still not clear).
You're naive "short is better" optimizer might end up alot slower if branches are expensive.
OK, you threw in branching example that doesn't seem applicable to the ps_2_0 profile which I was discussing, without clarifying relevance. The "short is better" paradigm represents the ps_2_0 and (currently only) the R3xx. It is a generic paradigm, that it seems likely that other IHVs will aim for, while your given assumption seems to be that other IHVs will implement unique paradigms with significant limitations and demands like nVidia did. I discussed in the post that they may, but that it seems more likely that they would try to avoid that if at all possible, and that it was indeed possible that they would. Again, if you want to discuss my reasons for thinking your assumption not a reasonable expectation (though reality can be stranger than fiction, and the released specs for various architectures could simply be misleading), we can expand upon that if you agree not to simply dismiss it as resolved.
Well, if you're talking about the LLSL, you're describing what MS is doing.
Where is MS providing an API for pluggable runtime ccompiler backends?
Eh? I quote again: "You want the compiler to be part of the device driver, and you want each vendor to ship their own optimizer that optimizes it at runtime." That is what I said MS was doing with the LLSL, not that they were "providing an API for pluggable runtime compiler backends". Please note that what immediately followed the sentence you decided to isolate and question, was my own question as to whether you meant to address what HLSL was lacking or talking about the LLSL as I asserted in what you quoted, and that responding with a question instead of simply answering mine seems counter-productive. Also, note that "pluggable" is discussed as part of the "contro"l issue I discussed.
(lots elided, irrelevent, based on a misunderstanding. My message is talking about runtime compilation, not static command line compilers with a fixed number of backends)
From this, I presume you are maintaining, after all, either that "compile on installation" is explicitly not part of what you are considering, or that the DX HLSL compiler can only be used by running "FXC.exe" from the command line.
Since when is that the case? Perhaps I misunderstand your choice of wording, but with regard to "pluggable"...
"Pluggable" is concerning issues of : 1) Patching without patching all affected games individuatlly (a valid logistic issue, but one that does not worsen the situation from current); 2) Actual deficiencies in the shipped compiler, which depends on your assumptions about other IHV hardware and specific case by case issues (valid application detection is an already prevalent alternative, but a less appealing one...driver compilation really is the best long term solution, I agree); 3) New issues with future hardware when dealing with the utilized DX paradigm, which I discussed, again, above (and in that text you termed "irrelevant").
The only point I disagree with here is your assurance that the observed deficiencies for the NV3x are deficiencies in the HLSL->LLSL compiler's absolute capability, and not simply a deficiency in the hardware that the compiler does not yet take into account (in a public release).
How is a static command line compiler that is infrequently updated going to address the issue of new hardware coming out (which is not taken into account).
OK, please read the "irrelevant" text again, in full. The discussion of this is readily evident in the "paradigm" idea you did not address. Please more carefully consider the list of possibilities presented in the entire previous post, and realize that it is the assumption that this issue is only represented by your choice of characterization and presented expectations about future hardware and the timeliness of DX evolution that I'm criticizing, and that both of these assumptions might simply not be correct.
Hmm...actually, would be more interesting if they were correct, OpenGL could use the boost.
Let's say FXC takes the NV35 into account. What happens when the NV40 and R400 arrives?
They run the 2.0, 2.0 extended, or 3.0 shader output, depending on the demands of the shader, their capabilities of the hardware, and the capabilities of the compiler shipping with the game. Please note the following: your characterization does not upgrade 2.0 shaders to 3.0 such that the game needs a new compiler just because the hardware became more capable. We could have a useful discussion of ways this might or might not happen, but it is not useful to skip the discussion and propose the conclusion you believe to be right in its place. If you'll read that "irrelevant" text again, please note that recognition of both was part of what I was discussing.
Games will be shipping with old versions of FXC and compiled shaders and games won't get the advantage immediately of any compiler updates until they download game patches!
Yes (though I thought it was a library call, not the FXC executable, and the label "run time" seems a bit of a misnomer when criticizing DX, though "pluggable" does not). Are you under the impression that I view the DX HLSL as perfect? I'm pointing out that you are proposing predictions as fact and ignoring factors of counter-indication as the basis for your criticism of DX HLSL (well, the part I'm disagreeing with). I thought you were proposing Cg as a way to overcome the mentioned issues, and if that is incorrect, the rest of my discussion retains relevance.
Whereas a backend runtime compiler will instantly affect the performance of all games.
Yes, but at costs that warrant more than simple dismissal. Costs like: IHVs having to actually develop a compiler, if their architecture is divergent, and doing a better job than the "naive" short shader implementation in DX (again, nVidia isn't making a good showing of this, even with their funding and resource commitment); giving up full and standardized featureset exposure in a LLSL, which it seems too early to do; introducing a wide avenue for fault manifestion, both at the driver level and application level...the LLSL specification paradigm does a great deal to prevent that.
Of course, aside from these issues, I think the pluggable backend compiler is definitely superior, but the reasons you are proposing for stating that the benefits manifest right now as problems with DX HLSL seem fallacious (reasons stated above, multiple times now).
Basically, you maintain that the ps_2_a profile isn't a profile that accomplishes what you say atleast as well Cg does, and I'm pointing out that I don't see why you are sure that this
I maintain nothing. I have said nothing about ps_2_a and Cg.
What you did say concerns ps_2_a and Cg with regard to the specific examples of what you maintain DX does not do and the hardware and CG/HLSL examples already under discussion, and my discussion of your given assumptions concerning future hardware is because of that.
Where are you pulling this stuff from? I am talking about a general improvement to the architecture of Microsoft's compiler.
Umm...things you said seemed to ignore the ps_2_a profile. Since the ps_2_a profile is part of HLSL, it seemed relevant to me.
I have not once said Cg "solves" the problem I am discussing better than anything else.
Well, I assume you mean "recently"? Question: So you're saying it does not, then? I had thought you made the implication that it was, so please simply flat out state it if you believe the opposite.
Hmm...you again seem to maintain that the capabilities represented by HLSL profiles are not applicable to the problem you propose.
It is not the profiles that are the problem, and I never said Cg solved it either. I am talking about compiler architecture here. All compilers have an intermediate representation used for optimization. Instead of passing assembly language to the device drivers for further optimization, I am talking about a simple change in architecture that allows IHVs to plug into this IR data structure and take over the generation of microcode or assembly themselves in the final phase of the compiler.
Has potential benefits and drawbacks. The "potential" benefits aren't automatically realized, and the drawbacks are not discussed at all. To avoid Cg/HLSL confusion, it would be handy to start a new thread, perhaps "GLslang/HLSL".
Either that, or upgrade the DX9 assembly so that it is rich enough to recover the full semantics of the original source code.
Hmm...in the above post, you took the time to make the statements and I feel they warrant reply and relevant criticism, and there are opportunities in it to point out (if you believe it to be the case) where I specifically misinterpreted you where I had adequate indication to interpret otherwise (I don't see them now, and being shown that they are there will improve my ability to understand you on this topic in the future) or where I am basing my thoughts on something that is false (same, but with regard to facts instead of communication).
Now, outside of that, treat the following as a "new" conversation for the moment.
I say this because simply introducing this (quoted) possibility shows that our actual disagreement is more about our takes on the current situation with shaders and what our respective posts said in this discussion, rather than any real difference on what the compiler architectures achieve technically (if we can agree to a scope for "technical"). To summarize outside of the disagreement on the above matters, I ask: "what shader is the DX 9 assembly not rich enough for, and why, including consideration of 'flavors' of 2.0 extended and 3.0?" and "why don't you think the assembly won't evolve in time for the usage of said shader?" Not asking for specifics, just general indication separate from a back and forth about what we're disagreeing about.
This is a simple computer science compiler theory issue, why does this have to turn into some inference that I am endorsing Cg.
The question is why did you choose to respond to a post that began "This
statement (
though not necessarily disagreement with the isolated statement it quotes, IMO) seems slightly facetious, and largely based on a body of assumptions that you did not establish as valid" in the way you did, and then label the large body of my discussion of the compiler issue as irrelevant?
Anyways, I've provided my reasons for responding in the context of Cg and HLSL above.