How Cg favors NVIDIA products (at the expense of others)

DemoCoder · Jul 18, 2003

That's history though. The fruits of Intel's research compiler are now incorporated into today's production compilers. The compiler IMHO served it's purpose, which was to shake up the status quo.

Entropy · Jul 18, 2003

DemoCoder said:
That's history though. The fruits of Intel's research compiler are now incorporated into today's production compilers. The compiler IMHO served it's purpose, which was to shake up the status quo.

I took a slightly more dim view of it at the time, but I fully agree that Intels compiler did help push the state-of-x86-compiler-art forward, and that this ultimately has been a very good thing. I would have preferred to see Intel simply put their x86 compilers in the public domain (or some acceptable variation thereof), and let tool vendors use them as a base rather than having to reinvent wheels.

And sure, this is history now, although there are parallells to various current discussions about gfx benchmarking mainly connected to the transferability of results both between benchmarks and applications, and between solutions from different IHVs.

Entropy

DemoCoder · Jul 18, 2003

Didn't Intel release a hacked version of GCC with their improvements a long time ago (before PGCC)? It seems recently they have changed their tune by cooperating more with open source, however the latest Intel compiler techniques are still proprietary and hasn't made it into GCC yet.
Atleast they are production quality now tho.

demalion · Jul 18, 2003

DemoCoder said:
Demalion, your entire response is predicated on the assumption that I am talking about and supporting Cg, which is incorrect.

Well, the problem here is that you were addressing this quote:

Doomtrooper said:
Hardly, HLSL is outperforming CG on a 5600, so how can that above statement be correct. A developer has a much better compiler to use that has already been optimized for all hardware, vs. one.

With commentary that seems clearly related to comparing Cg to DX 9 HLSL ("one compiler" is bad, "multiple backend optimizers" are needed and not present in the case Doom was advocating, the DX HLSL, etc.), exactly how did my response along those lines go wrong?
The wording of my response is indeed predicated (with reason, it still seems to me) on believing that this is what you are addressing, but the particulars of support I use are independent of that. If you meant your words as criticisms of Cg as well, I'm confused as to why HLSL was evident in your criticism and Cg was not when the quote you were addressing discussed both. Does this make this a matter of my misunderstanding instead of your miscommunication?

Thus, many of your responses refer to things that are not even in my messages (I have never said Cg solves any of the problems. To critique DX9 HLSL is not to endorse Cg)

This is a problem of obfuscating semantics on your part that does not make a good start in sticking to relevant issues. What you said was that multiple back ends were necessary, and the DX HLSL did not have this. You also made several assertions about what the compiler could not do, proposed as flaws in the HLSL using a LLSL, which you failed to support in this reply, and which are things that were under discussion that Cg seemed to try to do. For example: "All Microsoft's compiler can do is perform generic optimizations like common subexpression elimination, copy propagation, and dead code elimination. It cannot schedule instructions (reorder them to take advantage of each card's architecture) nor alter it's register allocation strategy based on card architecture." My discussion of 2_a was pointing out that it actually did the first for the NV3x, and my discussion of Cg was based on the idea that the second is what it tries to do. When you state the HLSL cannot do this that is an endorsement of Cg (AFAIK, one based on a false premise, which is what I tried to discuss in much of my text).

To point out that you didn't say the word "Cg" is semantically correct, but does not change that this assertion is present, nor that my address of the assertion is not invalidated (atleast, as far you've demonstrated) with regard to the issues I presented. The obfuscation is mentioning that assertion about my commentary being irrelevant, in whole, instead of addressing the discussion that still remains relevant.

demalion said:
demalion said:

This statement (though not necessarily disagreement with the isolated statement it quotes, IMO) seems slightly facetious, and largely based on a body of assumptions that you did not establish as valid...you can have "one" compiler behave differently, so there being only "one" MS compiler for DX HLSL does not validate your statement.

Click to expand...

You cannot ship a single compiler that will statically compile optimized code for all graphic cards both future and present. It would have to be a HUGE monolithic compiler, and everytime a new vendor shipped a graphics card, or someone updated their part of the compiler, Microsoft would have to reship it, and all developers would have to reship their games with the recompiled shaders.

This response indicates that you ignored a large part of provided rationale, namely that you have not established support for the idea that each shader hardware architecture will require compiler re-invention, though you contine to propose it is factual to such a degree that every single architecture will be completely unique and ill-served by the DX HLSL. You continue to simply make the assumption that the NV3x problems inherently prove this by simply ignoring that it could simply represent a uniquely deficient architecture that other IHVs would have reason to avoid if they can. However, I acknowledged that it is possible that they would fail in doing so, it is just that your commentary depends on ignoring that they might succeed.

Some of the text of mine you excised already addresses this.

If you want to play semantics, yes, you could theoretically, have one large uber compiler (like GCC), but how no one will get any benefit in their games if new optimizations are discovered later, nor will people with future graphics cards benefit on older games.

Two things: 1) again, you fail to validate the assertion thatr compilers need the level of complexity of gcc, which maps across significantly divergent hardware architectures with more complex execution stream demands, 2) gcc and its complexity and "uber"ness is not what I was proposing, that is your exaggeration of my viewpoint in lieu of addressing what I actually stated.
If you want to discuss the idea of when the LLSL paradigm is likely to be made unsuitable outside of the context of Cg and HLSL, I recommend a different thread where the examples of Cg and HLSL are not the specific topic such that their irrelevance needs to be established for discussion if that is the presumption, or simply clarifying which relevant examples under discussion your reply refers to, in full.

The optimizations have to happen at dynamically at runtime if you want them to be optimal per card and up to date.

Please note that I support this idea, I just don't think the SLissues discussed in this thread support your proposal that the idea is a significant factor right now with regard to the issues you mention, or that there is any other demonstration of which I'm aware that establish that it is. If you know of one, why didn't you just mention it?

Aside from that, and focusing only on the "ps_2_0" proifle, the idea of "shorter shader is better" is an idea with independent merit for shaders, and with the simplicity of expression with regard to graphics shading (very specific math and data fetching operations being the majority of workload), it is not the given you propose that "you don't want a single compiler for all video cards"

Click to expand...

Your criticism falls flat on its face the minute you vary the number of executable units in the GPU, since then the instruction stream would have to be reordered to take advantage of maximal parallelism.

This commentary is a bit puzzling, as my text began "aside from that" and continued with "focusing only on the ps_2_0" profile. Also, part of what you snipped includedtext like, for example, "the "single compiler" hypothetical (i.e., as a label that does not fit what it appears HLSL will be) is indeed undesirable". Taken altogether with my referencing the ps_2_a profile (which, btw, has at least the characteristics of reording texture and "arithmetic" ops in "tex/aop/aop" order that you discussed above) seems to make it appear that your criticizing something other than what I proposed.
As it is, my criticism (of your painting HLSL as deficient in areas where you have not provided support for the assertion) at this point is standing fine, because what you propose was discussed elsewhere than here when I'm talking about the ps_2_0 profile and what it represents, and where I'm trying to point out that other IHVs would probably have reason to design the hardware suitable for it and not do things that require a "monolothic" compiler.

Add in difference pipeline depths, branches,...

I tend to think this will indeed be an issue if branching implementations vary wildly between IHVs, but am also proposing: 1) this doesn't seem to be the case yet, 2) it is not a given that divergent implementations will require unique compilers, 3) if it is, the profile system in HLSL and the LLSL specifications seems like it will be able to expand to handle it, before it occurs (assuming MS isn't shutting out IHVs...are they?), and seems to have every reason to do so.
What you're arguing against is any standardized LLSL usage at all, since that is specifically what causes the issue, and the DX HLSL is a means of solving the development and management problems of shaders above that layer.

...and texture cache sizes,

I'm seperating this because this seems to have come out of left field. How would the compiler be handling this? I can see some limited opportunities for cache handling behavior changing when dependent reads are/are not occuring, but that's all I see on first blush. If this wasn't a mistake, and you have more in mind, please explain further.

and there is no way you are going to be optimal unless it is done in the driver.

Why simply repeat this assertion while leaving commentary that might indicate otherwise unaddressed?

This is precisely why Intel and HP are researching runtime recompilation of Itanium, taking a JIT-like approach to native code, because varying the number of functional units makes a huge difference in how you schedule instructions, and static compilation will ensure poor performance on future chipsets.

But GPUs are not the Itanium...obviously. Let me restate: if they "are the Itanium", could you simply even begin to illustrate how? They seem like dramatically different beasts to me, with the Itanium taking dependence on branching/scheduling/cache utilization to extremes beyond common desktop CPUs, and where common desktop CPUs already take it to levels quite divergent from GPUs. For GPUs, parallelism for unique replication is still the focus (as opposed to parallelism for single stream execution, multiple threading of single stream execution, or branch prediction, where each parellel processing instance is not an independent workload). I do see us heading for "parallel CPU-like scheduling" in individual GPUs "pipeline" concepts, inevitably (atleast, "in effect"), but your discussion seems to depend on fallacies to propose we are there already.

Also the assertion that "shorter is better" is not always neccessarily true.

Nor did I claim that it was, by either inference or statement. I proposed that paradigm as the ps_2_0 model, and further proposed that other IHVs seem to have every reason to work towards making their hardware execute that model well. Since the NV3x specifcally not exhibiting this characteristic was a prominent factor in my discussion and reasoning, I'm not sure why asserting that this is not something I was aware of became the basis of your reply to me.

The classic example is loop unrolling. Given a loop (say, iterational count 10) for a vertex shader, which do you predict will be faster: unrolling the loop into 10 copies of the body (longer), or the smaller code with the branch. The answer depends on how well the GPU handles branches, whether the unrolled code will fit into the vertex shader cache or not, and a host of other issues that will vary from GPU to GPU.

Umm...yes. I do realize that. But I was talking about ps_2_0 output in a specific context. In any case, concerning your new example, I actually thought this vertex processing problem was already handled by the LLSL "optimizer" (which would mean you're arguing against your criticism of LLSL, slightly) where necessary, but if it is not that would indeed be indication that this issue is present right now (so if you have such information, you will have moved your case forward in a way that I would not tend to criticize, in case the nature of my disagreement is still not clear).

You're naive "short is better" optimizer might end up alot slower if branches are expensive.

OK, you threw in branching example that doesn't seem applicable to the ps_2_0 profile which I was discussing, without clarifying relevance. The "short is better" paradigm represents the ps_2_0 and (currently only) the R3xx. It is a generic paradigm, that it seems likely that other IHVs will aim for, while your given assumption seems to be that other IHVs will implement unique paradigms with significant limitations and demands like nVidia did. I discussed in the post that they may, but that it seems more likely that they would try to avoid that if at all possible, and that it was indeed possible that they would. Again, if you want to discuss my reasons for thinking your assumption not a reasonable expectation (though reality can be stranger than fiction, and the released specs for various architectures could simply be misleading), we can expand upon that if you agree not to simply dismiss it as resolved.

Well, if you're talking about the LLSL, you're describing what MS is doing.

Click to expand...

Where is MS providing an API for pluggable runtime ccompiler backends?

Eh? I quote again: "You want the compiler to be part of the device driver, and you want each vendor to ship their own optimizer that optimizes it at runtime." That is what I said MS was doing with the LLSL, not that they were "providing an API for pluggable runtime compiler backends". Please note that what immediately followed the sentence you decided to isolate and question, was my own question as to whether you meant to address what HLSL was lacking or talking about the LLSL as I asserted in what you quoted, and that responding with a question instead of simply answering mine seems counter-productive. Also, note that "pluggable" is discussed as part of the "contro"l issue I discussed.

(lots elided, irrelevent, based on a misunderstanding. My message is talking about runtime compilation, not static command line compilers with a fixed number of backends)

From this, I presume you are maintaining, after all, either that "compile on installation" is explicitly not part of what you are considering, or that the DX HLSL compiler can only be used by running "FXC.exe" from the command line. :?:

Since when is that the case? Perhaps I misunderstand your choice of wording, but with regard to "pluggable"...

"Pluggable" is concerning issues of : 1) Patching without patching all affected games individuatlly (a valid logistic issue, but one that does not worsen the situation from current); 2) Actual deficiencies in the shipped compiler, which depends on your assumptions about other IHV hardware and specific case by case issues (valid application detection is an already prevalent alternative, but a less appealing one...driver compilation really is the best long term solution, I agree); 3) New issues with future hardware when dealing with the utilized DX paradigm, which I discussed, again, above (and in that text you termed "irrelevant").

The only point I disagree with here is your assurance that the observed deficiencies for the NV3x are deficiencies in the HLSL->LLSL compiler's absolute capability, and not simply a deficiency in the hardware that the compiler does not yet take into account (in a public release).

Click to expand...

How is a static command line compiler that is infrequently updated going to address the issue of new hardware coming out (which is not taken into account).

OK, please read the "irrelevant" text again, in full. The discussion of this is readily evident in the "paradigm" idea you did not address. Please more carefully consider the list of possibilities presented in the entire previous post, and realize that it is the assumption that this issue is only represented by your choice of characterization and presented expectations about future hardware and the timeliness of DX evolution that I'm criticizing, and that both of these assumptions might simply not be correct.

Hmm...actually, would be more interesting if they were correct, OpenGL could use the boost.

Let's say FXC takes the NV35 into account. What happens when the NV40 and R400 arrives?

They run the 2.0, 2.0 extended, or 3.0 shader output, depending on the demands of the shader, their capabilities of the hardware, and the capabilities of the compiler shipping with the game. Please note the following: your characterization does not upgrade 2.0 shaders to 3.0 such that the game needs a new compiler just because the hardware became more capable. We could have a useful discussion of ways this might or might not happen, but it is not useful to skip the discussion and propose the conclusion you believe to be right in its place. If you'll read that "irrelevant" text again, please note that recognition of both was part of what I was discussing.

Games will be shipping with old versions of FXC and compiled shaders and games won't get the advantage immediately of any compiler updates until they download game patches!

Yes (though I thought it was a library call, not the FXC executable, and the label "run time" seems a bit of a misnomer when criticizing DX, though "pluggable" does not). Are you under the impression that I view the DX HLSL as perfect? I'm pointing out that you are proposing predictions as fact and ignoring factors of counter-indication as the basis for your criticism of DX HLSL (well, the part I'm disagreeing with). I thought you were proposing Cg as a way to overcome the mentioned issues, and if that is incorrect, the rest of my discussion retains relevance.

Whereas a backend runtime compiler will instantly affect the performance of all games.

Yes, but at costs that warrant more than simple dismissal. Costs like: IHVs having to actually develop a compiler, if their architecture is divergent, and doing a better job than the "naive" short shader implementation in DX (again, nVidia isn't making a good showing of this, even with their funding and resource commitment); giving up full and standardized featureset exposure in a LLSL, which it seems too early to do; introducing a wide avenue for fault manifestion, both at the driver level and application level...the LLSL specification paradigm does a great deal to prevent that.

Of course, aside from these issues, I think the pluggable backend compiler is definitely superior, but the reasons you are proposing for stating that the benefits manifest right now as problems with DX HLSL seem fallacious (reasons stated above, multiple times now).

Basically, you maintain that the ps_2_a profile isn't a profile that accomplishes what you say atleast as well Cg does, and I'm pointing out that I don't see why you are sure that this

Click to expand...

I maintain nothing. I have said nothing about ps_2_a and Cg.

What you did say concerns ps_2_a and Cg with regard to the specific examples of what you maintain DX does not do and the hardware and CG/HLSL examples already under discussion, and my discussion of your given assumptions concerning future hardware is because of that.

Where are you pulling this stuff from? I am talking about a general improvement to the architecture of Microsoft's compiler.

Umm...things you said seemed to ignore the ps_2_a profile. Since the ps_2_a profile is part of HLSL, it seemed relevant to me.

I have not once said Cg "solves" the problem I am discussing better than anything else.

Well, I assume you mean "recently"? Question: So you're saying it does not, then? I had thought you made the implication that it was, so please simply flat out state it if you believe the opposite.

Hmm...you again seem to maintain that the capabilities represented by HLSL profiles are not applicable to the problem you propose.

Click to expand...

It is not the profiles that are the problem, and I never said Cg solved it either. I am talking about compiler architecture here. All compilers have an intermediate representation used for optimization. Instead of passing assembly language to the device drivers for further optimization, I am talking about a simple change in architecture that allows IHVs to plug into this IR data structure and take over the generation of microcode or assembly themselves in the final phase of the compiler.

Has potential benefits and drawbacks. The "potential" benefits aren't automatically realized, and the drawbacks are not discussed at all. To avoid Cg/HLSL confusion, it would be handy to start a new thread, perhaps "GLslang/HLSL".

Either that, or upgrade the DX9 assembly so that it is rich enough to recover the full semantics of the original source code.

Hmm...in the above post, you took the time to make the statements and I feel they warrant reply and relevant criticism, and there are opportunities in it to point out (if you believe it to be the case) where I specifically misinterpreted you where I had adequate indication to interpret otherwise (I don't see them now, and being shown that they are there will improve my ability to understand you on this topic in the future) or where I am basing my thoughts on something that is false (same, but with regard to facts instead of communication).

Now, outside of that, treat the following as a "new" conversation for the moment.

I say this because simply introducing this (quoted) possibility shows that our actual disagreement is more about our takes on the current situation with shaders and what our respective posts said in this discussion, rather than any real difference on what the compiler architectures achieve technically (if we can agree to a scope for "technical"). To summarize outside of the disagreement on the above matters, I ask: "what shader is the DX 9 assembly not rich enough for, and why, including consideration of 'flavors' of 2.0 extended and 3.0?" and "why don't you think the assembly won't evolve in time for the usage of said shader?" Not asking for specifics, just general indication separate from a back and forth about what we're disagreeing about.

This is a simple computer science compiler theory issue, why does this have to turn into some inference that I am endorsing Cg.

The question is why did you choose to respond to a post that began "This statement (though not necessarily disagreement with the isolated statement it quotes, IMO) seems slightly facetious, and largely based on a body of assumptions that you did not establish as valid" in the way you did, and then label the large body of my discussion of the compiler issue as irrelevant?

Anyways, I've provided my reasons for responding in the context of Cg and HLSL above.

DemoCoder · Jul 18, 2003

demalion, you are too verbose and you should be more direct and to the point.

First of all, you called my statement facetious. I was not. It was deadly serious. There is no satire, irony, or joke in it. It's not even an extreme statement, as anyone who is familiar with compiler theory would see that it is perfectly reasonable. In fact, not only can a compiler not be optimal for all architectures, it can not even be optimal for a SINGLE architecture. This is know as the "full employment theorem for compiler writers" and is mathematically provable by reduction to the Halting Problem.

Secondly, I cannot "prove" to you that a compiler written specifically for each architecture would be more optimal than a generic one without writing such a compiler, I can only point to the large body of historical evidence (40 years of compilers) that demonstrate this to be the case, not to mention the fact that just trying to write an optimal compiler for a SINGLE piece of hardware is hard.

Third,

To summarize outside of the disagreement on the above matters, I ask: "what shader is the DX 9 assembly not rich enough for, and why, including consideration of 'flavors' of 2.0 extended and 3.0?"

Shows your misunderstanding of the issue at hand. I am not claiming that that DX9 assembly lacks the expressiveness to some set of shaders (that is a separate issue). I am claiming that DX9 assembly does not fully represent the original semantic information of the HLSL source code, and therefore, the device driver doesn't have as much information as it could to perform driver level optimizations.

You do understand today that device drivers in DX9 are doing compilation right? These cards are not running DX9 byte codes directly, but they are being translated to low level instructions. During this time, the device driver can also perform optimizations. However, much of the original source semantics can be erased by the time the driver sees it, leading to lost optimization opportunities.

Let's say you compile a given shader for ps_2_a or ps_3 statically. The compiler might choose to use predication instead of branch instructions. If it does, when the DX9 driver intercepts the assembly, it will not be able to transform predicates back into branches if branches would be more efficient on that hardware.

Or, if a loop is unrolled by the compiler, a driver won't be able to "un unroll it" back into a real loop, if loops happen to be more efficient on that architecture, because the driver will never see the loop, that information has been removed by the compiler.

Moreover, depending on the architecture, sometimes you might even choose different instructions altogether, for example, to minimize register usage, or to minimize length, or to minimize resource hazards.

Fourth, the fact is, in this discussion, I never claimed Cg solved this problem. I am criticizing the idea that there is one "optimal" DX9 assembly for a given algorithm on all architectures. You would like to think that IHVs should look at the DX9 shader spec and implement it "optimally".

But since all architectures have different tradeoffs, they will all react to different shaders differently, and what might be fast on one architecture might be slower on another and vice versa.

I simply want to give device driver manufacturers the maximum amount of information neccessary to do the best job.

Finally, the approach I am suggesting is precisely what OpenGL2.0 is doing, bypassing the intermediate step of generating assembly source, and compiling directly in the driver.

DemoCoder · Jul 18, 2003

demalion said:
Yes, but at costs that warrant more than simple dismissal. Costs like: IHVs having to actually develop a compiler

Wrong, the IHV's don't have to develop the compiler. Just like today, if I create a new CPU, I need only to create a new architecture description configuration file for GCC, and viola, it can compile to the new target.

Compilers have an architecture like

1. Parsing -> 2. Semantic Analysis -> 3. Internal Representation 4. -> Generic Optimizations -> 5. Specific Optimizations -> 6. Instruction Selection

It is only the last two steps that the IHVs have to write and they are already doing 5. and 6. to some extent when they translate DX9 opcodes into their native format. Microsoft will provide steps 1,2,3, and 4 and all cards will get the generic set of optimizations that FXC is doing today (cse, copy prop, dead code removal, etc)

All that needs to be done is dump the internal compiler representation (like MS does with .NET compilation), and hand it off.

Slides · Jul 18, 2003

DemoCoder said:
demalion, you are too verbose and you should be more direct and to the point.

Amen!

demalion · Jul 18, 2003

DemoCoder said:
demalion, you are too verbose and you should be more direct and to the point.

Actually, you are only interested in the point that is convenient for you to address. There were several points, including ones you maintain irrelevant though I've provided support for them not being so. If you wish me to be less verbose, I suggest you spend your time addressing that support as an alternative to providing posturing for me to reply to, as it seems you will insist on doing.

I know you consider all of this text as pointless, and all of your own text you spend here simply labelling my viewpoint instead of discussing it, as not, but what I am doing is simply addressing each and every thing you say. Your not considering disagreement with yourself as "getting to the point" is just a matter of ego, not fact. If you stuck to one point, my reply would consist of replying to that one point, and if you multiply the amount of things under discussion, it is only your selective persistence in considering my opinions irrelevant while yours are not that justifies that this occurrence is not without reasons rooted in your own actions.
Notice that my criticism does not spend time only labelling yuur viewpoint and proposing the labelling as a fact, and therefore gaining length by ending up saying nothing, but spends time and words addressing your comments in detail. Your dedication to simply ignoring and dismissing inconvenient disagreement does not change that.

BTW, all of the above words, have several "points". Points that seems obvious to me, and I would be perfectly happy in leaving unstated, but that you seem to disagree with. I'm simply not responding as you like by capitulating and abandoning those points. Sorry, that is not something that changes about me, though if you have direct examples of how I could be more brief while doing other than modelling what I maintain for your convenience over what I believe, please provide them in a PM.

First of all, you called my statement facetious. I was not. It was deadly serious. There is no satire, irony, or joke in it.

For instance, take this example. You establish as the axis of your response that you will ignore an address of what I stated by the pretext of disputing my usage of the word facetious two posts ago, instead of the technical disagreement. This is not a side issue, this is the kernel of your dismissal of such discussion being necessary. In order to get you to engage in that discussion, it seems I have to address this, whereas if you'd constructed your response without building your statement upon this, or actually addressed my statements, I would not. (Yes, there is a unique point in that text, though it seems to be expected that you have no interest in doing anything other than multiplying the amount of points you refuse to discuss or recognize I've supported until I give up maintaining those that are inconvenient to you :-?

).

To address: "You wouldn't want a single device driver for all video cards and require level compatability" is something you said. If you are "deadly serious" in maintaining that Domtrooper proposed that, I'll simply disagree. It looked to me like this was an exaggeration by proposing a ridiculous extreme of his viewpoint. Hence my usage of "facetious". I do think this was pretty evident, and that addressing this was a waste of time, and that you are free to stop causing such whenever you wish to change some aspects of how you address disagreement (this does not necessitate that you cease disagreeing...please note that distinction).

It's not even an extreme statement, as anyone who is familiar with compiler theory would see that it is perfectly reasonable.

Again, something addressed by my previous posts. You maintain that because the HLSL has "one" name, that it has one characteristic of behavior and code output. You even contradict your discussion of gcc, which you proposed as an extreme example of what my viewpoint necessitated to satisfy your goals, and simply ignored any of my mention of HLSL profiles and the differences in code produced by them.

In fact, not only can a compiler not be optimal for all architectures, it can not even be optimal for a SINGLE architecture. This is know as the "full employment theorem for compiler writers" and is mathematically provable by reduction to the Halting Problem.

How does this apply to HLSL more than multiple backends that you propose? It seems it does not, and that your mentioning it was simply the practice of trying to portray that I am proposing that HLSL is perfect (when I did no such thing) and attacking that to support the idea that what I actually did say (something different, which was actually clearly demonstrated before) is dismissible.

Secondly, I cannot "prove" to you that a compiler written specifically for each architecture would be more optimal than a generic one without writing such a compiler, I can only point to the large body of historical evidence (40 years of compilers) that demonstrate this to be the case, not to mention the fact that just trying to write an optimal compiler for a SINGLE piece of hardware is hard.

Is this a reference to my reaction as to how your Itanium example seemed ridiculous to me, or to my asking you for an example of how the actual HLSL compiler is incapable? Neither seems addressed by this, because neither necessitates that you do what you are saying you cannot do. Seems rather irrelevant except as the mechanism I've described before, especially as your stated vertex shader example could have begun to do what I did ask, and I specifically tried to facilitate your usage of it to support your stance. Why did you reject that alternative? My current presumption is that you rejected it because facts turned out to be inconvenient, but, as I said, I don't know...maybe you forgot and you can now actually provide a relevant counter to what I propose now that you are reminded?

Third,

To summarize outside of the disagreement on the above matters, I ask: "what shader is the DX 9 assembly not rich enough for, and why, including consideration of 'flavors' of 2.0 extended and 3.0?"

Click to expand...

Shows your misunderstanding of the issue at hand.

A constant theme. Your usage of it does not seem to address my statements so far.

I am not claiming that that DX9 assembly lacks the expressiveness to some set of shaders (that is a separate issue).

Actually, I did mean a HLSL shader, including the abstract functionality that you propose the LLSL cannot express. I'm puzzled...did my "new conversation" statement cause you to think I'd switched topics from whether DX9 LLSL could implement optimally for multiple hardware to whether DX9 LLSL could implement a shader at all? I did not think it necessary to specify that I was still talking about HLSL abstraction when referring to your comments discussing the same, but since my wording is ambiguous without explicitly stating "HLSL", I guess it is at least partially my fault.

BTW, the "new conversation" comment was to try and re-start conversation about the compiler aside from your stated, but unsupported, assertions of factuality about current shading, not to change the context of discussing the LLSL. The >2.0 LLSL specs mentioned were provided to broaden discussion, and please note that I even tried to state that just a "general indication" would be a start for the "new conversation" to facilitate this occurrence. :-?

I am claiming that DX9 assembly does not fully represent the original semantic information of the HLSL source code, and therefore, the device driver doesn't have as much information as it could to perform driver level optimizations.

There is an inherent assumption about the disparity between the HLSL abstraction, the LLSL abstraction, and what the drivers are capable of processing for the hardware. I'm stating that your assumptions of the disparity between the LLSL abstraction and what the drivers can utilize for the hardware are not facts, but opinions that I've spent many paragraphs providing counter indications for. If you'll look back, I actually also said I think your argument is more relevant for "3.0" type shaders, or atleast that I don't see clear indications arguing against you in that case (my Cg and HLSL discussion seem to me to to argue against you for "2.0" type shaders, possibly including "extended" and the "2_a" proifle, though I still invite you to correct my understanding of your vertex shader example if I am incorrect in what I asked you about).

You do understand today that device drivers in DX9 are doing compilation right?

You quoted me discussing exactly that with regard to LLSL...why do you ask? Refresher:

demalion said:
DemoCoder said:

You want the compiler to be part of the device driver, and you want each vendor to ship their own optimizer that optimizes it at runtime.

Click to expand...

Well, if you're talking about the LLSL, you're describing what MS is doing.

Why do you think I prefer the term "LLSL" instead of "assembly", anyways? I mention this to illustrate why I'm excising your explanation of this.

Let's say you compile a given shader for ps_2_a or ps_3 statically.

Hmm...after all the unconstructive text, have we hit gold? This is the "new conversation" I was trying to start, which I mention because I consider all the posturing I addressed above as separate, and I want the context of my reply to be clear. I.e., you would be in error to inherently presume sarcasm instead of simply providing an answer to following questions.

The compiler might choose to use predication instead of branch instructions. If it does, when the DX9 driver intercepts the assembly, it will not be able to transform predicates back into branches if branches would be more efficient on that hardware.

OK, why would it choose to use predication instead of branch instructions, when the 3.0 and 2.0 "extended" specifications list branch instructions?

Or, if a loop is unrolled by the compiler, a driver won't be able to "un unroll it" back into a real loop, if loops happen to be more efficient on that architecture, because the driver will never see the loop, that information has been removed by the compiler.

Why are you assuming the compiler would unroll a loop for a 3.0 profile at all, or even for the ps_2_0 profile? I can see your point if the 2.0 based vs profiles do such, but, as I said, I thought that was done by the driver?

Moreover, depending on the architecture, sometimes you might even choose different instructions altogether, for example, to minimize register usage, or to minimize length, or to minimize resource hazards.

I discussed all of these particulars specifically before (in the "irrelevant text" of my first recent reply to you), and why your simply making this statement does not establish that the LLSL compiler model cannot meet these requirements, even if it only has "one" name. In lieu of quoting myself and adding even more length, I point you towards that first post and the "irrelevant" text, as I did before.

Fourth, the fact is, in this discussion, I never claimed Cg solved this problem.

You said that before, I merely asked you to state unequivocally that Cg did not solve this problem so we could drop dispute about Cg from being a factor of our particular discussion. Is that so hard?

I am criticizing the idea that there is one "optimal" DX9 assembly for a given algorithm on all architectures. You would like to think that IHVs should look at the DX9 shader spec and implement it "optimally".

Doesn't there being a LLSL compiler already change the wording of your argument significantly? What you're arguing against is the idea that the LLSL can be compiled to an optimal interpretation by the drivers, but you're using arguments that ignore that the HLSL compiler can already create different "assembly" for different architectures in the first place, and as an alternative to discussing my proposal that your branching examples do not seem to be true at the moment, and that it is not simply a given that they will be.

...various things I've responded to already...if you don't think I have, please just ask me to show you in PMs...

demalion · Jul 18, 2003

DemoCoder said:
demalion said:

Yes, but at costs that warrant more than simple dismissal. Costs like: IHVs having to actually develop a compiler

Click to expand...

Wrong, the IHV's don't have to develop the compiler. Just like today, if I create a new CPU, I need only to create a new architecture description configuration file for GCC, and viola, it can compile to the new target.

Umm...then why is nVidia having such trouble? What exactly are they doing with Cg if not writing a compiler? If other architectures are as divergent as you say, why will the Cg compiler example nVidia decides to release give the associated IHVs a head start? BTW, where is there a compiler as developed as gcc for shader output?

1. Parsing -> 2. Semantic Analysis -> 3. Internal Representation 4. -> Generic Optimizations -> 5. Specific Optimizations -> 6. Instruction Selection

It is only the last two steps that the IHVs have to write and they are already doing 5. and 6. to some extent when they translate DX9 opcodes into their native format.

Yes, but aren't you requring that they take control of 4 as well, and maintaining that 3 needs to be sufficiently abstracted? Also, Microsoft does some of 5 with profiles...really, 4 and 5 end up intermixed. My point is that you maintain GPUs will have to be like "PowerPC" versus "MIPS" versus "x86", when they could be more like "Duron" versus "Athlon XP" (atleast as far as differences in each pipeline).

Microsoft will provide steps 1,2,3, and 4 and all cards will get the generic set of optimizations that FXC is doing today (cse, copy prop, dead code removal, etc)

Why do you make statements dependent on the HLSL profile system not existing, and dependent on "FXC" from the command line being the only method of compiling to the LLSL? :-?

Actually, using something like "FXC" would be pluggable...come to think of it, I don't see why a developer couldn't provide a (small and universal for their title or for an engine) shader management tool that was constantly updated to provide some of the benefits we discussed before. Maybe HL 2 will do something like this, as I think it will be the first extensive HLSL implementation (AFAIK) in a shipping title.

All that needs to be done is dump the internal compiler representation (like MS does with .NET compilation), and hand it off.

Or for the LLSL to be used effectively as an "internal compiler representation", as I thought we'd established earlier. I do agree that it certainly might not be at some point, however, which is why I maintain that a party without vested interest control the specifics and evolution of that representation.

DemoCoder · Jul 18, 2003

demalion said:
DemoCoder said:

demalion, you are too verbose and you should be more direct and to the point.

Click to expand...

First of all, you called my statement facetious. I was not. It was deadly serious. There is no satire, irony, or joke in it.

Click to expand...

(five irrelevent paragraphs of verbiage that have nothing to do with the HLSL issue)

Again, something addressed by my previous posts. You maintain that because the HLSL has "one" name, that it has one characteristic of behavior and code output.

I said nothing of the sort. I said GCC is an example of an uber-cross-compiler, but GCC

#1 not as good as optimizing as Intel C and MSVC (does "ok" on oodles of CPU architectures, it is not the best on any one of them, but that's my point)

#2 Static compilation for multiple targets means that when the compiler is upgraded or the hardware is upgraded, no consumers get any immediate benefits, unlike today with device drivers where a driver ugprade can deliver significant performance boosts.

Is this a reference to my reaction as to how your Itanium example seemed ridiculous to me

Well, I made the mistake of assuming you knew something about compilers. Both the Itanium and GPUs are VLIW architectures. The failure of static compilation for VLIW is well known due to these architectures' varying the number of functional units. It only seems ridiculous if you are ignorant of history.

There is an inherent assumption about the disparity between the HLSL abstraction, the LLSL abstraction

It is a simple fact that they are not isomorphisms and that the optimization of HLSL into LLSL can lead to local minima, not global minima. ATI and 3DLabs are towing this line and specifically argue having the compiler inside the driver leads to better optimization (in this GDC presentation).

It is trivial to realize that DX9 places resource limits on the compiler output (nesting levels, # of registers, # of read ports on registers,etc) which cause the compiler to omit code which obfuscates the semantics of the original expression, and therefore, if your underlying hardware has superior than DX9 capabilities in some of the pipeline, it will be unable to realize some optimizations.

Why do you think I prefer the term "LLSL" instead of "assembly", anyways? I mention this to illustrate why I'm excising your explanation of this.

Doesn't matter what you call it, it's still assembly. I could call x86 assembly LLIL (low level intel language). Gee, big difference.

OK, why would it choose to use predication instead of branch instructions, when the 3.0 and 2.0 "extended" specifications list branch instructions?

Would you suggest that it use branches 100% of the time? Why even have predicates at all if you're never going to use them?

if(a < x) { b = 10; }
else { b =20; }

Why don't you, Mr know-it-all tell me the heuristic to decide which to generate? In some cases, the compiler can even generate a LRP instruction instead of a predicate or branch. Now you have three choices.

Why are you assuming the compiler would unroll a loop for a 3.0 profile at all, or even for the ps_2_0 profile? I can see your point if the 2.0 based vs profiles do such, but, as I said, I thought that was done by the driver?

Because there is a nesting depth limit for loops and/or subroutine calls, therefore, it has to inline or unroll.

You said that before, I merely asked you to state unequivocally that Cg did not solve this problem so we could drop dispute about Cg from being a factor of our particular discussion. Is that so hard?

My original statements to Doomtrooper never mentioned Cg. I never in any way said Cg did anything to solve this problem as my entire discussion was focused on run-time compilation. If you actually bothered to carefully read my messages, I wouldn't have to repeat myself a dozen times to explain what is obvious assumptions of bias on your part.

This is a rather pointless discussion. OpenGL is already going the route I mentioned specifically for the reasons I mentioned. I have given you enough reasons to doubt the perfection of LLSL as a medium for optimization, yet you continue, and will continue until I tire of responding so you won't have to admit you're wrong. (you have a history of this, as I recall your ignorance of the Amiga exec.library when I stated it has a bug that prevented preemptive multitasking from working properly in our famous Windows vs Amiga/BeOS "quality" thread)

DemoCoder · Jul 18, 2003

demalion said:
Umm...then why is nVidia having such trouble? What exactly are they doing with Cg if not writing a compiler?

What does Cg have to do with this? NVidia is having trouble because they have to write the whole compiler from scratch. If FXC was built into the driver architecture of DX with pluggable backends, the IHVs avoid 90% of the work.

Yes, but aren't you requring that they take control of 4 as well, and maintaining that 3 needs to be sufficiently abstracted?

No, the IHVs will not have to create their own IR nor do generic optimizations. For example, dead code removal erases information that is safe to erase. That means there is no need for IHVs to write code to do SSA form, or liveness analysis, and perform dead code removal.

Compilers use an incredible amount of complex datastructures, and house keeping code and maintenance of these structures is where most of the work is.

My point is that you maintain GPUs will have to be like "PowerPC" versus "MIPS" versus "x86", when they could be more like "Duron" versus "Athlon XP" (atleast as far as differences in each pipeline).

The existence of the radically different NV3x vs R300 proves you wrong already. Let's not forget the 3DLabs architecture either. Moreover, as I explained, even Itanium vs Itanian-N are very different as far as the compiler is concerned, even though the only difference is in the number of functional units.

Part of the reason for abstraction is to allow more freedom in the underlying implementation. Intel and AMD are hamstrung by their need to maintain x86 compatibility. If we reach the stage of Duron vs Athlon, then it is a sad day for the future evolution of GPUs.

Actually, using something like "FXC" would be pluggable...come to think of it, I don't see why a developer couldn't provide a (small and universal for their title or for an engine) shader management tool that was constantly updated to provide some of the benefits we discussed before.

FXC is not pluggable currently, and how would the developer maintain N different optimizers? Is he going to write them himself, without any knowledge of the proprietary pipeline architecture, or is ATI and NVidia going to waste their time writing plugins for this sole developer's tool? This is no better than the DOS days when developers had to maintain and link in umteem different drivers for video and audio cards.

Sorry, it has to be part of the OS.

Or for the LLSL to be used effectively as an "internal compiler representation", as I thought we'd established earlier.

And as I've explained for many difference reasons, it is inadequate. Perhaps after you go through the exercise of having to write a compiler and are forced to "shop around" for internal representation structures, you'll have more insight.

I do agree that it certainly might not be at some point, however, which is why I maintain that a party without vested interest control the specifics and evolution of that representation.

And who has suggested otherwise? Again, you fall back on the tired old Cg subject. Can it be any clearer that I am talking about MICROSOFT providing this infrastructure for IHVs?

archie4oz · Jul 19, 2003

OK, why would it choose to use predication instead of branch instructions, when the 3.0 and 2.0 "extended" specifications list branch instructions?

Because a particular IHV's hardware maybe be more efficient at dealing with predication. (e.g. spend more realestate and resources on execution hardware, data routing, etc. rather than prediction hardware). With the API vendor specifying the functionality via an assembly intermediary you're creating artificial barriers not only to hardware specific optimizations, but also to innovations in hardware implimentations.

I can see your point if the 2.0 based vs profiles do such, but, as I said, I thought that was done by the driver?

Passing on an internal representation (not like an SSA or anything, maybe along the lines of GCC's RTL) would provide a much more informative descriptor for the driver to compile as opposed trying to figure it out from a stream of assembly instructions. (It'd be like taking the assembly output of GCC and expcting ICC to optimize it better than if it had a peek at the internal structure (or the full source))...

but you're using arguments that ignore that the HLSL compiler can already create different "assembly" for different architectures in the first place,

No it creates different assemby outputs for different feature ranges, not architectures. There *IS* a difference.

My point is that you maintain GPUs will have to be like "PowerPC" versus "MIPS" versus "x86", when they could be more like "Duron" versus "Athlon XP" (atleast as far as differences in each pipeline).

I don't think he's stated that the have to be that way, but simply that they may be that way already and if not, then certainly headed that way. Besides even if you're looking at the same ISA you can still run into instances of similar feature sets that have require different approaches to optimization (e.g. Opteron vs. P4)

Or for the LLSL to be used effectively as an "internal compiler representation", as I thought we'd established earlier.

That for example would be like using x86 assembly as an internal representation for the compiler, so um no... I'd like more flexability in that regard. I wouldn't want to be hampered by being limited to a 2 operand, accumulator structure and have to engineer hardware resources into my design to allow me to explore more various architectural approaches, and a driver/compiler to have to deal with that.

EDIT: Aw crap... I responded too late!

DemoCoder · Jul 19, 2003

archie4oz said:
Passing on an internal representation (not like an SSA or anything, maybe along the lines of GCC's RTL) would provide a much more informative descriptor for the driver to compile as opposed trying to figure it out from a stream of assembly instructions. (It'd be like taking the assembly output of GCC and expcting ICC to optimize it better than if it had a peek at the internal structure (or the full source))...

Yeah, RTL would be better, although I've heard that the way GCC uses RTL still has some target-specific generation in the compiler front-end (e.g. if it knows that the target doesn't have a multiply instruction, it won't generate one in RTL)

There is much debate on whether tree representation, three/four address code, and p-code (stack machine) based IR is the best. Some optimizations are much easier to do on an AST, and some optimizations are much easier to do on three address code. GCC 3.5 will split the difference by incorporating a tree optimizer as well. Using Tree SSA, they will be able to do alot more optimizations more effectively like auto-vectorization, conditional constant propagation, and partial redundancy elimination. See http://gcc.gnu.org/projects/tree-ssa/ for interesting info.

archie4oz · Jul 19, 2003

Using Tree SSA, they will be able to do alot more optimizations more effectively like auto-vectorization, conditional constant propagation, and partial redundancy elimination. See http://gcc.gnu.org/projects/tree-ssa/ for interesting info.

Yeah I noticed several list messages that the IBM guys are really fired up about doing some work on GCC in this area (it'd sure be nice if GCC could generate code performing on the same level as XLC (moreso with autovectorization))

demalion · Jul 19, 2003

Let me point out something to you.

DemoCoder said:
demalion said:

DemoCoder said:

demalion, you are too verbose and you should be more direct and to the point.

Click to expand...

First of all, you called my statement facetious. I was not. It was deadly serious. There is no satire, irony, or joke in it.

Click to expand...

Click to expand...

Here we have a snide dig, constructed by quoting yourself out of context, (in quotes I already addressed, and, to my mind, debunked, in context), after my telling you that I spend the larget part of my time dealing with the snide (what a coincidence) commentary you throw in front of discussion.

(five irrelevent paragraphs of verbiage that have nothing to do with the HLSL issue)

And here we have you criticizing me for addressing such snide commentary, as if it is my lot in life to suffer it without complaint while you continue to propose it with impunity.

You're not going to convince me it is my lot in life, so how about simply skipping the snide commentary? If you don't want a discussion not directly related to the HLSL issues, stop making the discussion necessary by acting as you are.

Just an idea. Would save me some typing.

BTW, that bold statement isn't quite accurate. It was 5 paragraphs of text discussing your excuse for skipping my discussion about the HLSL issue. I.e., it was directly relevant to our discussion of the HLSL issue, thanks to your conduct.

Again, something addressed by my previous posts. You maintain that because the HLSL has "one" name, that it has one characteristic of behavior and code output.

Click to expand...

I said nothing of the sort.

DemoCoder said:
Doomtrooper, there is no way to be optimal for all hardware. You wouldn't want a single device driver for all video cards and require register level compatability to make it work, and you don't want a single compiler for all video cards.

The correlation between the statement and my description seems greater than "nothing".
I could be more verbose in my support, but I already discussed this in detail. It seems wasteful to predicate support for your argument on dismissing my statements out of hand and requiring me to dig up past discussions to be able to progress in what I had stated already.

I said GCC is an example of an uber-cross-compiler,

An example you stated the HLSL would have to follow in scale to succeed at handling multiple GPU designs, regardless of the difference in scope of their workloads and targets.

but GCC

#1 not as good as optimizing as Intel C and MSVC (does "ok" on oodles of CPU architectures, it is not the best on any one of them, but that's my point)

And here you seem to forget about the LLSL->assembly compilers the drivers already have, again.

#2 Static compilation for multiple targets means that when the compiler is upgraded or the hardware is upgraded, no consumers get any immediate benefits, unlike today with device drivers where a driver ugprade can deliver significant performance boosts.

Actually, the consumer does get benefit when the compiler is upgraded, it is just that for the HLSL model they have to patch an application. As I said, this does make the driver provided backend more convenient if the LLSL->assembly compiler limitations manifest, but that is not an issue of what the compiler is capable of targetting, but of delivery.
Details above (if you want me to quote such details when I make this statement, say so...I'm skipping that for at least some brevity).

Is this a reference to my reaction as to how your Itanium example seemed ridiculous to me

Click to expand...

Well, I made the mistake of assuming you knew something about compilers.

Do you get tired of the "demean then restate my viewpoint while ignoring yours" style of response?

Both the Itanium and GPUs are VLIW architectures. The failure of static compilation for VLIW is well known due to these architectures' varying the number of functional units.

I take it you mean the failure of one single static compiler that doesn't evolve when a VLIW architecture does? The problem with your example is that you are proposing that what HP and Intel is doing for the Itanium is a proof with regard to GPUs, with a widely divergent workload, and where you haven't bothered to establish that the VLIW evolution will be the same for GPUs. I guess this is as close as I can expect to an answer to "Let me restate: if they "are the Itanium", could you simply even begin to illustrate how?".

It only seems ridiculous if you are ignorant of history.

Hmm...demeaning at the end, too. Nice variation.

There is an inherent assumption about the disparity between the HLSL abstraction, the LLSL abstraction

Click to expand...

It is a simple fact that they are not isomorphisms and that the optimization of HLSL into LLSL can lead to local minima, not global minima.

First, my whole sentence: "There is an inherent assumption about the disparity between the HLSL abstraction, the LLSL abstraction, and what the drivers are capable of processing for the hardware."

Second, why do you ignore the paragraph of direct and specific discussion that goes with it to propose a possibility?

demalion said:
I'm stating that your assumptions of the disparity between the LLSL abstraction and what the drivers can utilize for the hardware are not facts, but opinions that I've spent many paragraphs providing counter indications for. If you'll look back, I actually also said I think your argument is more relevant for "3.0" type shaders, or atleast that I don't see clear indications arguing against you in that case (my Cg and HLSL discussion seem to me to to argue against you for "2.0" type shaders, possibly including "extended" and the "2_a" proifle, though I still invite you to correct my understanding of your vertex shader example if I am incorrect in what I asked you about).

ATI and 3DLabs are towing this line and specifically argue having the compiler inside the driver leads to better optimization (in this GDC presentation).

Where's the link, as maybe I can get some specifics there? A...slightly more specific summary of their statement would have been helpful.

It is trivial to realize that DX9 places resource limits on the compiler output (nesting levels, # of registers, # of read ports on registers,etc)

Yes.

which cause the compiler to omit code which obfuscates the semantics of the original expression,

That depends on the original expression and the full set of the resource limits. Actually, the resource limits have more to do with the capabilities of the hardware. You seem to ignore that failing to compile is part of the original expression, and that this goes hand in hand with instruction count limitations that do not parallel CPU compilation.

and therefore, if your underlying hardware has superior than DX9 capabilities in some of the pipeline, it will be unable to realize some optimizations.

Only if the compiler is incapable of expressing them in the LLSL in a form useful to the LLSL->assembly compiler (not a given, details above).

Why do you think I prefer the term "LLSL" instead of "assembly", anyways? I mention this to illustrate why I'm excising your explanation of this.

Click to expand...

Doesn't matter what you call it, it's still assembly. I could call x86 assembly LLIL (low level intel language). Gee, big difference.

Is the posturing the only point for you? The sequence leading up to this is completely nonsensical.

You ask: "You do understand today that device drivers in DX9 are doing compilation right?"

I answer with a quote of me saying that the LLSL is compiled by the driver, follwed by this rhetorical question about LLSL.

I might think that the case would be closed with the question answered, but then you quote only the final statement and then take a pot shot at me for drawing a distinction when you were asking if I saw the distinction of the first place!

It seems obvious that my usage of assembly was presented in the context of final output for the hardware to process, and my preference for the distinct term was purely an answer to your own specific question. Could you clarify how your derision begins to have any relevance to the quote?

OK, why would it choose to use predication instead of branch instructions, when the 3.0 and 2.0 "extended" specifications list branch instructions?

Click to expand...

Would you suggest that it use branches 100% of the time?

No, I'm suggesting that MS could have the compiler change behavior based on predicate and branch support reported, and that the approaches that the profiles represent need not be unique to every new hardware.

Why even have predicates at all if you're never going to use them?

Your stuck on this one specific behavior for the compiler idea, while saying there is a compiler in the driver already but maintaining that it is and always will be in applicable to solving the issue you propose in conjunction with those profiles.

if(a < x) { b = 10; }
else { b =20; }

Why don't you, Mr know-it-all tell me the heuristic to decide which to generate?

I wonder, are you at all capable of containing juvenile impulses when handlng questions and disagreement with someone who doesn't roll over and take your posturing with a smile?

Let's see what I can learn from what seems to be (but perhaps may not be, hopefully) my having to pretend my writing the compiler determines whether MS can do so or determines to what degree hardware designs will diverge with regard to this issue.

Of course, correct me where I go wrong, maybe even without what I just mentioned. I realize that the swizzling details are important.

Well, for the predicate, a setp_lt,if_pred,mov,else,mov,endif (sticking to one component, and using branching, atleast in LLSL expression) or setp_lt,mov,(p)mov (allowing per component execution of your idea, if necessary) seem possible.

For branching (without predicate), if_lt,mov,else,mov,endif works, correct?

For lrp, how would you best set the interpolation register to each end of the range (0,1 I'm assuming)? Can you use the predicate register for the interpolation control...and if so, is "true" defined usefully in any case? It looks like it would start with zeroing a register,if_lt,mov,endif or setp_lt,(p)mov, followed by a lrp.

Now, AFAICS, you're saying the compiler would pick one approach (that may resemble or completely fail to resemble my outlines), and that this one, in some particular case (more complex than your example), would be beyond the scope of the driver compiler for the LLSL to implement efficiently in the architecture.

I'm expecting that this request is you simply seeking to ridicule (again) and proclaim I'm unqualified to reason, nevermind the actual reasoning presented elsewhere. However, maybe I'm wrong, and correcting the details of the above nearly-random conjecture in the context of the more complex example will move us forward with something relevant. I really would prefer the latter.

In some cases, the compiler can even generate a LRP instruction instead of a predicate or branch. Now you have three choices.

Could you provide an example of the lrp output from the compiler for such a case, since you seem to have one in mind?

quote]Why are you assuming the compiler would unroll a loop for a 3.0 profile at all, or even for the ps_2_0 profile? I can see your point if the 2.0 based vs profiles do such, but, as I said, I thought that was done by the driver?

Because there is a nesting depth limit for loops and/or subroutine calls, therefore, it has to inline or unroll.[/quote]

Oh, you're predicating this on the shader exceeding the stated limitation. I'd propose that someone ignoring the limitation is planning on shipping when a compiler profile supports the limit they need, or they will be expecting the loop to unroll. However, if the HLSL compiler unrolls in the LLSL instead of the driver LLSL compiler having the responsibility, you're right that the shipped compiler, for that shader, would be specifically inefficient if it can't query output code supporting necessary loop depth.

What application is this, and why is patching it unacceptable? This case is not something that would happen to every single application, even if you wouldn't be patching them with new hardware.

You said that before, I merely asked you to state unequivocally that Cg did not solve this problem so we could drop dispute about Cg from being a factor of our particular discussion. Is that so hard?

Click to expand...

My original statements to Doomtrooper never mentioned Cg.

We went over this, what, 3 times now? Doomtrooper discussed Cg and HLSL, you discussed only HLSL and criticized it, and didn't mention Cg, and made statements that seem to resemble possible references to Cg in association to what HLSL was lacking. Was the discussion unclear? Miscommunication, misunderstanding, whatever, what is amazing is how you manage to avoid a clear answer when it is clearly established what would be required to simply drop the matter. :-?

I never in any way said Cg did anything to solve this problem as my entire discussion was focused on run-time compilation.

OK, I'm going to assume you're saying "when I say that I never said Cg did solve the problem, and stated HLSL did not solve the problem in a discussion about Cg and HLSL, I of course also meant that Cg did not solve the problem any better, though I just neglected to mention it".

Of course, this does is not logically equivalent to what you say, but you'd have to be playing a pretty silly game for this not to be what you actually mean. If this is correct, our discussion of Cg along the lines of my initial reply is over for now.

If you actually bothered to carefully read my messages, I wouldn't have to repeat myself a dozen times to explain what is obvious assumptions of bias on your part.

I discussed in rather specific detail what I noted about what I read, which you have not bothered to dispute. All I required was a pretty simple answer to a pretty clear question to avoid having to continue discussion. Is your choice not to provide it my fault in some way? I don't know how the question above could have been clearer.

This is a rather pointless discussion. OpenGL is already going the route I mentioned specifically for the reasons I mentioned. I have given you enough reasons to doubt the perfection of LLSL as a medium for optimization, yet you continue, and will continue until I tire of responding so you won't have to admit you're wrong.

It appears that when I said my problem was with your supplied reasoning, and asked for better, I was proposing that LLSL was perfect, even when I listed flaws for it, and said things like "Do you think I think the HLSL model is perfect?". Thanks for the news flash.

and now for something completely different...

(you have a history of this, as I recall your ignorance of the Amiga exec.library when I stated it has a bug that prevented preemptive multitasking from working properly in our famous Windows vs Amiga/BeOS "quality" thread)

You stated "Actually, the Amiga did not have preemptive multitasking in 1985." in response to my stating that it could pre-emptively mutlitask in 1985, in a discussion about how the time difference between Windows and the Amiga offering this might illustrate something undesirable about monopolistic influences on OS evolution.

I then asked you to say what was broken specifically, because it seemed to me that you were just trying to ignore what was inconvenient to you to avoid recognizing my point until the discussion had moved past it.

This was the proof:

Kickstarts before 2.0 contained two task switching bugs. After an interrupt, a task could lose the CPU to another equal priority task, even if the first taskâ€™s time was not up. The second bug allowed a task whose time was up to hold on to the CPU either forever, or until a higher priority task was scheduled. Two busy-waiting tasks at high priority would never share the CPU. Because the input.device runs at priority 20, usually the effect of these bugs was masked out for low priority tasks. The xecBase->Quantum field had little effect because of the bugs.

Now, it looks to me like low priority tasks had the issues masked because the input.device prevented deadlock. One issue was early pre-emption, and the other prevented pre-emption except by higher priority tasks. I do not see how this rationally supports that the Amiga did not do pre-emptive multi-tasking as you stated, and are apparently stating again, because the description seems to describe pre-emption as definitively occuring.
In fact, you stated " Yes, preemptive multitasking worked in some scenarios, but it didn't work in others" and "The bug didn't prevent pre-emption, it was just merely an inefficiency as high numbers of tasks on the ready-to-run queue would be suboptimal" immediately after.

Personally, I have no idea what that has to do with anything remotely related to this discussion, except as your own ego is involved and you are seeing this discussion primarily in terms of it while failing to notice your self-contradiction. It is strange that you continue to be surprised when I respond to things like this, and blame me for not thinking you're actions are making sense.

demalion · Jul 19, 2003

archie4oz said:
OK, why would it choose to use predication instead of branch instructions, when the 3.0 and 2.0 "extended" specifications list branch instructions?

Click to expand...

Because a particular IHV's hardware maybe be more efficient at dealing with predication. (e.g. spend more realestate and resources on execution hardware, data routing, etc. rather than prediction hardware).

Yes, and a particular hardware may not be more efficient at dealing with predication. The outlined plan for compiler behavior for the extended capability flags and profiles handles this pretty well, it seems. Why is this being simply ignored? If I am in error, discuss the extended capability flags and profile behavior and to show me where.

With the API vendor specifying the functionality via an assembly intermediary you're creating artificial barriers not only to hardware specific optimizations, but also to innovations in hardware implimentations.

OK, why is "assembly" being used as a justification that there are significant artificial barriers? I'm seeing a repetition of "it can be called assembly, assembly for CPUs express limitations, therefore this 'assembly' for GPUs is obviously doing the same ". As I said before, there is an entire discussion being omitted there.

I can see your point if the 2.0 based vs profiles do such, but, as I said, I thought that was done by the driver?

Click to expand...

Passing on an internal representation (not like an SSA or anything, maybe along the lines of GCC's RTL) would provide a much more informative descriptor for the driver to compile as opposed trying to figure it out from a stream of assembly instructions. (It'd be like taking the assembly output of GCC and expcting ICC to optimize it better than if it had a peek at the internal structure (or the full source))...

This example is flawed, because the LLSL is not targetted at being the assembly of a GPU, it is targetted at being a standardized intermediate representation. By simply stating it is assembly, and therefore by that label has the same limitations as the more complex, and direct, assembly languages when used for the more complex execution streams for CPUs, is not a supported statement, it just a statement that is being repeated.

However, if we instead proceed by simply discussing the particulars of the intermediate specification that the LLSL is lacking, we can then discuss how they relate to shader workloads and potential (and existing) hardware divergences, and actually end up discussing the issue. That is the missing discussion I keep asking for. Calling it "assembly" does not suffice.

but you're using arguments that ignore that the HLSL compiler can already create different "assembly" for different architectures in the first place,

Click to expand...

No, it creates different assemby outputs for different feature ranges, not architectures. There *IS* a difference.

Click to expand...

No, it is not just creating different assembly outputs for different feature ranges, and MS has said specifically demonstrated that it creates assembly for a specific architecture, and the LLSL specification do not seem to preclude that being more extensively represented if necessary.

For example, as I mentioned before, if you have access to the GDC slides and the discussion of HLSL, please take a look at the discussion included about HLSL for the "ps_2_a" profile. Here is one characteristic listed for that profile, paraphrased: "generating output that seeks opportunities to have texture operations interleaved with two 'arithmetic' ops when possible". That is not an issue of feature range, that is an architecture characteristic (well, now that the NV35 is released and has presumable benefit from this when processing floating point values).

My point is that you maintain GPUs will have to be like "PowerPC" versus "MIPS" versus "x86", when they could be more like "Duron" versus "Athlon XP" (atleast as far as differences in each pipeline).

Click to expand...

I don't think he's stated that the have to be that way, but simply that they may be that way already and if not, then certainly headed that way. Besides even if you're looking at the same ISA you can still run into instances of similar feature sets that have require different approaches to optimization (e.g. Opteron vs. P4)

Click to expand...

Oh, I'm all for discussing "may be", what I'm against is entirely skipping over discussion of "may not be" without trying to establsish accurate and consistent reasons for doing so.

Or for the LLSL to be used effectively as an "internal compiler representation", as I thought we'd established earlier.

Click to expand...

That for example would be like using x86 assembly as an internal representation for the compiler, so um no...

Click to expand...

Ack! The LLSL representation is not as low level as x86, and your entire basis for stating that is seems to be the label "assembly". This "assembly" offers various ranges of instructions to be used ("feature range" are not mutually exclusive with "architecture") and allows expressing multipled paradigms, and all I'm seeing to support that is the (demonstrably false) assertion that the compiler is only capable of using those expressions in one particular way, and for one specific architecture.

I'd like more flexability in that regard. I wouldn't want to be hampered by being limited to a 2 operand, accumulator structure and have to engineer hardware resources into my design to allow me to explore more various architectural approaches, and a driver/compiler to have to deal with that.

Click to expand...

Hmm...I still don't get how these extreme examples are useful, when simple statements I've said several times are simply ignored. Hey, I'm just asking you to address things like the ps_2_a discussion on your way to disagreeing with me, as your statements as presented seem to having significant inconsistencies.

demalion · Jul 19, 2003

DemoCoder said:
demalion said:

Umm...then why is nVidia having such trouble? What exactly are they doing with Cg if not writing a compiler?

Click to expand...

What does Cg have to do with this?

DemoCoder said:
Wrong, the IHV's don't have to develop the compiler. Just like today, if I create a new CPU, I need only to create a new architecture description configuration file for GCC, and viola, it can compile to the new target.

And, I brought up an IHV writing a compiler.

NVidia is having trouble because they have to write the whole compiler from scratch.

OK, and what if it isn't good enough? Are the other IHVs going to start somewhere better than where nVidia has been since after the initial Cg release? Note that my discussion maintains that this is a drawback being ignored, not that the drawback is unsurmountable (though there were others in the original discussion).

If FXC was built into the driver architecture of DX with pluggable backends, the IHVs avoid 90% of the work.

Hmm...but I don't see Microsoft giving that away to all the IHVs for free. You're right, however, that if Microsoft did that, this back end drawback would be largely eliminated.

Yes, but aren't you requring that they take control of 4 as well, and maintaining that 3 needs to be sufficiently abstracted?

Click to expand...

No, the IHVs will not have to create their own IR nor do generic optimizations. For example, dead code removal erases information that is safe to erase. That means there is no need for IHVs to write code to do SSA form, or liveness analysis, and perform dead code removal.

I see. I'd have thought register analysis would be a primary factor in dead code elimination for shaders (atleast in a literal interpretation of the phrase), and that seemed likely to be one of the things some IHVs would need to take up. However, there is nothing preventing IHVs from addressing it again after suitable general optimizations.

Compilers use an incredible amount of complex datastructures, and house keeping code and maintenance of these structures is where most of the work is.

My point is that you maintain GPUs will have to be like "PowerPC" versus "MIPS" versus "x86", when they could be more like "Duron" versus "Athlon XP" (atleast as far as differences in each pipeline).

Click to expand...

The existence of the radically different NV3x vs R300 proves you wrong already.

I did spend a lot of text discussing this specific example multiple times already. Since my discussion points out that the HLSL compiler has a profile for the NV3x (and pointed you to a thread discussing that specific in detail), and maintains that the NV3x simply does not prove that other IHVs will diverge as much as nVidia did, both indications and a desire to be able to compete seems to indicate they wouldn't go off on a completely new path, I'm not sure what you are maintaining is proven wrong.

Let's not forget the 3DLabs architecture either.

The licensing of the NV register combiner extension was an interesting development, though it does not indicate much about future architecture of the "DX 9" level.

Moreover, as I explained, even Itanium vs Itanian-N are very different as far as the compiler is concerned, even though the only difference is in the number of functional units.

No, you stated this, you did not explain the relevance to shaders and expectations of hardware design under discussion. Saying VLIW architectures can evolve in this way does not indicate this is the way they will evolve for GPUs.

Part of the reason for abstraction is to allow more freedom in the underlying implementation. Intel and AMD are hamstrung by their need to maintain x86 compatibility. If we reach the stage of Duron vs Athlon, then it is a sad day for the future evolution of GPUs.

But Intel and AMD don't have an upgradeable compiler between x86 and their internal architectures, and they still manage to provide effective evolution. Also, clock speed and replication provide significant avenue for innovation more significantly than for CPUs.

Actually, using something like "FXC" would be pluggable...come to think of it, I don't see why a developer couldn't provide a (small and universal for their title or for an engine) shader management tool that was constantly updated to provide some of the benefits we discussed before.

Click to expand...

FXC is not pluggable currently, and how would the developer maintain N different optimizers?

"FXC.exe" is what I think of when I see "FXC". An .exe file is indeed "pluggable" if you'd use it to compile your application's shaders. That thought can be termed "rambling".

And you misunderstood the rest of my statement. N=number of engines that can be served by a shader manager application. Doesn't even have to be collected by engine alone, could just be every title fromt he publisher that uses shaders addressed by one app. Like an update utility.

For example, Steam and HL 2 seem rather well suited for this idea. I know plug-in shader updates were mentioned, but the mechanism was not clarified AFAIR.

Is he going to write them himself, without any knowledge of the proprietary pipeline architecture, or is ATI and NVidia going to waste their time writing plugins for this sole developer's tool?

No...I meant a utility that was linked with the latest compiler that handled shader compiling for more than one game, that could be updated without patching each and every game for the publisher. I.e., a solution to one of the convenience issues with DX HLSL. The centralized application for multiple games idea for HL (and presumably HL2) would be served by similar, except that usual patching frequency requirements would seem to make it redundant to some extent.

This is no better than the DOS days when developers had to maintain and link in umteem different drivers for video and audio cards.

I still am puzzled as to why you go out on such tangents. Your conjecture does not resembly my proposal.

Sorry, it has to be part of the OS.

It probably will be when the complexity has arrived. The real DX 10, for example. Of course, it is theorized that this might be late to the table for hardware. Then OpenGL will have a chance to demonstrate how well it tackled the issues I mentioned. Go competition.

Or for the LLSL to be used effectively as an "internal compiler representation", as I thought we'd established earlier.

Click to expand...

And as I've explained for many difference reasons, it is inadequate.

Well, actually, as I referred to, you said "Either that, or upgrade the DX9 assembly so that it is rich enough to recover the full semantics of the original source code." You have not demonstrated that it is inadequate, you've proposed that it could be, if it doesn't evolve, and the parallels to the Itanium manifest. How can you confuse these actions? Or are you still not willing to discuss the ps_2_a profile at all?

Perhaps after you go through the exercise of having to write a compiler and are forced to "shop around" for internal representation structures, you'll have more insight.

Of course, by saying I'm not qualified to discuss it, and refusing to simply state the insight you have in mind.

I do agree that it certainly might not be at some point, however, which is why I maintain that a party without vested interest control the specifics and evolution of that representation.

Click to expand...

And who has suggested otherwise?

Well, you did, actually, since you ask, but I assume you mean recently, and assuming that I provided an accurate interpretation of your reply earlier, that would not be you now.

However, this was not an accusation directed at you, it was a fault in Cg relevant to the issue we just brought up, and what would be a fault in the DX HLSL if vested interests on MS's part start to play a role in it that disadvantages IHVs. Both were actually occurring to me when I made the statement.

Again, you fall back on the tired old Cg subject.

Actually, "bit players" being inadequately addressed by HLSL because of politics and economics was what was going through my mind along with Cg. I don't think that's a tired subject, actually, especially with "Santa wearing lipstick" working on bringing us a present relevant to this discussion. Presumably one with some unique characteristics.

Can it be any clearer that I am talking about MICROSOFT providing this infrastructure for IHVs?

Heh, see the other post. If the question I asked is resolved, it is resolved, but my bringing up something associated with Cg does not mean that I'm saying you were talking about it here (in fact, there is no reason I see to presume you were at all, because of the absence of the reasons I discussed earlier). Look at the thread title...it's allowed to be mentioned without being an accusation in your direction.

Colourless · Jul 19, 2003

Most developers will use D3DXCompileShader or D3DXCompileShaderFromFile instead of using FXC.exe

Those 2 functions are statically linked into the applcation and can not be upgraded without relinking the app.

DemoCoder · Jul 19, 2003

demalion said:
BTW, that bold statement isn't quite accurate. It was 5 paragraphs of text discussing your excuse for skipping my discussion about the HLSL issue.

Demalion, why is it that unique to you, and no other person on this board, you frequently accuse people of skipping or ignoring your discussion? Could it be that your discussions are often a) not relevent b) too verbose and roundabout or c) poorly written

I know of no other person on this board who complains as frequently as you do about people eliding parts of your messages. It seems whenever someone catches you in a quandry, you fall back on verbose passages about how they misquoted you.

DemoCoder · Jul 19, 2003

demalion said:
I take it you mean the failure of one single static compiler that doesn't evolve when a VLIW architecture does? The problem with your example is that you are proposing that what HP and Intel is doing for the Itanium is a proof with regard to GPUs, with a widely divergent workload, and where you haven't bothered to establish that the VLIW evolution will be the same for GPUs.

Demalion, your escape clause in all these arguments is the constant refrain "but it's a GPU". Compiler theory was developed against the background of abstract computing models on automata and idealized computing models. If you pick up most compiler books, it won't even mention any actual CPUs in the real world. The problems specific to VLIW scalability with static compilation are inherent and not architecture specific. If you create an FXC compiler profile specific attuned to say, a pipeline with 3 functional units, it will produce code that is not optimal for a pipeline with 5 functional units. Everytime new architecture is modified, the static compiler will have to be updated and all games will have to be recompiled. In fact, GPUs exacerbate the highlighted problems but being even more aggressively parallel in their scaling than convention CPUs.

That's why the the driver has to do instruction selection. The only issue is whether or not there is enough information in the DX9 assembly to make such decisions. As a human being, we can always look at an instruction stream and reverse engineer much of its original meaning, but compilers are a different story. The mere altering of a datastructure completely alters the kinds of optimizations that can be easily found by optimization algorithms.

Compilers are ultimately nothing more than pattern matchers, and refactoring the pattern can alter what you "recognize". Even in mathematics, you might be able to prove that two number systems are isomorphic and identical, but the types of things you can recognize and do in each space are radically different (e.g. time vs frequency domain)

DX9 assembly is not neccessarily the representation most conducive to allowing the driver to do the best optimizations. The result will be that fewer optimizations are found, and more difficult work has to be done by the driver developers to find them.

You seem to ignore that failing to compile is part of the original expression, and that this goes hand in hand with instruction count limitations that do not parallel CPU compilation.

Why would the compiler fail to compile deliberately if it could produce a transformation which fits within resource limits? Seems absurd. If I have a function call nesting > 4 levels deep on ps3.0 or 1 level deep on ps_2 extended, the compiler will have to inline any further calls.

f(g(h(i(j(x))))) should compile fine on ps3.0 and even ps2.0 even though DX9 can't support it in the assembly via "call" instructions.

Only if the compiler is incapable of expressing them in the LLSL in a form useful to the LLSL->assembly compiler (not a given, details above).

Why do you think I prefer the term "LLSL" instead of "assembly", anyways? I mention this to illustrate why I'm excising your explanation of this.

I just showed you an example. And please, it is not a low level shading language and it is not an intermediate representation. It is a assembly language based off of DX8 legacy and before HLSL was invented.

No, I'm suggesting that MS could have the compiler change behavior based on predicate and branch support reported, and that the approaches that the profiles represent need not be unique to every new hardware.

So you expect Microsoft will have to maintain dozens if not more compiler switches for every conceivable combination of architectural support themselves and that they will have to frequently update this uber compiler with patches and send it to developers so that they can RECOMPILE their GAMES and distribute patches to all the gamers out there?

my example said:
if(a < x) { b = 10; }
else { b =20; }

Well, for the predicate, a setp_lt,if_pred,mov,else,mov,endif (sticking to one component, and using branching, atleast in LLSL expression) or setp_lt,mov,(p)mov (allowing per component execution of your idea, if necessary) seem possible.

For branching (without predicate), if_lt,mov,else,mov,endif works, correct?

For lrp, how would you best set the interpolation register to each end of the range (0,1 I'm assuming)?

Why not count up the instruction slots used by your solution, compared to an slt/lrp or sub/cmp/lrp. As the author of an optimizer, which are you going to choose, and how are you going to code it

Now, AFAICS, you're saying the compiler would pick one approach (that may resemble or completely fail to resemble my outlines), and that this one, in some particular case (more complex than your example), would be beyond the scope of the driver compiler for the LLSL to implement efficiently in the architecture.

I'm saying that if the compiler chooses slt/lrp, it will be more difficult for the driver to reconstruct all the basic blocks that existed in the original source code. Likewise for predicates. The only solution is to force the compiler to always turn any conditional into a branch to preserve the original control flow graph, but again, because of resource limits, it is not always able to do this.

What application is this, and why is patching it unacceptable? This case is not something that would happen to every single application, even if you wouldn't be patching them with new hardware.

Let's say there are 100 games out there using HLSL. And every month, like clockwork, ATI, NVidia, 3DLabs, IMGTEC, and other companies find new ways to squeeze performance optimizations out of their designs, or release new HW. Just like driver updates, there will be a steady stream of compiler updates (as there is with GCC). Zealous gamers who always like to have maximum performance will eagerly download what? Continuously updated patches for every game in the their library, assuming that all game publishers will rigorously remain in sync with IHVs and publish patches everytime an update comes out?

Tell me demalion, would you argue for a "grand unified driver model" in which Microsoft ships all drivers for every card in one big uber-library file that developers statically link-in at compile time? A statically linked driver would in fact deliver higher performance, but at a massive cost to maintainability.

So sure, it is possible to continuously ship updates for a single static compiler that supports N profiles, but it is a bad idea for two reasons

#1 lots of overhead in distributing the improvements to users and developers
#2 "profiles" do not neccessarily represent the best medium for device drivers to work off of.

How Cg favors NVIDIA products (at the expense of others)

DemoCoder

Entropy

DemoCoder

demalion

DemoCoder

DemoCoder

Slides

demalion

demalion

DemoCoder

DemoCoder

archie4oz

ea_spouse is H4WT!

DemoCoder

archie4oz

ea_spouse is H4WT!

demalion

demalion

demalion

Colourless

Monochrome wench

DemoCoder

DemoCoder

Similar threads