HLSL 'Compiler Hints' - Fragmenting DX9?

DemoCoder said:
The way I look at is this issue is similar to the issue of ICD vs MCD OpenGL drivers.

Judging from your parallel, it seems to me that this similarity is rather superficial and inaccurate. For one example of a casualty of superficiality: the DX HLSL compiler does not seem accurately characterized as "one-size-fits-all", as per the thread topic.

For your usage, your parallel seems to boil down to "one set of hurdles has been overcome in the past, so talking about any other set of hurdles can be skipped over". Neither this proposition nor your usage of the MCD and static linking analogy seem to make sense.

...But the fact of the matter is, IHVs already have to include compilers and optimizers in their DX9 drivers that deal with "DX9 intermediate representation" (if you subscribe to demalions view), or my view, which is DX9 assembly which must be translated and optimized at run time.

The MCD discussion you provide doesn't do more to address my commenhtary than, for example, saying "it didn't rain yesterday" somehow argues against the idea of taking an umbrella out today, without regard to investigating the forecast.

Why do you refer to static linking as somehow automatically precluding optimization opportunities? Doesn't that depend on the compiler that is actually statically linked? Don't you even refer specifically to the compiler in the drivers that IHVs are working on? "That's why MCDs were bad, and it's why static compilation/linking is bad" is your summary of why you propose that the DX HLSL optimization will be lacking, and the entire idea of the driver optimizer contradicts that...especially as you discuss it as representing "90%" of the work necessary for a glslang implementation.

Moving past the MCD/ICD static linking parallel, and it seeming contradicted by your "90% already done" proposal, in the latter you seem to be arguing that IHVs have to write (and have written) "a compiler" anyways, so they might as well write a more complex compiler and additional higher level front end offering advantages to DX HLSL. My problem with that, is that it seems completely predicated on any and all challenges for this being trivial, and there being a built in guarantee of reaching optimization levels DX HLSL, profiles, and the "LLSL" compilers, will not.

Either way, they already have to do 90% of the work of a full compiler: register allocation, instruction scheduling, instruction selection, etc. TODAY.

This proposition about how "TODAY" they are doing "90%" seems to overlook a few things mentioned:

First...IIRC (?), the last time we discussed this you suggested using a common HL/intermediate front end among all vendors, which would save IHVs time. This seems an advantage over DX HLSL if: 1) the intermediate is more optimal for varying architectures than the DX HLSL "assembly"/profile methodology as it evolves, 2) it accomplishes optimization at least as well as what the HLSL compiler does (which you characterize as "10%" of the work?) in a similar time frame of release.

Second...explaining away any advantage to IHVs of the DX approach as down to a separation of "good IHVs" and "crappy IHVs" seems to propose that more IHVs actually having a good starting baseline with the HLSL optimizations is somehow not desirable, simply because DX HLSL "good performance" 'obviously' can't possibly be as good as what glslang "WILL" offer (as DX HLSL is 'obviously' a "one-size-fits-all" compiler that doesn't optimize for unique architectures).

This seems bass ackwards to start with: more IHVs starting at a known optimized representation already developed is not bad for consumers, IHVs, nor developers (other than, perversely, from a more "Console" oriented perspective as you mentioned, where one IHV would be ideal). An established and significant performance advantage achieved by glslang implementations would indeed be a way to show advantage to DX HLSL accomplishing that, but you propose that advantage as a given, sight unseen, based on what appears to be a distorted representation of DX HLSL.

The front-end of the compiler: parsing, intermediate optimizations, etc can and WILL likely be provided as reusable open source in DDKs.

Sure, but what will that compiler deliver? I also note that "when" is a pretty important question, and I'll point out that this matters to consumers and the success of glslang as well, and therefore so do the issues affecting that along the way and what they mean to IHVs and their consumers and what glslang delivers, regardless of whatever other hurdles have been overcome before.

3DLabs is already shipping free code people can use to write the frontend part of the compiler. Nvidia ships a CG frontend which can deal with DX9 HLSL syntax too.

So, how do you think the intermediate optimizations of these compare to HLSL? That would be an interesting discussion that would work towards answering the concerns I propose...some actually pertinent reference to the details of glslang status instead of some parallel to MCD/ICD and static linking.

One comment that stood out to me:

Personally, if OpenGL2.0 ICDs are harder to write, but yield more optimization opportunites, well, tough luck to IHVs without good driver dev teams.

"tough luck to IHVs"? What about "tough luck to glslang", or is an alternative to success impossible? I really hope (and expect) that the ARB is viewing things a bit differently.

About your "static linking" discussion:

And the idea that you have to RELINK applications to get performance from driver/compiler updates is ludicrous and an unworkable enduser distribution scheme.
...

There seem an abundant of things that are (still) wrong with this discussion, AFAICS:

As mentioned earlier: you propose there is "90%" of compiler work that IHVs have already done, for the driver back end for DX HLSL ships with the driver and is not statically linked. Even though I don't "see" the percentage you quote, I do consider whatever "percentage" sufficient to make your discussion here fail to resemble DX HLSL from the beginning.

"Front end", or "HLSL->intermediate", compiler updates are not required to happen every month, just because you decree static linking as ludicrous and unworkable..

Updating the "back end" does not necessitate updating the "front end" (relating to why I characterize the DX "assembly" as a LLSL or intermediate), and your entire "everytime a new Detonator of Catalyst driver came out, you'd have to download patches for all of your games to any improvement" seems like a rather "ludicrous" hyperbole.

The fact that a future upgrade to a driver (or compiler) can break some game's pixel shaders is an issue for the IHV's QA department.
...
And an issue for developers, depending on the luck of the "blame roulette" the consumer plays. And an issue for the ARB and OpenGL if steps aren't taken to minimize how much work IHVs have to do for this. You seem to decree that only IHVs have to worry about it, and that how OpenGL compare to Direct X, or how one game compares to another, as a result of such issue, are factors that simply will not matter to anyone at all.

...They have to deal with regression testing ANYWAY since even simply changes to an OGL driver can break older games. This should not hold back runtime compilation in the driver.

Any new shared code can exhibit regressions. This is not an argument against shared code.

Your reasoning in support of this seems to consist of simply stating "only IHVs have to worry about" any extra difficulties, so: developers will ignore all the "crappy" IHVs who have issues when deciding on an API (they're crappy, so sales to their customers don't count?); so, therefore it won't represent a challenge the ARB will have to spend any effort to address for realizing the potential of glslang and trying to further the success of OpenGL.

I don't understand the premise from the start, as IHVs sit on the ARB...I'm pretty sure this would be a focus of the language development effort, and not just something that can be considered the problem of "IHV QA departments", and irrelevant as a hurdle for language implementation outside of dividing IHvs between "crappy" and "good".

...

I really don't understand the apparent resistance to the idea of recognizing that DX HLSL isn't a "pushover" for glslang. It isn't the same as saying it can't be "toppled" (which, in fact, is exactly what I hope glslang does), it seems to me to be looking directly at the issues involved with achieving that, instead of simply looking away. Hopefully, someday soon, this will be from the perspective of marveling at what was achieved to deliver a robust and efficient glslang implementation.
 
While I mostly agree with DemoCoder demalion has one very important point:

While the glslang approach with all likelihood will offer better optimization oppurtunities the question is "when", i.e how long it will take for the IHVs to get usable compilers into their drivers. If this takes too long, then a lot of the impact of the gl shading language will be lost as developers move over to Direct3d HLSL for their shading needs.

However, I personally think that PS 2.0 is too low level for an intermediate representation, the register limitations and the fact that the float vectors is the only data type really is pretty low level and tied to the two current main stream architechtures.
 
Demalion, there are too separate issues. One is the fact that FXC should be a COM component invoked at runtime, not build time. This solves the problem of MS DirectX updates requiring all developers in the universe to ship relinked versions of their apps.

The second issue is writing the compiler itself. If we accept your philosophy that DX9 assembly IS an intermediate representation, then it is clear that the drivers MUST have compiler backends in them which can convert intermediate representation into low level code (VLIW, etc), unless of course you propose IHVs create hardware which runs DX9 bytecodes directly.

IHVs are already writing compiler backends, and one aspect of this is that they haven't done a good job, hence NVidia's inability to "deoptimize" compiler generated temporaries and minimize register usage.

The compiler frontend, the part that does parsing, semantic analysis, conversion to IR, and IR based optimizations, only needs to be written once as a DDK and can then be customized by IHVs. It will exist, it only takes time. OpenGL2.0 is just emerging from the draft board for christsakes.

Today it doesn't exist, but this is exactly the situation years ago when ICDs had to be written from scratch. Remember how long it took for the first full ICDs to come out?

OpenGL is doing things the right way, as they have always done. It is easier to program, easier to comprehend, but requires more work by IHVs for the developers and endusers to benefit.

That is the power of abstraction. Sometimes it makes the implementation harder, but it makes life easier for the users of the API.

OpenGL2.0 has already decided this is how it will be done. The people who made this decision (full compiler inside driver), are the very people who have to bear the brunt of the increased driver development cost: The IHVs on ARB.

Doesn't it strike you as strange that the people who have to develop the drivers disagree with you?
 
GameCat,

So your prior post was directed at mine? Your discussion of "PS 2.0" is confusing to me, and it seemed to be directed at something else.

GameCat said:
...
However, I personally think that PS 2.0 is too low level for an intermediate representation, the register limitations and the fact that the float vectors is the only data type really is pretty low level and tied to the two current main stream architechtures.

There are several aspects of this that seem erroneous or a contradiction to me, and perhaps you can clarify:
  • "PS 2.0" isn't all of DX HLSL. The profile system is as well, as is PS 2.0 extended and PS 3.0 as the DX HLSL evolves for hardware in the near future.
  • Do you mean "PS 2.0" register limitations? What are you proposing is insufficient? What it seems to me is hardware register limitations are still the primary issue for shaders, so I don't see how "PS 2.0" and higher register limitations are an issue. You perhaps mean some limitation besides count? If so, it would be helpful if you specified what in particular.
  • Your commentary on "float vectors is the only data type" seems backwards as a reason to propose it being "tied" to low level, as datatypes with less ranges/precisions and components "fit into" them. Adding more specification for less components and less precision are less abstract, and more low level. What PS 2.0 is "tied to" is a higher minimum...contrast it to PS 1.4 and lower.

EDIT: clarification
 
DemoCoder said:
Demalion, there are too separate issues. One is the fact that FXC should be a COM component invoked at runtime, not build time. This solves the problem of MS DirectX updates requiring all developers in the universe to ship relinked versions of their apps.

It is actually better than the "problem" we have right now for implementing game performance improvements, because targetting the "LLSL" and the "90% of the work of glslang" you propose for IHVs is not statically linked. Your self contradiction is mind boggling, as is this bogeyman of "DX HLSL will make you have to download patches every month!".

But we agree there are two different issues...the problem is you seem to depended on the ridiculous aspects of this bogeyman to propose the validity of the second issue.

The second issue is writing the compiler itself. If we accept your philosophy that DX9 assembly IS an intermediate representation, then it is clear that the drivers MUST have compiler backends in them which can convert intermediate representation into low level code (VLIW, etc), unless of course you propose IHVs create hardware which runs DX9 bytecodes directly.

Sure, that's exactly what I propose! Obviously, when I challenged your characterization of "90%", I could only be proposing "0%", so tackling my viewpoint as if that is what I mean is obviously not a waste of time. :oops:
Please consider "Even though I don't 'see' the percentage you quote, I do consider whatever 'percentage' sufficient to make your discussion here fail to resemble DX HLSL from the beginning" again, and note that proposing the "percentage" as "sufficient" does not lend itself to interpreting my comments as proposing "0%".

IHVs are already writing compiler backends, and one aspect of this is that they haven't done a good job,

"Not optimal" and "not a good job" are not synonymous.

hence NVidia's inability to "deoptimize" compiler generated temporaries and minimize register usage.

That's "not a good job" because their hardware is deficient and requires lots of work to run well at all. Please note, however, HLSL's evolution towards addressing that problem on its end, which you seem to insist on maintaining it cannot do.

In any case, this "not a good job" doesn't necessarily speak to other IHVs, though generally proposing them as "not optimal" in a naive implementation seems pretty reasonable.

The compiler frontend, the part that does parsing, semantic analysis, conversion to IR, and IR based optimizations, only needs to be written once as a DDK and can then be customized by IHVs.

And, I ask again, what is the baseline of optimization, conflicts, and bugs that this will establish? It is not a complicated question, it just an unanswered one. If you have an answer, please share.

It will exist, it only takes time. OpenGL2.0 is just emerging from the draft board for christsakes.

I'm still trying to ponder how you think this argues against what I'm stating? Why is OpenGL 2.0 "just emerging" from the draft board? How is your "only takes time" different than the time I'm proposing is cause for concern? Will DX HLSL and the profile system stop evolving? Or is it simply that you propose it will need to evolve "every month" for some reason?

Now on to an actual discussion about the MCD/ICD parallel:

Today it doesn't exist, but this is exactly the situation years ago when ICDs had to be written from scratch. Remember how long it took for the first full ICDs to come out?

The set of advantages and challenges in the respective comparisons are not the same, and the end result is not pre-ordained by substituting one set for the other.

OpenGL is doing things the right way, as they have always done.
I think it can be said that OpenGL is doing things the potentially better (IMO) way, but taking longer. The problem is that the goal OpenGL has in mind isn't all there is to "the right way", it is actually offering the capability, the issues that come with it, and when it is offered.

It is easier to program, easier to comprehend, but requires more work by IHVs for the developers and endusers to benefit.

In isolation, I think this comment has merit, and I think I agree with it. However, I don't think it is a universal view, or one that validly precludes viewing DX, in its current incarnation, as easier. Also, neither does it speak to what the DX HLSL situation will look like when the glslang is offering its benefits to endusers and developers.

That is the power of abstraction. Sometimes it makes the implementation harder, but it makes life easier for the users of the API.

Do I understand you correctly to be proposing somehow that DX HLSL is simply not an "abstraction", so is completely denied this "power of abstraction"?

OpenGL2.0 has already decided this is how it will be done. The people who made this decision (full compiler inside driver), are the very people who have to bear the brunt of the increased driver development cost: The IHVs on ARB.

You seem to be proposing that once you decide on something, achieving success with it, compared to what someone else decided to work toward, is a given?
This seems a pretty ludicrous precept, given things like that Microsoft has met the requirement of "deciding" as well, and, for an example from "IHVs" as we're discussing, one company deciding on the NV30 compared to another deciding on the R300. What the people deciding these things are actually capable of achieving, and when, matters as well. Hopefully, these particular "people" have it well in hand for a relatively problem free release in the near future.

Doesn't it strike you as strange that the people who have to develop the drivers disagree with you?

Hmm...your using that precept as the linchpin of how you're showing how I'm wrong is what seems "strange" to me.

Trying to make sense of your question:
No, I don't think IHVs disagree with me about the challenges they face for glslang, and I don't think I disagree with them about the possible advantages. This should indicate why I'm not "struck by the strangeness of it".
 
demalion said:
It is actually better than the "problem" we have right now for implementing game performance improvements, because targetting the "LLSL" and the "90% of the work of glslang" you propose for IHVs is not statically linked. Your self contradiction is mind boggling, as is this bogeyman of "DX HLSL will make you have to download patches every month!".

Sorry, I don't see the contradiction. I propose making 100% of the compiler dynamic and under the control of the IHV. The fact that you can achieve partial dynamism today is not a self-contradiction. In fact, you can achieve 100% dynamism if you ship FXC with your game and invoke it via D3DXCompileShader*

But of course, it still prevents the poor IHV from making decisions early in the compile phase, and despite what you think, and as I have explained in the past, compiler front-ends share a lot of code in common, but that is not in contradiction to the fact that early phase optimizations are still a win. The code that implements the high level optimizations may be the same, but the heuristic which determines whether the optimization gets performed or not is architecture specific.

Go lookup how inlining and cache collision optimizations algorithms work for example. (oh please, spare me "oh, but they are not the same, those are CPU algorithms" anti-analogy Demalion response. Go lookup register allocators then)

For example, all the code for parsing, type checking, semantic analysis, and even optimizations can be shared. However, which optimizations are switched on and how they are chosen is not always a decision which can be made STATICALLY and GLOBALLY based on some idea of a "profile" that applies to an entire category of hardware.

(sorry, can't even parse what you are trying to say)

That's "not a good job" because their hardware is deficient and requires lots of work to run well at all.

No. Just because they created an architecture that is register usage sensitive doesn't excuse the fact that a compiler could be written to take advantage of the strengths. You may as well complain that since standard compiled C or Fortran doesn't run well on massively parallel supercomputers, it is a fault of the hardware, and not the fault of the standard compiler not being aware of how to parallelize.

Likewise, it was Intel's fault that they designed "deficient" HW since taking advantage of SSE2 automatically requires a lot of work to run well.



And, I ask again, what is the baseline of optimization, conflicts, and bugs that this will establish? It is not a complicated question, it just an unanswered one. If you have an answer, please share.

I don't know, but shared source is all over the place, in applications, compilers, MS's operating system, etc. Either you'll be stung by the bugs in MS's FXC (shared by everyone), or you'll be stung by bugs in some open source DDK for OGL2.0 driver writers. FXC offers no advantages, unless you somehow think that a CLOSED SOURCE compiler written by Microsoft will have less bugs than an open-source DDK that has thousands of developers examining it and building their own compilers.

Will DX HLSL and the profile system stop evolving? Or is it simply that you propose it will need to evolve "every month" for some reason?

So do you propose that profiles be created for each and every chipset in existence, and have them all maintained by Microsoft and shipped in DX9.0? updates (which is how new FXC versions get distributed)? So if Nvidia, ATI, 3dLabs, et al want to ship a new FXC with an updated profile to correct some optimizations, they will have to wait until Microsoft wants to ship another FXC update?


The set of advantages and challenges in the respective comparisons are not the same, and the end result is not pre-ordained by substituting one set for the other.

Classic Demalion. The response to any analogy is to claim "but their not exactly the same" Carmack and 50+ other developers signed a petition to have MS ship the MCD devkit, because they were afraid that OpenGL drivers were too complex to implement

Carmack continued, ``The best thing Microsoft could do for the game community would be to release the Win95 OpenGL MCD framework to allow hardware vendors to easily build robust, full featured drivers. Without Microsoft's help, there will be several partial implementations to satisfy specific requirements, resulting in version problems and incompatibilities. The strengths of OpenGL are important enough that it is going to be used one way or another, but life would be so much better if Microsoft cooperated.''

This is exactly the argument that others have posted in this forum: OpenGL2.0 drivers that have a full compiler embedded in them will be more complex, more buggy, more incomplete, etc. Hence, the analogy between the cries that ICDs were too difficult.

Now, you can cry about how writing a compiler and writing an ICD are too entirely different things, but that's just you playing critic about something you have no idea about. Personally, as someone who has implemented many compilers, I find compilers MUCH easier to write than writing a large device driver like OpenGL ICD.

Compiler writing has a large body of study behind it. There are thousands of papers and hundreds of books on it. It is taught to undergaduate computer science majors. It is a well understood task, and you can find many people who know how to do it.

On the other hand, writing device drivers is not an academic exercise. It is an exercise in engineering based on specific experience, a black art that few people have experience in. There are far more people who can write compilers than people who can do OpenGL guy's job. He's got it much tougher.

The problem with OpenGL2.0 isn't the compiler, it's all the other stuff they added IMHO.


In isolation, I think this comment has merit, and I think I agree with it. However, I don't think it is a universal view, or one that validly precludes viewing DX, in its current incarnation, as easier.

OpenGL is preferred by the majority of game developers. It is easier to use, easier to understand, and self consistent. DirectX is a mishmash of crappy Windowisms, and lots of legacy holdovers from 9 different versions of the API. While DirectX9 fixed alot of stuff, it still takes way fewer lines of OpenGL code to accomplish something, and the code is easier to understand. The only reason DX still exists and wasn't killed is because of MS. Just a few years ago, game developers were petioning MS to drop DX and throw all effort into OGL.



Do I understand you correctly to be proposing somehow that DX HLSL is simply not an "abstraction", so is completely denied this "power of abstraction"?

DX is an abstraction, but it still presents a view of the HW as pointers to HW memory structures, even if that is not really the case, it's initialization heavy, and is based on passing around C-Structures. OpenGL treats the gfx HW as an abstract state machine. OpenGL objects are referenced by handles instead of memory pointers. OpenGL is procedural, and you build up the "scene" by streaming instructions to the hardware. DirectX is based on creating objects and structures, setting up a bunch of struct fields, and then calling methods to send these structs to the HW. It's just much harder and more annoying to use. Hundreds of lines of initialization code.



This seems a pretty ludicrous precept, given things like that Microsoft has met the requirement of "deciding" as well

Microsoft is pragmatic. Rather than spend time architecting the most elegant solution, they pick the low hanging fruit and ship it early. This is the case for all their software. IE1.0/2.0. Windows 1.0/2.0/3.0. DirectX1.0/2.0/3.0..../... Windows CE 1.0/2.0/3.0/... (PocketPC, etc)

That's why Microsoft software takes on average 4-5 revisions before people acknowledge it as stable and of useful quality.

If you look at the discussions that go on at ARB in the meeting notes, and compare them to the internal DX9 beta program, it is clear that Microsoft's perogatives are not to design a clean, elegant, solution, but to get something out as quickly as possible to get the early market lead.

ARB members are far more concerned about getting things right and NOT DAMAGING THE API with hacks.
 
DemoCoder said:
You mean like going with an interrim release and not waiting for 2.0? Elaborate if different.

The problems started well before the 1.5/2.0 talks.

First of all, they release a common vertex program api, which is great. You can target ati and nvidia vertex programming in one api, as the hardware is basically the same. Then they release a fragment program api, which is great, but too high level for some hardware (gf3/8500), and not quite right for others (geforcefx - considering the api is basically ATI_fragment_program). I haven't looked at all the implications of the new shading language and apis, but there are a few issues in there (nothing major, but still).

I understand that the fragment hardware was not the same(as each other), therefore could not have been as well supported as the vertex functionality, but there could have been a good solution (D3D found one). Backwards compatibilty from current generation hardware (nv3x, r3xx) is quite a task.

I'm quite happy with opengl 1.5 - no it did not need to be 2.0.
 
DemoCoder said:
Sorry, I don't see the contradiction. I propose making 100% of the compiler dynamic and under the control of the IHV. The fact that you can achieve partial dynamism today is not a self-contradiction. In fact, you can achieve 100% dynamism if you ship FXC with your game and invoke it via D3DXCompileShader*
Which still wouldn't help, as when Microsoft releases a new compiler target, you'd still need to patch the game to get any benefit. What DX9 HLSL needs, at the very least, is a "default profile" option that is meant to be used for runtime-compiled shaders, and is set by the drivers. This way simply updating the video drivers may update the optimization.

At the same time, I feel it's just silly for runtime compiling, which I feel is vastly superior to what DX9 HLSL offers, to even bother with a standard assembly language.

Anyway, may main argument for giving IHV's full access to the compiler is that it's always going to be easier to optimize if one starts from a higher level.
 
There are several aspects of this that seem erroneous or a contradiction to me, and perhaps you can clarify:

* "PS 2.0" isn't all of DX HLSL. The profile system is as well, as is PS 2.0 extended and PS 3.0 as the DX HLSL evolves for hardware in the near future.
* Do you mean "PS 2.0" register limitations? What are you proposing is insufficient? What it seems to me is hardware register limitations are still the primary issue for shaders, so I don't see how "PS 2.0" and higher register limitations are an issue. You perhaps mean some limitation besides count? If so, it would be helpful if you specified what in particular.
* Your commentary on "float vectors is the only data type" seems backwards as a reason to propose it being "tied" to low level, as datatypes with less ranges/precisions and components "fit into" them. Adding more specification for less components and less precision are less abstract, and more low level. What PS 2.0 is "tied to" is a higher minimum...contrast it to PS 1.4 and lower.

I was talking about ps 2.0 as a low level intermediate language to send to the driver. The points I made are AFAIK valid for ps 3.0 as well.

Registers: I find it pretty silly for the directx assembly to expose "registers" which obviously don't map very well to actual hardware registers. It just makes the driver peoples job more difficult. They should have just had an infinite amount of registers like in ARB_fragment_program and allowed the driver to barf on the shader if the register limit was exceeded. Register coloring isn't *that* hard.

Data types: Enforcing four component float vectors is kind of bad because it doesn't really match what the hardware does. The r300 has a separate scalar and vec3 unit for example, which might be easier to optimize for if scalars where supported by the intermediate language. Similar optimizatoin oppurtunities exist for booleans and ints, that's why they're in the HLSL. It would be more future proof if they were in the intermediate representation as well to avoid specifying a new one every time new hardware comes around.

Of course, there were lots of people smarter than me that worked on HLSL and I 'm sure they are aware of all of the above, they just thought getting implementations and drivers out fast was more important.
 
DemoCoder said:
demalion said:
It is actually better than the "problem" we have right now for implementing game performance improvements, because targetting the "LLSL" and the "90% of the work of glslang" you propose for IHVs is not statically linked. Your self contradiction is mind boggling, as is this bogeyman of "DX HLSL will make you have to download patches every month!".

Sorry, I don't see the contradiction. I propose making 100% of the compiler dynamic and under the control of the IHV. The fact that you can achieve partial dynamism today is not a self-contradiction.

On the one hand, you state that HLSL requires statically linked upgrades to "all your programs" to get improvements when you download a new driver.

On the other hand, you state that IHVs have already done "90%" of the work necessary for implementing a glslang compiler by implementing the driver side optimizer for DX "LLSL".

If they've done "90%" of the work for their driver side optimizer, why do you have to "download patches for all your programs" to get the performance improvements from the driver. I have the same question with other numbers that are less than "90%", but not "0%".

In fact, you can achieve 100% dynamism if you ship FXC with your game and invoke it via D3DXCompileShader*

Yes, I remember when I observed in the last discussion, among other things. Are you proposing this explains something you're trying to say, or just bringing it up? What does that asterik mean?

But of course, it still prevents the poor IHV from making decisions early in the compile phase, and despite what you think, and as I have explained in the past, compiler front-ends share a lot of code in common, but that is not in contradiction to the fact that early phase optimizations are still a win. The code that implements the high level optimizations may be the same, but the heuristic which determines whether the optimization gets performed or not is architecture specific.

Hey, wouldn't it be neat if you had different "thingies" for changing compiler behavior, based on introducing such architecture specific heuristics as necessary? You could call them "profiles" or something, and then the question becomes whether the "poor IHVs" or MS would find optimizations in a more timely fashion. Pity it took so long to discuss that, as this was pointed out in my initial post listing the factors. :-?

Go lookup how inlining and cache collision optimizations algorithms work for example. (oh please, spare me "oh, but they are not the same, those are CPU algorithms" anti-analogy Demalion response. Go lookup register allocators then)

Cache collision reminds me of some of the principles discussed in Nick's vertex buffer thread, which would seem to be most directly relevant to texture load type isntructions if I understand you correctly. We discussed inlining and register allocation in our last discussion, though at the time, you said you didn't feel like continuing the discussion, though I found it interesting...has that changed?

OK, so how do these things answer any of the points I brought up?

For example, all the code for parsing, type checking, semantic analysis, and even optimizations can be shared. However, which optimizations are switched on and how they are chosen is not always a decision which can be made STATICALLY and GLOBALLY based on some idea of a "profile" that applies to an entire category of hardware.

What is the connection between the above and this conclusion? The problem with your analogies is they fail to establish it.
"Not always" consists of "sometimes, and sometimes not". What is the "sometimes not" that "WILL" cause HLSL performance to be lacking?

(sorry, can't even parse what you are trying to say)

Do you mean "statically linked"? For my usage, "build time" inclusion is "static linking" and "run time" inclusion is the result of "dynamic linking".

That's "not a good job" because their hardware is deficient and requires lots of work to run well at all.

No. Just because they created an architecture that is register usage sensitive doesn't excuse the fact that a compiler could be written to take advantage of the strengths.

DC, stuff like this just doesn't make sense. You proposed that IHVs did a "bad job". I provided a reason why it seems a valid characterization for one particular IHV's compiler: because their hardware needed it badly. This relates to Cg as well, which was a compiler written to take advantage of the strengths of a particular hardware, and the IHV failed at it in comparison to HLSL.

The problem was your insisting that all IHVs did a bad job with HLSL as a premise for proposing that IHVs will be universally better served by glslang if they simply weren't "crappy". It doesn't make sense to then go on of a rant about how I'm being ludicrous for accusing nVidia of doing a bad job when I'm just pointing out the difference between how nVidia and other IHVs in the context of your assertion, and that the distinction is associated with demonstrably bad decisions.

You may as well complain that since standard compiled C or Fortran doesn't run well on massively parallel supercomputers, it is a fault of the hardware, and not the fault of the standard compiler not being aware of how to parallelize.

Likewise, it was Intel's fault that they designed "deficient" HW since taking advantage of SSE2 automatically requires a lot of work to run well.

Hmm...well, let's see why I have such a problem with certain analogies: If a massively parallel supercomputer offered less performance than a single CPU when running "standard compiled C", and we ignored the "LLSL" compilation issue completely for the moment, your analogy would have some relation to correcting my characterization of "deficient hardware". However, I daresay that this would rather clearly still be "deficient", even if called a "supercomputer", if code optimized for it still compared unfavorably for performance to the alternative. Which is the situation we were actually talking about when I mentioned "deficient" hardware.

Or it could be that I'm against all analogies, and not just ones used to change the conversation direction at points inconvenient for someone who doesn't want to discuss problems with what they said. I don't happen to think so, though.

BTW, as far as your second analogy, it would be more like comparing 3dnow! to SSE, I think, at least as far as getting rid of the obvious problems.

And, I ask again, what is the baseline of optimization, conflicts, and bugs that this will establish? It is not a complicated question, it just an unanswered one. If you have an answer, please share.

I don't know, but shared source is all over the place, in applications, compilers, MS's operating system, etc. Either you'll be stung by the bugs in MS's FXC (shared by everyone), or you'll be stung by bugs in some open source DDK for OGL2.0 driver writers.

This looks familiar. Almost like I discussed something similar before. If I ask you to read my initial discussion of the comparison to the approaches again, or otherwise point out that this was already a topic of my conversation before your analogies steered away from it, will this illustrate the problem with them more clearly?

FXC offers no advantages, unless you somehow think that a CLOSED SOURCE compiler written by Microsoft will have less bugs than an open-source DDK that has thousands of developers examining it and building their own compilers.

I invite you to address the elements of my initial post, where I itemize and list something other than "no advantages", in some way besides proposing an MCD/ICD parallel as a substitute to actually discussing them. That way, when you go on to propose there are "no advantages" as a premise of your argument again, analogies like those you've used might be something other than simply a mechanism for proposing your viewpoint in a vacuum of dissenting discussion.

This concept isn't complex, just wholly ignored and sacrificed on the altar of your arguing from the standpoint of "I'm right, why should I bother to actually pay attention to what you say?".

Will DX HLSL and the profile system stop evolving? Or is it simply that you propose it will need to evolve "every month" for some reason?

So do you propose that profiles be created for each and every chipset in existence, and have them all maintained by Microsoft and shipped in DX9.0? updates (which is how new FXC versions get distributed)?

No, and though I'm disappointed that you've forgotten the answers I gave back in July when we discussed this lat, I am relatively glad you asked here (because it isn't a flawed analogy, and actually deals with the content of my commentary...can we keep this up, please?).
I propose that more profiles be created as necessary to reflect a LLSL expression suitable for "intermediate->GPU opcode" optimisation by IHVs. How many will that be? Well, that depends on the suitability of the LLSL and profile characteristics itemized at that time.

Your alternative proposes that a new and unique type of compilation paradigm will be 1) introduced every month 2) unable to be resolved by a back end optimizer of the "shorter instruction count" principle intermediary or whatever other characteristics in prior profiles there are.

1) seems absolutely ludicrous, and I think I've specified why.

2) depends on how many different approaches are needed for what IHVs do. We have 2 at the moment, one being "shorter operation itemization count, and executing within the temp register requirement without significant slow down". I think this is a pretty generally useful intermediate guideline that more than one IHV should be able to utilize. You seem to propose that every single IHV product introduction will be an exception to that such that a new set of guidelines will be required in accordance with 1).

So if Nvidia, ATI, 3dLabs, et al want to ship a new FXC with an updated profile to correct some optimizations, they will have to wait until Microsoft wants to ship another FXC update?

Maybe if you stop going on about "every month" we can discuss that possibility in a reasonable fashion. If you can do that, please note that I've provided an answer when I discussed the advantages and disadvantages initially. For future reference, please don't have a discussion with someone who thinks HLSL is perfect and call them by my handle. :-?

Along the lines of my prior discussion, I propose that IHVs should already be communicating with MS about taking these matters into account, and their profile decisions should reflect such considerations. If they don't, glslang will have an opportunity to display a significant competitive advantage, if it overcomes the challenges it faces while DX HLSL is at a disadvantage, and IHVs find implementing a custom compiler less problematic than communicating with MS.
What about that was unclear?

The set of advantages and challenges in the respective comparisons are not the same, and the end result is not pre-ordained by substituting one set for the other.

Classic Demalion.

Please stop arguing against my handle/name, and argue against my discussion.

The response to any analogy is to claim "but their not exactly the same" Carmack and 50+ other developers signed a petition to have MS ship the MCD devkit, because they were afraid that OpenGL drivers were too complex to implement

Unless the petition involved the details of the issues about the challenges for compilation, replacing a discussion with this analogy still seems to be problematic to actually evaluating the current situation. If you weren't busy throwing around "typical Demalion", perhaps you'd have noticed that that is what I'd just said the first time.

...petition...

This is exactly the argument that others have posted in this forum: OpenGL2.0 drivers that have a full compiler embedded in them will be more complex, more buggy, more incomplete, etc. Hence, the analogy between the cries that ICDs were too difficult.

Yes, and that is a superficial analogy, because whether people are arguing about one approach being too complex or prone to bugs compared to another is not the issue, it is whether it actually will be, and why. That's why I deride your insistence on it, and try to encourage discussions that don't depend on its substitution. How is my initial observation about the analogy incorrect?

Now, you can cry about how writing a compiler and writing an ICD are too entirely different things, but that's just you playing critic about something you have no idea about.

So writing an ICD and writing a compiler are the same thing, then? I'm missing something...how does accusing me again change the relationship between the two things?

Personally, as someone who has implemented many compilers, I find compilers MUCH easier to write than writing a large device driver like OpenGL ICD.

How about writing a compiler that has to interact directly with all the elements of a large device driver like the OpenGL ICD?

Compiler writing has a large body of study behind it. There are thousands of papers and hundreds of books on it. It is taught to undergaduate computer science majors. It is a well understood task, and you can find many people who know how to do it.

On the other hand, writing device drivers is not an academic exercise. It is an exercise in engineering based on specific experience, a black art that few people have experience in. There are far more people who can write compilers than people who can do OpenGL guy's job. He's got it much tougher.

What are the challenges in this "black art"? I do think maintaining API compliance with complex interaction is part of it. Which actually seems to point out that a driver in the compiler increases the complexity of the driver writing challenge, and doesn't simply replace it with the "ease" of the challenge of writing a compiler, and actually seems to support the things you accuse people of "crying" about.

However, I understand the separate relevance to the proposition that writing a compiler as optimal as Microsoft's should be easy, at least as far as a standard compiler starting point. First, this does not seem to have been demonstrated to be true so far, but we seem to have agree that this was a "bad job" by nVidia. What I continue to not see, as I specified initially, is how a compiler can both resolve such conflict issues, and be freely and easily changed to suite individual architectures more than HLSL can without re-introducing them as challenges. Hey, maybe we can actually discuss that now?

The problem with OpenGL2.0 isn't the compiler, it's all the other stuff they added IMHO.

Err...maybe "all the other stuff" is related to what I'm talking about? How is it that ARB decisions are self-evidently right when you agree with them, and irrelevant when you view them as a problem? :oops:

In isolation, I think this comment has merit, and I think I agree with it. However, I don't think it is a universal view, or one that validly precludes viewing DX, in its current incarnation, as easier.

OpenGL is preferred by the majority of game developers.

Hmm...OK, this doesn't seem to be demonstrated at the moment. This seems to remind me more of comments related to DX 7 and earlier rather than DX 9. However, I'm glad if this is actually so, as OpenGL needs factors like this to counter the effect of trailing DX HLSL as much as it has.

It is easier to use, easier to understand, and self consistent. DirectX is a mishmash of crappy Windowisms, and lots of legacy holdovers from 9 different versions of the API. While DirectX9 fixed alot of stuff, it still takes way fewer lines of OpenGL code to accomplish something, and the code is easier to understand. The only reason DX still exists and wasn't killed is because of MS.

Yeah, monopolies suck that way, eh...?
I think MS has evolved DX in response to OpenGL's strengths. I really hope glslang makes a strong showing, as it is important that such an alternative continue to apply pressure in this regard if DX evolution isn't just going to stagnate and ossify. Similarly, I believe OpenGL's focus on one shader expression instead of just continuing the extension system was hastened to respond to the centralized DX implementation methodology and the advantages it offers for developers and ease of implementation for various hardware.

Just a few years ago, game developers were petioning MS to drop DX and throw all effort into OGL.
Err...a lot changes in a "few years", so I don't quite get the relevance of this. However, if we're still talking about the first part of my statement, where I agreed, and not the second, that doesn't matter much.

Do I understand you correctly to be proposing somehow that DX HLSL is simply not an "abstraction", so is completely denied this "power of abstraction"?

DX is an abstraction, but it still presents a view of the HW as pointers to HW memory structures, even if that is not really the case, it's initialization heavy, and is based on passing around C-Structures. OpenGL treats the gfx HW as an abstract state machine. OpenGL objects are referenced by handles instead of memory pointers. OpenGL is procedural, and you build up the "scene" by streaming instructions to the hardware. DirectX is based on creating objects and structures, setting up a bunch of struct fields, and then calling methods to send these structs to the HW. It's just much harder and more annoying to use. Hundreds of lines of initialization code.

From discussions of exactly this that I recall here, I don't think this view is quite universal, though I do tend to agree with the gist of your opinion.

This seems a pretty ludicrous precept, given things like that Microsoft has met the requirement of "deciding" as well

Microsoft is pragmatic. Rather than spend time architecting the most elegant solution, they pick the low hanging fruit and ship it early. This is the case for all their software. IE1.0/2.0. Windows 1.0/2.0/3.0. DirectX1.0/2.0/3.0..../... Windows CE 1.0/2.0/3.0/... (PocketPC, etc)

Yes, but you're answering why MS has an advantage, not addressing what I actually had a problem with: "You seem to be proposing that once you decide on something, achieving success with it, compared to what someone else decided to work toward, is a given?"

That's why Microsoft software takes on average 4-5 revisions before people acknowledge it as stable and of useful quality.

Well, what if the competition doesn't even show up until after that occurs? Hopefully the programming advantages you mention will work to counter as much of that eventuality that manifests.

If you look at the discussions that go on at ARB in the meeting notes, and compare them to the internal DX9 beta program, it is clear that Microsoft's perogatives are not to design a clean, elegant, solution, but to get something out as quickly as possible to get the early market lead.

Not sure things are quite that simple, as it would seem DX has evolved to be cleaner as it has progressed...so it would seem to be part of what MS is considering. Yay competition.

ARB members are far more concerned about getting things right and NOT DAMAGING THE API with hacks.

Well, I don't think that such concern automatically means they don't do that, unfortunately, and that still leaves my wondering about whether you considered the rest of the discussion of mine around the comment you're addressing.
 
GameCat said:
There are several aspects of this that seem erroneous or a contradiction to me, and perhaps you can clarify:

* "PS 2.0" isn't all of DX HLSL. The profile system is as well, as is PS 2.0 extended and PS 3.0 as the DX HLSL evolves for hardware in the near future.
* Do you mean "PS 2.0" register limitations? What are you proposing is insufficient? What it seems to me is hardware register limitations are still the primary issue for shaders, so I don't see how "PS 2.0" and higher register limitations are an issue. You perhaps mean some limitation besides count? If so, it would be helpful if you specified what in particular.
* Your commentary on "float vectors is the only data type" seems backwards as a reason to propose it being "tied" to low level, as datatypes with less ranges/precisions and components "fit into" them. Adding more specification for less components and less precision are less abstract, and more low level. What PS 2.0 is "tied to" is a higher minimum...contrast it to PS 1.4 and lower.

I was talking about ps 2.0 as a low level intermediate language to send to the driver. The points I made are AFAIK valid for ps 3.0 as well.

Registers: I find it pretty silly for the directx assembly to expose "registers" which obviously don't map very well to actual hardware registers.

How so? The NV3x is what seems to have the problems, and that seems to be because its register limitation is very low. As I said, if you mean other than count, please clarify.

It just makes the driver peoples job more difficult. They should have just had an infinite amount of registers like in ARB_fragment_program and allowed the driver to barf on the shader if the register limit was exceeded. Register coloring isn't *that* hard.

I think I understand your concern with register itemization...do you mean that they should just retain the abstract of variables in the LLSL and allocate an infinite set of register to them for compilation? Perhaps along the lines of this discussion?

I think this relates to glslang being intended as the API final representation, whereas DX intends to continue to expose the "assembly" to continued evolution as well. I don't think this hurdle is insurmountable within a combination of the profile and LLSL specification systems.

Data types: Enforcing four component float vectors is kind of bad because it doesn't really match what the hardware does. The r300 has a separate scalar and vec3 unit for example, which might be easier to optimize for if scalars where supported by the intermediate language.

But generating the intermediate language representation isn't the only place optimizations can occur, and even if it were, there is component masking, and perhaps swizzling.

Similar optimizatoin oppurtunities exist for booleans and ints, that's why they're in the HLSL. It would be more future proof if they were in the intermediate representation as well to avoid specifying a new one every time new hardware comes around.

Modern CPUs have more use for booleans and bytes than GPUs at the moment, yet AFAIK they don't actually have boolean and byte registers and calculation units. The NV30 transistor count and performance, in comparison to both the NV35 and R300, seem to indicate to me that GPUs aren't likely to have a future in it as well, though if you have some reason for disagreement I'm interested in hearing them. As it is, that idea looks to me like it is useful for something like hinting for predication, which can be expressed in the LLSL specification already.

Of course, there were lots of people smarter than me that worked on HLSL and I 'm sure they are aware of all of the above, they just thought getting implementations and drivers out fast was more important.

I don't think it was quite that simple, though I do think it was one factor. I'd be interested in your thoughts on what I mentioned. HLSL specifications for convenience of reference.
 
demalion said:
If they've done "90%" of the work for their driver side optimizer, why do you have to "download patches for all your programs" to get the performance improvements from the driver. I have the same question with other numbers that are less than "90%", but not "0%".

Quite obviously, because there is a limit to how much can be done to optimize DX9 "LLSL" in the driver and therefore improvements have to be made in the compiler itself. If this statement were not true, then why would the FX profile even need to be created. The fact is, NVidia's driver could only do so much on the PS2.0 FXC output, and therefore needed a special hack to the compiler. The fact that new profiles have to be consistently created is an admission that LLSL does not have representational power to allow the driver to do the best optimizations.


Now, there are two questions. #1 does NVidia have access to the FXC source and FX profile heuristics, or does Microsoft have to maintain this code? #2 How often can NVidia ship updates to the compiler and #3 If the compiler is updated, how will the improvements be realized in games?



What does that asterik mean?
Because the there is more than one function that begins with that prefix.



Hey, wouldn't it be neat if you had different "thingies" for changing compiler behavior, based on introducing such architecture specific heuristics as necessary?

Yeah, and wouldn't it be nice if these compiler options were a constant and didn't have to change based on runtime state? Wouldn't it be nice if LLSL worked as well for optimization in the driver as you continue to assert, even though you have no apparent experience writing either compilers or drivers, and can offer no explaination as to why LLSL + profiles remains the best solution for extracting maximum driver performance.


OK, so how do these things answer any of the points I brought up?

Because some heuristics can be best decided at runtime or after profiling. Based on driver state, some drivers might even perform a kind of "just in time" compilation, recompiling shaders based on other factors, such as bandwidth usage. OpenGL drivers do this today by deciding whether to allocate memory in video ram or AGP memory based on utilization statistics. A "heuristic" hardcoded into the driver with no regard to actual runtime factors would not be as optimal.

An example would be a library call to normalize() which may evaluate to a cube map, slow but accurate macro, or fast but with errors macro, based on what other pipeline state exists.

DC, stuff like this just doesn't make sense. You proposed that IHVs did a "bad job". I provided a reason why it seems a valid characterization for one particular IHV's compiler: because their hardware needed it badly.

Well, it sounded to me like you were saying that the blame for NVidia's poorly optimizating drivers is because their hardware sucks, instead of that their driver developers have a difficult task.

My point was, when SSE was first introduced, compilers did a very poor job at autovectorization. It took a long time before we saw software automagically benefit (without being handcoded to SSE). You might claim Intel's Pentium4 was "defficient" compared to the Athlon (which was beating it at the time), but now with better compilers, P4's routinely see very healthy increases from SSE utilization.

Future GPU's might have bizarre architectures that don't fit what FXC/PS2.0 was designed around today. OpenGL2.0 provides more semanic awareness to the driver, and more freedom for optimizations, something which might allow "defficient" HW which runs old software poorly, to later, as compiler technology advances, reach it's full potential.




Or it could be that I'm against all analogies, and not just ones used to change the conversation direction at points inconvenient for someone who doesn't want to discuss problems with what they said.

Or it could be that you couch every statement in terms of doubt and "maybes" and half-assertions, and you never state anything of importance. What can anyone learn from any of your discussions of LLSL and GPU compilers? Have you added anything to the discussion besides "well, maybe glslang won't do as advertised, I think, perhaps, but have no information to offer"

Really.



If I ask you to read my initial discussion of the comparison to the approaches again, or otherwise point out that this was already a topic of my conversation before your analogies steered away from it, will this illustrate the problem with them more clearly?

Why don't you succinctly state it here.



I propose that more profiles be created as necessary to reflect a LLSL expression suitable for "intermediate->GPU opcode" optimisation by IHVs. How many will that be? Well, that depends on the suitability of the LLSL and profile characteristics itemized at that time.

Well, why don't you tell us how you would evaluate the suitability of LLSL? Many people, not just me, have already told you the shortcomings. And further, I have just told you that runtime or profile based optimization (wherein the developer runs the game, collects profile info from the driver, and then reruns the compilation feeding back the profile information) could do even better.

And just as ATI and NVidia do driver-based game detection, no doubt, they would also be able to do compiler-heuristics tweaking on a per-game basis as well. Point being, your proposal that developers and endusers remain hostage to a centrally managed uber-compiler from Microsoft has serious problems, and you don't seem to even partway get it.



Your alternative proposes that a new and unique type of compilation paradigm will be 1) introduced every month

No, my position is that N developers will ship M driver updates per year, yielding N*M updates to the game community, today. I expect M compiler updates per N developers as well, due to the fact that IHVs will be discovering tweaks all the time which yield improved shader performance.

It's not a "new and unique" compilation paradigm. It's HL2 gets released, and the following month, ATI and NVidia discover simple heuristic tweaks which yield say, a 5-10% performance boost. Everytime a new games comes out, they might need to issue an update. Or, 3dMark2004 comes out, etc.


2) unable to be resolved by a back end optimizer of the "shorter instruction count" principle intermediary or whatever other characteristics in prior profiles there are.

2) depends on how many different approaches are needed for what IHVs do. We have 2 at the moment, one being "shorter operation itemization count, and executing within the temp register requirement without significant slow down". I think this is a pretty generally useful intermediate guideline that more than one IHV should be able to utilize.

You've been given umteen examples why LLSL as it exists today is not a good representation format for these optimizations. NVidia couldn't resolve the issue in their driver alone because of this. And overriding principles (shortest register count, shortest instruction count) are too simplistic to capture everything that needs to be done, because they are competing goals.

The most optimal code is not neccessarily at the extremes (shortest actual shader, or shader with fewest registers used), but is someone in between, and possibility finding the global minimal is extremely hard.



Yes, and that is a superficial analogy, because whether people are arguing about one approach being too complex or prone to bugs compared to another is not the issue, it is whether it actually will be, and why. That's why I deride your insistence on it, and try to encourage discussions that don't depend on its substitution. How is my initial observation about the analogy incorrect?

Because your comments add nothing and history is relevant. The people claiming that implementation difficulty is a problem lost the argument. Likewise, those claiming compiler implementation problems will be proven wrong as well.

You have provided not a single argument as to why ATI or NVidia will have a harder time implementing a compiler on their own hardware, which they have intimate knowledge of, than a compiler implementation by proxy, in which Nvidia and ATI have to spill confidential IP to MS and hope, through many exchanges, MS developers, who are trying to maintain a compiler for N different architectures, will a) pay attention to them and b) implement it properly as an "outsourced" activity.

You can appeal to authority and say "MS compiler developers are just that good", but it's a very weak argument. No one knows who NVidia and ATI and 3dLabs have hired to work on OGL2.0, and frankly, expertise in Visual C++ compilers doesn't neccessarily translate to GPU compilers, using classic Demalion analogy doubting.




How about writing a compiler that has to interact directly with all the elements of a large device driver like the OpenGL ICD?

Well, if it's a static compiler, it doesn't have to "interact" with squat, but is just a DLL that the driver delegates to to compile the shader. The only "interaction" part is in how you bind the registers to OGL state, and how you upload the code to the GPU. But these are the same issues that ARB_{vertex|fragment}_program have to deal with.

If it's dynamic runtime compilation, it's more difficult, but if you think this is what driver developers have to worry about, then you are a tantamount to admitting FXC's static approach which ignores runtime state is insufficient.


What are the challenges in this "black art"? I do think maintaining API compliance with complex interaction is part of it.

Writing a compiler from scratch is easier than writing an ICD. Adding a compiler into an ICD isn't that complex, since, as I stated, the compiler doesn't need to "interact" with anything. You call Compile(source), the compiler in the driver gets invoked, compiles to native code, and then the driver must upload it to VRAM and bind OGL variables to it. That is the only "interaction", an interaction, which is in fact, the same as what needs to be done today.

The compiler only needs to "interact" with OGL (that is, during compilation) if it is doing some kind of runtime optimization that you claim isn't neccessary, as we all know, because of your defense of LLSL as sufficient.


OpenGL is preferred by the majority of game developers.

Hmm...OK, this doesn't seem to be demonstrated at the moment.

Why not take a poll on B3D and ask which API developers here think is easier to use, cleaner, and which one they would use in ideal circumstances if Microsoft's market power didn't dictate DX's use.
 
demalion said:
As it is, that idea looks to me like it is useful for something like hinting for predication, which can be expressed in the LLSL specification already.

No, ints and booleans are useful for other reasons. For example, looping and indexing. Some architectures have special loop registers, others do not. Loop counters only need to be integers in the vast majority of cases, which means less transistors need to be used, and less latency on loop math.

How many loop registers? Archictures will differ. OpenGL2.0 makes no assumptions. DirectX assumes one or zero registers. DX assumes loops can't be nested, etc.

As a result, the DX compiler is FORCED to inline loops that perhaps shouldn't. It might be forced to "spill" the loop register so it can be reused. OGL2.0 makes no assumptions on loop nesting, register availability, or register count. It also doesn't assume what "type" a loop register is.

As a result, it is future proof. DX isn't, and even after VS/PS3.0, it will still need evolution. OGL2.0's direct compilation (no LLSL intermediary) means the shaders will automatically work on future hardware to the best degree possible.

With DX, future generation hardware drivers may be hobbled by being fed shaders compiled for devices which had no looping, or one level of nesting, or 1 register, etc. 2 years from now, someone playing HL2 will have their drivers being fed PS2.0 or PS3.0 shaders, even if their underlying hardware is a PS5.0 equivalent, with uber flexibility. OpenGL2.0 frees IHVs to expose features in their hardware without requiring assembly language extensions or API updates. MS will continue to require updates to their LLSL past 3.0.
 
DemoCoder said:
My point was, when SSE was first introduced, compilers did a very poor job at autovectorization. It took a long time before we saw software automagically benefit (without being handcoded to SSE). You might claim Intel's Pentium4 was "defficient" compared to the Athlon (which was beating it at the time), but now with better compilers, P4's routinely see very healthy increases from SSE utilization.

Future GPU's might have bizarre architectures that don't fit what FXC/PS2.0 was designed around today. OpenGL2.0 provides more semanic awareness to the driver, and more freedom for optimizations, something which might allow "defficient" HW which runs old software poorly, to later, as compiler technology advances, reach it's full potential.
No, my position is that N developers will ship M driver updates per year, yielding N*M updates to the game community, today. I expect M compiler updates per N developers as well, due to the fact that IHVs will be discovering tweaks all the time which yield improved shader performance.
No one expects a compiler to acheive 100% efficiency. A compiler's goal is to acheive as much efficiency as possible across a broad range of possible programs.
It's not a "new and unique" compilation paradigm. It's HL2 gets released, and the following month, ATI and NVidia discover simple heuristic tweaks which yield say, a 5-10% performance boost. Everytime a new games comes out, they might need to issue an update. Or, 3dMark2004 comes out, etc.
Personally, I find this idea absurd. If you already have an optimizing compiler, chances are you won't get such dramatic improvements. You said that Pentium 4s saw large improvements once compilers got better at using SSE. Well that doesn't apply here because the compiler should have been "SSE" aware all along.
You have provided not a single argument as to why ATI or NVidia will have a harder time implementing a compiler on their own hardware, which they have intimate knowledge of, than a compiler implementation by proxy, in which Nvidia and ATI have to spill confidential IP to MS and hope, through many exchanges, MS developers, who are trying to maintain a compiler for N different architectures, will a) pay attention to them and b) implement it properly as an "outsourced" activity.
So now you want to require each IHV to provide their own HLSL compiler. IHVs already have to create such a beast for OpenGL 2.0, because the API resides in the driver itself. In this case, the API is not inside the driver. Why should a new API require an IHV to commit more people to driver development? I think this is a step backwards. More IHVs with more compilers will mean more bugs. Application A works around bug B in an IHV's compiler, then IHV C has to workaround the application's incorrect behavior. Or the developer is completely unaware that that Bug B exists and thinks this is proper behavior. I see this already with assembly level code because of broken HW or drivers, I don't think it will get better. I also don't think this is good because having the HLSL's behavior potentially change on each driver release will cause confusing among ISVs.
Writing a compiler from scratch is easier than writing an ICD.
What are you basing this on? Have you written an optimizing compiler from scratch? Have you written an ICD?
Adding a compiler into an ICD isn't that complex, since, as I stated, the compiler doesn't need to "interact" with anything.
So that means it's not complex? No, that doesn't follow at all.
Why not take a poll on B3D and ask which API developers here think is easier to use, cleaner, and which one they would use in ideal circumstances if Microsoft's market power didn't dictate DX's use.
How much of that response would be good, old-fashioned bias, whichever way the poll's outcome went?
 
Ignoring all the pros and cons of both sides for a moment, think for a minute about the practicality of JIT shader compilation.

It isn't too absurd, now a’ days, to hear of full shader tree recompiles taking 30 minutes to an hour. Is that really something you'd want to have to go through every time you launch an app? Even if that could be drastically reduced, is it worth all that effort for a few percentile performance increase?
 
OpenGL guy said:
Personally, I find this idea absurd. If you already have an optimizing compiler, chances are you won't get such dramatic improvements. You said that Pentium 4s saw large improvements once compilers got better at using SSE. Well that doesn't apply here because the compiler should have been "SSE" aware all along.

CPU compilers don't spend as much time on optimizations as they could because compile times are are an issue and in a typical CPU application, 90% of the execution time is spent in 10% of the code, so there is a limit to how much generalized improvements can make a difference.

With GPUs, shaders are hotspots, critical sections. Optimizations are far more important. With CPUs, a 5% generalized improvement won't necessarily affect the runtime unless it happens to affect the hotspots. With GPUs, shaving off 5% of shaders is more likely to affect the overall game speed.

Couple that with the fact that IHVs will likely be adding in "detection" for some games, e.g. "if game == HL2 and shader == water, then set subexpression_elimination_threshold = 0.8" to tweak performance. C compilers have been consistently updated for the last decade.

The "full employment theorem for compiler writers" is true.

I also don't think this is good because having the HLSL's behavior potentially change on each driver release will cause confusing among ISVs.
But it will, just as drivers change behavior on each release, sometimes with regressions. Just look at NVidia's.

Writing a compiler from scratch is easier than writing an ICD.

What are you basing this on? Have you written an optimizing compiler from scratch? Have you written an ICD?

Yes, I have written several compilers. Compiler theory is a field of academic study, on which there is 30 years of research, which is taught in colleges around the world, which has numerous papers in the public on how-to. Any books, college courses, or Phd papers on ICD authoring?

Despite the fact that I have actual experience writing compilers, if I did not, it would not be difficult to learn. Tools exist to assist novices in generating parsers from specifications, and even to generate ASTs. From ASTs, there exist well known algorithms for generating IR and performing optimizations on the graph. And from IR, there are well known algorithms published for generating the final machine code. In fact, there are "step by step" compiler implementation books available for neophytes.

If you have never learned how to write a Fibonacci Heap, doing so why would simply a matter of consulting the literature. If you have never programmed an ICD, the best you can currently do, atleast from my research, is to download something like MesaGL and look at the internals. Basically, you have the OpenGL spec itself, and you have sample code, and that appears to be it.

Fact is, ICD writing, atleast from my point of view, is a factor of on the job experience. Compiler writing is a blend of public knowledge and job experience. Where can I go to learn how to write an ICD?


Adding a compiler into an ICD isn't that complex, since, as I stated, the compiler doesn't need to "interact" with anything.
So that means it's not complex? No, that doesn't follow at all.

Fine. Add to it as an external DLL, like GLUT and have it generate ARB_fragment_programs. Explain to me how this is a difficult job for you. Seems to me that it could be accomplished with hardly any changes to the actual driver at all. I could overlay an OpenGL2.0 wrapper on your driver today which delegates compilation to an external lib.

Of course, this is the inelegant "hack" implementation, but perhaps you can explain why it is complex.
 
Ilfirin said:
Ignoring all the pros and cons of both sides for a moment, think for a minute about the practicality of JIT shader compilation.

It isn't too absurd, now a’ days, to hear of full shader tree recompiles taking 30 minutes to an hour. Is that really something you'd want to have to go through every time you launch an app? Even if that could be drastically reduced, is it worth all that effort for a few percentile performance increase?

I find the idea of hour long shader compiles to be kind of absurd given the size and volume of shaders. Even the worst C compilers with full optimizations dish out tens of thousands of lines per second. Something is very wrong with the compilers used or the way the build is being done it sounds like.

Even so, JIT compilation can be cached or done at build time. Shaders only need to be recompiled if the compiler itself is updated, or some other bit of dependent state. Even if you don't think it is a good idea, it still doesn't invalidate the fact that future hardware still needs the shaders recompiled, and OpenGL2.0 has the compiler built in.
 
Back
Top