Which API is better?

Which API is Better?

  • DirectX9 is more elegant, easier to program

    Votes: 0 0.0%
  • Both about the same

    Votes: 0 0.0%
  • I use DirectX mainly because of market size and MS is behind it

    Votes: 0 0.0%

  • Total voters
    329
Joe DeFuria said:
DemoCoder said:
Are you listening Joe Defuria?

Yes I am.

I already said that the consequence of the DX model is to not being able to have "as optimal" translation vs. working from full source.

It's not just a question of "as optimal", it's a question of the MS compiler DEOPTIMIZING code (e.g CSE) for your platform.

How about the fact that if I ship drivers that don't do all that extra work needed to implement a DX asm optimizer, I will look bad in the marketplace and no one will buy my HW?

People selling the "ease" of writing DX9 drivers need to take a step back and look at the issues involved in handling DX9 input.
 
DemoCoder said:
How about the fact that if I ship drivers that don't do all that extra work needed to implement a DX asm optimizer, I will look bad in the marketplace and no one will buy my HW?

And this is different from GL how? (Except you're not doing work for DX ASM, but GL HLSL directly?)

How many times am I going to repeat that I agree, you can get more optimal results with the GL model.

Optimal performance aside...are you going to argue that you believe it takes more work to get a DX9 HLSL driver up and running from scratch, vs. getting a GL 2.0 GLSLang driver up and running from scratch?

People selling the "ease" of writing DX9 drivers need to take a step back and look at the issues involved in handling DX9 input.

People selling the "superior optimizability" of GL drivers need to take a step back and look at the track record of the pace of development of the robustness of said drivers.
 
IMO, all this talk about vendors other than the top two is moot. The other vendors (if they even get hardware out in the first place) are generally going to suffer wrt performance and stability anyway. There hasn't been any real serious hardware solutions in my mind other than nvidia or ati in the past few years that are truely acceptable. Why would you go with anyone else?
 
Uttar said:
Which is precisely why a developer thinking stuff like "on the NV3x, register usage got to be minimal, so I'm ready to increase instruction count quite a bit" would most likely just lose his time writing a shader slower or at least not faster than the original one...

Yes, a good argument for why you should leave this on the driver to decide. Not the developer himself by coding in assembly, and not by a third-party compiler such as in DX.
 
Joe DeFuria said:
And this is different from GL how? (Except you're not doing work for DX ASM, but GL HLSL directly?)

It's not. So the DX model isn't a gain in the long term, actually a loss. You need to write your optimizer anyway. But with GLSL, you got more potential to take full advantage of your hardware.

Joe DeFuria said:
Optimal performance aside...are you going to argue that you believe it takes more work to get a DX9 HLSL driver up and running from scratch, vs. getting a GL 2.0 GLSLang driver up and running from scratch?

It may be faster to get a naive first implementation up and running. But what matters is the shape of the implemenation by the time you release new hardware. Anything before that doesn't matter. After the first base implemenation of GLSL has been completed, I'm not so sure you even got much of a gain for the DX way when you release new hardware.

People selling the "superior optimizability" of GL drivers need to take a step back and look at the track record of the pace of development of the robustness of said drivers.

It has to collect a track record first before we can take a step back and judge it.
 
Humus said:
It's not. So the DX model isn't a gain in the long term, actually a loss.

So, the GL model is a loss in the short tem, and potentially the DX model is a loss in the long term.

It may be faster to get a naive first implementation up and running.

GOOD. STOP THERE.

It has to collect a track record first before we can take a step back and judge it.

I'm talking about the analogous OpenGL ICD track record. The same arguments were made for it: it can be more optimal for the hardware, because vendors have more control over it.

The track record as I see it, is that ICDs tooka much longer time to "mature", and for the longest time, ICDs were more "coded almost exclusively for Quake based engines perofmance and features, and the rest of software developers had pot luck"
 
Joe DeFuria said:
It may be faster to get a naive first implementation up and running.

GOOD. STOP THERE.

Why the heck, why? It's a state that the consumers never see.

I'm talking about the analogous OpenGL ICD track record. The same arguments were made for it: it can be more optimal for the hardware, because vendors have more control over it.

The track record as I see it, is that ICDs tooka much longer time to "mature", and for the longest time, ICDs were more "coded almost exclusively for Quake based engines perofmance and features, and the rest of software developers had pot luck"

Well, I still disagree. It took a long time to get mature DX drivers and many API revisions. We have had this discussion already, and I don't see a point to bring it up again.
 
Humus said:
Why the heck, why? It's a state that the consumers never see.

First release of ATI's DX9 drivers?

Well, I still disagree. It took a long time to get mature DX drivers and many API revisions. We have had this discussion already, and I don't see a point to bring it up again.

So which is it....it took a long time for mature DX drivers, or we never see "immature / naive" drivers?
 
DemoCoder said:
John said:
Err, but you get those instructions down at the driver they're all part of the intermediate format, no information is lost there (well other that the annoying bug with sincos).

A HLSL branch, depending on the code, can be written as MIN/MAX, CMP, LRP, IF, and IF_PRED. On some HW, it is more efficient to use CMP, on some, LRP/MIN/MAX, on others, branching, and perhaps others, predicates. The FXC compiler is forced to pick one of these, so let's say it picks CMP to implement HLSL if(cond).
A conditional should be left as a conditional, no argument there, although many of the combinations you list are _easy_ to convert back into something different (generally all but LRP).

Well, CMP performs badly on NV30, so you are now asking their driver to take low level assembly, which has been inlined and reordered, and reverse engineer loop branches out of it.
Loop branches ? Well if you’re talking about VS2.0 loops get left as loops, if you’re talking about PS, then the NV30 doesn’t support loops, so what is there to reverse engineer?

And IHV's are support to have an easier time developing DX9 drivers because of this? Why don't they just put a DECOMPILER back to SOURCE in there while they are at it?
If your HW is only really capable of supporting profile X, and thats all you attempt to expose then its a whole lot easier. Problems only start to arise when someone gets it wrong, and then attempts to claim that something is something that it isn't.

Your answer to everything is "profiles, profiles, and more profiles!" Despite the fact the number of profiles will have to grow quite large, Microsoft will have to maintain all of them, and it still doesn't remove the burden from the IHVs to write compilers to deal with DX9 assembly.
Profiles are a solution to a problem that you have repeatedly failed to address.
Actually its not that heavy duty to spot these types of optimisations, but yes its is more work than if you had higher level information available to you.
It's not that heavy duty if they remain relatively intact like I showed you above, but it will be hellishly difficult if the instructions get reordered through a scheduler, registers get packed, and some of those intermediate results are reused by the compiler.
The driver still has to create a dag to do its own scheduling, this will often reveal the majority of opportunities that were not obvious from simple examination. The scheduling performed by the Dx9 HLSL shouldn’t attempt to take account of latencies (afaik it doesn't) as it need to be neutral, instead it only try's to make things fit wrt to things like dep read chain limits as defined by the profile, and before you shout about it, there is one that doesn’t include those limits.
Isn't that a bug with the HLSL? Has no impact on the im format.

Yes, but I am listing the inadequacies of the whole platform. And right now, Microsoft has one poorly optimizing compiler to rule them all.
The issue is actually that the MS HLSL compiler shouldn’t be attempting many optimisations as they can/should be performed inside the drivers optimiser. This is an issue that needs addressing.

MS expands any and all DX9 macros. MS fubar's constant folding. And when will this be fixed?
All=some, when will things be fixed? Ask MS. When can I start writing GLSlang shaders that are guaranteed to work on all HW with some kind of undefined shader support? (as required by the specification).

And when people are playing HL2, will their SINCOS units be sitting idle?
Yes probably quite idle, irrespective of their presence or the compiler used, I think you'll find that the majority if shaders in half life don't use sincos.

Basically none of these issue need lead to less efficeint code, but yes it does lead to some effort in writing the driver based compiler, I won't deny it.
^^^^^^^^^^^^^^^^^^^^^^

Are you listening Joe Defuria?
[/quote]
Hmm, yes.
Of course you're completely ignoring the HUGE difficulties created by having to support all features in GLSlang if you expose the extensions, this will only go away when HW that fully supports it is available, and then you're still left with what to do for the huge pile of legacy HW out there.

John
 
arjan de lumens said:
JohnH said:
What common sub-expressions are you talking about exactly?
Common subexpression elimination is a fairly common and well-known compiler optimization. Consider e.g. the following two statements:
Code:
A = B+C+D+5;
X = B+C+D+7;
In this example, the sub-expression B+C+D is common for the two statements, so you or the compiler can optimize the code by evaulating B+C+D only once:
Code:
temp = B+C+D;
A = temp+5;
B = temp+7;
which saves a couple of instructions on most platforms and is done by most popular compilers. The problem is that in this case you need to keep 3 variables (A, B, temp) instead of just 2(A, B) to hold intermediate results, so this optimization generally will increase the number of temp variables or registers needed. So you can choose: do it or not? Doing it may penalize the NV3x architectures hard because of the extra needed registers; not doing it will penalize the R3xx architectures hard because they are forced to execute redundant instructions. Doing it on an assembly-like intermediate representation is harder and more error-prone than doing it on an HLL parse tree and is as such not a very good option.

Humus said:
JohnH said:
What common sub-expressions are you talking about exactly?

Uhm, pretty much any kind, unless the sub-expression is very large. Like this:

A = (X + Y) * Z;
B = (X + Y) * W;

On R300 this would optimally be done like this:

temp = X + Y;
A = temp * Z;
B = temp * W;

This adds an extra register however, which is unoptimal on NV30. So instead it would be preferable to do:

A = X + Y;
A *= Z;
B = X + Y;
B *= W;

One instruction more, but less register usage. With the MS compiler making these decisions rather than the driver one GPU will be at a disadvantage.

Edit: arjan de lumens beat me to it.

Arr. The easiest answer to that would be to just say don't develop HW with such bizarr issues, but that would be to easy, and not entirely fare.

To be honest I can't dissagree with you on this one (shame I left my cheque book at home). Althought its partly down to the HLSL. My view is that the HLSL should leave expressions within the original code as intact as the target profile allows. This would fix the problem, except the HLSL would then potentually start rejecting shaders that could perhaps be run due to it thinking instruction limits have been reached, there are a couple of approaches that would fix this (that don't involve GLSlang or full driverside compilation).

John.
 
Joe DeFuria said:
First release of ATI's DX9 drivers?

Naive implementation is what you got when your first working driver is up and running, typically several months before you release the card.

Joe DeFuria said:
So which is it....it took a long time for mature DX drivers, or we never see "immature / naive" drivers?

That didn't make any sense. Naive != buggy.
 
Humus said:
Naive implementation is what you got when your first working driver is up and running, typically several months before you release the card.

So, when exactly is the "magic line" crossed from naive to "non-naive"?

And does the time going from nothing to naive drivers, not impact the total time going from nothing to non-naive?

Joe DeFuria said:
So which is it....it took a long time for mature DX drivers, or we never see "immature / naive" drivers?

That didn't make any sense. Naive != buggy.

As you can see, I didn't say "buggy", I said "immature."
 
JohnH said:
A conditional should be left as a conditional, no argument there, although many of the combinations you list are _easy_ to convert back into something different (generally all but LRP).

Well, case on case on case ... finally you'll find that after all the cases provided so far in this thread where an IM format is inadequate ultimately leads us closer and closer to driver side HLSL. There's not much left that MS compiler can do that is universally useful. It will end up doing nothing really, except making the driver programmers life more miserable trying to sort out what it screwed up.

JohnH said:
If your HW is only really capable of supporting profile X, and thats all you attempt to expose then its a whole lot easier. Problems only start to arise when someone gets it wrong, and then attempts to claim that something is something that it isn't.

Except that you also need to support Y,Z,W,U,G,E and A which were released in previous API revisions.

Profiles are a solution to a problem that you have repeatedly failed to address.

You have failed to explain in what way profiles solves anything.

Yes probably quite idle, irrespective of their presence or the compiler used, I think you'll find that the majority if shaders in half life don't use sincos.

That's beyond the point. What about tomorrow? What hardware will be left unused in the future because MS compiler didn't support it? How about this, I've read (but don't know really if it's true) that ATI has a MAD2X instruction in the R300, that does 2 * a * b + c. It's not exposed through any profile, thus left unused. It would be useful for instance for accelerating the reflect() function, and cut down the instruction count from 3 to 2. But no, MS will force an unneccesary MUL instruction in there.

Of course you're completely ignoring the HUGE difficulties created by having to support all features in GLSlang if you expose the extensions, this will only go away when HW that fully supports it is available, and then you're still left with what to do for the huge pile of legacy HW out there.

No. That's easily fixed. The software renderer takes over. Like it has always done in OpenGL, and so far has worked just fine.
 
Joe DeFuria said:
So, when exactly is the "magic line" crossed from naive to "non-naive"?

When you begin trying to go beyond the first-best approach to something more complex.

Joe DeFuria said:
And does the time going from nothing to naive drivers, not impact the total time going from nothing to non-naive?

Sure, but why focus on that particular part? It's the end quality that reaches the end user that matters. If it consists of both X and Y, then saying X0 > X1 doesn't tell us much about whether X0+Y0 > X1+Y1.

Joe DeFuria said:
As you can see, I didn't say "buggy", I said "immature."

Well, "mature" in my post was meant to mean in terms of stability of buglessness. Which has nothing to do with naiveness of the driver.
 
Humus said:
When you begin trying to go beyond the first-best approach to something more complex.

I'm quite sure it's not that cut and dry. I would consider "naive" drivers to be along the lines of "an implementation which is focused mainly on providing correct functionality, and not on performance".

This implies that certain aspects of the driver can be more "naive" than others. I'd wager that "official release 1" drivers are typically more naive than not. (See every nVidi new product driver release + follow up 'amazing detonators')

Sure, but why focus on that particular part?

Because you aren't. ;)

It's the end quality that reaches the end user that matters. If it consists of both X and Y, then saying X0 > X1 doesn't tell us much about whether X0+Y0 > X1+Y1.

True. But then, I believe that there is more X0 and X1 in "initial relasae drivers" than there is Y0 and Y1. (That is, initial drivers tend to be more naive than not.)

Joe DeFuria said:
Well, "mature" in my post was meant to mean in terms of stability of buglessness. Which has nothing to do with naiveness of the driver.

Actually, I disagree to an extent. I'd tend to think that the less "naive" drivers become, the more potential for different kinds quirky bugs. (This depends on the specific nature of the optimizations being applied, of course.) In know that in my own code....as a general rule, the more code you have (and the more special cases you try and "optimize", the greater the risk of unintended consequences.) The goal, of course, is for the end result to be overall better. (Maybe introduce some more obscure bugs, but squash large ones that affect more people, while at the same time providing performance benefits for everyone.)
 
Naive in the context Humus seems to be writing appears to be more in terms of not even a release driver, but rather a "first draft" of a driver. This driver is made to just work, and optimization isn't even an issue. It just has to be able to render the scene. Once you have an initial, correct implementation, performance improvements can be made.

You can be pretty certain that every release driver we've seen for some time has had a number of performance optimizations already, simply due to the fact that we've seen some unbelievably bad-performing drivers from some of the smaller companies. Anybody remember Matrox' original G400 OpenGL drivers? Those were slow (I don't think this had anything inherently to do with OpenGL, but rather had more to do with the initial focus Matrox applied to implementing OpenGL).
 
Chalnoth said:
Naive in the context Humus seems to be writing appears to be more in terms of not even a release driver, but rather a "first draft" of a driver.

That's my context as well.

My point is, that it seems to me that in many cases, the first "release" driver has more "naiveness" to it, than optimizations. You don't release a driver with "glbal optimizations" throughout the driver...certain parts of the driver are "optimized", one release, others are not, etc.

You can be pretty certain that every release driver we've seen for some time has had a number of performance optimizations already, simply due to the fact that we've seen some unbelievably bad-performing drivers from some of the smaller companies.

And from the bigger companies as well. Again, I agree that release drivers are not "100% naive". That was never my contention. But it would not surprise me to learn that first release drivers are much closer to "100% naive" than they are to a theoretical 100% optimized.

Anybody remember Matrox' original G400 OpenGL drivers? Those were slow (I don't think this had anything inherently to do with OpenGL, but rather had more to do with the initial focus Matrox applied to implementing OpenGL).

Well, I happen to think that the intial slowness, and the pace at which it took those drivers to "not be slow" does have to do with GL's inherent structure. ;) Do you think those initial G400 (G200?) GL drivers were somehwat naive as well?
 
Humus said:
JohnH said:
What common sub-expressions are you talking about exactly?

Uhm, pretty much any kind, unless the sub-expression is very large. Like this:

A = (X + Y) * Z;
B = (X + Y) * W;

On R300 this would optimally be done like this:

temp = X + Y;
A = temp * Z;
B = temp * W;

This adds an extra register however, which is unoptimal on NV30. So instead it would be preferable to do:

A = X + Y;
A *= Z;
B = X + Y;
B *= W;

One instruction more, but less register usage. With the MS compiler making these decisions rather than the driver one GPU will be at a disadvantage.
Errr.... That seems like a relatively trivial thing for an NV back-end optimiser (assuming one actually exists) to detect. It'd be simple to spot the re-use of a register and if its contents are cheaper to compute than reserving the register, then just recompute the contents.

It seems to me that common sub-expression elimination is far more difficult than reversing the operation!
 
What you say is true Simon (you can use inlining to undo CSE), but that wasn't the point. The point is, MS's compiler generates code which places additional workload on the driver developer to "undo", thus falsifying the premise that MS's compiler makes IHV drivers easier to develop. The original premise was that MS does the parsing and optimization for you, and as an IHV, you just have to do a naive assembly to transform DX9 instruction tokens to native code. The reality is far from this, and in fact, MS's compiler does not alleviate the need to include an optimizing compiler in the driver, all it does is alleviate parsing, whoop-de-doo, the easiest part.

That said, it is true for basic CSE that it can be erased by inlining, but it gets harder when you have nested eliminations, or partial redundancy elimination in the presence of branches, plus instruction reordering. The driver has to go through the work of building up a DAG and liveness datastructure in order to find reuse and evaluate cost. Trivial is not what I'd call it. Straightforward maybe, but not trivial. Trivial for me = a peephole optimizer. Something for which you need only examine instructions a few at a time, ala a stream filter.
 
Back
Top