How Cg favors NVIDIA products (at the expense of others)

demalion said:
I see. I'd have thought register analysis would be a primary factor in dead code elimination for shaders (atleast in a literal interpretation of the phrase), and that seemed likely to be one of the things some IHVs would need to take up. However, there is nothing preventing IHVs from addressing it again after suitable general optimizations.

Register allocation happens late in the compiler phase. Most optimizations happen before registers are selected. Let's take an example.

Code:
int main()
{
  int x = 10;
  int y = f(x) +x + 1; 
  x = y + 3;
  return x;
}
int f(int arg) { if(x > 5) { return arg; } else { return 0; }

The compiler, if it were using three-address intermediate representation in SSA form, would generate an internal structure like (let's inline the function for simplicity)

Code:
1.  x := 10
2.  t1 := x + 1
3.  if x > 5 goto 6
4.  t2 := x
5. goto 7
6.  t3 := 0
7.  y := phi(t2,t3) + t1
8. x2 := y+3 
9. return x2

SSA form means no variable can be assigned to more than once, therefore, new variables are created for each new definition needed. phi() is a function which detects in step 7 whether the statement was reached from step 5 or from step 6 and chooses which variable to assign to y.

Now the compiler does constant folding and constant propagation. First, because of the nature of SSA form, a copy assignment to a variable (x =y ) can be replaced with it's value substituted whereever it occurs. occurs. (let's replace x with 10)

Code:
2.  t1 := 10 + 1
3.  if 10 > 5 goto 6
4.  t2 := 10
5. goto 7
6.  t3 := 0
7.  y := phi(t2,t3) + t1
8. x2 := y+3 
9. return x2

(for purposes of demonstrate, I will do this iteratively). Fold constants

Code:
2.  t1 := 11
3.  if 10 > 5 goto 6
4.  t2 := 10
5. goto 7
6.  t3 := 0
7.  y := phi(t2,t3) + t1
8. x2 := y+3 
9. return x2

copy propagation of t1, t2, and t3

Code:
3.  if 10 > 5 goto 6
4.  
5. goto 7
6. 
7.  y := phi(10,0) + 11
8. x2 := y+3 
9. return x2

constant conditional replacement. 10 > 5 is always true, remove lines 5 and 6 as dead code also (not reachable)

Code:
7.  y := phi(10) + 11
8. x2 := y+3 
9. return x2

constant folding (phi(10) is constant)

Code:
7. y := 21
8. x2 := y+3
9. return x2

propagate y and fold

Code:
8. x2 := 24
9. return x2

propagate

Code:
9. return 24

Now comes register allocation. First of all, we have a constant, so we put it somewhere.

Code:
def c0, 24, 0, 0, 0

next, instruction selection. If this is a pixel or vertex shader, than the final "return" actually is a write to a predefined destination register. This uses a special "pre allocated" semantic in the compiler to choose the right register

Code:
mov destReg, c0

As you can see 9 instructions were reduced to 1, before any registers were allocated. Live variable analysis happens on a tree or directed acyclic graph of abstract variable assignments, not on actual registers.
 
DemoCoder said:
demalion said:
BTW, that bold statement isn't quite accurate. It was 5 paragraphs of text discussing your excuse for skipping my discussion about the HLSL issue.

Demalion, why is it that unique to you, and no other person on this board, you frequently accuse people of skipping or ignoring your discussion?
*sigh* Straw man. The uniqueness is an argument support, presented as an absolute and pulled out of thin air, and you attack the uniqueness instead of the truthfulness of the assertion applying to you, when the truthfulness of the assertion directly answers your question.
My uniqueness is more related to bothering to point out the practice and continue conversing with someone who displays an obstinate dedication to the behavior with an illustration of how they are doing so, instead of using a simple put-down and stopping there.

Could it be that your discussions are often a) not relevent b) too verbose and roundabout or c) poorly written

or d) my assertions are true because of the people with whom I converse. I think I've actually provided some support for this possibility, so it is "funny" how you missed mentioning it. Now, you can perhaps blame me for conversing with you, but doesn't that say something about you?

I know of no other person on this board who complains as frequently as you do about people eliding parts of your messages.

Another straw man. I don't complain about people editing my messages, I complain about people ignoring large portions of my commentary that already addresses something they go on to propose. The editing tactics are just the mechanism...they can be used to do otherwise. I "complain" about it when they aren't used otherwise, and I even try show the support for why I state it. That last detail doesn't seem to matter, though?

t seems whenever someone catches you in a quandry,

Now you maintain that I didn't answer your assertions and just complained without details, which is directly contradictory to your own complaints about "5 paragraphs" that do exactly that. Argumentation by total disregard to self-consistency is not useful.

you fall back on verbose passages about how they misquoted you.

It is so nice that you continue to propose arguments completely separate from what you maintain is the only thing relevant...well, except when you are discussing these things, of course. You seem to consider my agreeing with your statements when you do this as the only acceptable response I can make. Astounding.

I thought I said all of this before, but saying that would be "falling back because you caught me in a quandry", even if I directly quoted myself already doing so and ask you why you didn't address it, right?

:rolleyes:
 
DemoCoder said:
demalion said:
I take it you mean the failure of one single static compiler that doesn't evolve when a VLIW architecture does? The problem with your example is that you are proposing that what HP and Intel is doing for the Itanium is a proof with regard to GPUs, with a widely divergent workload, and where you haven't bothered to establish that the VLIW evolution will be the same for GPUs.

Demalion, your escape clause in all these arguments is the constant refrain "but it's a GPU".

Actually, my "escape clause" is that GPU workload characteristics and CPU workload characteristics are quite different, and referring to the CPU without correlating the applicability is not valid. It's not an escape clause, because I'm only asking to discuss the correlation instead.

Compiler theory was developed against the background of abstract computing models on automata and idealized computing models. If you pick up most compiler books, it won't even mention any actual CPUs in the real world. The problems specific to VLIW scalability with static compilation are inherent and not architecture specific.

Yes, the problem is not architecture specific, but is it universal to every implementation of VLIW? The demands of the workload being addressed seem to indicate a focus on synchronized and parallel execution and scaling with that in mind, and designing for the ability for facilitating functional unit utilization for direct mathematical calculation.

It seems to me that there is more similarity with the main part of GPU workloads and SSE than with the general computing something like the EPIC VLIW approach targets, and my current understanding is that the Athlon excels at leveraging ALU utilization for both SIMD in things like SSE and their general ALU usage. Why won't GPU designers be trying to do the same thing?

That's a pretty direct and open avenue of discussion for an "escape clause". I don't know how I can more clearly state that I'm asking yout to simply do other than circumvent this discussion, though I think I'm just repeating the same request again.

If you create an FXC compiler profile specific attuned to say, a pipeline with 3 functional units, it will produce code that is not optimal for a pipeline with 5 functional units.

And how is the LLSL->GPU compiler completely unable to handle this? The LLSL characteristics are determine by profile.

Everytime new architecture is modified, the static compiler will have to be updated and all games will have to be recompiled. In fact, GPUs exacerbate the highlighted problems but being even more aggressively parallel in their scaling than convention CPUs.

Doesn't SIMD processing have something to do with how the architecture is modified? I'd think a focus on this and the discrete nature of each component would dictate the evolution as well as there being a VLIW approach. Also, the aggressive parallelism for GPUs is for discrete pixel outputs...isn't the parallelsim for the Itanium an issue of branch prediction discards?

Yes, I said this before too. If it is wrong in some particular, nothing is stopping you from simply tackling the task of pointing out why.

That's why the the driver has to do instruction selection. The only issue is whether or not there is enough information in the DX9 assembly to make such decisions.

Yes, I've tried to discuss this.

...Compilers are ultimately nothing more than pattern matchers, and refactoring the pattern can alter what you "recognize". Even in mathematics, you might be able to prove that two number systems are isomorphic and identical, but the types of things you can recognize and do in each space are radically different (e.g. time vs frequency domain)

I recognize the relavance of this, but I do not recognize that you've established when LLSL is insufficient to the scope of shader execution. You're repeating hypotheticals by maintaining the distinction between CPUs and GPUs do not matter. I'm simply asking you, again, to engage in a discussion of what you are maintaining instead of just repeating the statements that it is so. Really, do you not see this as being the case without me quoting myself to demonstrate it?

DX9 assembly is not neccessarily the representation most conducive to allowing the driver to do the best optimizations.

Right, but it is not necessarily a representation that prevents it. You're proposing that it is a fact that it is, right now, and I'm pointing out that there are indications to the contrary of that assertion. Why is your response simply to require me to say the same thing again?

The result will be that fewer optimizations are found, and more difficult work has to be done by the driver developers to find them.

You went from the possibility that DX 9 LLSL is not sufficient, and the verifiable observation that it seems to be succeed within its requirement of floating point processing fairly well, to stating the certainty that it is insufficient.

You seem to ignore that failing to compile is part of the original expression, and that this goes hand in hand with instruction count limitations that do not parallel CPU compilation.

Why would the compiler fail to compile deliberately if it could produce a transformation which fits within resource limits?

Well, when that transformation results in something that cannot fit into the resource limits (which includes real time execution). This directly influences what shaders will be implemented and when. If we're talking about other than real time situations, I'm thinking the driver having the compiler becomes less of an issue.

Seems absurd. If I have a function call nesting > 4 levels deep on ps3.0 or 1 level deep on ps_2 extended, the compiler will have to inline any further calls.

Well, ps 2 extended is listed as supporting up to 4 on the msdn website. Is that inaccurate?

f(g(h(i(j(x))))) should compile fine on ps3.0 and even ps2.0 even though DX9 can't support it in the assembly via "call" instructions.

Well, why wouldn't that output change depending on the result of the caps query? Is the compiler incapable of doing this because it targetting the LLSL?

Only if the compiler is incapable of expressing them in the LLSL in a form useful to the LLSL->assembly compiler (not a given, details above).

Why do you think I prefer the term "LLSL" instead of "assembly", anyways? I mention this to illustrate why I'm excising your explanation of this.

I just showed you an example. And please, it is not a low level shading language and it is not an intermediate representation.

Well, it neatly avoids the need to demonstrate its deficiencies as an intermediate representation if it isn't one.
Also, your quoting structure is a rather MAJOR distortion of my statements, such that you're not really addressing me at all...the first sentence was a response to your saying "and therefore, if your underlying hardware has superior than DX9 capabilities in some of the pipeline, it will be unable to realize some optimizations." You'd have to support that statement to address it, but instead you attack the second as if it was part of the first statement.

In actuality, the second was a nested quote of you quoting me on a completely separate line of discussion, where I was pointing out you weren't making sense by attacking it when it was intended to answer your question of whether I "understood" that the DX LLSL was compiled by the drivers already. Removing the nesting and representing it as a continuous statement does not help matters, it just looks like you found it more convenient to misrepresent me to make your assertion because your statement depends on that misrepresentation to be a coherent address to me.

No, I don't enjoy having to tackle nonsensical obstacles like this in every reply. It really would be okay to stop placing them there.

It is a assembly language based off of DX8 legacy and before HLSL was invented.

Well, that answers everything I've said, doesn't it? :oops: Actually, the removal of extensive modifiers and literal coissue expression seems to argue for movement away from exactly what you maintain.

No, I'm suggesting that MS could have the compiler change behavior based on predicate and branch support reported, and that the approaches that the profiles represent need not be unique to every new hardware.

So you expect Microsoft will have to maintain dozens if not more compiler switches for every conceivable combination of architectural support themselves and that they will have to frequently update this uber compiler with patches and send it to developers so that they can RECOMPILE their GAMES and distribute patches to all the gamers out there?

Well, the compiler switches are already there, and as far as what you've brought up with your unrolling and branching decisions this seems pretty simple. I feel I've covered the issue of your practice of gross misrepresentation of disagreement, so I won't do it again for now. I will point out that there are less than "dozens" of issues established at the moment. :-?

my example said:
if(a < x) { b = 10; }
else { b =20; }

Well, for the predicate, a setp_lt,if_pred,mov,else,mov,endif (sticking to one component, and using branching, atleast in LLSL expression) or setp_lt,mov,(p)mov (allowing per component execution of your idea, if necessary) seem possible.

For branching (without predicate), if_lt,mov,else,mov,endif works, correct?

For lrp, how would you best set the interpolation register to each end of the range (0,1 I'm assuming)?

Why not count up the instruction slots used by your solution, compared to an slt/lrp or sub/cmp/lrp.

I considered a sub/cmp type approach, but why wouldn't output targetted at ps 2.0 extended capabilities (or a vs 2.0 or higher profile, as I realize now that you are including) use its specified branching instructions, and let the LLSL->GPU compiler implement the specifics as sub/cmp if necessary?

As for slt, I simply didn't consider vertex shaders. Looks a bit like my thoughts on "if only a predicate register could be used for lrp interpolation control and bolean true was defined as 1". I still don't see how the LLSL->GPU compiler couldn't handle converting an if to slt/lrp, or why the compiler would have to fit your description to be able to provide slt/lrp optimizations in one profile and not another (assuming it does express this literally for the vs profile at current?).

As the author of an optimizer, which are you going to choose, and how are you going to code it

And this is nothing like saying the compiler can only behave in one way? I'd choose based on the capabilities of the target. The target, in this case, is a LLSL that a driver is going to compile again. Ignoring capability queries, one profile that targets based on instruction slot conservation and another that targets based on literal expression with the LLSL instructions would go a long way towards being a solution for multiple GPUs and their LLSL compilers.

Now, AFAICS, you're saying the compiler would pick one approach (that may resemble or completely fail to resemble my outlines), and that this one, in some particular case (more complex than your example), would be beyond the scope of the driver compiler for the LLSL to implement efficiently in the architecture.

I'm saying that if the compiler chooses slt/lrp, it will be more difficult for the driver to reconstruct all the basic blocks that existed in the original source code. Likewise for predicates.

If this other than what I proposed, I fail to see it at the moment.

The only solution is to force the compiler to always turn any conditional into a branch to preserve the original control flow graph, but again, because of resource limits, it is not always able to do this.

Why can't it do this in different profiles? Now, I think we're on the same page here, and I'm happy that this branch of the discussion is resulting in that if I'm right. Can you just directly answer the question? You know, the technical part of your reply could consist with just tackling this answer, and I would not "complain" as long as we continue making some progress and avoid repetition atleast by some measure (which this branch of discussion is doing at the moment)..

What application is this, and why is patching it unacceptable? This case is not something that would happen to every single application, even if you wouldn't be patching them with new hardware.

Let's say there are 100 games out there using HLSL.

Let's say there are 5 publishers for those 100 games. Or, maybe, there are 5 unique game engines.

And every month, like clockwork, ATI, NVidia, 3DLabs, IMGTEC, and other companies find new ways to squeeze performance optimizations out of their designs, or release new HW.

Well, ATI is the only one who promised monthly updates, though I think nVidia should be as capable from a resources standpoint. But I'll grant it for now.

Just like driver updates, there will be a steady stream of compiler updates (as there is with GCC). Zealous gamers who always like to have maximum performance will eagerly download what?

Updates to a shader compilation application written like an auto-update tool, from each publisher, capable of handling all their games. Or maybe a tool for each engine. Actually, this could be a growth direction for the inter-publisher software vendors or the engine designers. Now, if you find a flaw in this proposal, just finally say why.

Continuously updated patches for every game in the their library, assuming that all game publishers will rigorously remain in sync with IHVs and publish patches everytime an update comes out?

No, though your description's dependence on monthly compiler updates seems a rather ridiculous exaggeration in the first place.

Tell me demalion, would you argue for a "grand unified driver model" in which Microsoft ships all drivers for every card in one big uber-library file that developers statically link-in at compile time?

You did this before. I pointed out why it was silly. It would be helpful if you gave those reasons any thought at all.

A statically linked driver would in fact deliver higher performance, but at a massive cost to maintainability.

What if car bodies always wore out within a year of regular use? A welded and bolted car body would in fact deliver good structural strength and low rattle noise while in motion, but at a massive cost to maintainability.

That is the construction of your argument here. Hopefully the full scope of the parallel is evident without me specifying every particular.

So sure, it is possible to continuously ship updates for a single static compiler that supports N profiles, but it is a bad idea for two reasons

#1 lots of overhead in distributing the improvements to users and developers
The problem is that you are depending on hyperbole to support this as being established.
#2 "profiles" do not neccessarily represent the best medium for device drivers to work off of.
Say, how many times to I have to ask for you to say why? We're making progress, as atleast you recognized their existence, though.
 
Best for last

<only relating to a different post>
Hmm, you completely isolated one part of my statements by editing. Note that I'm not complaining about it. For comparison and contrast to an entirely different post.
</only relating to a different post>

DemoCoder said:
demalion said:
I see. I'd have thought register analysis would be a primary factor in dead code elimination for shaders (atleast in a literal interpretation of the phrase), and that seemed likely to be one of the things some IHVs would need to take up. However, there is nothing preventing IHVs from addressing it again after suitable general optimizations.

Register allocation happens late in the compiler phase. Most optimizations happen before registers are selected. Let's take an example.

...excellent example...

As you can see 9 instructions were reduced to 1, before any registers were allocated. Live variable analysis happens on a tree or directed acyclic graph of abstract variable assignments, not on actual registers.

Your example represents something that can be reduced to a single component constant, and specifying where it is stored. What about multiple component characteristics (like how per component variations in evaluation would be represented) and something like a texture sampled register that cannot be resolved in this way?

Your example was very clear, nothing stated in it needs to be repeated at length to reply to this query, just trying to cover optimizations related to non-constant registers and whether a different approach to multi-component variables and constants is required.
 
Your example represents something that can be reduced to a single component constant, and specifying where it is stored. What about multiple component characteristics (like how per component variations in evaluation would be represented) and something like a texture sampled register that cannot be resolved in this way?

The high level optimizations still operate prior to register allocation, of course, not being able to infer the value of conditional statements lowers the amount of unreachable code you can remove.

The first part of your question is an interesting one, since it begs the question as to what kind of intermediate representation we should be optimizing on. There are many schools of thought.

For example, if I write in HLSL

Code:
 a = b * c

My compiler might choose to use the following IR

Code:
 1. a := b * c
(vectors treated as opaque)

But it could also generate the following

Code:
1. a1 := b1 * c1
2. a2 := b2 * c2
3. a3 := b3 * c3
4. a4 := b4 * c4

That is, it could use a much lower level representation of what is going on.

The high level representation is useful for detecting algebraic identities in vector math, the stuff we learn in high school physics, like that (A cross B) dot (A or B) = 0 since A cross B is perpendicular to A and B.

The low level representation makes it easier to detect things like how to pack operations together. (for example, if I have a vec2 a,b,c, and d, and I have a*b and c*d, I can perform both in a single MUL by packing a nd c into a register and b and d into a register) It could also be used to discover that certain components of vectors are dead (not used) (now that I think about it, the low level case would also discover the dot-product/cross product optimization, but through a completely different mechanism - the way we prove it in math by writing down a huge list of terms and cancelling them all to zero. )


The two representations yield different kinds of optimizations that are easy to detect. Obviously, if you have a stream of randomly ordered low level (scalar) instructions, it will be harder to detect things like normals being dot-producted together with more robust "theorem proving" capability. I'll be honest, I have gone through the trouble of specing out an HLSL compiler, and my initial research shows that I have to convert back and forth between multiple representations to get the best optimizations. Not only that, but I must backtrack, because sometimes something you did in one representation obscures an optimization in another.

The people who wrote GCC discovered this too. Initially, all of their optimizations happened on very low level IR called RTL. Now however, they also include a high level representation and associated optimizations.



The reason why you don't do the generic optimizations after register selection is because these optimizations only need to work on which statements are reachable and which variables are used by which statements. Having a unique name for every value (like Single Static Assignment SSA form) makes it much easier. After register selection, you could have a single register, say r0, being used in 80% of your program, and it would be harder to track down the interdependencies.


If you have a bit of code that looks like this:

Code:
x = texture_lookup(coord)
y = x dot L
z = pow(y.x, exponent)
w = y.y * z
return z

You can work backwards quite easily. At the end of the block, Z is live, since it is being returned. Therefore, any statement defining Z is live. Z uses y.x, so anything defining Y.x is like. Y uses x, so anything defining X is live.

w was never marked live, therefore it is dead code. This is a less conservative approach (assuming code is dead until you find evidence that it isn't). Most compilers use a conservative approach (assume code is live, unless you can prove it isn't) But the principle is the same.


Now imagine the following

Code:
r0 = texture_lookup(coord)
r0 = r0 dot L
r1 = pow(r0.x, exponent)
r2 = r0.y * r1
return r1

It is now slightly harder (in r2.y, which r0 is being referred to?). Visually, it looks simple to you. Now imagine conditional branches, so it is not clear when r2 is evaluated when and IF those r0 defining instructions were executed and in which order. It becomes alot harder.

And I left out alot of register shuffling which could happen, to make it even less clear which statements are referring to values defined by other statements, and when each is "live" and for how long.

I don't really want to get into going through the whole optimizing compiler exercise, since I could spend weeks typing stuff up, but this is just a taste. It's one reason why hand tuned code still beats the best compilers on small problems, because it is frankly hard coding up pattern matching algorithms that know when to substitute or reorder instructions.

Sometimes, something which locally looks like a good optimization, ends up making the rest of the code slow.
 
Sxotty said:
...
I think there is this thing called the Xbox, and microsoft has to get some company to make the 3d chipset, if you don't think this affects their stance towards Nvidia and ATI you are incorrect.

The question is whether or not the xBox relationship has any bearing on the formation of M$'s D3d/HLSL development. As the complaint most often heard of late is that DX9/M$ HLSL "unfairly" favors ATi, there would appear little objective evidence to support the idea that the xBox relationship skews M$'s initiatives.

Rather, I think the example might provide a basis for stating that the xBox relationship is not being allowed to color M$'s PC/OS-centered objectives. That's only sound philosophy it seems to me. M$ is wise to completely separate the two and not to become overly dependent on any single IHV. The company built itself by working with the broad spectrum of IHVs in the formulation of software standards designed to allow a wide latitude of IHV competition, and I am sure they want to keep all of their options open, including which IHVs to use for any future xBox products they might like to do. In short, M$ is not about to allow itself to be used as a stepping stone to prominence by any one particular IHV, and I think this is what lies at the heart of whatever current xBox problems M$ is having with nVidia at the moment. M$'s major strength is the company's identity as a software company and the company is smart enough not to want to screw around with that identity (whereas a company like Apple suffers because it is constantly struggling to decide whether its interests lie in being a hardware company or a software company.)
 
Back
Top