Rightmark3D

First, sorry demalion for picking some points to reply to and ignoring others, but I wrote that post at 5am and wanted to just write some answers I could think of in my tired state before going to bed :)

demalion said:
Xmas, you are losing me here. We're talking about Cg...Cg doesn't compile to ATI specific extensions, nor is it likely to as long as ATI is committed to the standards Cg bypasses and instead chooses to do things like design their hardware for more computational efficiently (Pick some shader benchmarks comparing the cards, analyze the results, and tell me if you'd dispute even that).
I was talking about RightMark having a render path that prefers "native" extensions as opposed to ARB ones. This wasn't Cg specific, this wasn't even shader-specific. What about vertex buffer extensions?

RightMark supports PS1.4 in some tests. So it could as well support ATI's fragment shader extension in OpenGL.
And...

demalion said:
Hmm... OK, let's go over this again:

There are "synthetic" tests. I only noticed files fo DirectX at the moment, but I presume they intend to implement them for OpenGL. When/if they do, of course it is not bad if they support all OpenGL extensions, but that is not at all what we are discussing. That follows...

There are "game" tests. They are implemented using Cg, period. The only optimizations this delivers are nVidia's prioritized compilation, and for those targets nVidia has defined. So, in addition to the restriction of standards to only the ones nVidia has an interest in exposing (which excludes the ATI fragment extension for OpenGL and PS 1.4), the only vendor specific optimizations offered are the ones provided by nVidia, for nVidia hardware.
... I still can't see how having the clearly-labeled choice of supporting vendor-specific extensions (which I don't think are NVidia only) can possibly be a bad thing.

One thing to note: The option is labeled "Preferred extensions". For R200 PS1.4 there's only an ATI extension AFAIK. For R300 PS2.0 there's only an ARB extension. For NV30 PS2.0 there are ARB and NV extensions.


demalion said:
Solutions, covered again:

The authors of Rightmark 3D support DX 9 HLSL (ignoring for the moment that the shaders in the .cg files seem short and able to be ported to the ATI fragment extension for OpenGL and the 8500/900). This is what I referred to earlier as being as much as could be reasonably expected with limited resources (i.e., flawed, but potentially useful).

Other IHVs circumvent GLslang and DX 9 HLSL, whose specifications they happen to have a known say in, and support Cg on nVidia's terms. There does appear to be some reason for them to consider this undesirable, do you disagree?
I don't think they need "solutions", but supporting DX9 HLSL in parallel would certainly be nice to see.



demalion said:
demalion said:
To the DX 9 PS 2.0 spec, as I specifically stated.
Sorry, but the DX9 PS 2.0 spec only says what kinds of shaders are valid shaders, not what kinds of shaders are good shaders. So when writing a shader you have lots of options, but no indication of whether you're on the right track or not.
Yeah, and API's only specify what kind of programs are valid, not what kind are good programs. Your statement is misleading, since you are proposing that there is no relation to the specification of the instructions and modifiers, and then subsequently what is a good method of executing them. That is simply not true.

For instance, the DX 9 PS 2.0 specification does not specify that two non texture op instructions from a restricted set occur for each PS 2.0 instruction or texture op. This is a possible occurrence, but depending on it to occur for optimal performance is not a good method of execution. However, that is a characteristic of Cg's known target, and it is the ability to control this that nVidia is using Cg to allow them to do for shader content creation, during development and at run time, before the low level optimizer has to worry about it. They promote that precompiled code be generated using Cg, even if there is another alternative, for that reason...but, you maintain this does not matter despite evidence and analysis proposed to the contrary. :-?

demalion said:
You continue to maintain that simplification, despite my having tried to point you to examples of that not being the case. Again, looking at some benchmarks, notice instruction count parity (including when targetting the same assembly) and yet performance differences even for the same datatype. Also notice the recommendations to aid Cg compilation at the end, which the Cg compiler itself could (and likely will) do, even though the concerns are unique to the nv30, not all architectures.
And you continue to talk about different targets while I am only talking about PS 2.0 assembly, and not about anything to aid high level compilers.
We are talking about a high level compiler, Xmas. What do you think Cg is using to output to PS 2.0? You are circumventing the simple observation that compilers produce different code, by saying you are only talking about the code.

Why didn't you address my instruction example at all? If it has flaws, it would be useful to discuss where I went wrong. If it doesn't, it seems more than slightly applicable. :?:

You can write bad code using PS 2.0, you can hand code bad code, you can compile bad code in a compiler. The problem with Cg is that the bad code that the compiler will be improved to avoid is defined by nVidia. The benefit of DX 9 HLSL, as long as MS doesn't have a conflict of interest, is that all IHVs have input into that definition (without having to write a custom compiler at nVidia's request).
A good set of base principles can be reached and targetted without conflict, except for designs that are a bad match for the general case base specifications of the LLSL. It so happens that the nv30 is a bad design for all specifications I know of except nVidia's customized OpenGL extension and their formerly (?) bastardized PS 2.0, yet they want other IHVs to have to also create another HLSL compiler ...for nVidia's toolset :!:...or allow nVidia to dictate the code their low level optimizer have to be engineered to address effectively.

You maintain there is nothing wrong with things occurring along the lines of this goal.

I have to answer that as a whole.
The problem I'm trying to get at is, as long as you don't have any applicable metric for quality of assembly shader code except "shorter is better", how can you say one shader is better than another one?

I am particularly considering that different compilers produce different code. The point is, is there a way to tell which one is better?

Is "NV30-optimized" code that takes certain characteristics into account worse than code without those optimizations?
Can we take performance as a metric? We can. But in which cases?
If HLSL code runs faster than equvalent-length Cg code with a driver that is optimized for "DX9 HLSL compiler style" and for nothing else, that is self-fulfilling. Of course this is also true the other way round.

What if a Cg shader runs faster on hardware A and the equivalent HLSL shader runs faster on hardware B?


demalion said:
btw, arbitrary swizzle is an optional feature of "PS2.x" and if a compiler would output code that uses it, it would not work on PS2.0 cards at all.

Is this in reference to the sentence after my example shader code? Thanks for the correction, but what about the rest of what I proposed?
I'll reply to that in a separate post.


demalion said:
Well, that seems to fit my second description, quoted alone for brevity. I already provided an answer:
demalion said:
I think the second is wrong, though it does seem to be what nVidia is intent on accomplishing.

And expanded upon it (why didn't you address this the first time?):

demalion said:
(btw, I don't know why you mention pocketmoon's Cg recommendations here, as I'm not talking about how you should write a HLSL shader)
Well, if you look at the recommendations, consider that they are nv30 specific recommendations that nVidia has an interest in having the Cg compiler handling, regardless of the impact on low level optimization workload for other IHVs. Then consider that some might not impact other IHV hardware negatively...that there is a conceivable difference between those that do and do not, despite what you have proposed. Then apply that thinking to some of the established unique attributes of the nv30.

You might then try doing the same with the R300, and realize how little divergence from general case specification expression its performance opportunities require.

I call that the result of a "better design".
I would also call R300 the better design. However I do not think a design that I would consider even better would have to be equally "general case compatible" and not sensitive to optimizations.

But that's a reason why I think the "integrated" concept of GLslang is much better suited to shader programming.


demalion said:
The difference is that an 8x1 is full speed when handling any amount of textures, and a 4x4 architecture is only full speed when handling multiples of 4. I get that out of your text because code optimized for the nv30 is exactly what you are saying that, for example, the R300 should be able to handle, rather than the nv30 should be able to handle code generated to provide general case opportunities (I've even provided a list for discussion of optimizations I propose are common).

Anyways, pardon the emphasis, as we've been circumventing this idea by various paths in our discussion a few times now:
The base specification that allows applying 1 or more textures to a pixel is not optimized for 1 TMU per pipe, 1 TMU per pipe is optimized for efficiency for the base specification.

Your assertion that the DX 9 HLSL output is just as unfair as Cg's output is predicated on a flawed juxtaposition of that statement that parallels the situation with nVidia's hardware (try replacing "1" with "4"...that reflects what you'd have to say to support a similar assertion about Cg and DX 9 HLSL, and it does not make sense because that statement is exclusive of less than 4 textures, while this statement is not exclusive of 4 or more textures...).

This TMU example pretty closely parallels several aspects of the difference between the shader architectures in question. You continue to propose that because the specification allows expressing code in such a way that it is optimized for the one that has more specific requirements, despite discussion of how that such optimization can hinder the performance of other architectures, that it is fine to optimize code for the specific case instead of the general case. The cornerstone of this belief seems to be that this is fine because the hardware designed for the general case should be able to handle any code thrown at it at all (why bother to design something that performs well in the general case anyways?), rather than the hardware designed for the specialized case having to seek opportunities to address its own special requirments.

You proclaim that the idea of HLSL optimization for DX 9 is not significant, so it doesn't matter what optimization is used, and that continues to make no sense to me.
I certainly do not think HLSL optimization is not significant. However I also don't think there is such a thing as "the general case". There are shaders that run well on R300, others run well on both R300 and NV30, and others run best on NV30.


More later, sorry. Got to go to a birthday party :D
 
Ichneumon said:
Ante P said:
Fine use Cg for a benchmark but then they shouldn't claim to be so independent, the whole site sorta contradicts tthe fact that they are using a vendor specific compiler which nVidia themselves says outputs nV optimized code.

And has no support for PS 1.4 at all.

Why would it support something their own cards don't........

Ichy you know the deal here bro.

More bs vendor specific........
 
demalion said:
Here are some observations:
For some operations, the nv30 can execute two 2 component fp ops in the same time as it can execute one 4 component op. It can also arbitrarily swizzle. It should also be able to use it's FX12 units legally in specific instances (like when adding a valid FX12 constant to values just read from a 12 bit or less integer texture format :-?).

This might be optimized uniquely for the nv30 when two of four constants are valid FX12 constants such that (this is for an integer texture format, and please forgive the amateurish nature):

for everyone:
Code:
textureload (r,g,b,a}->xtex
add xtex, (float32,float32,int12,int12)->simpleeffect
2 cycles for a card that can perform well in the general case
nv30 executing this unoptimized for fp32:
Code:
*-textureload (r,g,b,a}->xtex
**addf xtex, (float,float,int,int)->simpleeffect
3 clock cycles
Now I'm pretty puzzled. A 4-component fp addition takes 2 clock cycles on NV30? This contradicts anything I've heard so far about NV30 FP ALUs. In fact, results from thepkrl seem to indicate that it always takes one clock cycle.

demalion said:
Swizzling furthers the optimization opportunities that could be explicitly stated in the LLSL without adversely affecting the nv30, but hindering the performance for others. So do modifiers.
Arbitrary swizzling is an optional PS2.x feature. A driver that does not support it would not accept code with arbitrary swizzling.
What about modifiers? There are only 5 modifiers in PS2.0 (instruction modifiers: centroid, pp, sat; source modifiers: -, abs), and they should be "free" on any hardware.
 
Xmas said:
First, sorry demalion for picking some points to reply to and ignoring others, but I wrote that post at 5am and wanted to just write some answers I could think of in my tired state before going to bed :)

It's okay, it just leaves some important avenues of discussion dead in the water. Atleast now I know why, though.

demalion said:
Xmas, you are losing me here. We're talking about Cg...Cg doesn't compile to ATI specific extensions, nor is it likely to as long as ATI is committed to the standards Cg bypasses and instead chooses to do things like design their hardware for more computational efficiently (Pick some shader benchmarks comparing the cards, analyze the results, and tell me if you'd dispute even that).
I was talking about RightMark having a render path that prefers "native" extensions as opposed to ARB ones. This wasn't Cg specific, this wasn't even shader-specific. What about vertex buffer extensions?

RightMark supports PS1.4 in some tests. So it could as well support ATI's fragment shader extension in OpenGL.

The problem isn't what it could do, it is what it does (and does not) do. This seems self evident to me as everything I've mentioned that it should do, which included mentioning PS 1.4 and the OpenGL ATI fragment extension once at the very least, are things it could do. :?:

From this, I really think you missed some things I said. Please recall the beginning of this post (ending with a :-?).

That is the context of the OpenGL support.

Further:

demalion said:
Hmm... OK, let's go over this again:

There are "synthetic" tests. I only noticed files fo DirectX at the moment, but I presume they intend to implement them for OpenGL. When/if they do, of course it is not bad if they support all OpenGL extensions, but that is not at all what we are discussing. That follows...

There are "game" tests. They are implemented using Cg, period. The only optimizations this delivers are nVidia's prioritized compilation, and for those targets nVidia has defined. So, in addition to the restriction of standards to only the ones nVidia has an interest in exposing (which excludes the ATI fragment extension for OpenGL and PS 1.4), the only vendor specific optimizations offered are the ones provided by nVidia, for nVidia hardware.
... I still can't see how having the clearly-labeled choice of supporting vendor-specific extensions (which I don't think are NVidia only) can possibly be a bad thing.
I was at a loss here, as I've abundantly pointed out to you that there is no choice, and just a short while ago you seemed to realize that there wasn't one, but then I realized you said "(which I don't think are NVidia only)", and must have not read the quoted text carefully.

You really need to read my quoted text more than glancingly. OpenGL in Rightmark3D at the moment = Cg. Cg does not support any OpenGL extensions except those nVidia's hardware supports, hence the benchmark consisting of a solid black window when I run it, and the existence of only the .cg files for the game benchmarks as I already mentioned to you....

If Cg did, it would (again, repetition) either be as nVidia dictated, or by forcing ATI (and therefore other IHVs) to write a back end and circumvent the other HLSLs (which other IHVs collectively have a say in). nVidia's response to this rather obvious drawback is that Cg = DX 9 HLSL. The only guarantee we have of that is nVidia's assurances...and the indication of the opposite that we have is various clear examples of conflict of interest with the way nVidia maintains Cg, and the already observed code output differences between the two.

OpenGL in Rightmark3D COULD be something else, like Cg for nVidia, and custom extensions for anyone else capable of the tests it is using. It looks to me like this includes the R200 at least, and I actually wouldn't be surprised if it included some other hardware as well (I don't know the precise functionality exposed in OpenGL by the P10 and Parhelia).

But, it isn't. I've covered this more than a few times already.

One thing to note: The option is labeled "Preferred extensions". For R200 PS1.4 there's only an ATI extension AFAIK. For R300 PS2.0 there's only an ARB extension. For NV30 PS2.0 there are ARB and NV extensions.

I know all this, and this is directly covered repeatedly in my prior text. Ack! Please read atleast the text you quoted above again, it answers everything you are stating.

demalion said:
Solutions, covered again:

The authors of Rightmark 3D support DX 9 HLSL (ignoring for the moment that the shaders in the .cg files seem short and able to be ported to the ATI fragment extension for OpenGL and the 8500/900). This is what I referred to earlier as being as much as could be reasonably expected with limited resources (i.e., flawed, but potentially useful).

Other IHVs circumvent GLslang and DX 9 HLSL, whose specifications they happen to have a known say in, and support Cg on nVidia's terms. There does appear to be some reason for them to consider this undesirable, do you disagree?
I don't think they need "solutions", but supporting DX9 HLSL in parallel would certainly be nice to see.

Hmm...well your above comments seem to indicate you are thinking things that are simply not true at all, as far as I understand you.


...stuff I think you should read again...
I have to answer that as a whole.
The problem I'm trying to get at is, as long as you don't have any applicable metric for quality of assembly shader code except "shorter is better", how can you say one shader is better than another one?

Xmas, that initial stipulation seems completely fictional to me, and the text you quoted went to great lengths to try and illustrate that. When you have time, please read it again.
The metric for the quality of "assembly shader code" is performance.

To that end:
The metric for the quality of "DX 9 LLSL" is the ability to allow each IHV to achieve performance without precluding it for others for which the code is applicable.
The metric for the quality of "Cg's DX 9 LLSL" output is the performance of the nv30.

What you persist in maintaining is that the last goal and the prior goals are interchangeable.

The simplification of the goal to "shorter is better" is a fallacy because it ignores that it is possible for one sequence of instructions to be more efficient than another. Your statement illustrates a flaw in what you propose as the concept of "assembly": as you present it implies that each instruction represents the quickest unit of operation (which inherently lends itself to "shorter is better"), but that is plain wrong (even for modern CPUs, I think...again, why I think "LLSL" makes sense to use to describe PS 2.0 shader specification), what each instruction represents is an execution concept, commonly agreed upon, that allows for the most efficient low level optimizations by IHVs.

Included in what I've pointed out is an illustration of having longer "operation count" performing as fast or faster than shorter, and asserting that it is easily possible for longer "operation count" code, even expressed in the common LLSL, to execute faster for a low level optimizer for the hardware to which it is suited. Examples and specific benchmark discussion is provided to facilitate that, and I also think you should make the time to look at that.

I am particularly considering that different compilers produce different code. The point is, is there a way to tell which one is better?

Hmm...Yes? You had just quoted a long piece of text attempting to illustrate that, and it isn't productive to ask the question again without addressing the support presented there.

Is "NV30-optimized" code that takes certain characteristics into account worse than code without those optimizations?
Yes, when the nv30 performance characteristics deviate so widely from the base spec. That was the point of the TMU illustration. Of course, it isn't worse for the nv30, which is why Cg is not suitable as the sole means of producing benchmark shader code.

Can we take performance as a metric? We can. But in which cases?
If HLSL code runs faster than equvalent-length Cg code with a driver that is optimized for "DX9 HLSL compiler style" and for nothing else, that is self-fulfilling. Of course this is also true the other way round.

That is pretty circular, as you are again saying the assumptions made in generating the code don't matter since one architecture performs better. Again, simplifying to vendor versus vendor instead of special case versus general case. Why do you use the label "self-fulfilling"? What your example illustrates is that there is a possible performance difference between executing "Cg style" and "DX 9 HLSL" and so Cg code cannot be arbitrarily substituted, even if it is the same length. That is not self-fulfilling, that is a logical progression.

OK, let me repeat some optimization assumptions I recall from my searching of Microsoft's pages or something I read here: "Well, I suspect recommendations common to IHVs will be to utilize write masking, utilize specified macros when possible, and optimize to the most concise usage of modifiers and instructions as possible."

Keeping these in mind:
  • write masking:
    R300 benefits from write masking, so does NV30. Example case: you need to calculate a floating point scalar value. General recommendation would have you write mask it to one component. This allows any hardware that can increase performance for this case to do so...for the R300 it allows it to seek an opportunity to execute it concurrent with a vec3, for the NV30 it allows it to not spend multiple clock cycles writing to registers for all 4 components.
    what Cg would have reason to do: splitting a vec4 into what has been analyzed to be two integer components and two float components is an nv30 specific optimization that hinders any architecture that doesn't depend on integer processing for increased performance.
  • utilize specified macros when possible:
    SINCOS: PS 2.0 has it, so does NV30. Optimized general case DX 9 LLSL should use SINCOS, defined as a macro to let the nv30 utilize its advantage. Luckily, because the R300 was designed well, it can offer this functionality at a decent speed using a texture lookup.
  • optimize to the most concise usage of modifiers and instructions as possible:
    You maintain that this is the only criteria for Cg, and I'm stating you are demonstrably wrong by example and benchmark results. Because of the commonality of functionality expression, it is indeed a primary criteria for the DX 9 HLSL compiler (but not the only one, witness the PS_2_a target instruction count and relative performance to which I already referred you). Cg has a specific interest to violate this criteria for the sake of the actual metric the compiler is concerned with: performance. The problem is that the author of the compiler is an IHV with a specific piece of hardware to target. The more complex the language expression, and the more special case the execution characteristics of the architecture, the less correlation possible between shorter instruction count and execution speed.

What if a Cg shader runs faster on hardware A and the equivalent HLSL shader runs faster on hardware B?

Use both shaders? Do you recall my initial post on the topic, and my continued statement that offering both DX 9 HLSL and Cg would be useful?
The problem isn't when Cg is faster and for what, but when it is not faster. Neither compiler is perfect, but nVidia's interest in Cg is speed execution for their hardware. That is just a plainly bad idea to use exclusively in a benchmark that proports to be cross vendor, and your counterargument to that is requiring low level optimizer to be able to handle any code at all, and that does not make any sense to me.

To hopefully prevent misreprentation:

Xmas said:
I think that any driver must be capable of optimizing any assembly shader code it gets, regardless if it was generated by Cg, DX9 HLSL compiler, any other HLSL compiler or coded in assembly.

demalion said:
btw, arbitrary swizzle is an optional feature of "PS2.x" and if a compiler would output code that uses it, it would not work on PS2.0 cards at all.

Is this in reference to the sentence after my example shader code? Thanks for the correction, but what about the rest of what I proposed?
I'll reply to that in a separate post.

Well, for that post: The possibilities for swizzle expression in the base PS 2.0 LLSL that are easily optimizable for arbitray swizzle, and not easily optimized for less capable swizzling (but that could be optimized if expressed differently) still concern me. As you said, however, the DX 9 HLSL compiler recognizes the specific case as an output profile, which by your sometimes-applied expectations of a low level optimizer it should not bother to do. :-?

A whole palette of undesirable possibilities are opened up by using nVidia's compiler, and the idea that IHVs should have to spend time engineering their low level optimizer for these cases when the DX 9 HLSL would already be dealing with it for each hardware does not seem valid.

...
I would also call R300 the better design. However I do not think a design that I would consider even better would have to be equally "general case compatible" and not sensitive to optimizations.

You are making absolutely no sense to me with that statement. :(

You seem to be attempting to say that handling the general case well is not necessary for a good design. As I said before, it isn't if you define the general case as the one suited for your design. That, by the way, is what self-fulfilling looks like IMO...there is no logical progression from "perform general case well" to "an even better design does not have to do that". :-?

A good design handles the general case well, and offers opportunites for increased performance readily available in the general case.

For example: Except for clock speed requirements, transistor count, and the attendant cooling solutions (i.e., efficiency), an nv30 with fp24 register combiners would be such a design. In that case, it could execute both the special case and the specific case well. The difference then would be that there wouldn't even be a reason to use Cg for PS 2.0 generation for the nv30, assuming nVidia still felt compelled to offer it as they do now.

But that's not the situation we're dealing with.

But that's a reason why I think the "integrated" concept of GLslang is much better suited to shader programming.
The distinction is that GLslang is more actively attempting to replace the low level expression of shader functionality. As I said, OpenGL does not have a strong standardized low level expression legacy to recognize (ARB fragment seems to be the intermediate step, not directly associated with GLslang evolution).

I think GLslang is a good thing too, which is why I'd look forward to GLslang versus DX ? HLSL benchmarks.

demalion said:
...more stuff I think you should read again...

I certainly do not think HLSL optimization is not significant.


You need to make up your mind...what do you think the assertion of yours that I labelled self-fulfilling proposes? I think I've provided notation of the other comments I propose state that as well.

However I also don't think there is such a thing as "the general case".

The problem is that you think this without begininning to actually address my discussion to the contrary. You did quote it, though...I hope you do so again, and answer the statements I've made in them and the expansions in this post and in some others we've referenced.

There are shaders that run well on R300, others run well on both R300 and NV30, and others run best on NV30.

This is arbitrary equivalence ignoring everything stated in what you quoted. Maybe my discussion here as well as the prior pseudo-code example will be useful.

More later, sorry. Got to go to a birthday party :D

Well, have a good time, and reply at your leisure, but Ack! please address things more productively next time.

EDIT: fixed broken list
 
Xmas said:
....
Now I'm pretty puzzled. A 4-component fp addition takes 2 clock cycles on NV30? This contradicts anything I've heard so far about NV30 FP ALUs. In fact, results from thepkrl seem to indicate that it always takes one clock cycle.

Look more closely...the delay is in outputting the 4 components, so it would perform as outlined Also, by splitting in this way, the FX12 computation usage could be exploited for some operations.

demalion said:
Swizzling furthers the optimization opportunities that could be explicitly stated in the LLSL without adversely affecting the nv30, but hindering the performance for others. So do modifiers.
Arbitrary swizzling is an optional PS2.x feature. A driver that does not support it would not accept code with arbitrary swizzling.

Hmm...clarified in my other reply, sorry for the delay. If it makes you feel better, I made a decent curry chicken stew for the first time during the delay. :p

What about modifiers? There are only 5 modifiers in PS2.0 (instruction modifiers: centroid, pp, sat; source modifiers: -, abs), and they should be "free" on any hardware.

Yep, but each hardware has specific modifiers they could use for actual low level optimization, and their differences there facilitates code generated uniquely with one in mind (that would be Cg and its set of nVidia controlled priorities) preventing such opportunities from being easily visible.

For example, from here, we have the following list for the R300: "negate, invert, bias, scale, bias and scale, and channel replication, and instruction modifiers, such as _x2, _x4, _x8, _d2, _d4, _d8, and _sat", some of which the R300 low level optimizer might be able to use for optimization with DX 9 HLSL generated code, and which Cg generated code might easily preclude by instruction ordering changes that are valid within the spec. It is not the reordering to the advantage of the nv30 that I object to for Cg, but using that for other hardware and blithely stating no difference in optimization opportunities could result.
 
flick556 said:
...

a.) Nvidea is moving too fast for everyone else and even the brand spanking new direct x9.0 does not take advantage of some of their advanced features.

I really don't think the nv30 release schedule qualifies as "moving too fast for everyone else", and I'm not sure to what you refer about DX 9.0, but I will point out that the ps_2_a profile target for DX 9 HLSL does seem to show performance increases for the nv30 that allow it to perform better than it normally does in shader execution.

b.)I hate to keep hearing people talk about how nvidea is lowering the specs. fp32/fp18 is better than just fp24 plane and simple.

For accuracy's sake, the list of specifications with different performance characteristics are FX12/FP16/FP32.

Also, I'll point out that your FP16, FP24, and FP32 discussion is self-contradicting.

For your C, and D, there are some things I would term significant inaccuracies, but you are entitled to your opinion.

...In the end I just want to see the geforce fx do everything it says it can do and then compare it to the Radeon doing it’s best job.

That does seem pretty reasonable, I think.

I would be in favor of exclusive games for ati and geforce respectively just to stop them from slowing each other down and it looks like this is where things are headed.
Well, just wanted to state that I don't agree with this for reasons too lengthy to discuss in this thread right now. Prior discussions of this are available in the forum if you wish to do some searching.
 
IMO exclusive games are an indication of idiocy....... gimme a break here with that crap.

Should we have each Developer coding for seperate cards? Ya that'll be real profitable and reasonably quick..... :rolleyes:
 
muzz said:
IMO exclusive games are an indication of idiocy....... gimme a break here with that crap.

Should we have each Developer coding for seperate cards? Ya that'll be real profitable and reasonably quick..... :rolleyes:

If the games are better as a result I think your opinion would change.
These standards have forced Nvidea and ATi cards to be to similar, I think they should be radically different and completely incompatible with each other. That would initiate real competition and not this oligopoly cartel that exist today being refereed by big bad Microsoft.
 
demalion said:
For accuracy's sake, the list of specifications with different performance characteristics are FX12/FP16/FP32.

Also, I'll point out that your FP16, FP24, and FP32 discussion is self-contradicting.

For your C, and D, there are some things I would term significant inaccuracies, but you are entitled to your opinion.

All I’m trying to say is nvidea shaders written specifically for nividea hardware will be better than shaders written for everyone. And though I might have the abbreviations wrong the truth is fx supports both higher and lower floating points than Ati. And Ati only supports one middle range fp that does not match that of nividea or the movie industry.
 
flick556 said:
These standards have forced Nvidea and ATi cards to be to similar, I think they should be radically different and completely incompatible with each other. That would initiate real competition and not this oligopoly cartel that exist today being refereed by big bad Microsoft.

I think you'd find that developers would really hate it if we went out and made our cards completely incompatible with each other. The presence of APIs that are generally supported by all cards is what makes the PC a viable gaming platform. Nothing in the standards is forcing us to go down similar routes in implementation - you can see that with R300 and NV30 our routes have actually been very different. As to which of us took the better route...

The presence of the 'referee' and standards organisations such as the OpenGL ARB protects competition, and theoretically gives smaller players than 'the big two' an opportunity to get involved in the market as well. Without the presence of standardised APIs you can certainly say goodbye to anyone else getting involved - ask yourself how difficult it would have been for 3DFX's early near-monopoly to be broken if OpenGL and Direct3D hadn't existed/been created.

Nvidea is moving too fast for everyone else and even the brand spanking new direct x9.0 does not take advantage of some of their advanced features.

Which advanced features of nVidia's are not exposed in DX9 that you would particularly like to see?

I hate to keep hearing people talk about how nvidea is lowering the specs. fp32/fp18 is better than just fp24 plane and simple.
fp32 allows better quality and fp18 allows better speed.

If nVidia at any point in any PS2.x shader are using fixed point (FX12) processing then they are lowering the spec. If at any point they are using FP16 without the partial precision modifier being explicitly used then they are lowering the spec. If they don't do either of these things then everything is fine - if they use FP32 everywhere then they are actually exceeding the requirements of the spec - good for them.

Part of the whole fx push is being able to use the same fp32 as the movies like shrek and toy story. The goal is to be able to render these movies in real time (and I almost 100% sure they already demoed a scene from toy story.)

AFAIK nVidia have not yet demoed any scenes from Toy Story in real time.

fp18 is great and those screen shots floating around the net don't represent the difference between fp18 and fp24, I’m sure their was something far more dramatic occurring like a bug since the difference between these two specs is not something a human eye can detect. and even these issues where fixed with the newest drivers while still maintaining the high frame rates so the performance boost is coming from somewhere else besides downgrading fp.

I'm inclined to agree with you that something more dramatic is occurring in the screenshots rather than just FP16 vs. FP24, but I'm not particularly sure that it is a bug (in the classical definition of such - an unintentional error).

As to your other point FP16 is ok for some things, but you cannot generally write entire shader programs with it - at the very least because of errors in texture sampling. FP16 does not have enough mantissa bits (10) to allow accurate sampling once textures become fairly large. Hence any path in shaders that use texture lookups really have to be at higher precision. FP24 was chosen at least in part because it should have sufficient range and precision for all general shader operations and can hence be used universally.

Your point about the difference between FP16 and FP24 seems to make little sense when linked with your comment about how FP32 then allows better quality. If the difference between FP16 and FP24 is "not something a human eye can detect", as you declare, then the difference between FP24 and FP32 could be regarded as even smaller in absolute terms (relatively there are the same number of bits of difference between FP16 and FP24 as FP24 and FP32, but the errors between FP24 to FP32 will generally occur at a smaller scale). By extension of your argument we could therefore just have FP16. Clearly the designers of NV30 thought otherwise.

As to FP16 allowing greater speed, it appears that it is quite possible to make it run at a slower speed than FP24 if you design badly enough, or alternatively you can make an FP24 implementation run faster if you design well enough. Take your pick.

m$ chose not to implement all thier features into direct x 9.0, know that was a very mean thing to do.

Again - which of their features did Microsoft not include in DX9, and what evidence do you have that Microsoft did it because they were 'mean' rather than having good reasons? There is support for mixed floating point and fixed point shaders (PS1.0->1.4), and floating point shaders with both full and partial precision, including extensions for the NV30 architecture (PS2.0 and above). What omission to this spec do you find most objectionable?

They are key developers in the creation of Opengl they know are creating thier own very capable compiler and thier active in all types of development tools. I like the fx specifications allot, I can't wait to games fully utilize them, without cg most likely exporting to opengl,since m$ won't play nicely these features would never get used and those are the games I'am buying. Who in the world gave M$ or ATI the right to tell Nvidea the best way to display graphics.

I could argue that ATI are at least as key to the creation of OpenGL as nVidia since ATI are a permanent member of the ARB, but I prefer to think that both companies try to lend useful experience to the advance of the OpenGL standard. It's certainly also true that nVidia are a key player in creating lots of vendor-specific extensions for developers, and I certainly don't doubt that they can create a very capable compiler that takes best advantage of their architecture.

Anyway - ATI do not tell nVidia the best way to display graphics. I like to think that at the moment we're just trying to show them the best way... ;)
 
andypski said:
Which advanced features of nVidia's are not exposed in DX9 that you would particularly like to see?

Correct me if I'am wrong but I thought the fx's dynamic shader instructions where not supported by direct x 9.0
 
flick556 said:
And Ati only supports one middle range fp that does not match that of nividea or the movie industry.

You need to move away from the term ‘supports’, to ‘calculates at’.

First off, although ‘only ATI’ supports FP24 what’s important here is that this is the base specification for DirectX, so while both ATI and MS support it, others are likely to follow: S3’s Delta Chrome has already been said to calculate at FP24 precision, as, I’ll be, SiS’s DX9 part does.

As for the movie industry, they are more concerned with input and output formats, just so long as their isn’t errors that are produced from lower accuracy internally – according to NV’s docs, and some subsequent press releases we’ve seen they use FP16 inputs and outputs, so FP24 precision is likely to be enough internally. However, as these are inputs and outputs its what texture sources and targets you support that are important. Presently ATI supports a wide array of FP sources and targets under DX, including IEEE formats, as we’ve established here with current DX drivers NVIDIA support none and we don’t know which ones are supported under OpenGL.
 
flick556 said:
andypski said:
Which advanced features of nVidia's are not exposed in DX9 that you would particularly like to see?

Correct me if I'am wrong but I thought the fx's dynamic shader instructions where not supported by direct x 9.0
What isn't NV30 able to expose under DX9:
-only half (512) instructions are possible
-no pack/unpack instructions
-each fp32 register can serve as 2 fp16 registers (providing 64 fp16 registers)
-vPos in ps_3_0 model only defines x,y components while NV30 is a ps_2_0 part and defines x,y,z,w components (and ends up unexposed in ps_2_0)
 
MDolenc said:
andypski said:
Which advanced features of nVidia's are not exposed in DX9 that you would particularly like to see?

What isn't NV30 able to expose under DX9:
-only half (512) instructions are possible
-no pack/unpack instructions
-each fp32 register can serve as 2 fp16 registers (providing 64 fp16 registers)
-vPos in ps_3_0 model only defines x,y components while NV30 is a ps_2_0 part and defines x,y,z,w components (and ends up unexposed in ps_2_0)
-Both 512/1024 instruction count and register count are more likely design issues than features.
-packing unpacking happens under DX9 "automagically" then reading from textures and writing to backbuffer/MET/MRT
-vPos.zw data can be passed to pixel shader by vertex shader via texture coordinates IMHO
 
demalion said:
Look more closely...the delay is in outputting the 4 components, so it would perform as outlined
I don't understand. Which delay? And where did you get that from? That's news to me.


demalion said:
What about modifiers? There are only 5 modifiers in PS2.0 (instruction modifiers: centroid, pp, sat; source modifiers: -, abs), and they should be "free" on any hardware.

Yep, but each hardware has specific modifiers they could use for actual low level optimization, and their differences there facilitates code generated uniquely with one in mind (that would be Cg and its set of nVidia controlled priorities) preventing such opportunities from being easily visible.

For example, from here, we have the following list for the R300: "negate, invert, bias, scale, bias and scale, and channel replication, and instruction modifiers, such as _x2, _x4, _x8, _d2, _d4, _d8, and _sat", some of which the R300 low level optimizer might be able to use for optimization with DX 9 HLSL generated code, and which Cg generated code might easily preclude by instruction ordering changes that are valid within the spec. It is not the reordering to the advantage of the nv30 that I object to for Cg, but using that for other hardware and blithely stating no difference in optimization opportunities could result.
[/quote]
Sorry, my mistake. There are no hardware-specific modifiers. Those modifiers you list for R300 are all part of the DX9 spec, but they are only listed in the PS1.x reference in the documentation, so I wasn't sure if they are available in PS2.0 too.
 
Xmas said:
demalion said:
Look more closely...the delay is in outputting the 4 components, so it would perform as outlined
I don't understand. Which delay? And where did you get that from? That's news to me.

Well, it's an example, not the whole focus of discussion, but to address it...I'm thinking you read this:

thepkrl said:
The FLOAT/TEXTURE unit can handle any instruction with any format of input or output. All instructions execute in one cycle, except for LRP,RSQ,LIT,POW which take 2 and RFL which takes 4.

But missed this:

thepkrl said:
Registers and performance:

Number of registers used affects performance. For maximum performance, it seems you can only use 2 FP32-registers. Every two new registers slow down things:
examples of the slowdowns incurred for an instruction sequence...

These are in the first post in the thread you linked.

And later, there is observation in the context of 4 component calculation:

thepkrl said:
If input regs are used in the unit they are connected to, using them is free. If they are used for FLOAT/TEXTURE coords, an extra round is needed to first store them into a temp register. For example "ADD R0,f[TEX0],f[TEX0]" takes two rounds.



demalion said:
What about modifiers? There are only 5 modifiers in PS2.0 (instruction modifiers: centroid, pp, sat; source modifiers: -, abs), and they should be "free" on any hardware.

Yep, but each hardware has specific modifiers they could use for actual low level optimization, and their differences there facilitates code generated uniquely with one in mind (that would be Cg and its set of nVidia controlled priorities) preventing such opportunities from being easily visible.

For example, from here, we have the following list for the R300: "negate, invert, bias, scale, bias and scale, and channel replication, and instruction modifiers, such as _x2, _x4, _x8, _d2, _d4, _d8, and _sat", some of which the R300 low level optimizer might be able to use for optimization with DX 9 HLSL generated code, and which Cg generated code might easily preclude by instruction ordering changes that are valid within the spec. It is not the reordering to the advantage of the nv30 that I object to for Cg, but using that for other hardware and blithely stating no difference in optimization opportunities could result.
Sorry, my mistake. There are no hardware-specific modifiers.

In the PS 2.0 spec, you mean?

Those modifiers you list for R300 are all part of the DX9 spec, but they are only listed in the PS1.x reference in the documentation, so I wasn't sure if they are available in PS2.0 too.
Well, from what I understand, they are not. This list agrees with what you stated, except that it lists 2 of your items (abs, centroid) as not being for PS 2.x.

I mention this as an example of an optimization opportunity that might be hidden from the R300 by code generated specifically for the nv30.

Related to this, I found this powerpoint viewer, and in the afforementioned GDC slides, there is a discussion of the ps 2_a profile (for the DX 9 HLSL) that lists emphasizing exactly this type of criteria: purposefully breaking up arithmetic operation sequences with texture ops in the generated LLSL. For the nv30, this specific optimization makes sense for the ps 2.0 profile for Cg as well, at nVidia's convenience. EDIT: Forgot to finish this thought: it would make sense to facilitate the nv30 finding its own opportunities, like I outlined in the example pseudo-code, and as well to facilitate the hypothetical floating point register combiner nv30-alike design I've mentioned a few times, that would deviate from the general special case (as per my 4x4 / 8x1 parallel)

As an example, the power of 2 multiplication/division modifiers, if the R300 can do them in the same clock cycle like the R200 seems to be able to, might have such an opportunity hidden from the scope of the low level optimizer by such code.

This is the lose-lose situation of either having nVidia making such decisions or requiring IHVs to circumvent non-nVidia (i.e., cross-vendor standardized) HLSLs, to which I keep referring.
 
demalion said:
The problem isn't what it could do, it is what it does (and does not) do. This seems self evident to me as everything I've mentioned that it should do, which included mentioning PS 1.4 and the OpenGL ATI fragment extension once at the very least, are things it could do. :?:

From this, I really think you missed some things I said. Please recall the beginning of this post (ending with a :-?).
The reason RightMark uses Cg is that the developers wanted to use a HLSL and they wanted to target both OpenGL and D3D. Cg is the only available product fulfilling those requirements. While I don't think it's an optimal choice for a benchmark, I still think it's a viable decision.
In the light of using Cg to save a lot of work, implementing support for DX9 HLSL, given the similarity to Cg, seems to be reasonably easy to do. So it lends itself to being the next step when they got some time at their hands.
I do not expect them to leave their HLSL path, actually I think they should stick to it (HLSL in general, not necessarily Cg). Consequently, there will be no support for ATI's "PS1.4" OpenGL extension. But that's not NVidia's fault. And there will be no other compiler besides Cg until OpenGL2.0 finally arrives. But that's, again, not NVidia's fault.

I hope they will support GLslang when it's available, and that will put an end to this discussion as GLslang doesn't target any intermediate assembly stage. But that's something for the next version of RightMark.


demalion said:
demalion said:
Hmm... OK, let's go over this again:

There are "synthetic" tests. I only noticed files fo DirectX at the moment, but I presume they intend to implement them for OpenGL. When/if they do, of course it is not bad if they support all OpenGL extensions, but that is not at all what we are discussing. That follows...

There are "game" tests. They are implemented using Cg, period. The only optimizations this delivers are nVidia's prioritized compilation, and for those targets nVidia has defined. So, in addition to the restriction of standards to only the ones nVidia has an interest in exposing (which excludes the ATI fragment extension for OpenGL and PS 1.4), the only vendor specific optimizations offered are the ones provided by nVidia, for nVidia hardware.
... I still can't see how having the clearly-labeled choice of supporting vendor-specific extensions (which I don't think are NVidia only) can possibly be a bad thing.
I was at a loss here, as I've abundantly pointed out to you that there is no choice, and just a short while ago you seemed to realize that there wasn't one, but then I realized you said "(which I don't think are NVidia only)", and must have not read the quoted text carefully.

You really need to read my quoted text more than glancingly. OpenGL in Rightmark3D at the moment = Cg. Cg does not support any OpenGL extensions except those nVidia's hardware supports, hence the benchmark consisting of a solid black window when I run it, and the existence of only the .cg files for the game benchmarks as I already mentioned to you....

If Cg did, it would (again, repetition) either be as nVidia dictated, or by forcing ATI (and therefore other IHVs) to write a back end and circumvent the other HLSLs (which other IHVs collectively have a say in). nVidia's response to this rather obvious drawback is that Cg = DX 9 HLSL. The only guarantee we have of that is nVidia's assurances...and the indication of the opposite that we have is various clear examples of conflict of interest with the way nVidia maintains Cg, and the already observed code output differences between the two.

OpenGL in Rightmark3D COULD be something else, like Cg for nVidia, and custom extensions for anyone else capable of the tests it is using. It looks to me like this includes the R200 at least, and I actually wouldn't be surprised if it included some other hardware as well (I don't know the precise functionality exposed in OpenGL by the P10 and Parhelia).

But, it isn't. I've covered this more than a few times already.
The situation for OpenGL is: there is no HLSL other than Cg. And as the only cards suited for Cg compiled shaders in OpenGL are those supporting PS2.0/ARB fragment shader level, we can probably conclude that those game tests using Cg require that level of hardware, like the Mother Nature scene does. So there would be no point in trying to support other extensions here.

When you want to benchmark that kind of functionality, supporting hardware/extensions with reduced functionality is not an option. You would only put a lot of work into something which would not be equivalent to "the real thing" and give you skewed results.

As for the other tests that don't use Cg, do we have any information on which extensions/render path they use in OpenGL?

There is a switch in Rightmark labelled "preferred extensions" which provides the choice between "ARB" and "native". Now wouldn't it be sensible to use that switch not only to decide wether to use either NV30 or ARB fragment shader extensions for Cg, but also to decide which extensions to use for those tasks that do not involve "shading", or in all the other tests? Be it the fragment and vertex paths specific to ATI, NVidia, Matrox, or any other vendor. Or the different vertex object/vertex buffer extensions, for example.

You seem to preclude that this switch does more than setting the Cg profile to either ARBFP or NV30FP. I think it does more. But even if it does not, I would still be interested in the different results. That's why I think having such an option is always good. Not because it means better results for one chip or another, but because it gives additional data for technical performance analysis of a chip.

As for the solid black window - well, this is a beta version. Seeing a black window isn't really surprising, nor does it give any conclusive proof about extension usage. Actually you're lucky, I don't get this beta to run any test at all :D And it screws my gamma settings :devilish:


demalion said:
I don't think they need "solutions", but supporting DX9 HLSL in parallel would certainly be nice to see.

Hmm...well your above comments seem to indicate you are thinking things that are simply not true at all, as far as I understand you.
What things do you mean?


...stuff I think you should read again...
Ok, I'll do...
demalion said:
demalion said:
To the DX 9 PS 2.0 spec, as I specifically stated.
Sorry, but the DX9 PS 2.0 spec only says what kinds of shaders are valid shaders, not what kinds of shaders are good shaders. So when writing a shader you have lots of options, but no indication of whether you're on the right track or not.
Yeah, and API's only specify what kind of programs are valid, not what kind are good programs. Your statement is misleading, since you are proposing that there is no relation to the specification of the instructions and modifiers, and then subsequently what is a good method of executing them. That is simply not true.

For instance, the DX 9 PS 2.0 specification does not specify that two non texture op instructions from a restricted set occur for each PS 2.0 instruction or texture op. This is a possible occurrence, but depending on it to occur for optimal performance is not a good method of execution.
Yep. That's the point. It does not specify it.
Xmas said:
the DX9 PS 2.0 spec only says what kinds of shaders are valid shaders, not what kinds of shaders are good shaders.
I was not talking about what is a good way to execute it. I was specifically talking about whether you can determine if a shader is "good" or "not so good" without executing it, only with the help of the specification. If you can not, you also can not state that it is off the base spec.
If you can, you have found a valid metric for shader code quality.

demalion said:
However, that is a characteristic of Cg's known target, and it is the ability to control this that nVidia is using Cg to allow them to do for shader content creation, during development and at run time, before the low level optimizer has to worry about it. They promote that precompiled code be generated using Cg, even if there is another alternative, for that reason...but, you maintain this does not matter despite evidence and analysis proposed to the contrary. :-?
I don't say it does not matter. But I think it's acceptable if the generated shader is either "good" or "indeterminable" according to what I said above.

demalion said:
We are talking about a high level compiler, Xmas. What do you think Cg is using to output to PS 2.0? You are circumventing the simple observation that compilers produce different code, by saying you are only talking about the code.
I am talking about code regardless where it came from. Simply because where it came from doesn't make it better or worse. Whether I write it in assembly or a compiler generates it, if it's the same code, it's the same code. Compilers produce different code, I'm aware of this.


Why didn't you address my instruction example at all? If it has flaws, it would be useful to discuss where I went wrong. If it doesn't, it seems more than slightly applicable. :?:
It would be good if you could clear up the issue on the add taking 2 cycles. I doubt that, but maybe you can show me proof of the opposite.

To state it clearly: I don't doubt that there is code which gives NV30 an advantage over R300 compared to other code that performs better on R300. I don't doubt you could find code with exactly the same characteristics as in your example.

But this is a two-sided argument, it also means I don't doubt there is code which gives R300 an advantage over NV30 compared to other code that performs better on NV30.

Given we have such shaders, if we cannot determine, according to the PS2.0 assembly spec, which one is "better", then it is ok to use either of them IMO.

If we can determine the "better" shader, that one should be used and the other one is only useful for specific optimizations. This might be applicable to your example.
 
Back
Top