Rightmark3D

demalion said:
Well, it's an example, not the whole focus of discussion, but to address it...I'm thinking you read this:

thepkrl said:
The FLOAT/TEXTURE unit can handle any instruction with any format of input or output. All instructions execute in one cycle, except for LRP,RSQ,LIT,POW which take 2 and RFL which takes 4.

But missed this:

thepkrl said:
Registers and performance:

Number of registers used affects performance. For maximum performance, it seems you can only use 2 FP32-registers. Every two new registers slow down things:
examples of the slowdowns incurred for an instruction sequence...

These are in the first post in the thread you linked.

And later, there is observation in the context of 4 component calculation:

thepkrl said:
If input regs are used in the unit they are connected to, using them is free. If they are used for FLOAT/TEXTURE coords, an extra round is needed to first store them into a temp register. For example "ADD R0,f[TEX0],f[TEX0]" takes two rounds.

I think this part just before what you quotet answers this:
thepkrl said:
When I mentioned register limitations, I meant temporary registers (R1,R2,...) which you use to store intermediate results. In addition there are input and output registers (colors, texture coordinates, result), using which doesn't add extra slowdown. However, there seem to be limitations on where input registers can be efficiently used. Using a texcoord or color in fp-calculation costs an extra cycle.
So the "two FP32 registers without performance hit" is only referring to temporary registers. You're only using one temp register in your example (xtex). Also your example doesn't access input registers the way described, it only adds a constant.


demalion said:
Xmas said:
my mistake. There are no hardware-specific modifiers.
In the PS 2.0 spec, you mean?
I don't know whether _x8, _d4 and _d8 are natively supported on NV30. Those seem to be the only PS1.4 specific ones. DX will not accept shaders with modifiers it does not know.

demalion said:
Those modifiers you list for R300 are all part of the DX9 spec, but they are only listed in the PS1.x reference in the documentation, so I wasn't sure if they are available in PS2.0 too.
Well, from what I understand, they are not. This list agrees with what you stated, except that it lists 2 of your items (abs, centroid) as not being for PS 2.x.

I mention this as an example of an optimization opportunity that might be hidden from the R300 by code generated specifically for the nv30.
You're right about abs and centroid. I didn't look at the version table. However I don't think there are any modifiers the R300 supports and NV30 does not support. That said, it might even be the case that R300 (and, partly, NV30) does not natively support (meaning for free) some of the PS1.x modifiers because they are suited for integer types and more complex on float types (x2, x4, x8, d2, d4 are shift operations for integers, but for floats that's adding or subtracting a number from the exponent)

demalion said:
Related to this, I found this powerpoint viewer, and in the afforementioned GDC slides, there is a discussion of the ps 2_a profile (for the DX 9 HLSL) that lists emphasizing exactly this type of criteria: purposefully breaking up arithmetic operation sequences with texture ops in the generated LLSL. For the nv30, this specific optimization makes sense for the ps 2.0 profile for Cg as well, at nVidia's convenience. EDIT: Forgot to finish this thought: it would make sense to facilitate the nv30 finding its own opportunities, like I outlined in the example pseudo-code, and as well to facilitate the hypothetical floating point register combiner nv30-alike design I've mentioned a few times, that would deviate from the general special case (as per my 4x4 / 8x1 parallel)
As long as that kind of optimization doesn't slow down other cards, i think anyone can be happy with it.

As an example, the power of 2 multiplication/division modifiers, if the R300 can do them in the same clock cycle like the R200 seems to be able to, might have such an opportunity hidden from the scope of the low level optimizer by such code.
If that modifier is not supported in PS2.0 (as it seems), then how would there be a way for R300 to use it?
 
Xmas said:
demalion said:
The problem isn't what it could do, it is what it does (and does not) do. This seems self evident to me as everything I've mentioned that it should do, which included mentioning PS 1.4 and the OpenGL ATI fragment extension once at the very least, are things it could do. :?:

From this, I really think you missed some things I said. Please recall the beginning of this post (ending with a :-?).
The reason RightMark uses Cg is that the developers wanted to use a HLSL and they wanted to target both OpenGL and D3D. Cg is the only available product fulfilling those requirements.

I think you need to re-read this post and this post from me.

I think you should because you are replying as if my own statements do not include such in their considerations...an example:

demalion said:
If it is a Cg benchmark, it is a Cg benchmark, but that's what it should be called because that really is not representative of DX 9's tools for achieving the same thing at this time. If it is a DX 9 HLSL and Cg benchmark, that's even a good thing I think. I actually think this is quite possible, since I seem to remember PS 1.4 support in the prior releases.

Xmas said:
While I don't think it's an optimal choice for a benchmark, I still think it's a viable decision.

I think your statement here is contradicts those you made which prompted me to address your comments, as evident in several places in our discussion. including the above excerpt from a reply of mine disagreeing with you...for example:

[url=http://www.beyond3d.com/forum/viewtopic.php?p=111014&#111014 said:
Xmas[/url]]
Ratchet said:
I'm not going to claim that I know even a fraction of what most of the people here know about video hardware and all that... but a benchmark that uses Cg? Seems like that would invalidate the whole purpose of a benchmark in the first place, would it not?
Why? If they use the generic profiles, I don't see a problem here. It's not much different from using the DX9 HLSL compiler.

This sentiment has been consistently present in your discussion, including your argument that the low level compiler should be able to handle any compiler's code equally well (hence the branch of our discussion indicating how that is counter to what is indicated to be the case in reality). Your conclusion has remained the same, but you are stating the premise differently here, while your argument still follows the statement you made in this quote.

In the light of using Cg to save a lot of work, implementing support for DX9 HLSL, given the similarity to Cg, seems to be reasonably easy to do. So it lends itself to being the next step when they got some time at their hands.

I again refer you to my prior statements addressing this. I'll quote some pertinent highlights shortly.

I do not expect them to leave their HLSL path, actually I think they should stick to it (HLSL in general, not necessarily Cg).
Well, then it has the flaw of not representing the technical abilities of all cards, as I've also stated prior.
Consequently, there will be no support for ATI's "PS1.4" OpenGL extension. But that's not NVidia's fault.
That Rightmark has this flaw as a general benchmark, as opposed to an nVidia card benchmark, is not nVidia's fault? That's true. But it does represent a fault in Cg that invalidate its use for representative cross vendor benchmarking.

As for the authors of Rightmark, note a prior comment on my part:
[url=http://www.beyond3d.com/forum/viewtopic.php?p=111449&#111449 said:
demalion[/url]]
It lists "ARB" and "Native", so yes, and there are shaders for fixed, partial, and full precision (this is referring to the synthetic tests). Don't know what the scenes (game scenes, in the context of Cg as being discussed in the post) look like, I don't have hardware that supports any of Cg's targets. If they supported DX 9 HLSL they would have done as much as could be reasonably expected for the time being with limited resources.

Restatement of the above commentary in the context of the discussion in which it was made:

[url=http://www.beyond3d.com/forum/viewtopic.php?p=111653&#111653 said:
demalion[/url]]The authors of Rightmark 3D support DX 9 HLSL (ignoring for the moment that the shaders in the .cg files seem short and able to be ported to the ATI fragment extension for OpenGL and the 8500/900). This is what I referred to earlier as being as much as could be reasonably expected with limited resources (i.e., flawed, but potentially useful).

And there will be no other compiler besides Cg until OpenGL2.0 finally arrives. But that's, again, not NVidia's fault.

True, but the entire context of convenience of the Rightmark3D developers and representative benchmarking was already addressed. Please take a look at the shader files in question and note that they are not necessarily onerous to implement at low level, and that DX9 HLSL usage to generate a PS 1.4-alike template might make it simpler still. Again, they are short enough that I don't think even that would be required except for comparison.

I hope they will support GLslang when it's available, and that will put an end to this discussion as GLslang doesn't target any intermediate assembly stage. But that's something for the next version of RightMark.

I'll point out that the beginning of this discussion is due to Rightmark existing in this state now. Note the pertinence of commentary in my prior discussion

Also, Glslang will not put an end to this discussion, just circumvent the reason for Rightmark 3D to have the reason you propose to use Cg (a reason of convenience: the tools to express functionality in OpenGL exist beyond Cg).
Actually, as far as just ATI and nVidia, if a text file can be used to specify the R200 OpenGL extension, providing support for implementing that for the game tests (and doing this and the equivalent for the applicable nVidia OpenGL extensions for the synthetic tests) would be a significant step. Heck, within the context of being an "Open Source benchmark", just adding support for reading in user viewable and modifiable files representing low level instruction shaders to be applied to the tests, for both OpenGL, and Direct X (already done for synthetics, needs to be for game tests) for all the cards for which it claims to be applicable, would remove the flaw I mention, once those files were actually included and reviewed.

As I said earlier, that's why I'm not on a "Rightmark3D bashing" bandwagon, but on a "Rightmark 3D has issues that need to be address" bandwagon.

.
The situation for OpenGL is: there is no HLSL other than Cg. And as the only cards suited for Cg compiled shaders in OpenGL are those supporting PS2.0/ARB fragment shader level, we can probably conclude that those game tests using Cg require that level of hardware, like the Mother Nature scene does.
Why are talking as if I hadn't address this exact point already? Ack!! I've reached my quote cuttof for the night, heh, you can find go find it yourself...it really isn't too hard to spot, and is probably represented atleast once in what I link to above.

So there would be no point in trying to support other extensions here.

You keep on making arbitrary stipulations and then proposing a conclusion based on them, even after I've provided direct opportunity to discuss why I think the stipulation is mistaken, or have provided evidence and analysis specificaly indicating the opposite of the stipulation. :-?

When you want to benchmark that kind of functionality, supporting hardware/extensions with reduced functionality is not an option.

You then go on to base a definitive statement on your stipulation as if you've established it as factual, and also circumventing discussion, as repeated just above, of what has been actually been observed. In this case, I have only my own observation, but the actual shader files are available to you, and anyone, to dispute my interpretation...yet you do this instead. Even so, again for this particular part of the discussion, I have a whole line of other discussion that I've been repeating several times, that remains applicable even if my observation is in question, yet we never get to discuss it because you propose statements without basis instead.

You would only put a lot of work into something which would not be equivalent to "the real thing" and give you skewed results.

By this you seem to state that they are justified in using Cg for convenience because the results can't be skewed, which again contradicts what you proported to recognize about HLSL at the beginning of your post.

As for the other tests that don't use Cg, do we have any information on which extensions/render path they use in OpenGL?

They don't as far as I know and have been able to ascertain by running it, and by looking at files in its directory. As I already outlined by pointing out that I only saw files with names indicating DirectX shader functionality.

That's the type of comment that causes me to refer you to previous posts.

I understand from your prior comment about waiting for Rightmark3D to download that you might not have it available to view for yourself yet, but that shouldn't stop you from recognizing my observation in the meantime.

There is a switch in Rightmark labelled "preferred extensions" which provides the choice between "ARB" and "native". Now wouldn't it be sensible to use that switch not only to decide wether to use either NV30 or ARB fragment shader extensions for Cg, but also to decide which extensions to use for those tasks that do not involve "shading", or in all the other tests? Be it the fragment and vertex paths specific to ATI, NVidia, Matrox, or any other vendor. Or the different vertex object/vertex buffer extensions, for example.

You keep asking questions to which I think my answer has been obviously stated. i.e., YES. This, I feel, was completely covered in our prior "The problem is not what it could do, but what it does do" discussion you just quoted!

You seem to preclude that this switch does more than setting the Cg profile to either ARBFP or NV30FP.

And that is based on what I actually observed...Ack! The list for the Synthetic tests include PS 1.4 and VS 1.1, which can be expressed using existing OpenGL extensions for the 8500, and yet there is no option listed to target OpenGL. Also, the game tests seem to be only expressed using Cg, and to be the only tests for which the OpenGL tab applies (following from the above). Therefore, the switch does nothing more "than setting the Cg profile to either ARBFP or NV30FP".

This is a repeat of information directly stated in my prior posts. :!:

I think it does more.

This fits the template of what you asked me to clarify later, and is not unique to this post.

But even if it does not, I would still be interested in the different results.

That's fine for a nv30 evaluation.

That's why I think having such an option is always good. Not because it means better results for one chip or another, but because it gives additional data for technical performance analysis of a chip.

Yes, as long as it doesn't propor to represent results between different chips that are not represented as equivalently as possible for a given standard, or set of standards. OpenGL and Direct X are such standards, Cg is not, except for the nVidia hardware, and it is wrong for Rightmark 3D to use it in place of such and represent itself as benchmarks for them. Again, this is a restatement of my prior posts.

As for the solid black window - well, this is a beta version. Seeing a black window isn't really surprising, nor does it give any conclusive proof about extension usage.

That's an observed result that confirms my conclusion, which were in turn based on observations already repeatedly outlined, and you can evaluate that for yourself, just please recognize that it exists, please.
It is not presented as "conclusive proof".

Actually you're lucky, I don't get this beta to run any test at all :D And it screws my gamma settings :devilish:

So you do have it? Tell me if you find indication of OpenGL being used for the synthetic tests, and then we can discuss this particular issue in relation to the synthetic tests with some hope of not being circular. My line of reasoning is above, as well as my answer to what I think of it if you do indeed find such indication.

demalion said:
I don't think they need "solutions", but supporting DX9 HLSL in parallel would certainly be nice to see.

Hmm...well your above comments seem to indicate you are thinking things that are simply not true at all, as far as I understand you.
What things do you mean?

One example highlighted above, which applies to what you have presented before. Your assertions relating to good assembly and low level optimizers being able to handle any code thrown at them are also things I believe I have given good indication were simply not true as well, by example and analysis, among other things we've tried to discuss.

demalion said:
demalion said:
To the DX 9 PS 2.0 spec, as I specifically stated.
Sorry, but the DX9 PS 2.0 spec only says what kinds of shaders are valid shaders, not what kinds of shaders are good shaders. So when writing a shader you have lots of options, but no indication of whether you're on the right track or not.
Yeah, and API's only specify what kind of programs are valid, not what kind are good programs. Your statement is misleading, since you are proposing that there is no relation to the specification of the instructions and modifiers, and then subsequently what is a good method of executing them. That is simply not true.

For instance, the DX 9 PS 2.0 specification does not specify that two non texture op instructions from a restricted set occur for each PS 2.0 instruction or texture op. This is a possible occurrence, but depending on it to occur for optimal performance is not a good method of execution.
Yep. That's the point. It does not specify it.[/quote]
By this you state that because the LLSL does not specify you can NOT do this, it doesn't matter that Cg has unique reason to do this for the nv30 and that this is not significant. Again, your actual argument contradicts the statement you intially proposed as defining (in this post, atleast) your stance on using Cg in this way.

This is the type of thing that prompts me to ask you to make up your mind, and I don't think I ask that unfairly.


Xmas said:
the DX9 PS 2.0 spec only says what kinds of shaders are valid shaders, not what kinds of shaders are good shaders.
I was not talking about what is a good way to execute it. I was specifically talking about whether you can determine if a shader is "good" or "not so good" without executing it, only with the help of the specification.

:oops:

Your comment seems to be based on the pretext that you can't determine whether a specific shader is a good way of executing the function without actually executing it, as that is the only case I can see in which those two sentences do not completely contradict each other, assuming compilers can actually optimize. :-?

If you can not, you also can not state that it is off the base spec.
If you can, you have found a valid metric for shader code quality.

I'll throw in a highlight here because I think this illustrates the same issue of ignoring my discussion and things that seem to me are already readily evident.

This is the type of comment that prompts me to refer to my 4x4 and 8x1 by parallel which I consider particularly illustrative of the fallacy of what you propose is the case among those two.

I don't say it does not matter. But I think it's acceptable if the generated shader is either "good" or "indeterminable" according to what I said above.

Then you are saying the HLSL compiler used does not matter for your definition of "acceptable".
There are just too many conflicts in your statements left unresolved.

demalion said:
We are talking about a high level compiler, Xmas. What do you think Cg is using to output to PS 2.0? You are circumventing the simple observation that compilers produce different code, by saying you are only talking about the code.
I am talking about code regardless where it came from. Simply because where it came from doesn't make it better or worse.

If I had an Exorcist emote (without the overt hostility of something disgusting like vomiting, to keep things civil :p), this would be the place it belonged.

I think my above comment about unresolved contradictions applies.

Whether I write it in assembly or a compiler generates it, if it's the same code, it's the same code. Compilers produce different code, I'm aware of this.

Please search for my comments along the lines of "you can hand code bad code, compile bad code with Cg, or compile bad code with HLSL" or something like that, as I would have to do quote it for you to address this sentiment again.

Why didn't you address my instruction example at all? If it has flaws, it would be useful to discuss where I went wrong. If it doesn't, it seems more than slightly applicable. :?:
It would be good if you could clear up the issue on the add taking 2 cycles. I doubt that, but maybe you can show me proof of the opposite.

No highlight is suitable now, because we're actually discussing this usefully now, albeit it took a while to start doing so. Response in the other post (didn't take me long to reply, but I'm posting this first for chronological consistency).

To state it clearly: I don't doubt that there is code which gives NV30 an advantage over R300 compared to other code that performs better on R300. I don't doubt you could find code with exactly the same characteristics as in your example.

But this is a two-sided argument, it also means I don't doubt there is code which gives R300 an advantage over NV30 compared to other code that performs better on NV30.

OK, this is another place for the head spinning emote I wish I had. Please put 4x4 and 8x1 as example substitutes into your discussion above, and refer to my prior comments to aid in illustrating, again, the fallacy of your statement.

Given we have such shaders, if we cannot determine, according to the PS2.0 assembly spec, which one is "better", then it is ok to use either of them IMO.

This is based on the spec not explicitly stating how you should code, therefore you can't make an assumption about what is general case or not, nevermind the simple observation that "the more unique restrictions on the code structure you entail, the less general case it is".

If we can determine the "better" shader, that one should be used and the other one is only useful for specific optimizations. This might be applicable to your example.

I've already discussed this in depth, including example optimizations which you have not recognized in making this statement (which reminds me that a highlight might be helpful to address your maintaining that this is not the case).
 
Xmas said:
...
I think this part just before what you quotet answers this:
thepkrl said:
When I mentioned register limitations, I meant temporary registers (R1,R2,...) which you use to store intermediate results. In addition there are input and output registers (colors, texture coordinates, result), using which doesn't add extra slowdown. However, there seem to be limitations on where input registers can be efficiently used. Using a texcoord or color in fp-calculation costs an extra cycle.
So the "two FP32 registers without performance hit" is only referring to temporary registers.

He seems to be talking about 2 components, not writing to 2 or more registers simultaneously... :?: I'm a bit confused by what you are saying.

EDIT: Ah, I'm redefining his statement for the convenience of my example, I see what you are thinking now. This illustrates something I believe is associated with the pack/unpack functionality in the units under discussion, as what he is stating is made equivalent to what I'm stating by using such functionality for register usage, which is also illustrated by his discussion of storing two 16-bit value sets in one 32-bit register.

You're only using one temp register in your example (xtex).

With 4 components, not 2. EDIT: I see more clearly what you are saying now, it is a good thing I looked again after I finished the other post.

I quote again:

thepkrl said:
If input regs are used in the unit they are connected to, using them is free. If they are used for FLOAT/TEXTURE coords, an extra round is needed to first store them into a temp register. For example "ADD R0,f[TEX0],f[TEX0]" takes two rounds.


Xmas said:
Also your example doesn't access input registers the way described, it only adds a constant.
Yes it does access input registers the way described (which is "free" for integer components atleast)...the performance hit we are discussing is the output, as I specifically said.

EDIT: Hmm...Oh, I see the contradictions, the text is talking about input, heh.

When that result is being used as an integer (i.e., at screen output, or for the mixed mode FX12 limited shader paradigm), it allows a speed up. What my example should be stipulating, as in my mind (sorry, I did not express this at all clearly, it was just implicit in my thinking), is that you are actually going to use the floating point values for further calculation (at which point you actually access them).

My use of the word output is indeed misleading because the slowdown in my example is in using the floating point precision of the output, which I took to be a given, but I did not establish. As we both recognize, calculating at fp32 is what is stated not to be a slow down, and my evaluation of speed occurs in the context of "deciding what format to output". Both educational in needing to take care with such assumptions, and possibly illustrative of the implementation detail divergence to which I referred elsewhere. :)

I still feel like I need a head spinning emote for other things though. :p
[/EDIT]

demalion said:
Xmas said:
my mistake. There are no hardware-specific modifiers.
In the PS 2.0 spec, you mean?
I don't know whether _x8, _d4 and _d8 are natively supported on NV30. Those seem to be the only PS1.4 specific ones. DX will not accept shaders with modifiers it does not know.

It seems pretty clearly expressed to me that I'm not talking about expressing this in the LLSL, but about opportunities expressed in the actual "assembly" of the specific card that the driver low level optimizer converts it to.

demalion said:
Those modifiers you list for R300 are all part of the DX9 spec, but they are only listed in the PS1.x reference in the documentation, so I wasn't sure if they are available in PS2.0 too.
Well, from what I understand, they are not. This list agrees with what you stated, except that it lists 2 of your items (abs, centroid) as not being for PS 2.x.

I mention this as an example of an optimization opportunity that might be hidden from the R300 by code generated specifically for the nv30.
You're right about abs and centroid. I didn't look at the version table. However I don't think there are any modifiers the R300 supports and NV30 does not support.

Hmm...well, you are stipulating things you think without any provided reason. I'd say the stated PS 1.4 specification gives reason to indicate something to the contrary, as well as other clearly exhibited factors illustrating their divergent designs.

That said, it might even be the case that R300 (and, partly, NV30) does not natively support (meaning for free) some of the PS1.x modifiers because they are suited for integer types and more complex on float types (x2, x4, x8, d2, d4 are shift operations for integers, but for floats that's adding or subtracting a number from the exponent)
That is quite true for that example, but the case I mentioned for looking for integer valued calculations for the nv30 would also be applicable. It would not be as universal an opportunity as I implied by context, however.

demalion said:
Related to this, I found this powerpoint viewer, and in the afforementioned GDC slides, there is a discussion of the ps 2_a profile (for the DX 9 HLSL) that lists emphasizing exactly this type of criteria: purposefully breaking up arithmetic operation sequences with texture ops in the generated LLSL. For the nv30, this specific optimization makes sense for the ps 2.0 profile for Cg as well, at nVidia's convenience. EDIT: Forgot to finish this thought: it would make sense to facilitate the nv30 finding its own opportunities, like I outlined in the example pseudo-code, and as well to facilitate the hypothetical floating point register combiner nv30-alike design I've mentioned a few times, that would deviate from the general special case (as per my 4x4 / 8x1 parallel)
As long as that kind of optimization doesn't slow down other cards, i think anyone can be happy with it.
This is what I mean by making up your mind: do the HLSL optimizations expressed in the LLSL matter (as per the beginning of your prior post), or do they not (as per other statements, including the one quoted in my reply to that post).

As an example, the power of 2 multiplication/division modifiers, if the R300 can do them in the same clock cycle like the R200 seems to be able to, might have such an opportunity hidden from the scope of the low level optimizer by such code.
If that modifier is not supported in PS2.0 (as it seems), then how would there be a way for R300 to use it?

By the low level optimizer. When I say "visible opportunities", it continues to mean opportunities of which the actual architecture instruction execution model is capable. That seems to be plainly stated. The statement you're making there makes no sense to me.
 
MDolenc said:
...
What isn't NV30 able to expose under DX9:
-only half (512) instructions are possible
...

Isn't this statement a bit inaccurate? I thought the reason for this was to be able to guarantee constant usage without introducing complexities like "1024 instructions with no constants used". This is based on my understanding of constants taking instruction slots for the nv30, so if anyone understands differently just correct me.
 
demalion said:
MDolenc said:
...
What isn't NV30 able to expose under DX9:
-only half (512) instructions are possible
...
Isn't this statement a bit inaccurate? I thought the reason for this was to be able to guarantee constant usage without introducing complexities like "1024 instructions with no constants used". This is based on my understanding of constants taking instruction slots for the nv30, so if anyone understands differently just correct me.
Precisely.
DX as general API can't expose all chip details of different HW vendors. This way multuhead in exposed at API level only in DX9 althrow Matrox had DualHead at DX6 timeframe, DX8 eliminated PS.0.5 for original Radeons and GF1-2, etc... Let's _imagine_ what SiS will introduce HW what store both code, constants _and_ some other registers in the same fixed size pool (actually this is really bad idea and show here just for example) - Is another strange validation rule should be introduced to DX9?
 
demalion,
I really think there is no point in discussing this (the longer one of your last two posts) any further.
You tell me to re-read some of your postings, so I didn't seem to get what you intended. And on the other hand, I have the feeling that some of your replies are not related to those quotes you are replying to, so you must have interpreted them differently than I intended.
Maybe I'm unable to express my thoughts correctly and understandably. And sometimes I have a really hard time trying to figure out the meaning of a sentence of yours.
You think that there are contradictions in my statements. At least I can't find any contradictions in my thoughts on the topic.

And if the base of the discussion is flawed, it's pointless to go further.

Oh, some things I wanted to add:
If I state something, it doesn't mean I'm contradicting what you said.
If I state something you've already discussed, it doesn't mean I haven't read what you wrote. Just that my opinon is different.
If I state something that has been stated before, in similar or different form, it doesn't mean I haven't read this, but that I use this statement to provide the context for another statement.

I do not have the time nor the will do discuss every point in a lengthy discussion like this. It's just not possible.
 
An incomplete sysnopsis of my view of the conversation

The problem with the discussion as I see it, is that I provided direct examples that provided a clear path for discussion, and the avenues of replies you took consistently failed to address them, snowballing into full length discussions of their own without resolving the premises that prompted them.

You first stated that it was pointless to object to Rightmark 3D using Cg as benchmark, and supported this with, for example, statements like that a low level optimizer in the driver should be able to handle any code thrown at it, so if Cg differed from DX 9 HLSL in the LLSL code it produced, that did not invalidate it for applicability to other cards no matter how that code looked.

To further support that, you went on to say that the length of code was the only metric for good code, so naturally Cg would produce short code that would be just as fast for the R300 as for the nv30, and that the structure of the code did not matter. I provided several discussion of indications to the contrary, including several direct examples of instruction count and performance metrics, analysis of some principles of the nv30 architecture that seem related to it, and supported these discussions with references to several sources that could be discussed clearly and directly, including instruction specific benchmarking and the resultant pipeline characteristics observed, Microsoft's GDC powerpoint presentation illustrating the difference in the code output proposed for the PS 2.0 and PS 2.a specifications, and also my understanding of John Carmack's comments comparing the nv30 to the r300.

Alongside this, you discussed that the OpenGL options (ARB and Native) represented useful functionality in Rightmark3D, which to my observation remains an erroneous statement, and I informed you my observations indicating that they simply offered different options for Cg, and did not expose any other functionality except what nVidia cards support (in direct contradiction to what you indicated was the case to support that Rightmark3D was a good benchmark right now), and after repeating those observations several times, I simply asked you to check for yourself and provide your observations of its behavior supporting your analysis.

I think there are very clearly things, similar to this last example, that seem erroneous in your comments, and I've tried to point out clear illustration of why I thought so, ranging from several examples I spent time conceiving to illustrate what I was saying, to simply asking you to look for yourself at something you could independently verify, and then confirm or deny if your observations concurred with what I was saying.
I think the conversation ended up where it did because you consistently chose not to do address these things with anything besides new statements that offered what you said was your opinion without providing any support for it.
I don't consider this recap a complete list of the items that cause me to believe this.

...


What I find frustrating is that you proport that your initial statements are not incorrect or contradictory, after having failed to provide any support for them, or any reasons why my assertions of contradiction are incorrect.
If the only way to address that frustration is to end the discussion, so be it.

:-?
 
demalion said:
He seems to be talking about 2 components, not writing to 2 or more registers simultaneously... :?: I'm a bit confused by what you are saying.

EDIT: Ah, I'm redefining his statement for the convenience of my example, I see what you are thinking now. This illustrates something I believe is associated with the pack/unpack functionality in the units under discussion, as what he is stating is made equivalent to what I'm stating by using such functionality for register usage, which is also illustrated by his discussion of storing two 16-bit value sets in one 32-bit register.

You're only using one temp register in your example (xtex).

With 4 components, not 2. EDIT: I see more clearly what you are saying now, it is a good thing I looked again after I finished the other post.
Im absolutely sure he is referring to full fp32 4-component temporary registers. Not writing to them simultaneously, but generally using them in the shader.

demalion said:
I quote again:

thepkrl said:
If input regs are used in the unit they are connected to, using them is free. If they are used for FLOAT/TEXTURE coords, an extra round is needed to first store them into a temp register. For example "ADD R0,f[TEX0],f[TEX0]" takes two rounds.


Xmas said:
Also your example doesn't access input registers the way described, it only adds a constant.
Yes it does access input registers the way described (which is "free" for integer components atleast)...the performance hit we are discussing is the output, as I specifically said.

EDIT: Hmm...Oh, I see the contradictions, the text is talking about input, heh.
The way I understand the quote:
It's about the interpolated texture coordinate registers and the interpolated color registers. If you use texture coordinates in any non-texture-sampling instruction, it takes an extra cycle. If you use interpolated color in float/tex operations, it takes an extra cycle.


demalion said:
When that result is being used as an integer (i.e., at screen output, or for the mixed mode FX12 limited shader paradigm), it allows a speed up. What my example should be stipulating, as in my mind (sorry, I did not express this at all clearly, it was just implicit in my thinking), is that you are actually going to use the floating point values for further calculation (at which point you actually access them).

My use of the word output is indeed misleading because the slowdown in my example is in using the floating point precision of the output, which I took to be a given, but I did not establish. As we both recognize, calculating at fp32 is what is stated not to be a slow down, and my evaluation of speed occurs in the context of "deciding what format to output". Both educational in needing to take care with such assumptions, and possibly illustrative of the implementation detail divergence to which I referred elsewhere. :)

I still feel like I need a head spinning emote for other things though. :p
As far as I understand it, neither using the result from the add(f) in further fp calculations nor outputting it to the color output register costs any extra cycles.

demalion said:
demalion said:
Xmas said:
my mistake. There are no hardware-specific modifiers.
In the PS 2.0 spec, you mean?
I don't know whether _x8, _d4 and _d8 are natively supported on NV30. Those seem to be the only PS1.4 specific ones. DX will not accept shaders with modifiers it does not know.
It seems pretty clearly expressed to me that I'm not talking about expressing this in the LLSL, but about opportunities expressed in the actual "assembly" of the specific card that the driver low level optimizer converts it to.
If NV30 supports these modifiers, it's equal opportunities, and not hardware-specific.

demalion said:
You're right about abs and centroid. I didn't look at the version table. However I don't think there are any modifiers the R300 supports and NV30 does not support.
Hmm...well, you are stipulating things you think without any provided reason. I'd say the stated PS 1.4 specification gives reason to indicate something to the contrary, as well as other clearly exhibited factors illustrating their divergent designs.[/quote]
I'm not stipulating things. I'm saying what I think is most likely. That's simply my opinion. There designs are divergent but still there are a lot of similarities. PS1.4 spec only indicates that three modifiers (x8, d4, d8) are not supported by the NV30 integer units, and assuming the NV30 register combiners are identical to the NV2x ones (besides precision/range), the OpenGL register combiner extension spec proves this.
But for the float units, I think it is equally likely for both to either support or not support those modifiers natively.

demalion said:
That said, it might even be the case that R300 (and, partly, NV30) does not natively support (meaning for free) some of the PS1.x modifiers because they are suited for integer types and more complex on float types (x2, x4, x8, d2, d4 are shift operations for integers, but for floats that's adding or subtracting a number from the exponent)
That is quite true for that example, but the case I mentioned for looking for integer valued calculations for the nv30 would also be applicable. It would not be as universal an opportunity as I implied by context, however.
Sorry, I'm not sure I understand what you're saying here.

demalion said:
As long as that kind of optimization doesn't slow down other cards, i think anyone can be happy with it.
This is what I mean by making up your mind: do the HLSL optimizations expressed in the LLSL matter (as per the beginning of your prior post), or do they not (as per other statements, including the one quoted in my reply to that post).
I never said they do not matter. If they would not matter, they wouldn't be there.

demalion said:
If that modifier is not supported in PS2.0 (as it seems), then how would there be a way for R300 to use it?
By the low level optimizer. When I say "visible opportunities", it continues to mean opportunities of which the actual architecture instruction execution model is capable. That seems to be plainly stated. The statement you're making there makes no sense to me.
My fault. Could you give an example of how you think such opportunities can be hidden?

btw, seeing your last post, I think it was a good decision to stop that discussion here. It shows you have not understood many things I said like I intended them to be understood. I admit I think this is mostly my fault. But the discussion didn't work this way. It's also frustrating for me.
 
I vote only one thing can be discussed at a time. My eyes roll back and my tongue lolls out when I see the point for point tete-a-tete that goes on ad nauseum.

(Though, I should just shut up and let you guys keep talking. ;) )
 
Progress?

Xmas said:
demalion said:
He seems to be talking about 2 components, not writing to 2 or more registers simultaneously... :?: I'm a bit confused by what you are saying.

EDIT: Ah, I'm redefining his statement for the convenience of my example, I see what you are thinking now. This illustrates something I believe is associated with the pack/unpack functionality in the units under discussion, as what he is stating is made equivalent to what I'm stating by using such functionality for register usage, which is also illustrated by his discussion of storing two 16-bit value sets in one 32-bit register.

You're only using one temp register in your example (xtex).

With 4 components, not 2. EDIT: I see more clearly what you are saying now, it is a good thing I looked again after I finished the other post.
Im absolutely sure he is referring to full fp32 4-component temporary registers. Not writing to them simultaneously, but generally using them in the shader.

Any interpretation at all of the wording establishes directly that short shader code is not the only metric for performance for the nv30 (I consider this the self evident criteria for evaluation here, since Rightmark3D is a benchmark that measures performance, though you continue to avoid recognizing this), which then directly establishes the problem with using the code nVidia's Cg compiler generates for a general benchmark. Since this is the point of my example in the first place, I continue discussing this here primarily to recognize the problems with the way I proposed my example, and also to further provide opportunities to try and illustrate this to you given that you still seem to maintain that statements you have made are not erroneous.

To address my example:

You are right that in the context of thepkrl's testing results and comments presented in that thread, the only thing established for certain is that my example's performance increase would depend on being able to conserve register usage and prevent exceeding the stated limit of using 2 4 component fp32 registers (which would depend on assumptions about pack/unpack not directly established by the thread), and that either this or some other factor allowed by using only 2 component fp32 not unquestionably established in that text would be required to definitively establish its applicability.

My understanding of the comment I quoted about loading floating point registers with value leads me to conclude things that are not directly supported by the benchmarks, so, for brevity, I'll abandon it as a bad example illustration instead of continuing to stipulate such criteria.

Instead, I'll point you to consideration of Hyp-X's example as an alternative illustration of how code structured for the nv30's performance characteristics is not representative of the general case optimization.

My question: has it atleast facilitated in finally establishing how code can be longer and yet faster for the nv30, and can we then return to the discussion about Cg's applicability for benchmark usage with that fundamental item considered?

demalion said:
demalion said:
Xmas said:
my mistake. There are no hardware-specific modifiers.
In the PS 2.0 spec, you mean?
I don't know whether _x8, _d4 and _d8 are natively supported on NV30. Those seem to be the only PS1.4 specific ones. DX will not accept shaders with modifiers it does not know.
It seems pretty clearly expressed to me that I'm not talking about expressing this in the LLSL, but about opportunities expressed in the actual "assembly" of the specific card that the driver low level optimizer converts it to.
If NV30 supports these modifiers, it's equal opportunities, and not hardware-specific.
No, because the nv30 has interests in code expression that might hide such opportunities for other architectures' low level optimizers but not for itself. The problem with your stipulation here is that you see no problem in proposing an "If" as an answer, ignoring other "Ifs", like whether there are any other distinct operation modifiers that actual hardware low level instructions can express.

This returns to my 4x4 verus 8x1 example, where the general case, as I've already expressed, is served by focusing on implementation of the spec that doesn't prevent any architecture from reasonably being able to seek its own further optimization opportunities. Just because the spec can be used to express an implementation that violates that principle, doesn't mean that such an implementation is equivalent to the general case, nor that it is the responsibility of other parties to adapt to such an implementation instead of the general case.

Have we established this much?

demalion said:
You're right about abs and centroid. I didn't look at the version table. However I don't think there are any modifiers the R300 supports and NV30 does not support.
Hmm...well, you are stipulating things you think without any provided reason. I'd say the stated PS 1.4 specification gives reason to indicate something to the contrary, as well as other clearly exhibited factors illustrating their divergent designs.
I'm not stipulating things. I'm saying what I think is most likely.

If you are maintaining something is "most likely" as the basis for your comments, you are stipulating it. A stipulation is establishing something as a condition for agreement...in this case, establishing that your evaluation is accurate, to the end of seeking my agreement that your line of reasoning is validated by it.

Or, I could just reword it as "well, you are maintaining things you think are most likely without any provided reason."

Since this doesn't change anything about our discussion, I'll use this objection on your part as an example of something that leads to unproductive discussion.

That's simply my opinion. There designs are divergent but still there are a lot of similarities.

A "lot of similarities" in designs you observe to be divergent does not do anything to address the assertion in question.

PS1.4 spec only indicates that three modifiers (x8, d4, d8 ) are not supported by the NV30 integer units, and assuming the NV30 register combiners are identical to the NV2x ones (besides precision/range), the OpenGL register combiner extension spec proves this.
But for the float units, I think it is equally likely for both to either support or not support those modifiers natively.

The R300 doesn't seem to have unique integer units. If it supports these modifiers in one clock cycle, it indicates that the applicability is part of its general case if the data input (and perhaps output, as it seems to be able to freely interchange the datatype being input and output) to which it is applying it allows the operation.

Aside from that example (since it is based on the stipulation that the R200 had such functionality, and ATI should have ample reason and ability to include it in the R300), I think the existence of that functionality, even in the R200, illustrates that hardware designs indeed easily diverge in the type of low level optimizations opportunities they offer.

I understand that your opinion differs, but you haven't provided any reason that I have been able to make sense of as of yet.

demalion said:
That said, it might even be the case that R300 (and, partly, NV30) does not natively support (meaning for free) some of the PS1.x modifiers because they are suited for integer types and more complex on float types (x2, x4, x8, d2, d4 are shift operations for integers, but for floats that's adding or subtracting a number from the exponent)
That is quite true for that example, but the case I mentioned for looking for integer valued calculations for the nv30 would also be applicable. It would not be as universal an opportunity as I implied by context, however.
Sorry, I'm not sure I understand what you're saying here.

I'm saying the low level optimizer could apply this to, for example, an integer texture source. I think this is a possible parallel to the reason for, again for example, the nv30 only offering integer texture support in some drivers.

demalion said:
As long as that kind of optimization doesn't slow down other cards, i think anyone can be happy with it.
This is what I mean by making up your mind: do the HLSL optimizations expressed in the LLSL matter (as per the beginning of your prior post), or do they not (as per other statements, including the one quoted in my reply to that post).
I never said they do not matter. If they would not matter, they wouldn't be there.

Well, you've avoided addressing my quote of your actual initial statement in the thread a few times now, as well as my discussions of how your "opinions" presented in other text reflect exactly that sentiment. If we can't establish it from that direction, maybe the above discussion might serve?

demalion said:
If that modifier is not supported in PS2.0 (as it seems), then how would there be a way for R300 to use it?
By the low level optimizer. When I say "visible opportunities", it continues to mean opportunities of which the actual architecture instruction execution model is capable. That seems to be plainly stated. The statement you're making there makes no sense to me.
My fault. Could you give an example of how you think such opportunities can be hidden?

I can't if you continue to propose that you "think that it is most likely" that there is no difference in the low level optimization opportunties between different architectures as an answer to such discussions. :-?

Which is why proposing such opinions without addressable support (more than stating they are your opinion) for them in response to an opposing opinion with provided addresable support is not a productive way to hold a discussion.

btw, seeing your last post, I think it was a good decision to stop that discussion here.

Well, I'd be interested in what in particular prompted that conclusion, but I thought your post to which I was replying had established that was your conclusion already, which is why I bothered to make the statements I did in that post instead of advancing the discussion.

It shows you have not understood many things I said like I intended them to be understood.

I understood many comments you made, I just continue to fail to understand, for some of them, your reasons for maintaining that they are true.

I admit I think this is mostly my fault. But the discussion didn't work this way. It's also frustrating for me.

I think the conversation could be useful again if the above discussion initiates progress, or will remain unproductive if it does not. I think I've tried to facilitate the first option.
 
I manage synthetic (not game tests) for RM3D. Thats one that was published is quite old and buggy. At days (tommorow) will be released beta of latest d3d synthetic set with completely new shell environment that supports modular tests. anyone who wants to contribute some synthetic speed/quality/precision tests fell free to contact me unclesam@ixbt.com with any ideas/code and so on. Source code of shell and modules i will send back on you request if you need.

for example we still didnt know how to profile state changes right - its a big question.

lets do this test set better together 8)
 
Back
Top