Rightmark3D

madshi said:
As you can read here:

http://www.beyond3d.com/forum/viewtopic.php?t=5150

The NV30 is very sensitive, e.g. the more registers are used the slower it gets and profits from mixing the shaders with integer stuff. The R300 has a totally different behaviour. It's in my opinion absolutely obvious that if you want to have a fair comparison you need to either write a shader without having any special hardware in mind. Or you have to adjust the shader so that it runs as fast as possible on both NV30 (by using as few registers as possible and mixing with integer stuff etc) and R300. If you write a shader with NV30 in mind it's still interesting to test it on R300 and vice versa. But it can't be used as a fair benchmark!
As we are talking about DX9, mixing with integer operations isn't relevant here.
Is a shader that tries to use as few registers as possible specifically optimized for NV30? Or is this just an obvious optimization that should always be done? Using few registers usually doesn't hurt R300.

If optimization in one area hurts optimization in another area, how do you find a "neutral" way of optimizing? How is it possible to write an unbiased benchmark then?

Perhaps it would be better to elaborate on in how far Cg outputs "bad" code for R300 and other chips before discussing this topic further.

madshi said:
Xmas said:
What if a shader programmer came up with the same assembly code as the Cg compiler. The program would run executing the same shader code, in how far would that be better?
That's highly unlikely, if you look at how careful you have to be to get halfway good performance out of the DX9 NV30 shaders.
Depends on where you put your preferences. As a shader programmer, you get performance recommendations from ATI, NVidia, and maybe other companies. If they don't contradict each other, you try to follow all of them. If they do, you have to decide what you rank higher. Now it's important to know where they contradict each other.

madshi said:
Xmas said:
If Cg outputs better code than DX9 HLSL compiler, meaning it is faster on any hardware, should a benchmark still use DX9 HLSL to be more comparable?
*If*. Do you really believe that!? :oops:
I'm pretty sure there are cases where it does.
 
Xmas said:
Is a shader that tries to use as few registers as possible specifically optimized for NV30? Or is this just an obvious optimization that should always be done? Using few registers usually doesn't hurt R300.

I think the issue was, that if you end up modifying your shader (reducing quality, robustness, etc.) so that it runs satisfactorally on NV30...that's what shouldn't always be done.

The developer has three choices in these circumstances:

1) Use a single shader that is fast on both architectures,
2) Use a single shader of better quality / characteristcs that may not run as well on NV30
3) Use two shaders depending on the architecture.

Each having pros and cons...
 
Joe DeFuria said:
I think the issue was, that if you end up modifying your shader (reducing quality, robustness, etc.) so that it runs satisfactorally on NV30...that's what shouldn't always be done.
Do you mean modifying at HLSL or at assembly level? RightMark uses high level shaders, and I don't think they modify their shaders for NV30. I also don't think Cg output is less robust or has reduced quality compared to DX9 HLSL compiler output. I'd consider it a bug if that's not true.
 
Xmas said:
Do you mean modifying at HLSL or at assembly level? RightMark uses high level shaders, and I don't think they modify their shaders for NV30. I also don't think Cg output is less robust or has reduced quality compared to DX9 HLSL compiler output. I'd consider it a bug if that's not true.

I mean modifying at the HLSL level. In short:

1) Artist / Programmer creates a shader or set of shaders to achieve a desired effect.

2) It compiles (any given compiler - Cg or DX9) optimally in terms of register resource use.

3) It consumes register resources on NV30 hardware to the extent that it's not really viable performance wise.

4) Performance is acceptable on R300

Now the developers have a the choices that I outlined above...
 
MDolenc said:
As Xmas said: the only good guideline is: use as few instructions as possible. Rightmark is a DirectX 9.0 application, which comes with a completly standard ps_2_0 path, and this path does not include integer calculations. Cg absolutly CAN'T sneak ANY integer code into it!
That's what I thought, too, but then I read somewhere that it IS possible, despite what most of us believe(d). Unfortunately I don't remember where I read it, but I think it was in a thread on beyond3d.
MDolenc said:
Also both DX9 HLSL and Cg do actually NEAD to use as few registers as possible. What will happen when they hit an upper limit for register count? Ok, we wasted one register there so we'll remove it now? Not that simple.
Of course if you can write code which is optimally fast by only using 2 registers, you shouldn't be using more registers. However, what if your shader could save some instructions by using 4 registers instead of 2? That's very well possible, don't you agree? Using 4 registers would slow down NV30, but not R300. Now you would have to check whether the slow down is worth the shorter shader code. The Cg compiler might come to the conclusion that at the end using only 2 registers is faster for NV30, so it will output shader code with only 2 registers. Of course it will run fine on R300, too. But on R300 the shader would have been faster with using 4 registers. Can you follow me?
MDolenc said:
There is ATI profile in DX9 HLSL? Where have you seen it? :rolleyes:
I thought all the IHVs would write profiles for DX9 HLSL, is that wrong? Maybe it is.
MDolenc said:
How can you just claim that Cg outputs sucky code if you haven't seen what kind of code it outputs?
Did I claim that? I only said that Cg optimizes for NVidia hardware only, which in some cases can end up in shaders, which are only sub-optimal for R300. Should a fair benchmark not compare Cg to DX9 HLSL?
 
Xmas said:
Is a shader that tries to use as few registers as possible specifically optimized for NV30? Or is this just an obvious optimization that should always be done? Using few registers usually doesn't hurt R300.
I've read that the double number of available registers of the Opteron in 64bit mode already gives it quite a performance boost. Using few registers if you get along with few registers is fine. But if you have complicated code it can benefit from multiple registers. In such a situation the optimal shader code for NV30 and for R300 probably looks different, can you agree to that?
Xmas said:
If optimization in one area hurts optimization in another area, how do you find a "neutral" way of optimizing? How is it possible to write an unbiased benchmark then?

Perhaps it would be better to elaborate on in how far Cg outputs "bad" code for R300 and other chips before discussing this topic further.
Good comments, I agree. But until we know that Cg doesn't favour NV30, we shouldn't use it as a fair benchmark - at least not without comparing DX9 HLSL benches with it, don't you agree?
Xmas said:
I'm pretty sure there are cases where it does.
Yes. But there will probably also be cases where it does not, right? So isn't it "wrong" to use Cg for benchmarking if we are not sure whether the specific benchmark shader runs faster on any hardware than DX9 HLSL?

The most fair benchmark in my eyes would be one that compares Cg compiled shaders with comparable DX9 HLSL shaders. Then we will not only be able to compare Cg <-> DX9 HLSL on NVidia hardware, but also on other hardware. That would be interesting for us all, would it not?
 
Xmas said:
demalion said:
Here are some points for discussion on Cg:
The first prevents all R200 and RV2x0 cards from exposing anything but PS 1.3. This precludes expanded dynamic range,
You can have HDR in PS1.1, too.

Yes, Cg documentation does state it depends on the value DX returns for MaxPixelShaderValue in the DX profiles. What about the flexibility and shader length that I mentioned?

Additionally, the compiler doesn't even have to support it explicitly. Constants are limited to [-1, 1] in PS 1.x, and all other calculations' precision or range does not affect the compiler.

Hmm? You mean the compiler succeeds in targetting a profile for which it presumes the code is in error? I don't think that makes sense...but the documented behavior for Cg should prevent a datatype with the applicable range from being an error in any case.

Quite likely not pertinent, but something that concerned me: Given that badly programmed shaders might depend on the implicit clamping of early ps 1.1-1.3 shader hardware, I wanted to see what the 8500 reported for MaxPixelShaderValue (i.e, to see if actual API usage might depend on using PS 1.4 to expose range for compatibility reasons), but the one place I found a list (in an 8500 review here) had a blank spot (why'd you stop including references to that info in other reviews, Wavey? :( Or did I miss it in my search? ), and I didn't find an easy way to check quickly. I'll try to dig for some example DirectX setup code to check myself, but as far as Cg exposing as much range as it can within proper API behavior, I consider the Cg documentation to indicate that it does.

The second allows the nv30 to compete.
But it's not an issue here because it only works in OpenGL.

Well, actually, Rightmark exposes OpenGL, so Cg's relevant OpenGL targets, and lack of them, are indeed relevant to this discussion. Note that it is the 8500/9000 technology that is the hardware that is disadvantaged, which, although not necessarily the largest direct revenue source, should represent the largest portion of ATI's shader supporting marketshare given the time, price range, and integrated product targetting for it past and near future.
It is not insignificant for nVidia to be in a situation (if Cg gains developer support) where they either control how ATi's products are represented in comparison to their own, or requires IHVs (who wouldn't have much choice if ATI did, so let's say "ATI") to commit to Cg on nVidia's terms to prevent that control, which entails circumventing APIs in which ATI, and other IHVs, already have a say.

Anyways, as far as my particular concern for Direct X: given past driver behavior, I'm worried that nVidia has been working to exactly this end. What I think is that the driver could still indeed support "integer and/or fp16 PS 2.0", since I'm not aware that Cg is precluded from exposing that driver behavior as long as the generic PS 2.0 calls perform to the DX 9 specification in all facets.

Restated: I'm aware that nVidia says Cg uses "standard DX 9 PS 2.0 calls", I'm just not aware that that precludes WHQL certification if driver behavior deviates only for a specific case, like Cg's initialization and setup (and 3dmark 03, for that matter). As far as the WHQL drivers and DX 9 goes, this can be tested and this idea dismissed or supported for PS 2.0 and PS 2.0 extended, for the current behavior of the drivers and Cg, by an FX owner (where is pocketmoon_?).

As long as it doesn't currently, this isn't a concern for DirectX right now. If it cannot for WHQL drivers at all, then this will not be a concern for the future under DirectX as well. I'm more concerned with Det "FX" and the next Cg release after its announcement, but current behavior can be established.

The third is a pervasive problem. For instance, the R300 performance is optimized for scalar/3 component vector execution with texture ops simultaneously dispatched, while this emphasis absolutely chokes the nv30, especially if floating point is needed. Under DX 9 HLSL, the output is optimized to the spec, which the IHVs can then anticipate and optimize for, but for Cg, the output is optimized to the nv30's special demands, and this can, and has, resulted in "LLSL" that is distinct. They say they aren't bypassing the API, but they do exactly that with regards to DX 9 HLSL..
What is "optimized to the spec"? There is no defined "best" way to write an assembly shader, the only obvious optimization being using as few instructions as possible (which may sometimes not be the best case for certain hardware).

Hmm...well, as has been stated, that isn't quite true. Please consider the shader length and performance results for pocketmoon's Cg and DX 9 HLSL comparisons, and the compiler assistance recommendations that were concluded (which was up yesterday, but isn't at the moment for some reason...maybe new results are forthcoming?). The only way to circumvent this particular issue is to have the standard DX 9 target output be identical to the output from the DX 9 HLSL, as I proposed. It is a failure to assure things like this that make using Cg not only a matter of addressing the nv30's shortcomings, but of API cooption.

The drivers have to make the best out of the assembly shaders by reordering and pairing instructions, regardless of whether the shader was written in assembly, HLSL or Cg.

By controlling the output, nVidia maintains a perpetual advantage in the ability to optimize...and no, Cg does not produce equivalent code to DX 9 HLSL in all cases, unless another IHV buys into supporting nVidia's toolset instead of DX 9 HLSL, and reinventing the compiler Microsoft provided. A pretty lose-lose situation.

nVidia is circumventing the DX API by replacing a part of it: the HLSL. That is only beneficial as an avenue to allow their own hardware's deficiencies (in comparison to the API's requirements) to be overcome (it isn't useful in DirectX for any other purpose, and if it doesn't do that, it is only useful as a tool of cooption...why use it?). If you are using Cg for DX, why not use DX 9 HLSL? The relationship between performance is not as you proport by any indication I've seen (except maybe for the nv30, but I've covered that much already).

If Cg is used in a benchmark, it is considered "not right".
Please address what I actually said...I said it is "not right" as other than a Cg benchmark or a Cg versus HLSL benchmark. As a Cg-replacing-DX 9-HLSL benchmark it is "not right", and also as a benchmark that proports to be representative of OpenGL performance of other than nVidia cards it is not right for the stated reasons. This will only become worse as there is more competition that is expected to either depend on nVidia's compiler priorities and honesty, or create another high level optimizer in addition to a low level optimizer. ATi could do this pretty easily, I think, but why would they? nVidia hoped that it would get Cg adopted for OpenGL so IHVs would automatically be doing this when targetting OpenGL HLSL, but that didn't work. If they'd really have commited language evolution to ARB control in that circumstance, that would have been a good thing, but that's not what we're dealing with now.

What if a shader programmer came up with the same assembly code as the Cg compiler. The program would run executing the same shader code, in how far would that be better?

If the programmer came up with the "same" code themselves (or using DX 9 HLSL, or any independent means), the shader code would not be worse for that instance, of course. That's recognized by all my example solutions, so I'm puzzled as to why you're asking. The problem is that this is not at all guaranteed to be the case in the future, nor is it representative of what is the actual cases now.

If Cg outputs better code than DX9 HLSL compiler, meaning it is faster on any hardware, should a benchmark still use DX9 HLSL to be more comparable?
You're being a bit strange here.
To do this, they would have to: support PS 1.4, produce code that was as good or (as you, seemingly fictionally, stipulate) better than the DX 9 HLSL output for any hardware, and be committed in actuality to their stated "open" intent. That's a valid question, but I've already given a specific answer, but indicated that I saw no indication of this, and in fact saw indication of the opposite.

Are you going to state that nVidia's intent is indeed "open"? This to me is ignoring the evidence of their consistent past and current behavior, with regards to Cg as well as other factors, and that they are trying specifically to replace the DX 9 HLSL API not just for their hardware, but for all hardware, with their Cg toolset.

Perhaps you are thinking that every IHV is as self-interested? To me this is ignoring that no other IHV is projecting their self-interest as broadly as nVidia is with an initiative like Cg (again, among other things illustrating differing approaches to projecting self interest), which is the specific problem with it.
 
demalion said:
Yes, Cg documentation does state it depends on the value DX returns for MaxPixelShaderValue in the DX profiles. What about the flexibility and shader length that I mentioned?
You're right on those points. I was specifically saying that PS1.4 and HDR are not linked together except that PS 1.4 requires a range of [-8, 8] for texture coordinate calculations.

demalion said:
Additionally, the compiler doesn't even have to support it explicitly. Constants are limited to [-1, 1] in PS 1.x, and all other calculations' precision or range does not affect the compiler.

Hmm? You mean the compiler succeeds in targetting a profile for which it presumes the code is in error? I don't think that makes sense...but the documented behavior for Cg should prevent a datatype with the applicable range from being an error in any case.
This was still about HDR. The compiler doesn't have to specifically support it. It just works, provided the hardware supports it.


demalion said:
Quite likely not pertinent, but something that concerned me: Given that badly programmed shaders might depend on the implicit clamping of early ps 1.1-1.3 shader hardware, I wanted to see what the 8500 reported for MaxPixelShaderValue (i.e, to see if actual API usage might depend on using PS 1.4 to expose range for compatibility reasons), but the one place I found a list (in an 8500 review here) had a blank spot (why'd you stop including references to that info in other reviews, Wavey? :( Or did I miss it in my search? ), and I didn't find an easy way to check quickly.
I'm pretty sure the R2x0 reports 8.0 as MaxPixelShaderValue, so you can use this range in any PS1.x shader. R300 reports some ridiculously high number due to FP24 support.

No shader should rely on clamping, with the exception of the saturate modifier of course. (It's very strange that the GLslang working group mentions clamping behaviour as one of the few (very weak) reasons against the implementation of half floats)


The second allows the nv30 to compete.
But it's not an issue here because it only works in OpenGL.

Well, actually, Rightmark exposes OpenGL, so Cg's relevant OpenGL targets, and lack of them, are indeed relevant to this discussion. [/quote]
Does it target NVidia specific fragment shader extensions? If not, there is no integer support.

demalion said:
Note that it is the 8500/9000 technology that is the hardware that is disadvantaged, which, although not necessarily the largest direct revenue source, should represent the largest portion of ATI's shader supporting marketshare given the time, price range, and integrated product targetting for it past and near future.
This depends on the specific shaders RightMark uses. Only if we know them we can say if PS1.4 would really be an advantage here. Probably it is.


demalion said:
What is "optimized to the spec"? There is no defined "best" way to write an assembly shader, the only obvious optimization being using as few instructions as possible (which may sometimes not be the best case for certain hardware).

Hmm...well, as has been stated, that isn't quite true. Please consider the shader length and performance results for pocketmoon's Cg and DX 9 HLSL comparisons, and the compiler assistance recommendations that were concluded (which was up yesterday, but isn't at the moment for some reason...maybe new results are forthcoming?). The only way to circumvent this particular issue is to have the standard DX 9 target output be identical to the output from the DX 9 HLSL, as I proposed. It is a failure to assure things like this that make using Cg not only a matter of addressing the nv30's shortcomings, but of API cooption.
Uhm, *what* is not true? I was talking about assembly shaders. If you know any other "globally valid" recommendations for writing assembly shaders instead of "use as few instructions as possible", please let me know.

Maybe Cg optimizes for shader length, and additionally for register usage, while the DX9 HLSL compiler only optimizes for shader length while not worrying about register usage.
As I said, if there are two ways of optimizing that affect each other, how do you find an "unbiased" way to optimize? Is the DX9 compiler "unbiased" if it takes only one way of optimizing into account?

demalion said:
What if a shader programmer came up with the same assembly code as the Cg compiler. The program would run executing the same shader code, in how far would that be better?

If the programmer came up with the "same" code themselves (or using DX 9 HLSL, or any independent means), the shader code would not be worse for that instance, of course. That's recognized by all my example solutions, so I'm puzzled as to why you're asking. The problem is that this is not at all guaranteed to be the case in the future, nor is it representative of what is the actual cases now.
I am asking because a shader programmer could take exactly the same recommendations into account as the Cg compiler does when writing shader code. I think this is pretty likely. A programmer surely takes some performance recommendations into account.
 
Xmas said:
...

No shader should rely on clamping, with the exception of the saturate modifier of course. (It's very strange that the GLslang working group mentions clamping behaviour as one of the few (very weak) reasons against the implementation of half floats)

Completely separate discussion, but: what did they specifically say about clamping behavior? It seems to me it would be pertinent for preventing custom coded rescaling depending on the particular format used for specific hardware speed instead of sticking to an expected scaling range, and is a "clean code" principle along the lines of what I understand them to be targeting for "write once" HLSL shaders. I still don't know about VS/FS interaction as allowed by GLslang, but if that is planned, they may just consider fp24 and fp32 enough complexity in floating point specifications for the fragment shader. Actually, I'm thinking nv40 could have fp24 (maybe even fp32) register combiners (couldn't it?), and that it is a matter of the industry at large not being concerned with a limitation that appears unique to nVidia's vision of the future.
Further aside: I really don't see why nVidia isn't able to address this deficiency in hardware, but then again I don't really understand the nv30 transistor count...maybe some factor of design I'm not aware of. In any case, I don't see why fp32 performance issues couldn't be addressed as soon as the nv35 (it seems it is a matter of registers, not processing) which would make fp16 support even more fruitless as a LCD.

Hmm...anyone tested whether the 5200 (I seem to recall some association with nv35 improvements) is any slower with fp32 than fp16? Don't recall, though I have a faint impression that is a failing of my memory of results and not a failing of someone to test this.

The second allows the nv30 to compete.
But it's not an issue here because it only works in OpenGL.

Well, actually, Rightmark exposes OpenGL, so Cg's relevant OpenGL targets, and lack of them, are indeed relevant to this discussion.
Does it target NVidia specific fragment shader extensions? If not, there is no integer support.

It lists "ARB" and "Native", so yes, and there are shaders for fixed, partial, and full precision. Don't know what the scenes look like, I don't have hardware that supports any of Cg's targets. If they supported DX 9 HLSL they would have done as much as could be reasonably expected for the time being with limited resources.
I took a look at some of the .cg files, which were rather short and have a fixed datatype version, so I actually think they could be done pretty easily using PS 1.4.
Since Rightmark is a work in progress, the issues I raise might be a temporary situation (the PS 1.4 in the Direct3D synthetic tests indicate to me that they intend to support both APIs for all exposed levels of functionality, and have been let down for now in the game tests by Cg's lack of DX PS 1.4 and OpenGL ATI_fragment extension support).

demalion said:
Note that it is the 8500/9000 technology that is the hardware that is disadvantaged, which, although not necessarily the largest direct revenue source, should represent the largest portion of ATI's shader supporting marketshare given the time, price range, and integrated product targetting for it past and near future.
This depends on the specific shaders RightMark uses. Only if we know them we can say if PS1.4 would really be an advantage here. Probably it is.

I'm pretty sure the game scenes could be successfully implemented for the 8500. They look pretty simple, and disqualifying factors were not immediately evident to me (the linear interpolation function usage looked compatible with the PS 1.4 spec, for example).

demalion said:
What is "optimized to the spec"? There is no defined "best" way to write an assembly shader, the only obvious optimization being using as few instructions as possible (which may sometimes not be the best case for certain hardware).

Hmm...well, as has been stated, that isn't quite true. Please consider the shader length and performance results for pocketmoon's Cg and DX 9 HLSL comparisons, and the compiler assistance recommendations that were concluded (which was up yesterday, but isn't at the moment for some reason...maybe new results are forthcoming?). The only way to circumvent this particular issue is to have the standard DX 9 target output be identical to the output from the DX 9 HLSL, as I proposed. It is a failure to assure things like this that make using Cg not only a matter of addressing the nv30's shortcomings, but of API cooption.
Uhm, *what* is not true?

Exactly what I quoted is what I meant "isn't quite true" (not the same as "untrue" or "not true"). It's not just a matter of register usage, but instruction order and the visibility of optimization opportunities to the driver's optimizer (for example, it might depend on assumptions about Microsoft's high level optimizations to minimize performance overhead from optimizing for dispatching low level shader code).

I was talking about assembly shaders. If you know any other "globally valid" recommendations for writing assembly shaders instead of "use as few instructions as possible", please let me know.

Well, it is possible that the Cg compiler will offer improved output from what we saw in pocketmoon's testing, but it doesn't yet as far as I've seen, nor does it appear that the DX 9 HLSL is going to be without updates provided by Microsoft to address issues from all vendors moving forward. Also shader instruction count doesn't seem a "globally valid" goal when it is the only goal.

For an example from his testing (which is back up), the shader instruction count across targets is similar except for the Shader Test 4, yet performance varies even when the datatypes used are similar.
Also, with significantly longer "instruction count" (it is unclear to me how directly shader counts compare), the nv30 specific path is as fast as the DX 9 HLSL output performance.

The indication seems to be that the prioritization for the output is for the demands of the functionality exposure in the nv30 fragment program extension, and that even nVidia's low level DX 9 optimizer cannot always optimize for equivalent performance from this type of output (even with similar datatypes).
Two ways of addressing this are to change the DirectX target output if they can (which looking at the performance results, like the first test and some others, would need to be for other than just instruction count concerns), and to change the low level optimization.
The first is something nVidia controls as I outlined before, and the last is something that seems easier for nVidia to do, and other IHVs would have to do additional work for, looking at the Shader 4 instruction count and performance results (and I also outlined before).

Maybe Cg optimizes for shader length, and additionally for register usage, while the DX9 HLSL compiler only optimizes for shader length while not worrying about register usage.
I don't think it is as simple as "considering register usage/not considering register usage"...to restate, I think it is for the extent of and type of optimzation of register use required. The R300's performance is offered in less specialized register usage situations, therefore it performs better for the standardized API and less involved optimizations. The nv30 approach is more specialized, even without offering more functionality, and would continue to be even if the register combiners supported floating point (it is just that the raw throughput would then allow nVidia more leisure to optimize instead of necessitating Cg for them).

As I said, if there are two ways of optimizing that affect each other, how do you find an "unbiased" way to optimize? Is the DX9 compiler "unbiased" if it takes only one way of optimizing into account?
I don't think it does take "only one way of optimizing into account", I just think that the R300 offers better general case performance and requires a less dedicated set of optimzations, and that is the simple reason why it is well suited to general APIs and the nv30 does not appear to be. I think Carmack's comment indicate a similar outlook, though I can't speak for him (Duh! :p).
I just think the R300 design is smarter for general exposure of the same functionality. Sort of like a design with 4 pipelines with 4 texture units versus 8 pipelines with 1...it takes more specialized cases for the first design to show advantage, it's not that people are giving special consideration to the second.
I do think designs similar to the nv30 can address its shortcomings (it seems to me there is a resistence to simply recognizing that the nv3x design has problems), but that discussion belongs in a nv40 speculation thread.

...
I am asking because a shader programmer could take exactly the same recommendations into account as the Cg compiler does when writing shader code. I think this is pretty likely. A programmer surely takes some performance recommendations into account.

Of course, and the the full set of those "same recommendations" are suitable considerations for the nv30, as I stated in one of my example cases. The problem is using all of that set of recommendations for other cards when there are other tools vendors are supporting (and not just one vendor, other vendors collectively) where the recommendations represent more than one party's interests.
 
madshi said:
MDolenc said:
Also both DX9 HLSL and Cg do actually NEAD to use as few registers as possible. What will happen when they hit an upper limit for register count? Ok, we wasted one register there so we'll remove it now? Not that simple.
Of course if you can write code which is optimally fast by only using 2 registers, you shouldn't be using more registers. However, what if your shader could save some instructions by using 4 registers instead of 2? That's very well possible, don't you agree? Using 4 registers would slow down NV30, but not R300. Now you would have to check whether the slow down is worth the shorter shader code. The Cg compiler might come to the conclusion that at the end using only 2 registers is faster for NV30, so it will output shader code with only 2 registers. Of course it will run fine on R300, too. But on R300 the shader would have been faster with using 4 registers. Can you follow me?
As I'm a more of a mathematician give me an example where using 4 registers instead of two will provide less instructions. Until then I'll claim that using more registers will produce equal or more instructions.
Don't mix general CPU optimisations with GPU shader code optimisations. CPUs have to run code that would in theory need 3, 8, 10, 65740,... registers and when you run out of register space on CPU you'll have to put these registers in system memory. This means that as soon as you are given more registers your code will indeed run MUCH faster. GPUs on the other hand don't have to run arbitrary complex shader code and when a HLSL compiler or assembly shader programmer runs out of registers shader simply won't work (you'll have to multipass - manually).
 
MDolenc said:
As I'm a more of a mathematician give me an example where using 4 registers instead of two will provide less instructions.

The following example shows 4 vs 3 regs, but with enough time taken it would be possible to craft a better example.
But I think it proves the point.

Take the following code (NOT actual shader code):

Code:
float foo(float a, float b, float c)
{
	return (a-b)*(a+b)*(a-c) + (a-c)*(a+c)*(a-b);
}

Yes I know this is stupit and can be optimized by opening parantheses and such, but compilers do not alowed to do this! (It's always up to the programmer to do such optimizations.)

Here's a naive postfix compiler output which is also the one with the smallest possible register useage (3):

Code:
sub  r0, a, b		; a-b
add  r1, a, b		; a+b
mul  r0, r0, r1		; (a-b)*(a+b)
sub  r1, a, c		; a-c
mul  r0, r0, r1		; (a-b)*(a+b)*(a-c)
sub  r1, a, c		; a-c
add  r2, a, c		; a+c
mul  r1, r1, r2		; (a-c)*(a+c)
sub  r2, a, b		; a-b
mad  r0, r1, r2, r0	; (a-b)*(a+b)*(a-c) + (a-c)*(a+c)*(a-b)

That's 10 instructions.
Now while the compiler cannot rearrange the expression, it can do CSE.
Here's the output of the CSE capable compiler:

Code:
sub  r0, a, b		; a-b
add  r1, a, b		; a+b
mul  r1, r0, r1		; (a-b)*(a+b)
sub  r2, a, c		; a-c
mul  r1, r1, r2		; (a-b)*(a+b)*(a-c)
add  r3, a, c		; a+c
mul  r2, r2, r3		; (a-c)*(a+c)
mad  r0, r2, r0, r1	; (a-b)*(a+b)*(a-c) + (a-c)*(a+c)*(a-b)

It took 8 instuctions, but this code uses 4 registers.
 
demalion said:
Completely separate discussion, but: what did they specifically say about clamping behavior? It seems to me it would be pertinent for preventing custom coded rescaling depending on the particular format used for specific hardware speed instead of sticking to an expected scaling range, and is a "clean code" principle along the lines of what I understand them to be targeting for "write once" HLSL shaders.

They write (excerpt):
GLslang shader spec said:
33) Should precision hints be supported (e.g., using 16-bit floats or 32-bit floats)?
DISCUSSION: Standardizing on a single data type for computations greatly simplifies the specification of the language. Even if an implementation is allowed to silently promote a reduced precision value, a shader may exhibit different behavior if the writer had inadvertently relied on the clamping or wrapping semantics of the reduced operator. By defining a set of reduced precision types all we would end up doing is forcing the hardware to implement them to stay compatible.
That's just ridiculous as there is no defined clamping or wrapping behavior you can rely on.


It lists "ARB" and "Native", so yes, and there are shaders for fixed, partial, and full precision. Don't know what the scenes look like, I don't have hardware that supports any of Cg's targets. If they supported DX 9 HLSL they would have done as much as could be reasonably expected for the time being with limited resources.
If that means having the choice between ARB and native path, that's quite a positive feature.
I haven't finished downloading the beta yet. But do we know for sure that it only uses Cg and not both Cg and DX9 HLSL?


demalion said:
demalion said:
What is "optimized to the spec"? There is no defined "best" way to write an assembly shader, the only obvious optimization being using as few instructions as possible (which may sometimes not be the best case for certain hardware).

Hmm...well, as has been stated, that isn't quite true. Please consider the shader length and performance results for pocketmoon's Cg and DX 9 HLSL comparisons, and the compiler assistance recommendations that were concluded (which was up yesterday, but isn't at the moment for some reason...maybe new results are forthcoming?). The only way to circumvent this particular issue is to have the standard DX 9 target output be identical to the output from the DX 9 HLSL, as I proposed. It is a failure to assure things like this that make using Cg not only a matter of addressing the nv30's shortcomings, but of API cooption.
Uhm, *what* is not true?

Exactly what I quoted is what I meant "isn't quite true" (not the same as "untrue" or "not true"). It's not just a matter of register usage, but instruction order and the visibility of optimization opportunities to the driver's optimizer (for example, it might depend on assumptions about Microsoft's high level optimizations to minimize performance overhead from optimizing for dispatching low level shader code).
Maybe I'm not making myself clear on this, but you seem to totally miss my point. You wrote "optimized to the spec". To what spec?

AFAIK there is no "How you should write your assembly shaders.pdf". Maybe I have yet to see it, and if it exists my point is useless.
But if there is no such thing, what guidelines can you follow when writing an assembly shader? Maybe the only rule everyone would agree to is "use as few instructions as possible" (NVidia and others would probably add "use as few registers as possible", and others add "use clever pairing").
So how can you say one assembly shader is better than another, if they are equal in length, when there are no other "generally accepted" guidelines to follow? You might say "the one which executes faster is better", but that doesn't cover the case when one shader is faster on certain hardware and the other is faster on other hardware.

A HLSL shader compiler now has the same task, producing an assembly shader. It should follow the same recommendations as a human assembly shader programmer does. But there are almost no recommendations on the final output. (btw, I don't know why you mention pocketmoon's Cg recommendations here, as I'm not talking about how you should write a HLSL shader)

The driver has to accept native assembly shaders, shaders compiled with Cg, DX9 HLSL, or any other shader compiler that might exist somewhere. It certainly should not rely on getting "DX9 HLSL compiler output style".


demalion said:
...
I am asking because a shader programmer could take exactly the same recommendations into account as the Cg compiler does when writing shader code. I think this is pretty likely. A programmer surely takes some performance recommendations into account.

Of course, and the the full set of those "same recommendations" are suitable considerations for the nv30, as I stated in one of my example cases. The problem is using all of that set of recommendations for other cards when there are other tools vendors are supporting (and not just one vendor, other vendors collectively) where the recommendations represent more than one party's interests.
But how do you know the "set of recommendations" from NVidia is not better suited to other vendors' hardware than the "set of recommendations" from ATI? Currently there are only NVidia and ATI supporting PS1.4+, and <PS1.3 is too limited to allow more than the most trivial optimizations.
What if XabreII, DeltaChrome, P20, etc. perform more like NV30 than R300?
 
Xmas said:
What if XabreII, DeltaChrome, P20, etc. perform more like NV30 than R300?

Doesn't seem very likely.
At least not the Xabre II, it seems to have a lot more in common with R3X0 than NV3X. For an example: it will only support FP24 not 16/32.
DeltaChrome and Xabre II is more likely to be more "simple" architectures in any case IMHO.
 
Xmas said:
...

It lists "ARB" and "Native", so yes, and there are shaders for fixed, partial, and full precision. Don't know what the scenes look like, I don't have hardware that supports any of Cg's targets. If they supported DX 9 HLSL they would have done as much as could be reasonably expected for the time being with limited resources.
If that means having the choice between ARB and native path, that's quite a positive feature.

No, it is not a positive feature except for nVidia cards. To recall the discussion that led up to that answer:

Xmas said:
demalion said:
Xmas said:
demalion said:
The second allows the nv30 to compete
But it's not an issue here because it (allowing the nv30 to compete) only works in OpenGL
Well, actually, Rightmark exposes OpenGL, so Cg's relevant OpenGL targets, and lack of them, are indeed relevant to this discussion.
Does it target NVidia specific fragment shader extensions? If not, there is no integer support.
It does target nvidia specific target shader extensions as well as provide the DX target shader output nVidia has determined. What is missing is, for example, DX 9 HLSL, whose ability to target PS 1.4 is determined by Microsoft (and not nVidia), and also the possibility of exposing ATI's 8500/9000 supporting OpenGL extension, regardless of whether the shaders in question can utilize that target or not.
That is not "quite a positive feature", nor does that comment make sense given the discussion. :-?


I haven't finished downloading the beta yet. But do we know for sure that it only uses Cg and not both Cg and DX9 HLSL?

There are no .fx files, or anything indicative of HLSL besides the .cg files.

Further:
readme.txt said:
Cg Toolkit 1.1 from NVIDIA is needed for some Gaming tests,
but redistributables included.
No DX 9 HLSL alternative is mentioned.

However:
readme.txt said:
All stuff is subject to change according to your demands.
Which is why I'm not on a "Rightmark 3D bashing" bandwagon, but on a "Rightmark 3D needs to address some issues" bandwagon.


...Maybe I'm not making myself clear on this, but you seem to totally miss my point. You wrote "optimized to the spec". To what spec?

To the DX 9 PS 2.0 spec, as I specifically stated. That's why Cg requires IHVs to provide their own high level compiler to be failry represented, and why IHVs have no reason to since other high level standards not maintained by their competitor exist. It is also why Cg is not a good idea to be the exclusive high level functional expression in something proposed as a general benchmark.

AFAIK there is no "How you should write your assembly shaders.pdf". Maybe I have yet to see it, and if it exists my point is useless.
Well, I suspect recommendations common to IHVs will be to utilize write masking, utilize specified macros when possible, and optimize to the most concise usage of modifiers and instructions as possible.

If you have powerpoint, take a look at this, I'd have searched for excerpts to discuss myself, but I don't have powerpoint and don't have time to sift through the raw data for text at the moment (assuming it isn't compressed).

But if there is no such thing, what guidelines can you follow when writing an assembly shader? Maybe the only rule everyone would agree to is "use as few instructions as possible" (NVidia and others would probably add "use as few registers as possible", and others add "use clever pairing").

You continue to maintain that simplification, despite my having tried to point you to examples of that not being the case. Again, looking at some benchmarks, notice instruction count parity (including when targetting the same assembly) and yet performance differences even for the same datatype. Also notice the recommendations to aid Cg compilation at the end, which the Cg compiler itself could (and likely will) do, even though the concerns are unique to the nv30, not all architectures.

Here are some observations:
For some operations, the nv30 can execute two 2 component fp ops in the same time as it can execute one 4 component op. It can also arbitrarily swizzle. It should also be able to use it's FX12 units legally in specific instances (like when adding a valid FX12 constant to values just read from a 12 bit or less integer texture format :-?).

This might be optimized uniquely for the nv30 when two of four constants are valid FX12 constants such that (this is for an integer texture format, and please forgive the amateurish nature):

for everyone:
Code:
textureload (r,g,b,a}->xtex
add xtex, (float32,float32,int12,int12)->simpleeffect
2 cycles for a card that can perform well in the general case
nv30 executing this unoptimized for fp32:
Code:
*-textureload (r,g,b,a}->xtex
**addf xtex, (float,float,int,int)->simpleeffect
3 clock cycles
"everyone" code optimized for nv30:
Code:
*-textureload {r,g,b,a}->xtex
--addx xtex, (0,0,int,int)->simpleeffect
*-addf simpleffect,(float,float)->simpleffect
2 clock cycles

The problem isn't the optimization, but the optimization being present in the "assembly" (which is a standardized assembly, not the hardware specific assembly, which is why I like "LLSL"). EDIT: The "x" and "f" indicate what the nv30 would do, requiring only an instruction isolated low level optimization, not the DX 9 assembly op code.

nv30 optimized code for (hypothetical outside of ATI for now, but fitting statements made so far)"everyone else":
Code:
*-textureload {r,g,b,a}->xtex
*-add xtex, (0,0,int,int)->simpleeffect
*-add simpleffect,(float,float,0,0)->simpleffect
3 clock cycles

Swizzling furthers the optimization opportunities that could be explicitly stated in the LLSL without adversely affecting the nv30, but hindering the performance for others. So do modifiers.
So, IHVs should hunt for this in their low level optimizer when code would only look like this to benefit the nv30? That's backwards...nvidia is the one who should have to look for opportunities like this in their low level optimizer. That's the type of difference between the "DX 9 spec" and the "Cg spec"...not because the "Cg spec" can't be the same, but because nVidia expresses no interest in assuring that it will be (unless you count lip service, which other IHVs don't seem to want to do for some reason) and because IHVs already have the "DX 9 spec".

Related to this, my understanding of nv30 constant implementation lends itself to unique expression in the LLSL that might result in unnecessary limitations for other architectures in complex usage based on something like the above. So, for example, the nv30 could have (float,float,int,int) as a constant and encode it as two constants with no additional concern but this would use up registers cummulatively for other cards (unless they accepted they should be adding overhead to their optimizer instead of nVidia :-?).

So how can you say one assembly shader is better than another, if they are equal in length, when there are no other "generally accepted" guidelines to follow? You might say "the one which executes faster is better", but that doesn't cover the case when one shader is faster on certain hardware and the other is faster on other hardware.

For DX 9, yes it does. There is a standard way of coding to the assembly to which several parties have input, so they can then further optimize at a low level. The issue is that nVidia wants to force IHVs to see to representing their hardware on nVidia's terms (Cg).

A HLSL shader compiler now has the same task, producing an assembly shader. It should follow the same recommendations as a human assembly shader programmer does. But there are almost no recommendations on the final output.

Where are you getting this "there are almost no recommendations"? Because I haven't pointed to a pdf?

Here are the possible sources for the DX low level code: DX 9 HLSL, Cg, hand coding.
Correct me if I understand you incorrectly: You are stating that either nVidia cannot produce Cg code targetted to DX 9 LLSL more suited to the nv30 than other hardware because you think they have the same goal of reduced instruction count for DX 9 (as opposed to having a goal of performance for the nv30, which is not synonymous) because the nv30 doesn't have unique performance characteristics with regard to anything besides instruction count (which is why I pointed you to pocketmoon's page, by the way, which illustrate the opposite, as well as analyses presented in other threads)...or that it doesn't matter that it does because other IHVs should optimize for the case when hand coding is as nVidia centric as Cg's output anyways.

I think the first is incorrect, demonstrably by benchmarking and theoretically by analyzing the nv30 architecture's divergence from fast execution of general case DX9 LLSL without reauthoring (I simply state nVidia is responsible for providing this optimization for their hardware, and that this optimization should not be inflicted on other IHVs as the standard).

I think the second is wrong, though it does seem to be what nVidia is intent on accomplishing.

(btw, I don't know why you mention pocketmoon's Cg recommendations here, as I'm not talking about how you should write a HLSL shader)

Well, if you look at the recommendations, consider that they are nv30 specific recommendations that nVidia has an interest in having the Cg compiler handling, regardless of the impact on low level optimization workload for other IHVs. Then consider that some might not impact other IHV hardware negatively...that there is a conceivable difference between those that do and do not, despite what you have proposed. Then apply that thinking to some of the established unique attributes of the nv30.

You might then try doing the same with the R300, and realize how little divergence from general case specification expression its performance opportunities require.

I call that the result of a "better design".

The driver has to accept native assembly shaders, shaders compiled with Cg, DX9 HLSL, or any other shader compiler that might exist somewhere. It certainly should not rely on getting "DX9 HLSL compiler output style".

You seem to be asking "why encourage optimizing for anything except optimizing for the nv30", by ignoring that it is the nv30 which has a design that requires extensive optimization, and optimizations that don't make sense for other architectures. I return to my 4x4 versus 8x1 example.

demalion said:
...
I am asking because a shader programmer could take exactly the same recommendations into account as the Cg compiler does when writing shader code. I think this is pretty likely. A programmer surely takes some performance recommendations into account.

Of course, and the the full set of those "same recommendations" are suitable considerations for the nv30, as I stated in one of my example cases. The problem is using all of that set of recommendations for other cards when there are other tools vendors are supporting (and not just one vendor, other vendors collectively) where the recommendations represent more than one party's interests.
But how do you know the "set of recommendations" from NVidia is not better suited to other vendors' hardware than the "set of recommendations" from ATI?

Now you are stipulating things that are contraindicated by the actual indications we've seen so far. Also note that you are first equating "ATI" with the discussion of "general recommendations" (for the ARB fragment path as well as DX 9, by the way) and then not considering that the R300 is simply a design that handles the general set of recommendations better, and also ignoring my earlier mention that Carmack's comments seem to support that observation.

Currently there are only NVidia and ATI supporting PS1.4+, and <PS1.3 is too limited to allow more than the most trivial optimizations.
That's a pretty definite statement, but I don't know where you are getting it from. Perhaps addressing my hasty and amateurish :)-?) example above will provide an opportunity to clarify.

What if XabreII, DeltaChrome, P20, etc. perform more like NV30 than R300?
What if nVidia stuck to designing their hardware to perform optimally within standards?
By the way, none of those listed architectures have given indication in released information of doing as you state, and in fact released information seems to indicate the opposite...doesn't mean they won't (released information didn't clearly indicate the nv30's issues for a long while, though I think that is an nVidia specifc talent), but it does seem to make it a weak pretext for saying Cg is a suitable general benchmarking tool right now.
You did, however, skip the unknown quantity supposedly coming from PowerVR, which might indeed have similar issues (I don't think so, but it is possible). Also, with limited resources, it does actually seem likely as something Matrox might try to achieve, but I don't know of any indication that would support that right now.
 
demalion said:
No, it is not a positive feature except for nVidia cards. To recall the discussion that led up to that answer:
Huh? Why do you think it's only "NVidia-native"? Don't you think there are also ATI specific extensions that are supported here? And of course, it's an option to use a specific path. I don't think it is negative to get another (clearly vendor-specific) result from a benchmark when it is obvious that the "standard" result should be obtained without optimizations. It's always good to have an optional feature IMO.


demalion said:
To the DX 9 PS 2.0 spec, as I specifically stated.
Sorry, but the DX9 PS 2.0 spec only says what kinds of shaders are valid shaders, not what kinds of shaders are good shaders. So when writing a shader you have lots of options, but no indication of whether you're on the right track or not.

demalion said:
You continue to maintain that simplification, despite my having tried to point you to examples of that not being the case. Again, looking at some benchmarks, notice instruction count parity (including when targetting the same assembly) and yet performance differences even for the same datatype. Also notice the recommendations to aid Cg compilation at the end, which the Cg compiler itself could (and likely will) do, even though the concerns are unique to the nv30, not all architectures.
And you continue to talk about different targets while I am only talking about PS 2.0 assembly, and not about anything to aid high level compilers.

btw, arbitrary swizzle is an optional feature of "PS2.x" and if a compiler would output code that uses it, it would not work on PS2.0 cards at all.

demalion said:
So how can you say one assembly shader is better than another, if they are equal in length, when there are no other "generally accepted" guidelines to follow? You might say "the one which executes faster is better", but that doesn't cover the case when one shader is faster on certain hardware and the other is faster on other hardware.

For DX 9, yes it does. There is a standard way of coding to the assembly to which several parties have input, so they can then further optimize at a low level. The issue is that nVidia wants to force IHVs to see to representing their hardware on nVidia's terms (Cg).
No, there is NO standard way of coding to the assembly. I can write whatever assembly shader I want, and as long as it is valid code, any DX9 compatible driver will accept it.

demalion said:
A HLSL shader compiler now has the same task, producing an assembly shader. It should follow the same recommendations as a human assembly shader programmer does. But there are almost no recommendations on the final output.

Where are you getting this "there are almost no recommendations"? Because I haven't pointed to a pdf?

Here are the possible sources for the DX low level code: DX 9 HLSL, Cg, hand coding.
Correct me if I understand you incorrectly: You are stating that either nVidia cannot produce Cg code targetted to DX 9 LLSL more suited to the nv30 than other hardware because you think they have the same goal of reduced instruction count for DX 9 (as opposed to having a goal of performance for the nv30, which is not synonymous) because the nv30 doesn't have unique performance characteristics with regard to anything besides instruction count (which is why I pointed you to pocketmoon's page, by the way, which illustrate the opposite, as well as analyses presented in other threads)...or that it doesn't matter that it does because other IHVs should optimize for the case when hand coding is as nVidia centric as Cg's output anyways.
I think that any driver must be capable of optimizing any assembly shader code it gets, regardless if it was generated by Cg, DX9 HLSL compiler, any other HLSL compiler or coded in assembly.

demalion said:
The driver has to accept native assembly shaders, shaders compiled with Cg, DX9 HLSL, or any other shader compiler that might exist somewhere. It certainly should not rely on getting "DX9 HLSL compiler output style".

You seem to be asking "why encourage optimizing for anything except optimizing for the nv30", by ignoring that it is the nv30 which has a design that requires extensive optimization, and optimizations that don't make sense for other architectures. I return to my 4x4 versus 8x1 example.
I don't know how you read that certain interpretation out of this quote, but regarding your example: isn't a shader optimized for 1 TMU per pipe equally not making sense for other architectures like 4x4?

demalion said:
Now you are stipulating things that are contraindicated by the actual indications we've seen so far. Also note that you are first equating "ATI" with the discussion of "general recommendations" (for the ARB fragment path as well as DX 9, by the way) and then not considering that the R300 is simply a design that handles the general set of recommendations better, and also ignoring my earlier mention that Carmack's comments seem to support that observation.
I have not seen indications that any product from another vendor is more like R300 and not like NV30 yet. Honestly, I have not seen in-depth information on their architecture at all, be it SiS, S3, 3DLabs or any other vendor.

I am not equating ATI with the set of "general recommendations". I just think the "set of recommendations" the DX9 HLSL compiler uses is much closer to ATI's set of recommendations than it is to NVidia's set of recommendations.

demalion said:
Currently there are only NVidia and ATI supporting PS1.4+, and <PS1.3 is too limited to allow more than the most trivial optimizations.
That's a pretty definite statement, but I don't know where you are getting it from. Perhaps addressing my hasty and amateurish :)-?) example above will provide an opportunity to clarify.
I think PS1.3 with its 4 texture ops + 8 arithmetic ops, no swizzle and a handful of registers is too limited to be a useful target for HLSL code.

demalion said:
What if XabreII, DeltaChrome, P20, etc. perform more like NV30 than R300?
What if nVidia stuck to designing their hardware to perform optimally within standards?
By the way, none of those listed architectures have given indication in released information of doing as you state, and in fact released information seems to indicate the opposite...doesn't mean they won't (released information didn't clearly indicate the nv30's issues for a long while, though I think that is an nVidia specifc talent), but it does seem to make it a weak pretext for saying Cg is a suitable general benchmarking tool right now.
You did, however, skip the unknown quantity supposedly coming from PowerVR, which might indeed have similar issues (I don't think so, but it is possible). Also, with limited resources, it does actually seem likely as something Matrox might try to achieve, but I don't know of any indication that would support that right now.
What is the "standard" NVidia should stick to?
I don't know how chips from other vendors will perform, but I think it is possible for them to perform similar to NV30. Also, "etc." doesn't skip anyone.
 
Xmas said:
demalion said:
No, it is not a positive feature except for nVidia cards. To recall the discussion that led up to that answer:
Huh? Why do you think it's only "NVidia-native"? Don't you think there are also ATI specific extensions that are supported here?

Xmas, you are losing me here. We're talking about Cg...Cg doesn't compile to ATI specific extensions, nor is it likely to as long as ATI is committed to the standards Cg bypasses and instead chooses to do things like design their hardware for more computational efficiently (Pick some shader benchmarks comparing the cards, analyze the results, and tell me if you'd dispute even that).

And of course, it's an option to use a specific path. I don't think it is negative to get another (clearly vendor-specific) result from a benchmark when it is obvious that the "standard" result should be obtained without optimizations. It's always good to have an optional feature IMO.

Hmm... OK, let's go over this again:

There are "synthetic" tests. I only noticed files fo DirectX at the moment, but I presume they intend to implement them for OpenGL. When/if they do, of course it is not bad if they support all OpenGL extensions, but that is not at all what we are discussing. That follows...

There are "game" tests. They are implemented using Cg, period. The only optimizations this delivers are nVidia's prioritized compilation, and for those targets nVidia has defined. So, in addition to the restriction of standards to only the ones nVidia has an interest in exposing (which excludes the ATI fragment extension for OpenGL and PS 1.4), the only vendor specific optimizations offered are the ones provided by nVidia, for nVidia hardware.

Solutions, covered again:

The authors of Rightmark 3D support DX 9 HLSL (ignoring for the moment that the shaders in the .cg files seem short and able to be ported to the ATI fragment extension for OpenGL and the 8500/900). This is what I referred to earlier as being as much as could be reasonably expected with limited resources (i.e., flawed, but potentially useful).

Other IHVs circumvent GLslang and DX 9 HLSL, whose specifications they happen to have a known say in, and support Cg on nVidia's terms. There does appear to be some reason for them to consider this undesirable, do you disagree?

demalion said:
To the DX 9 PS 2.0 spec, as I specifically stated.
Sorry, but the DX9 PS 2.0 spec only says what kinds of shaders are valid shaders, not what kinds of shaders are good shaders. So when writing a shader you have lots of options, but no indication of whether you're on the right track or not.
Yeah, and API's only specify what kind of programs are valid, not what kind are good programs. Your statement is misleading, since you are proposing that there is no relation to the specification of the instructions and modifiers, and then subsequently what is a good method of executing them. That is simply not true.

For instance, the DX 9 PS 2.0 specification does not specify that two non texture op instructions from a restricted set occur for each PS 2.0 instruction or texture op. This is a possible occurrence, but depending on it to occur for optimal performance is not a good method of execution. However, that is a characteristic of Cg's known target, and it is the ability to control this that nVidia is using Cg to allow them to do for shader content creation, during development and at run time, before the low level optimizer has to worry about it. They promote that precompiled code be generated using Cg, even if there is another alternative, for that reason...but, you maintain this does not matter despite evidence and analysis proposed to the contrary. :-?

demalion said:
You continue to maintain that simplification, despite my having tried to point you to examples of that not being the case. Again, looking at some benchmarks, notice instruction count parity (including when targetting the same assembly) and yet performance differences even for the same datatype. Also notice the recommendations to aid Cg compilation at the end, which the Cg compiler itself could (and likely will) do, even though the concerns are unique to the nv30, not all architectures.
And you continue to talk about different targets while I am only talking about PS 2.0 assembly, and not about anything to aid high level compilers.
We are talking about a high level compiler, Xmas. What do you think Cg is using to output to PS 2.0? You are circumventing the simple observation that compilers produce different code, by saying you are only talking about the code.

Why didn't you address my instruction example at all? If it has flaws, it would be useful to discuss where I went wrong. If it doesn't, it seems more than slightly applicable. :?:

You can write bad code using PS 2.0, you can hand code bad code, you can compile bad code in a compiler. The problem with Cg is that the bad code that the compiler will be improved to avoid is defined by nVidia. The benefit of DX 9 HLSL, as long as MS doesn't have a conflict of interest, is that all IHVs have input into that definition (without having to write a custom compiler at nVidia's request).
A good set of base principles can be reached and targetted without conflict, except for designs that are a bad match for the general case base specifications of the LLSL. It so happens that the nv30 is a bad design for all specifications I know of except nVidia's customized OpenGL extension and their formerly (?) bastardized PS 2.0, yet they want other IHVs to have to also create another HLSL compiler ...for nVidia's toolset :!:...or allow nVidia to dictate the code their low level optimizer have to be engineered to address effectively.

You maintain there is nothing wrong with things occurring along the lines of this goal.

btw, arbitrary swizzle is an optional feature of "PS2.x" and if a compiler would output code that uses it, it would not work on PS2.0 cards at all.

Is this in reference to the sentence after my example shader code? Thanks for the correction, but what about the rest of what I proposed?

demalion said:
So how can you say one assembly shader is better than another, if they are equal in length, when there are no other "generally accepted" guidelines to follow? You might say "the one which executes faster is better", but that doesn't cover the case when one shader is faster on certain hardware and the other is faster on other hardware.

For DX 9, yes it does. There is a standard way of coding to the assembly to which several parties have input, so they can then further optimize at a low level. The issue is that nVidia wants to force IHVs to see to representing their hardware on nVidia's terms (Cg).
No, there is NO standard way of coding to the assembly. I can write whatever assembly shader I want, and as long as it is valid code, any DX9 compatible driver will accept it.

Again, your statement is misleading, since we aren't talking about standard as in "able to run", but standard as in able to run quickly, which has a not so slight impact on which cards people buy and what effects are implemented at all. This is again avoiding the significance of the reality that different compilers can produce different code.

demalion said:
A HLSL shader compiler now has the same task, producing an assembly shader. It should follow the same recommendations as a human assembly shader programmer does. But there are almost no recommendations on the final output.

Where are you getting this "there are almost no recommendations"? Because I haven't pointed to a pdf?

Here are the possible sources for the DX low level code: DX 9 HLSL, Cg, hand coding.
Correct me if I understand you incorrectly: You are stating that... it doesn't matter that it does [generate nv30 optimized LLSL code] because other IHVs should optimize for the case when hand coding is as nVidia centric as Cg's output anyways.
I think that any driver must be capable of optimizing any assembly shader code it gets, regardless if it was generated by Cg, DX9 HLSL compiler, any other HLSL compiler or coded in assembly.
Well, that seems to fit my second description, quoted alone for brevity. I already provided an answer:
demalion said:
I think the second is wrong, though it does seem to be what nVidia is intent on accomplishing.

And expanded upon it (why didn't you address this the first time?):

demalion said:
(btw, I don't know why you mention pocketmoon's Cg recommendations here, as I'm not talking about how you should write a HLSL shader)
Well, if you look at the recommendations, consider that they are nv30 specific recommendations that nVidia has an interest in having the Cg compiler handling, regardless of the impact on low level optimization workload for other IHVs. Then consider that some might not impact other IHV hardware negatively...that there is a conceivable difference between those that do and do not, despite what you have proposed. Then apply that thinking to some of the established unique attributes of the nv30.

You might then try doing the same with the R300, and realize how little divergence from general case specification expression its performance opportunities require.

I call that the result of a "better design".

Xmas said:
demalion said:
The driver has to accept native assembly shaders, shaders compiled with Cg, DX9 HLSL, or any other shader compiler that might exist somewhere. It certainly should not rely on getting "DX9 HLSL compiler output style".

You seem to be asking "why encourage optimizing for anything except optimizing for the nv30", by ignoring that it is the nv30 which has a design that requires extensive optimization, and optimizations that don't make sense for other architectures. I return to my 4x4 versus 8x1 example.
I don't know how you read that certain interpretation out of this quote, but regarding your example: isn't a shader optimized for 1 TMU per pipe equally not making sense for other architectures like 4x4?

The difference is that an 8x1 is full speed when handling any amount of textures, and a 4x4 architecture is only full speed when handling multiples of 4. I get that out of your text because code optimized for the nv30 is exactly what you are saying that, for example, the R300 should be able to handle, rather than the nv30 should be able to handle code generated to provide general case opportunities (I've even provided a list for discussion of optimizations I propose are common).

Anyways, pardon the emphasis, as we've been circumventing this idea by various paths in our discussion a few times now:
The base specification that allows applying 1 or more textures to a pixel is not optimized for 1 TMU per pipe, 1 TMU per pipe is optimized for efficiency for the base specification.

Your assertion that the DX 9 HLSL output is just as unfair as Cg's output is predicated on a flawed juxtaposition of that statement that parallels the situation with nVidia's hardware (try replacing "1" with "4"...that reflects what you'd have to say to support a similar assertion about Cg and DX 9 HLSL, and it does not make sense because that statement is exclusive of less than 4 textures, while this statement is not exclusive of 4 or more textures...).

This TMU example pretty closely parallels several aspects of the difference between the shader architectures in question. You continue to propose that because the specification allows expressing code in such a way that it is optimized for the one that has more specific requirements, despite discussion of how that such optimization can hinder the performance of other architectures, that it is fine to optimize code for the specific case instead of the general case. The cornerstone of this belief seems to be that this is fine because the hardware designed for the general case should be able to handle any code thrown at it at all (why bother to design something that performs well in the general case anyways?), rather than the hardware designed for the specialized case having to seek opportunities to address its own special requirments.

You proclaim that the idea of HLSL optimization for DX 9 is not significant, so it doesn't matter what optimization is used, and that continues to make no sense to me.

I have not seen indications that any product from another vendor is more like R300 and not like NV30 yet. Honestly, I have not seen in-depth information on their architecture at all, be it SiS, S3, 3DLabs or any other vendor.
Well, I maintain that the R300 approach is rather obviously more suited to API specifications than the nv30 in terms of design efficiency, and that the nv30 approach only makes sense given nVidia's commitment to their own legacy hardware.

As what I feel is another opinion similar to what I've been saying:
[url=http://www.beyond3d.com/interviews/jcnv30r300/index.php?p=2 said:
In this interview, Carmack[/url]]
Apparently, the R300 architecture is a better target for a straightforward assembler / compiler, while the NV30 is twitchy enough to require more serious analysis and scheduling, which is why they expect significant improvements with later drivers.

Maybe he's wrong, or I misunderstand him, and maybe my prior analsyis is wrong, but could you address why that is the case instead of simply stating that this assertion is wrong to then support that using Cg is fine for a general benchmark no matter what it's code output looks like?

As far as indications go, I think the P10's current architecture details do not lend itself to a pipeline that is partially integer where the integer units sit idle when doing floating point processing, nor does what I recall from the GLslang specifications indicate this type of focus. I do feel it is pretty safe to say that the P20 design will be focused on general case performance, which I again state the nv30's design is not (except as the "general" case input to the low level optimizer is redefined as much as possible by Cg).

Also, I think this information provided by S3 indicates more similarity to the R300 than nv30, but we'll actually have to see the delivered hardware to be sure.

...
I am not equating ATI with the set of "general recommendations".
Pardon me, perhaps I should have said you are "replacing the idea of 'general recommendations' with 'recommendations suited to ATI'".
I just think the "set of recommendations" the DX9 HLSL compiler uses is much closer to ATI's set of recommendations than it is to NVidia's set of recommendations.
Yes it is, and not by accident! ATI designed the R300 that way. Smart of them, wasn't it?
The nature of your commentary is predicated on "vendor" versus "vendor", ignoring that one vendor's design is simply better suited for the teneral case by substituting that vendor's name when discussing the "general case". I really don't think you have a leg to stand on for that, and that I've articulated my reasons for believing that. We're not effectively communicating here, though, and I think my attempts for useful examples for discussion are being unproductively bypassed.
I think PS1.3 with its 4 texture ops + 8 arithmetic ops, no swizzle and a handful of registers is too limited to be a useful target for HLSL code.
Well, the argument that HLSL development is too good for limited shader lengths per pass is one that we can have another time. What I'll maintain for now is that it is not up to nVidia (or you) to determine that for everyone.

What is the "standard" NVidia should stick to?
Please address my example shader code directly, my discussion of pocketmoon's shader results, my 8x1 and 4x4 example parallel that attempt to answer that very question, and then tell me why they do not apply in a way other than "HLSL optimization doesn't matter because the 'assembly' low level optimizer should be able to handle any code it receives", since that ignores provided demonstration that low level optimizers are limited and that HLSL's are not limited to "on the fly" use.
I don't know how chips from other vendors will perform, but I think it is possible for them to perform similar to NV30.
Yes, it is, but as I've stated I think it is a weak premise to use as a support for your assertions that the HLSL output isn't suited to the general case, but is rather suited to "ATI". Doesn't fit my understanding of the architectures, nor seem related to any realistic interpretation of the relationships between ATI, Microsoft, nVidia, and all other IHVs. It does seem to arbitrarily equate "ATI" and "general case" to me, due to ATI designing with that in mind, but we covered that just above.
Also, "etc." doesn't skip anyone.
What I meant by the text (that I leave unquoted) was that you skipped listing the possibilities that I don't have solid information to doubt (PowerVR and Matrox) and listed ones that I did. I don't have indication of the Xabre II, I guess, and I do think a case could be made for them actually being the most likely of all to produce something similar to the NV31 or NV34.
 
The name says it all right mark (finally we get to see 2.0 shader benchmarks, written to use Nv30 hardware as it’s intended)
My thoughts:

a.) Nvidea is moving too fast for everyone else and even the brand spanking new direct x9.0 does not take advantage of some of their advanced features.

b.)I hate to keep hearing people talk about how nvidea is lowering the specs. fp32/fp18 is better than just fp24 plane and simple.
fp32 allows better quality and fp18 allows better speed. Part of the whole fx push is being able to use the same fp32 as the movies like shrek and toy story. The goal is to be able to render these movies in real time (and I almost 100% sure they already demoed a scene from toy story.) Ati can't do this at the same fp as the original movie. fp18 is great and those screen shots floating around the net don't represent the difference between fp18 and fp24, I’m sure their was something far more dramatic occurring like a bug since the difference between these two specs is not something a human eye can detect. and even these issues where fixed with the newest drivers while still maintaining the high frame rates so the performance boost is coming from somewhere else besides downgrading fp.

c.) cg is already more functional from a developers point of view than Microsoft’s compiler. It's been integrated into all the major 3d modeling tools and should eventually allow not programs, but artist to handle the special effects and that alone will increase quality. Can you say let's draw some pretty water and not let's program some pretty water. :D

d.) m$ chose not to implement all thier features into direct x 9.0, know that was a very mean thing to do. But they are very capeable of supporting themselves unlike ATI. They are key developers in the creation of Opengl they know are creating thier own very capable compiler and thier active in all types of development tools. I like the fx specifications allot, I can't wait to games fully utilize them, without cg most likely exporting to opengl,since m$ won't play nicely these features would never get used and those are the games I'am buying. Who in the world gave M$ or ATI the right to tell Nvidea the best way to display graphics.

In the end I just want to see the geforce fx do everything it says it can do and then compare it to the Radeon doing it’s best job. I would be in favor of exclusive games for ati and geforce respectively just to stop them from slowing each other down and it looks like this is where things are headed.
 
Back
Top