Xmas said:
demalion said:
No, it is not a positive feature except for nVidia cards. To recall the discussion that led up to that answer:
Huh? Why do you think it's only "NVidia-native"? Don't you think there are also ATI specific extensions that are supported here?
Xmas, you are losing me here. We're talking about Cg...Cg doesn't compile to ATI specific extensions, nor is it likely to as long as ATI is committed to the standards Cg bypasses and instead chooses to do things like design their hardware for more computational efficiently (Pick some shader benchmarks comparing the cards, analyze the results, and tell me if you'd dispute even that).
And of course, it's an option to use a specific path. I don't think it is negative to get another (clearly vendor-specific) result from a benchmark when it is obvious that the "standard" result should be obtained without optimizations. It's always good to have an optional feature IMO.
Hmm... OK, let's go over this again:
There are "synthetic" tests. I only noticed files fo DirectX at the moment, but I presume they intend to implement them for OpenGL. When/if they do, of course it is not bad if they support all OpenGL extensions, but that is not at all what we are discussing. That follows...
There are "game" tests. They are implemented using Cg, period. The only optimizations this delivers are nVidia's prioritized compilation, and for those targets nVidia has defined. So, in addition to the restriction of standards to only the ones nVidia has an interest in exposing (which excludes the ATI fragment extension for OpenGL and PS 1.4), the only vendor specific optimizations offered are the ones provided by nVidia, for nVidia hardware.
Solutions, covered again:
The authors of Rightmark 3D support DX 9 HLSL (ignoring for the moment that the shaders in the .cg files seem short and able to be ported to the ATI fragment extension for OpenGL and the 8500/900). This is what I referred to earlier as being as much as could be reasonably expected with limited resources (i.e., flawed, but potentially useful).
Other IHVs circumvent GLslang and DX 9 HLSL, whose specifications they happen to have a known say in, and support Cg on nVidia's terms. There does appear to be some reason for them to consider this undesirable, do you disagree?
demalion said:
To the DX 9 PS 2.0 spec, as I specifically stated.
Sorry, but the DX9 PS 2.0 spec only says what kinds of shaders are
valid shaders, not what kinds of shaders are
good shaders. So when writing a shader you have lots of options, but no indication of whether you're on the right track or not.
Yeah, and API's only specify what kind of programs are valid, not what kind are good programs. Your statement is misleading, since you are proposing that there is no relation to the specification of the instructions and modifiers, and then subsequently what is a good method of executing them. That is simply not true.
For instance, the DX 9 PS 2.0 specification does
not specify that two non texture op instructions from a restricted set occur for each PS 2.0 instruction or texture op. This is a possible occurrence, but depending on it to occur for optimal performance is not a good method of execution. However, that is a characteristic of Cg's known target, and it is the ability to control this that nVidia is using Cg to allow them to do for shader content creation, during development and at run time, before the low level optimizer has to worry about it. They promote that precompiled code be generated using Cg, even if there is another alternative, for that reason...but, you maintain this does not matter despite evidence and analysis proposed to the contrary.
demalion said:
You continue to maintain that simplification, despite my having tried to point you to examples of that not being the case. Again, looking at
some benchmarks, notice instruction count parity (including when targetting the same assembly) and yet performance differences even for the same datatype. Also notice the recommendations to aid Cg compilation at the end, which the Cg compiler itself could (and likely will) do, even though the concerns are unique to the nv30, not all architectures.
And you continue to talk about different targets while I am only talking about PS 2.0 assembly, and not about anything to aid high level compilers.
We are talking about a high level compiler, Xmas. What do you think Cg is using to output to PS 2.0? You are circumventing the simple observation that compilers produce different code, by saying you are only talking about the code.
Why didn't you address my instruction example at all? If it has flaws, it would be useful to discuss where I went wrong. If it doesn't, it seems more than slightly applicable.
You can write bad code using PS 2.0, you can hand code bad code, you can compile bad code in a compiler. The problem with Cg is that the bad code that the compiler will be improved to avoid is defined by nVidia. The benefit of DX 9 HLSL, as long as MS doesn't have a conflict of interest, is that all IHVs have input into that definition (without having to write a custom compiler at nVidia's request).
A good set of base principles can be reached and targetted without conflict,
except for designs that are a bad match for the general case base specifications of the LLSL. It so happens that the nv30 is a bad design for all specifications I know of except nVidia's customized OpenGL extension and their formerly (?) bastardized PS 2.0, yet they want other IHVs to have to
also create another HLSL compiler ...for nVidia's toolset
...or allow nVidia to dictate the code their low level optimizer have to be engineered to address effectively.
You maintain there is nothing wrong with things occurring along the lines of this goal.
btw, arbitrary swizzle is an optional feature of "PS2.x" and if a compiler would output code that uses it, it would not work on PS2.0 cards at all.
Is this in reference to the sentence after my example shader code? Thanks for the correction, but what about the rest of what I proposed?
demalion said:
So how can you say one assembly shader is better than another, if they are equal in length, when there are no other "generally accepted" guidelines to follow? You might say "the one which executes faster is better", but that doesn't cover the case when one shader is faster on certain hardware and the other is faster on other hardware.
For DX 9, yes it does. There is a standard way of coding to the assembly to which several parties have input, so they can then further optimize at a low level. The issue is that nVidia wants to force IHVs to see to representing their hardware on nVidia's terms (Cg).
No, there is NO standard way of coding to the assembly. I can write whatever assembly shader I want, and as long as it is valid code, any DX9 compatible driver will accept it.
Again, your statement is misleading, since we aren't talking about standard as in "able to run", but standard as in
able to run quickly, which has a not so slight impact on which cards people buy and what effects are implemented at all. This is again avoiding the significance of the reality that different compilers can produce different code.
demalion said:
A HLSL shader compiler now has the same task, producing an assembly shader. It should follow the same recommendations as a human assembly shader programmer does. But there are almost no recommendations on the final output.
Where are you getting this "there are almost no recommendations"? Because I haven't pointed to a pdf?
Here are the possible sources for the DX low level code: DX 9 HLSL, Cg, hand coding.
Correct me if I understand you incorrectly: You are stating that... it doesn't matter that it does [generate nv30 optimized LLSL code] because other IHVs should optimize for the case when hand coding is as nVidia centric as Cg's output anyways.
I think that any driver must be capable of optimizing any assembly shader code it gets, regardless if it was generated by Cg, DX9 HLSL compiler, any other HLSL compiler or coded in assembly.
Well, that seems to fit my second description, quoted alone for brevity. I already provided an answer:
demalion said:
I think the second is wrong, though it does seem to be what nVidia is intent on accomplishing.
And expanded upon it (why didn't you address this the first time?):
demalion said:
(btw, I don't know why you mention pocketmoon's Cg recommendations here, as I'm not talking about how you should write a HLSL shader)
Well, if you look at the recommendations, consider that they are nv30 specific recommendations that nVidia has an interest in having the Cg compiler handling, regardless of the impact on low level optimization workload for other IHVs. Then consider that some might not impact other IHV hardware negatively...that there is a conceivable difference between those that do and do not, despite what you have proposed. Then apply that thinking to some of the established unique attributes of the nv30.
You might then try doing the same with the R300, and realize how little divergence from general case specification expression its performance opportunities require.
I call that the result of a "better design".
Xmas said:
demalion said:
The driver has to accept native assembly shaders, shaders compiled with Cg, DX9 HLSL, or any other shader compiler that might exist somewhere. It certainly should not rely on getting "DX9 HLSL compiler output style".
You seem to be asking "why encourage optimizing for anything except optimizing for the nv30", by ignoring that it is the nv30 which has a design that requires extensive optimization, and optimizations that don't make sense for other architectures. I return to my 4x4 versus 8x1 example.
I don't know how you read that certain interpretation out of this quote, but regarding your example: isn't a shader optimized for 1 TMU per pipe equally not making sense for other architectures like 4x4?
The difference is that an 8x1 is full speed when handling any amount of textures, and a 4x4 architecture is only full speed when handling multiples of 4. I get that out of your text because code optimized for the nv30 is exactly what you are saying that, for example, the R300 should be able to handle, rather than the nv30 should be able to handle code generated to provide general case opportunities (I've even provided a list for discussion of optimizations I propose are common).
Anyways, pardon the emphasis, as we've been circumventing this idea by various paths in our discussion a few times now:
The base specification that allows applying 1 or more textures to a pixel is not optimized for 1 TMU per pipe, 1 TMU per pipe is optimized for efficiency for the base specification.
Your assertion that the DX 9 HLSL output is just as unfair as Cg's output is predicated on a flawed juxtaposition of that statement that parallels the situation with nVidia's hardware (try replacing "1" with "4"...that reflects what you'd have to say to support a similar assertion about Cg and DX 9 HLSL, and it does not make sense because that statement is exclusive of less than 4 textures, while this statement is not exclusive of 4 or more textures...).
This TMU example pretty closely parallels several aspects of the difference between the shader architectures in question. You continue to propose that because the specification allows expressing code in such a way that it is optimized for the one that has more specific requirements, despite discussion of how that such optimization can hinder the performance of other architectures, that it is fine to optimize code for the specific case instead of the general case. The cornerstone of this belief seems to be that this is fine because the hardware designed for the general case should be able to handle any code thrown at it at all (why bother to design something that performs well in the general case anyways?), rather than the hardware designed for the specialized case having to seek opportunities to address its own special requirments.
You proclaim that the idea of HLSL optimization for DX 9 is not significant, so it doesn't matter what optimization is used, and that continues to make no sense to me.
I have not seen indications that any product from another vendor is more like R300 and not like NV30 yet. Honestly, I have not seen in-depth information on their architecture at all, be it SiS, S3, 3DLabs or any other vendor.
Well, I maintain that the R300 approach is rather obviously more suited to API specifications than the nv30 in terms of design efficiency, and that the nv30 approach only makes sense given nVidia's commitment to their own legacy hardware.
As what I feel is another opinion similar to what I've been saying:
[url=http://www.beyond3d.com/interviews/jcnv30r300/index.php?p=2 said:
In this interview, Carmack[/url]]
Apparently, the R300 architecture is a better target for a straightforward assembler / compiler, while the NV30 is twitchy enough to require more serious analysis and scheduling, which is why they expect significant improvements with later drivers.
Maybe he's wrong, or I misunderstand him, and maybe my prior analsyis is wrong, but could you address why that is the case instead of simply stating that this assertion is wrong to then support that using Cg is fine for a general benchmark no matter what it's code output looks like?
As far as indications go, I think the P10's current architecture details do not lend itself to a pipeline that is partially integer where the integer units sit idle when doing floating point processing, nor does what I recall from the GLslang specifications indicate this type of focus. I do feel it is pretty safe to say that the P20 design will be focused on general case performance, which I again state the nv30's design is not (except as the "general" case input to the low level optimizer is redefined as much as possible by Cg).
Also, I think
this information provided by S3 indicates more similarity to the R300 than nv30, but we'll actually have to see the delivered hardware to be sure.
...
I am not equating ATI with the set of "general recommendations".
Pardon me, perhaps I should have said you are "replacing the idea of 'general recommendations' with 'recommendations suited to ATI'".
I just think the "set of recommendations" the DX9 HLSL compiler uses is much closer to ATI's set of recommendations than it is to NVidia's set of recommendations.
Yes it is, and not by accident! ATI designed the R300 that way. Smart of them, wasn't it?
The nature of your commentary is predicated on "vendor" versus "vendor", ignoring that one vendor's design is simply better suited for the teneral case by substituting that vendor's name when discussing the "general case". I really don't think you have a leg to stand on for that, and that I've articulated my reasons for believing that. We're not effectively communicating here, though, and I think my attempts for useful examples for discussion are being unproductively bypassed.
I think PS1.3 with its 4 texture ops + 8 arithmetic ops, no swizzle and a handful of registers is too limited to be a useful target for HLSL code.
Well, the argument that HLSL development is too good for limited shader lengths per pass is one that we can have another time. What I'll maintain for now is that it is not up to nVidia (or you) to determine that for everyone.
What is the "standard" NVidia should stick to?
Please address my example shader code directly, my discussion of pocketmoon's shader results, my 8x1 and 4x4 example parallel that attempt to answer that very question, and then tell me why they do not apply in a way other than "HLSL optimization doesn't matter because the 'assembly' low level optimizer should be able to handle any code it receives", since that ignores provided demonstration that low level optimizers are limited and that HLSL's are not limited to "on the fly" use.
I don't know how chips from other vendors will perform, but I think it is possible for them to perform similar to NV30.
Yes, it is, but as I've stated I think it is a weak premise to use as a support for your assertions that the HLSL output isn't suited to the general case, but is rather suited to "ATI". Doesn't fit my understanding of the architectures, nor seem related to any realistic interpretation of the relationships between ATI, Microsoft, nVidia, and all other IHVs. It does seem to arbitrarily equate "ATI" and "general case" to me, due to ATI designing with that in mind, but we covered that just above.
Also, "etc." doesn't skip anyone.
What I meant by the text (that I leave unquoted) was that you skipped listing the possibilities that I don't have solid information to doubt (PowerVR and Matrox) and listed ones that I did. I don't have indication of the Xabre II, I guess, and I do think a case could be made for them actually being the most likely of all to produce something similar to the NV31 or NV34.