DemoCoder said:
I already mentioned using texture lookup replacement. The most obvious example is normalization, or attentuation/pow() replacements.
Right, and if you sometimes depend on this for performance
in place of computational functionality you also have, deciding the right time to swap texture look ups for the computation is a performance solution, but not one for a generic re-scheduler. It depends on the card implementation and other bandwidth demands occurring at the same time.
Also, won't your examples result in continuity problems, as well as have severe precision issues given nVidia's texture access issues?
AFAICS, in such a case, the performance gain is not something that can be presumed to represent re-scheduling performance gain opportunities.
But what do you mean by shader code restructuring?
Point of information: That's not my statement, that's Valve's.
If by that, you mean reducing register count, moving code blocks around, substituting more optimal subexpressions, all of this can be done by compiler technology.
That's the question I agree we should look at, replacing "you" (me) with "Valve". From there being a NV3x (NV35, really, though I'm interested in how the NV30 would perform with it) path, I think Valve and nVidia devrel came up with a compromise set of shaders that didn't try to do the same work in a different way, but tried to execute a new workload that performed better but looked at least as good as the DX 8.1 shaders (and resembled the DX 9 shaders closely when possible). This is a game, this makes sense, and introduces an opportunity of "better than DX 8, but similar in demands" shaders if the developer has the time.
This correlates strongly to me with departure from resembling an "optimally re-scheduled implemenation doing the same thing as the DX 9 shaders" (which is what a driver re-scheduler would do) with regard to several factors: image quality and workload equivalence was not the goal, but achieving necessary performance improvement on the NV35; the shaders used in this path are vendor/card specific; nVidia was involved in making them; nVidia refers to a lack of ability to distinguish PS 1.4 shaders from PS 2.0 in part of their defensive reply to Valve's statements; absolutely nothing (again, AFAICS) in either a gamer's interest or Valve's (once they committed to a NV3x path) demands that this new body of shaders be geared at necessarily doing (exactly) the same workload as the DX 9 shaders.
This is not presented as something conclusive, as I am encouraging the same investigation of Valve's shaders that you are...this is presented as why I don't think your representation of the NV3x versus DX 9 HL2 performance figures as reflecting something that can be tackled by a re-scheduler seems to make sense at the moment. Evaluating that should be part of the investigation we both think should occur.
IOW, "I disagree with that, and I think I have good reason...you can discuss why my reasons are perhaps good or bad, but this doesn't mean we shouldn't then go and investigate what is actually the case afterwards". I hope the context is clear?
The only "tricks" I could see related to "code restructuring" which are not generally available to compilers is replacing expressions with approximations, e.g. numerical integration subsitutes, Newton-Rhaphson, etc since that would require the compiler to "recognize" fuzzy mathematical concepts which might not be detectable at compiler time and replace them with approximations.
Well, I think the texture look up cases end up being approximations for this hardware and for a generic implementation, as per a re-scheduler, don't they?
Another example is flat out disabling parts of the lighting equation: e.g. turn off per-pixel specular, switch to per-vertex calculations, leave off fresnel term for water, etc.
That's part of the DX version level support as Valve has listed before. I do think there is opportunity for some of this as part of the NV3x path solution as well, since it seems part of the same integrated structure of implementation control as was listed to relating to this. DX 6, 7, 8, 8.1, 9...they were presented as target configs with decisions relating to exactly such factors, along with implementing increasing shader versions being used, and the "8.2" (that Anand seemed to refer to, I think) and "NV3x" seem to be additions to this list.
Like I said, I'd wait till I see Valve's actual shaders.
I agree with this. I'm disagreeing with your indication, as I read it, that the performance differences are naturally related to missed opportunities by a driver low-level re-scheduler, when all of these other factors are part of what is determined by the "NV3x" implementation as well.
...
But I can tell you that Cg's NV3x code generator is a waste-o-rama when it comes to registers or even generating peephole optimizations to reduce instruction count, that's why I have my doubts as to the quality of the translation between PS2.0 bytecode into NV3x internal assembly in NVidia's drivers.
I definitely think there is room for improvement. I also think, unfortunately, that nVidia is perfectly willing to throw in any method of performance "improvement" and "sell" it as this type of optimization. I'd prefer if we could spend time more purely focused on investigating the former, but the latter issue intrudes. However, I am also saying that making assumptions about the latter issue shouldn't preclude investigating the former (in which, I think I'm agreeing with you).
NVidia has a non-trivial job, because of their over complicated architecture and performance "gotchas" with respect to resource usage, whereas ATI's "DX9-like" architecture and single precision through the pipeline with no bottlenecks in register usage makes the optimization issue much easier.
This is exactly the type of thing I would find it interesting to focus on in a comparison article. I think the cheating issues should be investigated and isolated, and dealt with separately, to facilitate avoiding confusion between that issue, and this.