Driver based HLSL->GPU opcode compiler. Everything is done in one place, and controlled by the IHVs.
The advantages of this are that absolutely all possible HLSL->GPU opcode optimization opportunities that can be implemented are accessible, with the right compiler, and the IHV has complete control over releasing HLSL compilation improvements, if necessary, as part of driver updates.
Reality intrusion: Finding all optimization possibilities doesn't come for free, and neither does creating bug and conflict free compilers across multiple architectures while attempting to find them, let alone updating them with significantly more frequency than they might be done otherwise. Also, by virtue of being a standard, there is still "one body" imposing limits on the behavior, though the standard is the only place where "strengths" and "weaknesses" of that body are guaranteed to be brought to bear. However, if IHVs can agree to share and develop common compiler technology that is applicable to them all, and succeed, this seems a possible way to reduce the significance of this hurdle quite a bit. Hopefully, politics and economics don't interfere with this exercise.
My opinion: A rather clearly visible goal of optimal compilation is visible on the far side of this rather large obstacle, though their being unique benefits to this approach is not established yet. If all obstacles are overcome in a timely fashion, however, that goal should achieve the best possible result if reached...though this observation neither guarantees actually achieving better compilation, nor negates the issues that need to be overcome.
API based HLSL->intermediate compiler, driver based intermediate->GPU opcode compiler. One body controls the first part as a standard, and the IHVs still have control over compilation for each GPU.
The advantage of this is that standardizing an intermediate specifies more reproducible compiler behavior, "nailing things down" more firmly. Also, the strengths of the "one body" can be brought directly to bear on one part of the task, depending on how much communication there is between IHVs and the "one body".
Reality intrusion: politics and economics can cause interference with communication between companies, and optimal performance is limited by the suitability of the intermediate representation for implementing optimized GPU opcode. Also, "weaknesses" in the "one body" controlling the first part might possibly be universally applied.
My opinion: I think we're already seeing good performance here, and a demonstration of how the intermediate representation can be successful. However, we're also seeing indications of bugs that hinder that, and might be related to economics and politics.
...
I'm still waiting to see how things pan out in comparing the success of the two approaches...they both have advantages and disadvantages.
I'm hoping GLSLang is able to take off and overcome the associated hurdles to at least catch up with DX 9 HLSL, because its hurdles are, AFAIK, significant and uncharted. With current hardware, I have my doubts on that happening...perhaps the R420 and NV40, along with the hopefully extensive work IHVs have been doing in this regard already, will allow it to show advantage compared to the current DX HLSL implementation when those products are released. The intermediate->GPU opcode compilation seems to be making significant progress, which leaves IHVs having to match more of MS's general compiler experience for the GLSLang approach to compete favorably. How that is going to be done remains to be seen.
More directly to the thread topic, I think HL 2 overall illustrates that the issues with the NV3x are independent of HLSL and LLSL, as specified by the commentary about changes to shaders required, and that the basic HLSL model (implemented properly to spec) is flexible enough to accomadate differing hardware. I think the quoted commentary relates to that.
As far as that results in "annoyance" for HLSL profiles, someone has mentioned there is a query to return the optimal profile for current hardware. This seems to indicate that no annoyance is being added besides executing the query and using the result, at least as far as the annoyances that are actually due to the DX 9 HLSL mechanics for supporting different hardware.
This does seem to leave issues (for both approaches) that have yet to be addressed, and how things will compare, and when, an open question.