nVidia's response to Valve

Dio said:
That's an interesting point. It should be the compiler's job to spot that kind of thing though - not yours!

Agreed. That's why I said what I said. The HLSL compiler is neither optimal in general, nor optimal for a specific architecture. A difference of 2 to 1 in speed won't surprise me at all. Optimal GeForce FX shaders could definitely be much faster than naively coded HLSL ones, even if the algorithm has no problem, and without losing any image quality.
 
DemoCoder said:
I already mentioned using texture lookup replacement. The most obvious example is normalization, or attentuation/pow() replacements.

Right, and if you sometimes depend on this for performance in place of computational functionality you also have, deciding the right time to swap texture look ups for the computation is a performance solution, but not one for a generic re-scheduler. It depends on the card implementation and other bandwidth demands occurring at the same time.

Also, won't your examples result in continuity problems, as well as have severe precision issues given nVidia's texture access issues?

AFAICS, in such a case, the performance gain is not something that can be presumed to represent re-scheduling performance gain opportunities.

But what do you mean by shader code restructuring?

Point of information: That's not my statement, that's Valve's.

If by that, you mean reducing register count, moving code blocks around, substituting more optimal subexpressions, all of this can be done by compiler technology.

That's the question I agree we should look at, replacing "you" (me) with "Valve". From there being a NV3x (NV35, really, though I'm interested in how the NV30 would perform with it) path, I think Valve and nVidia devrel came up with a compromise set of shaders that didn't try to do the same work in a different way, but tried to execute a new workload that performed better but looked at least as good as the DX 8.1 shaders (and resembled the DX 9 shaders closely when possible). This is a game, this makes sense, and introduces an opportunity of "better than DX 8, but similar in demands" shaders if the developer has the time.

This correlates strongly to me with departure from resembling an "optimally re-scheduled implemenation doing the same thing as the DX 9 shaders" (which is what a driver re-scheduler would do) with regard to several factors: image quality and workload equivalence was not the goal, but achieving necessary performance improvement on the NV35; the shaders used in this path are vendor/card specific; nVidia was involved in making them; nVidia refers to a lack of ability to distinguish PS 1.4 shaders from PS 2.0 in part of their defensive reply to Valve's statements; absolutely nothing (again, AFAICS) in either a gamer's interest or Valve's (once they committed to a NV3x path) demands that this new body of shaders be geared at necessarily doing (exactly) the same workload as the DX 9 shaders.

This is not presented as something conclusive, as I am encouraging the same investigation of Valve's shaders that you are...this is presented as why I don't think your representation of the NV3x versus DX 9 HL2 performance figures as reflecting something that can be tackled by a re-scheduler seems to make sense at the moment. Evaluating that should be part of the investigation we both think should occur.

IOW, "I disagree with that, and I think I have good reason...you can discuss why my reasons are perhaps good or bad, but this doesn't mean we shouldn't then go and investigate what is actually the case afterwards". I hope the context is clear?

The only "tricks" I could see related to "code restructuring" which are not generally available to compilers is replacing expressions with approximations, e.g. numerical integration subsitutes, Newton-Rhaphson, etc since that would require the compiler to "recognize" fuzzy mathematical concepts which might not be detectable at compiler time and replace them with approximations.

Well, I think the texture look up cases end up being approximations for this hardware and for a generic implementation, as per a re-scheduler, don't they?

Another example is flat out disabling parts of the lighting equation: e.g. turn off per-pixel specular, switch to per-vertex calculations, leave off fresnel term for water, etc.

That's part of the DX version level support as Valve has listed before. I do think there is opportunity for some of this as part of the NV3x path solution as well, since it seems part of the same integrated structure of implementation control as was listed to relating to this. DX 6, 7, 8, 8.1, 9...they were presented as target configs with decisions relating to exactly such factors, along with implementing increasing shader versions being used, and the "8.2" (that Anand seemed to refer to, I think) and "NV3x" seem to be additions to this list.

Like I said, I'd wait till I see Valve's actual shaders.

I agree with this. I'm disagreeing with your indication, as I read it, that the performance differences are naturally related to missed opportunities by a driver low-level re-scheduler, when all of these other factors are part of what is determined by the "NV3x" implementation as well.
...
But I can tell you that Cg's NV3x code generator is a waste-o-rama when it comes to registers or even generating peephole optimizations to reduce instruction count, that's why I have my doubts as to the quality of the translation between PS2.0 bytecode into NV3x internal assembly in NVidia's drivers.

I definitely think there is room for improvement. I also think, unfortunately, that nVidia is perfectly willing to throw in any method of performance "improvement" and "sell" it as this type of optimization. I'd prefer if we could spend time more purely focused on investigating the former, but the latter issue intrudes. However, I am also saying that making assumptions about the latter issue shouldn't preclude investigating the former (in which, I think I'm agreeing with you).

NVidia has a non-trivial job, because of their over complicated architecture and performance "gotchas" with respect to resource usage, whereas ATI's "DX9-like" architecture and single precision through the pipeline with no bottlenecks in register usage makes the optimization issue much easier.

This is exactly the type of thing I would find it interesting to focus on in a comparison article. I think the cheating issues should be investigated and isolated, and dealt with separately, to facilitate avoiding confusion between that issue, and this.
 
HL2.jpg
 
ET said:
Dio said:
That's an interesting point. It should be the compiler's job to spot that kind of thing though - not yours!
Agreed. That's why I said what I said. The HLSL compiler is neither optimal in general, nor optimal for a specific architecture. A difference of 2 to 1 in speed won't surprise me at all.
There's no reason why the in-driver compiler can't do this kind of thing, although it would probably be better done in the higher-level compiler.
 
Dio said:
ET said:
Dio said:
That's an interesting point. It should be the compiler's job to spot that kind of thing though - not yours!
Agreed. That's why I said what I said. The HLSL compiler is neither optimal in general, nor optimal for a specific architecture. A difference of 2 to 1 in speed won't surprise me at all.
There's no reason why the in-driver compiler can't do this kind of thing, although it would probably be better done in the higher-level compiler.

While there's no reason, the in-driver compiler writers need to be aware of all such issues and address them. That is, they'd have to know that a certain code structure is common (possibly because the specific compiler tends to generate it), and optimise that. Which is why it can be reasonable for a new driver to improve speed considerably (something that people here would likely immediately attribute to cheating).

Also, some optimisations are hard to do when you lack the context of the high level language. If a compiler generates suboptimal code, detecting it and optimising it in the general case is difficult.
 
AMDMB has a preview of 51.xx performance here

PS2.0 gets a sizable boost. So either they are cheating or Democoder is right that PS2 scheduling was pityful before.

Time will tell.

Cheers
Gubbi[/url]
 
There are the cases I typically notice in poorly optimized DX9 PS2.0 code

1) use more more registers than needed
a) not detecting when a register goes "dead" (not uses by any instructions following it) and hence can be reused
b) not detecting cases where multiple scalars can be "packed" into a single register, or 2 "2-d" vectors can be packed. e.g. dot products produce scalars, but I typically see entire registers being wasted to store the result

2) inadequate copy propagation. uses up instruction slot needless and register as well. (e.g. mov r1, r0 )

3) missing peohole optimizations. CMP into LRP, MIN/MAX, etc, doesn't always generate fused multiply-add where possible

4) algebraic and strength reduction. lack of algebraic reasoning/substitution. e.g. if someone wrote a "stupid crossproduct", would the compiler recognize it and substitute the two-instruction-swizzle-trick?

5) on Nvidia's architecture, common subexpression elimination has to be balanced against register usage. Strategies for avoiding recalculation of subexpressions cache subexpression results into extra registers and use them later. However, on NV3x, calcating x = y+z + 2 and w = y+z +5 might be faster if you do not waste an extra register to hold Y+Z.

However, most compilers think CSE is a quite reasonable optimization, and on most archiectures it is, not Nvidia takes a severe performance hit on register usage, probably way more costly to use an extra 2 registers vs just execute an extra 2 instructions.

Unfortunately, in most compilers, CSE happens way before register allocation, which means the register allocator would have to "UN-CSE" the intermediate representation.

Now do you see why I think Nvidia is having problems dealing with the output from FXC or a typical developer's hand-code PS2.0? Most developers do CSE naturally, cause they think that recalculating an expression twice is a waste, so they naturally assign it into another local variable and reuse it where they need it. But Nvidia takes a 50% speed reduction if you use more than a certain number of registers, therefore, the drivers would actually have to "de-optimize" some of the natural optimizations that programmers and compilers perform.

That would require picking a "live register" that has a sequence of instructions which calculate a value and cache it there, which is then used by two other register definitions later. Once it finds this register, it would have to INLINE the expressions in place of where the register was used and thus expand the size of the shader.

But picking the right parts of the shader to do this on is extremely tricky and much of the information of any common subexpressions was erased by the time it went through MS's FXC and turned into PS2.0 code.

Consider this
Code:
x = N.L * k
y = N.L * t0
z = x+y

A compiler without algebraic reasoning would not "get" that dot product follows the distributive law. This code can be rewritten

Code:
x = k + t0
y = N * x
x = y . L
with only two temporary registers required.

More likely, the compiler will recognize N.L as a common subexpression and generate
Code:
temp = N.L
x = temp * k
y = temp * t0
temp = x + y

using 3 registers and 4 slots

I predict if Nvidia does any good optimizing at all, it will work best under OpenGL2.0 because of the way the compiler is integrated with the driver.
 
DC,
Nice post.

One thought WRT your comment on "stupid crossproduct": obviously there are things like cross products built in as macros. Perhaps there needs to be more of these intrinsics in the compiler/assembler so that driver writers can use the best option for their HW.
 
Or better yet, completely drop the pseudo-assembly intermediate language, and let the driver deal with HLSL directly. Alot of semantic information is lost in the HLSL->intermediate step.

Cheers
Gubbi
 
Let me try and see what you people are rambling on about now.

HLSL -> driver level compiler -> etc...

We aren't going to be running a virtual machine at the driver level, are we? :oops:
Might as well make JAVA into a HLSL as well and use that. :?
 
KILER...

We're already using something like JAVA bytecode as the target for the compiler--its called PS2.0 assembly.

Now:
Code:
   HLSL compiler               Driver
HLSL   ->    PS2.0 assembly    ->    native microcode/assembly
By implementing what Democoder is suggesting, its removing the 'bytecode' stage and allowing the driver to compile directly to native microcode/assembly.

In theory its good, though on the other hand, as NVIDIA seems to be showing us, it can be a big burden to put on a company.
 
DemoCoder said:
...
5) on Nvidia's architecture, common subexpression elimination has to be balanced against register usage. Strategies for avoiding recalculation of subexpressions cache subexpression results into extra registers and use them later. However, on NV3x, calcating x = y+z + 2 and w = y+z +5 might be faster if you do not waste an extra register to hold Y+Z.

Yes, if there is no unique behavior in this regard used for nVidia hardware in generating the LLSL.

However, most compilers think CSE is a quite reasonable optimization, and on most archiectures it is, not Nvidia takes a severe performance hit on register usage, probably way more costly to use an extra 2 registers vs just execute an extra 2 instructions.

The HLSL does use registers in this way for the ps_2_0 profile (clarifying since you said "most compilers"), but...

Unfortunately, in most compilers, CSE happens way before register allocation, which means the register allocator would have to "UN-CSE" the intermediate representation.

...the HLSL compiler has other profiles, and I understand one that has been mentioned to be targetted at the NV3x, and I don't see how it is prevented from doing other than what you are presuming. This is why I initially mentioned my curiosity about this factor in the benchmark results.

Now do you see why I think Nvidia is having problems dealing with the output from FXC or a typical developer's hand-code PS2.0?

That nVidia has trouble with these is not in contention. Where was my disagreement with this communicated? I don't know how to make it clearer that I am addressing the specific assertion of HL2 performance change between the DX 9 path and NV3x path being due to a poor re-scheduler, since: I directly stated agreement that the Det 50 has the opportunity to show performance gains from improvement in a rescheduler, and directly stated the list of other factor involved and my reasoning as to why it wasn't the sole reason (in contrast to saying it was not a reason at all). None of which is related to the NV3x not having problems with the ps_2_0 profile code and its optimizations, AFAICS.

I skip a bunch here because I do believe I understand and agree with it, but I don't see what about my commentary it is addressing.

I don't skip the following portion, because it is an interesting discussion:

Consider this
Code:
x = N.L * k
y = N.L * t0
z = x+y

A compiler without algebraic reasoning would not "get" that dot product follows the distributive law. This code can be rewritten

Code:
x = k + t0
y = N * x
x = y . L
with only two temporary registers required.

I tested the ps 2.0 compiler, and it does do CSE. Unfortunately, I don't have the access to the additional profiles to test how things change for this example case you provide. As for the ps_2_0 profile...

More likely, the compiler will recognize N.L as a common subexpression and generate
Code:
temp = N.L
x = temp * k
y = temp * t0
temp = x + y

using 3 registers and 4 slots

Is MAD two slots? That's what it used, and I presume you just take that as an obvious given. It also used a register for preparing for the dp3 op (instead of referencing t0 directly), for a complete listing of the allocations.

I predict if Nvidia does any good optimizing at all, it will work best under OpenGL2.0 because of the way the compiler is integrated with the driver.

Well, it might if the Microsoft profile for the nVidia cards doesn't change behavior characteristics to take this into account, but there doesn't seem to be any specific need for things to be that way for the HLSL, at least for this example case. But that's a HLSL discussion of a different nature.
 
... which doesn't necessarily mean it's a single instruction on the hardware... although it seems terribly likely.
 
demalion, there is no disagreement, I am just showing a specific case where NVidia has a hard optimization to do and I doubt they do it well, if at all.

NVidia's architecture is weird. On almost every architecture I can think of (CPU, DSP, etc) register usage is "free". In fact, the compiler will try to maximize register usage, and minimize touching the stack or memory, since they are slow.

Thus, optimizations like CSE or partial redundancy elimination are seen as universally good, and developers are likely to do them naturally to make code clearer. They reduce code size and eliminate recalculation.

My point is, the NV3x takes a *huge* hit if you use more than, I think, 2 FP registers. Every 4 registers used yields a 50% performance hit if I recall. These certainly could make a big difference.

Of course, it's sheer speculation.


This will probably be my last message. I'm traveling for the next three weeks in Italy and France.


p.s. BTW, yes

Code:
temp = N.L 
x = temp * k 
y = temp * t0 
temp = x + y

can be rewritten

Code:
temp = N.L
x = temp * k
y = temp * t0 + x  (MAD)

Of course, I could alter the example slightly to obfuscate this fact from the compiler, but the gist is the register usage problem, not slot minimization.
 
Might it be that the NV3x only actually has 2 32 bit registers and uses an internal cache to temporarily store data when more registers are needed? It'd be quite a strange design decision, of course. . .
 
demalion said:
...the HLSL compiler has other profiles, and I understand one that has been mentioned to be targetted at the NV3x

What are you basing this on? I'm wondering because I haven't such profiles. I know that the summer SDK is supposed to support more profiles, but with the current one?
 
Yes, I saw the profiles mentioned earlier in the year, then the specific beta and profile mentioned at the beginning of summer or just before, then read of a summer sdk release with this that seems to be a release candidate.

If it was in the current (to me) sdk, I would have used it to test the above case, so I was indeed referring to that and not the current one. I do presume that a company like Valve has had access to it and have evaluated and tested it, and I'm wondering at the role it is playing in HL2 implementation and its specific body of shaders.
 
Hi guys,

OK, I've only just got back home in the UK from Shader day and am beginning to catch up on the aftermath of Gabes statements.

I've not read all this thread so I don't know if this has been mentioned before, but I see that NVIDIA are suggesting using PS1.4 for many effects and there will be limited visual difference. There is a problem with this though.

Whilst Valve were there they displayed that they had implemented higher dynamic range into the engine. HDR will be available in the game to those that support its requirements, however it is not currently implemented in the benchmark. HDR needs the a higher range, and PS1.4 only has -8, 8 range which is probably not enough - the HDR effects also require float buffers which none of the current NVIDIA drivers support. There is also dramatic visual differences between HDR being enabled and disabled.

HDR is currently not in the benchmark mode because NVIDIA don't support float buffers at the moment and the benchmark was designed to render equally on all systems.
 
Back
Top