Discuss NV40's shader unit architecture

nobie said:
I'm still worried that the X800 XT will force them to revert to their forced partial-precision "cheating." It's an ace up their sleeve in case they need some extra juice.
Current benchmarks don't indicate that that would help. Some benchmarks show _pp hints causing slowdowns, even.
 
_PP causes slow down in very short shaders, maybe it's because of the overhead when converting between fp16 and fp32 data(i.e. the ALU can only process data in fp32 format)?
 
991060 said:
_PP causes slow down in very short shaders, maybe it's because of the overhead when converting between fp16 and fp32 data(i.e. the ALU can only process data in fp32 format)?
I'd be somewhat surprised if there was overhead to using FP16. I'd be more inclined to think it's a driver bug at the moment, that _pp instructions aren't being scheduled properly. After all, converting FP16 to FP32 format should be trivial: write the mantissa to the FP32 mantissa, leaving the last bits zero; write the exponent to the FP32 exponent, leaving the first bits zero. You'd think that this would be fully hardware-accelerated, if only to support the input of FP16 data.

Then again, the performance hit could be caused by both FP shader units not supporting the FP16-FP32 conversion. If launch drivers don't fix this issue, then I think that may be confirmed.

Edit: Actually, if you think about it, the _pp hint is only a hint, so there never should be a performance hit. Future drivers, I would hope, would fix this by simply ignoring the hint if there will be a performance hit (assuming that the performance hit won't always be there....).
 
I think what you're seeing is a compiler artifact. Rather than ignoring _PP hints and running everything at FP32, it's trying to schedule instructions differently, and you're ending up with different results.
 
Luminescent said:
R3x0 is able to execute instructions that require results at lower than native hardware precision with no appartent penalty.
While I think DemoCoder's probably right, what I was saying is that both SU's don't necessarily need to be able to natively handle FP16 values to deal with input and output of FP16 values without a performance hit.
 
My friend ran my precision test on a NV40 and it showed that both PP and non-PP has the same precision, 23 bits. If the test is correct, I suspect that PP hint is mostly dropped by NVIDIA, only applied to some instructions which really matter (such as nrm and other costly instructions).
 
DemoCoder said:
I think what you're seeing is a compiler artifact. Rather than ignoring _PP hints and running everything at FP32, it's trying to schedule instructions differently, and you're ending up with different results.

But why would they need to be scheduled differently anyway?
 
Yes, but that still doesn't mean the compiler isn't acting differently, even if it "drops" the PP hint (it may be that simply, there aren't any "half" registers anymore, so most ops will show FP32), it could still be trying to apply optimizations for FP16 in terms of register allocation and ordering which were carried over from the NV3x shader compiler code.

There are already a few definate bugs in the 60.72 driver, and alot of the "dual issue" potential is probably being wasted right now. Autoparallelizing code is tough on compilers. It's why developers bitched about the PS2 "dual shader" multicpu system, why they bitch about PS3 and Xbox2, some of why Itanium had troubles, etc

Maximizing NV40 PS potential will take alot of work. Just look at some of those hand coded shaders that get 4-7 ops per cycle by clever arrangement, and imagine writing a compiler to pack and schedule the instructions like that.
 
pcchen said:
My friend ran my precision test on a NV40 and it showed that both PP and non-PP has the same precision, 23 bits. If the test is correct, I suspect that PP hint is mostly dropped by NVIDIA, only applied to some instructions which really matter (such as nrm and other costly instructions).
Given the Farcry screenshots we've seen, I doubt this is the case with the drivers the reviewers were using. That is assuming, of course, that the blocky specular highlight is due to some instructions that use _pp when they really shouldn't.

If this is true, with the reviewer's drivers, then I suppose it would mean the Farcry shots are due to something else. It could be, for example, that they use a texture lookup for the specular highlights, and that lookup uses point sampling.
 
Chalnoth said:
Given the Farcry screenshots we've seen, I doubt this is the case with the drivers the reviewers were using. That is assuming, of course, that the blocky specular highlight is due to some instructions that use _pp when they really shouldn't.

If this is true, with the reviewer's drivers, then I suppose it would mean the Farcry shots are due to something else. It could be, for example, that they use a texture lookup for the specular highlights, and that lookup uses point sampling.

I am not familiar with Farcry. However, I suspect it's possible that Farcry thinks NV40 as NV3X, and tries to use some "workaround" designed for NV3X for better performance. I could be wrong, though.

Of course I can't be sure that current driver does not insert _pp automatically, because my precision test shader uses only a few instructions (mostly add/sub/mad/cmp). It does not measure anything about texture addressing, either.
 
Back
Top