Anyone got a NV40...?

Chalnoth said:
By the way, as far as the compiler is concerned, I really wouldn't expect much benefit for short shaders like this one. The benefit will be in longer, more complex shaders where latency hiding can be done.

Of course that's what I was talking about. I've tried to hide the latency of the texldd but it isn't possible with current drivers.
 
Uh oh, i got a bad suspicion why it might be so slow, and why it's supposed to take two cycles on NV3x... actually, it makes sense. But it also makes texldd almost completely useless...

Damien, could you please do a comparison between bilinear and trilinear filtering (no AF, sampler 0 should be point or bilinear in both cases)?

edit: Oh, and make sure the s0 texture contains all zeroes.
 
Xmas said:
Uh oh, i got a bad suspicion why it might be so slow, and why it's supposed to take two cycles on NV3x... actually, it makes sense. But it also makes texldd almost completely useless...
Can you give us a bit more clues?
 
texldd breaks quad coherence. Every pixel needs its individual sampling.

Damnit, why isn't there an instruction that takes "virtual" and "real" texture coordinates, and calculates LOD from the virtual coordinates? That would mean one LOD value per quad instead of per pixel.
 
Xmas, I think you should post on the directxdev mailing list or write email to nvidia's developer relation directly for faster/more accurate response.
 
Xmas said:
Uh oh, i got a bad suspicion why it might be so slow, and why it's supposed to take two cycles on NV3x... actually, it makes sense. But it also makes texldd almost completely useless...

Damien, could you please do a comparison between bilinear and trilinear filtering (no AF, sampler 0 should be point or bilinear in both cases)?

I run the shader on a surface // to the screen so trilinear is never used.

I've tried with point sampling, bilinear. The fillrate is the same. I've also tried with FSAA. It's strange... 3% faster with fsaa 4x on...
 
I have vague memories that someone measured that texldd takes 4 cycles on NV30, not 2.
And it would make much more sense - because breaking the quad results in 4x the work being done (even if it's not really needed).

But 9 cycles? That's just too long.
 
NV30 has two TMUs per pipe, so it should be able to take four independent bilinear samples in two cycles.
 
Back
Top