The Experts Speak.. "Automatic Shader Optomizer"

rwolf said:
I would imagine they could take an MD5 checksum of the shader and then the driver could replace the shader whenever the checksum was detected. You could store precompiled results in the driver or compile on the fly.
Shaders are referred to by handle. From the handle is derived a pointer to the driver's internal data structure that stores the details of the shader. At runtime it gets compiled and the compiled version is stored for later use.
 
Actually, being pedantic, Direct3D8 used handles to it's shaders. Direct3D9 however uses interfaces for it's shaders.

Of course, to the coder it's not a huge difference other than you need to destroy the shaders in a different manner: IDirect3DDevice8::DeletePixelShader() and IDirect3DDevice8::DeleteVertexShader() vs IDirect3DPixelShader9::Release() and IDirect3DVertexShader9::Release()
 
Colourless said:
Actually, being pedantic, Direct3D8 used handles to it's shaders. Direct3D9 however uses interfaces for it's shaders.

Of course, to the coder it's not a huge difference other than you need to destroy the shaders in a different manner: IDirect3DDevice8::DeletePixelShader() and IDirect3DDevice8::DeleteVertexShader() vs IDirect3DPixelShader9::Release() and IDirect3DVertexShader9::Release()
At the driver level it's still a handle :)
 
Frightening, innit. Done so much work inside drivers I don't know what it's like outside any more...
 
Enbar said:
Humus said:
Interesting is that this is what ATI does for sin and cos instructions, while nVidia actually has native hardware to perform these calculation at the same rate as other instructions. So if there's room for performance improvement in this aspect, then that's on the ATI side rather than on nVidia's.

Yes, I read somewhere that's how ATI does trig func. I didn't know nvidia was claiming native support for these and I highly doubt they would do these at the same rate as a mad. The best I can imagine they would have is to have implemented these as macros in the hardware. In this case they could probably short cut the macro with a less precise version. All this is guessing though. However if nvidia can do a cos as fast as they can do a mad I'll eat my socks. It just wouldn't make any sense to use all the necessary transistors for that.

Prepare to chew. ;)

Mark Kilgard said:
Note that NVIDIA's CineFX architecture (exemplified by GeForce FX) has SIN and COS instructions at both the vertex and fragment level. These are not compound instruction approximations, but a single fast instruction for each. These SIN and COS instructions are just as efficient as a DP4 or other instruction.

http://www.opengl.org/discussion_boards/ubb/Forum3/HTML/009281.html
 
Humus said:
Prepare to chew. ;)

Mark Kilgard said:
Note that NVIDIA's CineFX architecture (exemplified by GeForce FX) has SIN and COS instructions at both the vertex and fragment level. These are not compound instruction approximations, but a single fast instruction for each. These SIN and COS instructions are just as efficient as a DP4 or other instruction.
http://www.opengl.org/discussion_boards/ubb/Forum3/HTML/009281.html
Note that Kilgard said "DP4 or other instruction" but not that all instructions have the same efficiency.

BTW, there are some other interesting quotes from Mark Kilgard in that thread:
Mark Kilgard said:
Cg is winning as the common shading language of choice for the whole 3D industry...
It is?
The ARB really screwed itself by attempting to embed a shading language in the driver. They were warned against it, but the ARB as a whole insisted. If not for Cg, OpenGL users would really be suffering now without a standard GPU programming language for OpenGL. The fact that Cg supports the same language as Microsoft's HLSL implementation is icing on the cake since now your shaders are interoperable between major 3D APIs, a great thing.
They aren't quite interoperable from what I've seen.
The dust has pretty much settled. Cg provides a high-level GPU programming languages that works for Windows, Linux, and MacOS X, for both OpenGL and Direct3D, for both the latest GPUs and older GPUs, and it's redistributable at no cost, has good documentation, good optimizations,...
Good optimizations? It really doesn't seem so.
 
Enbar said:
However if nvidia can do a cos as fast as they can do a mad I'll eat my socks. It just wouldn't make any sense to use all the necessary transistors for that.
Don't see why not. Cosine (restricted over a sensible domain) is not terribly difficult.
 
Dio said:
Frightening, innit. Done so much work inside drivers I don't know what it's like outside any more...

Could we call that "Coding Xenophobia"?
 
Trig functions are all about domains. ;) :D

Simon F said:
Enbar said:
However if nvidia can do a cos as fast as they can do a mad I'll eat my socks. It just wouldn't make any sense to use all the necessary transistors for that.
Don't see why not. Cosine (restricted over a sensible domain) is not terribly difficult.
 
Simon F said:
Dio said:
Frightening, innit. Done so much work inside drivers I don't know what it's like outside any more...

Could we call that "Coding Xenophobia"?

Surely you mean "Coding Agoraphobia" ... unless you are referring to NV having "Coding Xenophobia" fearing anything apart from their own 'species' of coding language ;)
 
To reduce a sine/cosine into a reasonable domain, you could multiply the input value with 1/(2pi), take the fractional part of the result, then re-multiply with 2pi. There are other symmetries that you can exploit as well, like cos x = -cos (x+pi), cos x=cos (-x), cos x = sin (0.5pi-x), etc which you can use to further reduce the range to something like [0, 0.25pi]. From that point, you can do series expansions, lookup tables, or a combination, with transistor count obviously depending on the precision you need.

As fast as DP4? Not really per se, but if they have put both functions into a pipeline of fixed length, they will behave as if they are equally fast.
 
Or you could simply use a lookup table with linear interpolation and get pretty good results. The same table could be used for both sin and cos.
 
Or a simple polynomial interpolation - this can give significantly better results than a linear interpolation.

PI reduction is actually a more difficult problem than the rest of the computation.
 
Dio said:
Or a simple polynomial interpolation - this can give significantly better results than a linear interpolation.

PI reduction is actually a more difficult problem than the rest of the computation.

Couldn't you just multiply and truncate some high bits to fit it into your table's range?

Anyway, I don't find sin and cos are very useful in shaders anyway. You can usually reduce them to a linear combination (where the weighting is computed once per frame) of constants, whether the function's argument is dependent on space or time.

I'd think that a shader would have to be pretty exotic to absolutely need sin or cos.
 
Well, I've used sin/cos in many shaders that I wouldn't call very exotic. One of my shaders for my shader competition contribution uses sin() in the fragment shader.
 
Mintmaster said:
Couldn't you just multiply and truncate some high bits to fit it into your table's range?
I must defer to the mathemagicians, but as far as I understand it, you get very limited accuracy from 'simple' PI reduction.
 
Humus said:
Well, I've used sin/cos in many shaders that I wouldn't call very exotic. One of my shaders for my shader competition contribution uses sin() in the fragment shader.

A bif topic maybe, but I know you code on R3x0 which use macros for sin/cos so I wondered how expensive they are to use over NV3x native support (how many clocks for the instruction)?
 
Quite expensive. The DX9 docs lists the sincos macro as taking 8 instruction slots. And then we haven't included reduction to [-pi,pi], which takes another 3 instructions.
 
Humus said:
Quite expensive. The DX9 docs lists the sincos macro as taking 8 instruction slots. And then we haven't included reduction to [-pi,pi], which takes another 3 instructions.

Just curious, but why did you decide not to use a lookup texture? Did you need a lot of accuracy?
 
Back
Top