D44.03 shader code

Hellbinder[CE said:
]Yes but that does not explain why they need to be hardcoded into the Driver.

If they're emulating fixed function with shaders, where should the shader code reside?
 
Yeah, they're likely innocent, as Popnfresh said.

But simple question then: WHY are they doing this?
One possible explanation is that those are the functions their dedicated T&L lighting power ( which, I speculate, are extra FX12 units for non-transform related work ) can emulate.

So unlike the obvious, transforming T&L into VS, nVidia would be doing the opposite here: transforming part of a VS program into T&L to gain speed. Of course, that assumes they can use both at the same time, which they likely can although not through direct API support ( or maybe they only can in the NV35 or something, no idea )

Of course, that'd mean those things would only be templates the drivers would base themselves on ( and thus only look for it roughly, not caring about register number or stuff ) to find what it can optimizes.

All just speculation, of course, and I'd be very surprised if what I said hold true, it's more likely BS. But hey, what could it be then?


Uttar
 
RussSchultz said:
Hellbinder[CE said:
]Yes but that does not explain why they need to be hardcoded into the Driver.

If they're emulating fixed function with shaders, where should the shader code reside?
Russ, as I mentioned before, it's not reasonable to have every fixed function shader stored in the driver as there are far too many combinations. More I won't say.
 
Speculation: They may be code fragments for appending to user vertex shader code.

For instance:
According to spec the viewport transformation (normalized view space to screenspace) takes place after the vertex shader runs. In practice, this transformation is also done in the vertex shader unit if you're using the programmable pipeline and the driver appends a few extra vertex shader instructions to the end of your program to do this.

The way the spec / nvidia's drivers handle point sprites might require something simular. I don't know the exact specification in D3D for doing point sprites so i'm just speculating. Still seems perfectly innocent to me. Except for the first one there's absolutely nothing fancy going on in those vertex shaders.
 
Uttar -> I read in many of your posts that you think nvidia could share FX12 units of the fragment part with the geometry part.

I think it couldn't be the case. Flexibility is NVIDIA marketing answer to questions about bad performances questions. "it's flexible, we could optimize". Flexibility is the opposite of pipeline.

Use a fragment pipeline to work on geometry : OK
It'll be the case in the future I think but it's not in the NV3x.

Take one unit of a fragment pipeline and put it on a geometry pipepline : NO
I don't think that doing this is possible. It would kill pipeline efficiency.

The only things they could 'easily' share between geometry pipelines and fragment pipelines are registers or buffers.

What do you think about it ?
 
Frankly, I don't know whether some FX12 units are shared between Fragment and Geometry in the NV3x. I'm just saying it's *possible*, and it might be an efficient way to look at it, considering that otherwise those FX12 Geometry units are often not used.

What's fairly likely, however, is that nVidia got FX12 units dedicated for hardwired T&L functionality.

What I do think is that the NV40 might have fully shared VS/PS functionality. I'm sayign might, because frankly, few facts back it up. But it would just be insane to use an unified instruction set and stuff ( as confirmed by CMKRNL several months ago ) and use some dynamic allocation...


Uttar
 
OpenGL guy said:
RussSchultz said:
Hellbinder[CE said:
]Yes but that does not explain why they need to be hardcoded into the Driver.

If they're emulating fixed function with shaders, where should the shader code reside?
Russ, as I mentioned before, it's not reasonable to have every fixed function shader stored in the driver as there are far too many combinations. More I won't say.

Well, ok, so not every fixed function shader is there, but that really isn't the point.

The helper fragments, or whatever have to reside somewhere where the driver can reach them. That would be: in the driver or in the registery. Ring0 can't touch the filesystem directly (if my vague recollection of WinNT/XP architecture is on target)
 
OpenGL guy said:
RussSchultz said:
Hellbinder[CE said:
]Yes but that does not explain why they need to be hardcoded into the Driver.

If they're emulating fixed function with shaders, where should the shader code reside?
Russ, as I mentioned before, it's not reasonable to have every fixed function shader stored in the driver as there are far too many combinations. More I won't say.

First of all, these are vertex shaders, and it certainly is reasonable to do it. You can implement the entire OpenGL T&L lighting model in one shader with static branches. The OpenGL2 shading spec lists such a megashader.


For various reasons, such a megashader is not optimal, and for that reason, you would probably want to create short versions for the "common case" T&L vertex shaders, then handle the ones that don't fit with a "catch all"

Unless ATI has fixed function hw for T&L, you guys are emulating also. That means your driver or video card bios contains low level native vertex shader snippets, not neccessarily in text form.

Most of those Nvidia shaders do nothing that is application specific. They either do nothing but pass through to pixel shader, or simply transformation of texture coordinates, etc. The only one that looks out of place is

MUL o[TEX0],v[9],c[15].xyzz;
MUL o[TEX1],v[10],c[23].xyzz;
MUL o[TEX2],v[11],c[31].xyzz;
MUL o[TEX3],v[12],c[39].xyzz;
MUL R0,v[9],c[15].xyzz;
MOV o[TEX0],R0;
MUL R1,v[10],c[23].xyzz;
MOV o[TEX1],R1;
MUL R2,v[11],c[31].xyzz;
MOV o[TEX2],R2;
MUL R3,v[12],c[39].xyzz;
MOV o[TEX3],R3;
SGE R4.x,c[63].x,c[63].x;
SLT R4.y,c[63].x,c[63].x;
SLT R5,c[63],R4.yyyy;
SGE R6,c[63],R4.xxxx;
ADD R7,R4.xxxx,-R5.xyzw;
ADD R7,R7,-R6.xyzw;
MUL R8.w,R0.y,R5.x;
MAD R8.w,R0.z,R6.x,R8.w;
MAD o[TEX0].w,R4.x,R7.x,R8.w;
MUL R8.w,R1.y,R5.y;
MAD R8.w,R1.z,R6.y,R8.w;
MAD o[TEX1].w,R4.x,R7.y,R8.w;
MUL R8.w,R2.y,R5.z;
MAD R8.w,R2.z,R6.z,R8.w;
MAD o[TEX2].w,R4.x,R7.z,R8.w;
MUL R8.w,R3.y,R5.w;
MAD R8.w,R3.z,R6.w,R8.w;
MAD o[TEX3].w,R4.x,R7.w,R8.w;

There are 4 branches in this code and a bunch of scalar math.
 
DemoCoder said:
First of all, these are vertex shaders, and it certainly is reasonable to do it. You can implement the entire OpenGL T&L lighting model in one shader with static branches. The OpenGL2 shading spec lists such a megashader.
And how useful is this to the hardware/driver? Would it even fit in the hardware?
For various reasons, such a megashader is not optimal, and for that reason, you would probably want to create short versions for the "common case" T&L vertex shaders, then handle the ones that don't fit with a "catch all"
And just what are these "common cases"? How do you handle ones that aren't in your list? And isn't it better to treat them all the same way?
 
Uttar said:
Frankly, I don't know whether some FX12 units are shared between Fragment and Geometry in the NV3x. I'm just saying it's *possible*

Then I'm going to kick in and say it's not possible. FX12 for transform is just too little precision for transform and simply doesn't have anywhere near the range we need for geometry. Fp16 would probably have to little precision too, definitely too small range. Fp24? Could maybe work, but would think that all vendors are already doing things at 32bit in the vertex pipeline.
 
OpenGL guy said:
DemoCoder said:
First of all, these are vertex shaders, and it certainly is reasonable to do it. You can implement the entire OpenGL T&L lighting model in one shader with static branches. The OpenGL2 shading spec lists such a megashader.
And how useful is this to the hardware/driver? Would it even fit in the hardware?

Well, how would YOU do it? Let's postulate that your hardware lacks a fixed function pipeline. Tell me how you would provide this functionality without crafting vertex shaders.

The only other possibility is some sort of "dynamic" vertex shader creation where the API looks at all the pipeline state and creates a vertex shader "on the fly" to implement the pipeline state, but this is inefficient and will most likely not generate optimal shaders unless you plan to implement a peephole optimizer as well.

For example, you could have code like

if lighting
foreach opengl light enabled
if diffuse
generate vertex shader fragment to do diffuse
if specular
generate vertex fragment to do do specular
...

if fog
gen fog

...

But you will likely waste some clock cycles in the implementation by not being clever about reuse of instructions.
 
BTW, there is a simple test to find out if these vertex shaders are benchmark specific.

Hex edit the drivers and turn them into NO-OPs. Then, run some fixed function T&L apps/games and see if you get artifacts/screwed up visuals. Then, run a T&L heavy benchmark like Viewperf and see if it is screwed or runs faster with nop'ed shaders.

If hex editing screws up any and all games that use fixed function, then they are not cheats, but fixed function emulation optimizations.

If they only appear to do anything on Viewperf or say, 3dMark VS benchmarks, then you know they are benchmark specific. However, if it's good for ATI to replace shaders with optimized functionally identical equivalents, why would it be bad for Nvidia?
 
DemoCoder said:
Well, how would YOU do it? Let's postulate that your hardware lacks a fixed function pipeline. Tell me how you would provide this functionality without crafting vertex shaders.
You want me to divulge our driver secrets? Give me a break.
The only other possibility is some sort of "dynamic" vertex shader creation where the API looks at all the pipeline state and creates a vertex shader "on the fly" to implement the pipeline state, but this is inefficient and will most likely not generate optimal shaders unless you plan to implement a peephole optimizer as well.
The API (D3D or OpenGL) does no such thing. How would the API even know it had to do this? If you export HW_VERTEX_PROCESSING in the D3D caps, then that means you support HW vertex processing. In other words, the driver/hardware has to handle everything if requested to by the application.
For example, you could have code like

if lighting
foreach opengl light enabled
if diffuse
generate vertex shader fragment to do diffuse
if specular
generate vertex fragment to do do specular
...

if fog
gen fog

...

But you will likely waste some clock cycles in the implementation by not being clever about reuse of instructions.
I don't see any need to waste clock cycles at all.
 
BTW, All of those vertex programs are OpenGL Vertex Programs. Specifically NV_vertex_program. The first one is nv_v_p2 tho.

They have nothing to do with 3DMark03 (as that is a D3D program).
 
BTW: the first vertex shader in the text file seems to be missing some code.

It doesn't have the %!VP1.0 MAIN: header and more importantly doesn't write to o[HPOS] (the output vertex position). It's not a valid shader.

Also: since only the (incomplete) first shader and the (simple pass-through) last shader are the only ones that DON'T do something with point sprites I don't know why anyone is even considering that these are "replacement" shaders for benchmarks or such. NVidia's vertex shaders get compiled down to a microcode -- a single instruction becomes one or more microcode instructions - and if they were going to replace shaders its much more likely they'd replace with optimised microcode versions.
 
OpenGL guy said:
DemoCoder said:
Well, how would YOU do it? Let's postulate that your hardware lacks a fixed function pipeline. Tell me how you would provide this functionality without crafting vertex shaders.
You want me to divulge our driver secrets? Give me a break.

Puh-lease, how about giving me a break. Anyone with two brain cells can enumerate the possible methods of achieving fixed function emulation, since it is no great trade secret. I doubt SIGGRAPH is going to be accepting any papers on your implementation. You could say "we do it dynamically" vs statically. Oooh, that would be a huge leak of proprietary information that would sure to give NVidia alot of help.

You could have simply said that you don't have any thing to back up your comments in this thread. You accused Russ in the Quack thread of "not doing the legwork", well, here you are making accusations about the purpose or no purpose of these NVidia vertex shader fragments. Why not do the legwork for us and figure out what they are meant for.

I proposed that they are used somehow for the fixed function pipeline. Others proposed they are appended or prepended to existing shaders for some reason. I also proposed that perhaps they are substitutions used in Viewperf benches.

But regardless, you have the source code now, so I would like to see the opinion of a guy who supposedly works on ATI's drivers about what these short instruction shaders (which appear to do almost nothing) are used for. Do some legwork. Retreating behind "I can't talk because I don't want to divulge secrets" removes you from the discussion.



The API (D3D or OpenGL) does no such thing. How would the API even know it had to do this? If you export HW_VERTEX_PROCESSING in the D3D caps, then that means you support HW vertex processing. In other words, the driver/hardware has to handle everything if requested to by the application.

I am talking about the driver intercepting calls to draw with fixed function state and composing vertex programs on the fly and uploading them to the GPU to perform the needed fixed function processing. If you can't imagine how this is done, I won't go any further. It's a trade secret.


I don't see any need to waste clock cycles at all.

Well then you are not thinking hard enough. The best example is the NV30 architecture. Each additional register lowers performance. A naive code generator would not generate optimal code. Moreover, as the vertex shader HW gets more complex and general purpose, you have the additional overhead of parallel execution scheduling, resource hazards, and superiority of handtweaked algorithms.

If you suppose that a vertex code generator always generates optimal code, then #1 you violate the "full employment theorem for compiler writers" and #2 your HW is probably very simple with respect to parallelism, scheduling, and resource usage.
 
Yeah, I was told by someone earlier today that they were all OpenGL.

I guess we might want to start checking all those OpenGL games used as benchmarks now eh? ;)
 
DemoCoder said:
OpenGL guy said:
DemoCoder said:
Well, how would YOU do it? Let's postulate that your hardware lacks a fixed function pipeline. Tell me how you would provide this functionality without crafting vertex shaders.
You want me to divulge our driver secrets? Give me a break.

Puh-lease, how about giving me a break. Anyone with two brain cells can enumerate the possible methods of achieving fixed function emulation, since it is no great trade secret. I doubt SIGGRAPH is going to be accepting any papers on your implementation. You could say "we do it dynamically" vs statically. Oooh, that would be a huge leak of proprietary information that would sure to give NVidia alot of help.
Puh-lease read my NDA. Puh-lease look around for what information ATI has divulged about the architecture of the R300 (and derivatives). Puh-lease take your sarcasm elsewhere.
You could have simply said that you don't have any thing to back up your comments in this thread.
Without access to the driver source code or feedback from nvidia we are all speculating. What makes my speculation any less valid?
You accused Russ in the Quack thread of "not doing the legwork", well, here you are making accusations about the purpose or no purpose of these NVidia vertex shader fragments. Why not do the legwork for us and figure out what they are meant for.
Obviously, they don't look anything like "stubs" for fixed function vertex shader code to me.
I proposed that they are used somehow for the fixed function pipeline. Others proposed they are appended or prepended to existing shaders for some reason. I also proposed that perhaps they are substitutions used in Viewperf benches.
So which one is it? I doubt you'd append them to shaders because that would just make them longer and slower and would also change the end result in many cases. Shader replacements sounds more reasonable, especially given the presence of the "VP1.0" and "END" tokens.
But regardless, you have the source code now, so I would like to see the opinion of a guy who supposedly works on ATI's drivers about what these short instruction shaders (which appear to do almost nothing) are used for. Do some legwork. Retreating behind "I can't talk because I don't want to divulge secrets" removes you from the discussion.
You asked me about how I would implement something. You did not ask me what I thought the code was for. Again, take your barbs elsewhere.
The API (D3D or OpenGL) does no such thing. How would the API even know it had to do this? If you export HW_VERTEX_PROCESSING in the D3D caps, then that means you support HW vertex processing. In other words, the driver/hardware has to handle everything if requested to by the application.
I am talking about the driver
Then say driver and not API because they are not the same thing.
intercepting calls to draw with fixed function state and composing vertex programs on the fly and uploading them to the GPU to perform the needed fixed function processing. If you can't imagine how this is done, I won't go any further. It's a trade secret.
You're right, I have no imagination.
I don't see any need to waste clock cycles at all.
Well then you are not thinking hard enough. The best example is the NV30 architecture. Each additional register lowers performance. A naive code generator would not generate optimal code. Moreover, as the vertex shader HW gets more complex and general purpose, you have the additional overhead of parallel execution scheduling, resource hazards, and superiority of handtweaked algorithms.
Then don't make a naive code generator for cryin' out loud! nvidia has plenty of resources to make smart code, right? Good grief.

Also, this problem should have occurred to people long before the product came out the door.
If you suppose that a vertex code generator always generates optimal code, then #1 you violate the "full employment theorem for compiler writers" and #2 your HW is probably very simple with respect to parallelism, scheduling, and resource usage.
There are several steps to code compilation. For example, there's conversion to machine code and then optimization. Why can't the driver optimize the code? And if you are building shaders from scratch, as you mentioned above, then you have two opportunities for optimization.

You're right, it may not always give the best results, but it sure better give the majority of the performance of hand-tuned code.

And if your HW was not so sensitive to resource usage, then I would say it's more complex, not less.
 
OpenGL guy said:
And if your HW was not so sensitive to resource usage, then I would say it's more complex, not less.

Well, the itanium folks would probably get all in a hissy about that.

Beyond thatl, what are your opinions for what these shader snippets are?
 
Back
Top