I percieve the three major capabilities of SM 3.0 hardware over 2.0 as:
i) elegant and effective displacement mapping in Vertex Shader 3.0 (and this NV40 appears to do very well)
ii) branching in Pixel Shader 3.0 (possibly reducing memory overheads - loading shaders and constants) and
iii) the speed and resources to run longer shaders (which appears to be there to some degree).
I expect a few synthetic tests will show us how well dynamic branching in PS 3.0 turns out. Personally I think it will be acceptable in at least a few situations, given NV40 has the actual grunt to run more and/or longer shaders. If NV40 has the grunt to do 2x - 3x the shader load of NV35 you have some room to manouver.
I see PS 3.0 allows the programmer the significant convience of writting more general shaders, such as handling multiple lights in one shader with branching for each light source.
I don't expect it to be a panacea for the world's problems - but I do expect that the combination of having the speed and resources to run longer shaders combined with the generality that can be derived from branching and conditionals will be a very positive benefit in the future.
I would like to see speed tests of current vs next generation cards on a full load of short vs long shaders. Then I'd like to see tests running one reasonable length, complex shader with 8 branches / conditionals for 8 lights compared and constrasted with running the nearest equivalent set of 8 shorter, individual shaders for each light source. This could be a PS 3.0 vs PS 3.0 or PS 3.0 vs PS 2.0/ PS 2.+ test.
I would hope to see an almost neutral performance different on running a shader with say 20 instructions, looped eight times vs running eight shaders of 20 instructions each. To me you have performed around 160 instructions either way. So then you need to examine how if at all this above scenario affects the efficency of how you move data around your hardware and whether you can achieve high and beneficial utilisation levels on your hardware. If the branching models means less work loading both shader code and constants, meaning you might see a very nice performance gain!
i) elegant and effective displacement mapping in Vertex Shader 3.0 (and this NV40 appears to do very well)
ii) branching in Pixel Shader 3.0 (possibly reducing memory overheads - loading shaders and constants) and
iii) the speed and resources to run longer shaders (which appears to be there to some degree).
I expect a few synthetic tests will show us how well dynamic branching in PS 3.0 turns out. Personally I think it will be acceptable in at least a few situations, given NV40 has the actual grunt to run more and/or longer shaders. If NV40 has the grunt to do 2x - 3x the shader load of NV35 you have some room to manouver.
I see PS 3.0 allows the programmer the significant convience of writting more general shaders, such as handling multiple lights in one shader with branching for each light source.
I don't expect it to be a panacea for the world's problems - but I do expect that the combination of having the speed and resources to run longer shaders combined with the generality that can be derived from branching and conditionals will be a very positive benefit in the future.
I would like to see speed tests of current vs next generation cards on a full load of short vs long shaders. Then I'd like to see tests running one reasonable length, complex shader with 8 branches / conditionals for 8 lights compared and constrasted with running the nearest equivalent set of 8 shorter, individual shaders for each light source. This could be a PS 3.0 vs PS 3.0 or PS 3.0 vs PS 2.0/ PS 2.+ test.
I would hope to see an almost neutral performance different on running a shader with say 20 instructions, looped eight times vs running eight shaders of 20 instructions each. To me you have performed around 160 instructions either way. So then you need to examine how if at all this above scenario affects the efficency of how you move data around your hardware and whether you can achieve high and beneficial utilisation levels on your hardware. If the branching models means less work loading both shader code and constants, meaning you might see a very nice performance gain!