Discuss NV40's shader unit architecture

NVidia is going to really have to beef up their compiler to schedule all this, so I expect initial drivers won't demonstrate full performance unless assembly code is hand written for the NV40.

Plus, to use the free nrm_pp they'll have to go detect all dp4_pp,rsq_pp,mul_pp sequences (possibly not sequential tokens) and change them to nrm_pp due to the fact that FXC never generated NRM until 9.0c (unreleased) and most hand coders never used DX9 "macros" because they were warned not to use them.

They should provide a driver checkbox option ("force low precision normalization") and even detect dp4/rsq/mul and force it to NRM_PP, it's 3 slots into 1!
 
DemoCoder said:
They should provide a driver checkbox option ("force low precision normalization") and even detect dp4/rsq/mul and force it to NRM_PP, it's 3 clock cycles into 1!

I'm not exactly sure, but isn't normalisation usually sensitive to precision? :?:
 
Well, given that most people used to get by using a cube-map normalizer (non-FP) I think it would be on-par. Point is, make it an end-user visible option, and if it causes artifacts in the game, the user can disable it. Or make it a CoolBits setting, or something.
 
DemoCoder said:
Point is, make it an end-user visible option, and if it causes artifacts in the game, the user can disable it.

That would be the best solution hands down... let's hope it ends up that way.
 
The first thing I thought about when i saw the Shader unit 2 "2+2" co issue was register combiners as well. Hell, it probably does AB+CD (or RG+BA in colour terms) in a single cycle. Just like the GeForce 256 combiners, albeit with slightly more precision ;)

Obviously it's more capable than register combiners since it presumably can do full 4 component vector ops but stil...
 
DemoCoder said:
NVidia is going to really have to beef up their compiler to schedule all this, so I expect initial drivers won't demonstrate full performance unless assembly code is hand written for the NV40.
Fortunately they're already had experience with this on the NV3x, so it may not take as long as you'd think for decent performance. If the FP register performance hit has been significantly reduced, the NV40 could really put out a tremendous amount of processing power, once the compilers get up to speed.

Plus, to use the free nrm_pp they'll have to go detect all dp4_pp,rsq_pp,mul_pp sequences (possibly not sequential tokens) and change them to nrm_pp due to the fact that FXC never generated NRM until 9.0c (unreleased) and most hand coders never used DX9 "macros" because they were warned not to use them.
Yeah, that's pretty scary. It's really too bad Microsoft elected to continue assembly language shaders through PS 3.0. This may require nVidia to rewrite by hand shaders for most older games that used PS 2.0 for optimal performance.
 
Looks like Nvidia is being open with this architecture instead of trying to keep everything a secret like NV30.
 
With regard to the VS unit...
See image here
...I'm a bit intrigued by the "penalty free" branching. Perhaps in many cases this may be so, but surely there will be situations where branches will impact.

It says that the VS in NV40 is MIMD so that each of the 6 (?) Vertex units can be running independently. What happens if one vertex requires, say, 200 instruction cycles and the others only 10 cycles? The other vertex units could potentially move on to new vertices when they finish, but surely NV40 hasn't got infinite post-VS vertex buffering. That long vertex must eventually stall the pipeline and so surely the branching can't be entirely penalty free <shrug>
 
Simon F said:
That long vertex must eventually stall the pipeline and so surely the branching can't be entirely penalty free <shrug>

OK, Simon repeat after me : MAAAAARKEEEEETTIIIIIIIIIIIIIIINNNNGGGgggg... :LOL:

K-
 
In nVIDIA's CineFX 3.0 technical brief, it says:
"The operating system or APIs can impose limits, but the hardware is not limited to shader program length." with regard to both VS and PS in NV40.

Any thoughts? May it's another incarnation of F-Buffer?
 
Chalnoth said:
DemoCoder said:
NVidia is going to really have to beef up their compiler to schedule all this, so I expect initial drivers won't demonstrate full performance unless assembly code is hand written for the NV40.
Fortunately they're already had experience with this on the NV3x, so it may not take as long as you'd think for decent performance. If the FP register performance hit has been significantly reduced, the NV40 could really put out a tremendous amount of processing power, once the compilers get up to speed.

First, nVidia claim better FP32 performance (default over FP16) unless storage requirements are high, so the register problem is reduced.

Second, a new Shader 3.0 HLSL is coming this summer (in a DX9 SDK update) and I think that it is safe to assume that the primary input is from nVidia this time round! ;)
 
I found sth interesting in Dave's review:
according to the fillrate test
Code:
ps_2_0

dcl v0
dcl v1

def c0, 0.3f, 0.7f, 0.2f, 0.4f

add r0, c0, v1
add r0, r0, -v0
mov oC0, v0

needs 2 clock to execute on NV40, and
Code:
ps_2_0

dcl v0
dcl v1

def c0, 0.3f, 0.7f, 0.2f, 0.4f
def c1, 0.9f, 0.3f, 0.8f, 0.6f

add r0, c0, v1
mad r0, c1, r0, -v0
mad r0, v1, r0, c1
mad r0, v0, c0, r0
mov oC0, r0
needs 4 clocks

Does it mean shader unit2 is not enabled in current driver?
 
Damnit - that's strange!
Unless there were some evil restrictions for dependancy - it'd be surprising, but definitively not out of the question.

Uttar
 
Seems like HFR found also something interesting (babelfish is your friend ;)):
Nous avons donc dû nous résoudre à tester les Pixel Shader 3.0 via quelques petits shaders écrits en assembleur par nos soins (étant donné que le compilateur HLSL ne gère pas encore les Pixel Shader 3.0). Nous n´avons cependant pas eu le temps de tester ce point en profondeur. Nous nous sommes concentré sur le coût des branchements en évitant de saccager les performances d´avance. Par exemple nous avons effectué un branchement réalisé sur une valeur qui dépend du triangle. Les 4 pipelines de chaque quad engine peuvent ainsi fonctionner en même temps. Nous avons bien entendu éviter toute utilisation de texture. Ce test ressemblait à ceci :
If xxxx
Ecran rouge
Else
Ecran vert
Endif
C´est le branchement le plus simple que nous pouvions réaliser. Il a nécessité pas moins de 9 passages dans le pipelines ce qui est énorme. Nous nous attendions à 2 passages en espérant que NVIDIA ait prévu son architecture pour que ça puisse se faire en un seul. Ce résultat nous a donc déçu, mais il est cependant possible qu’il ne soit attribuable qu´aux drivers peu avancés : l´avenir nous le dira !
http://www.hardware.fr/articles/491/page6.html
 
And this shader needs 6400/420.1 = 15 colcks, I add my speculation as the comment.

Code:
ps_2_0

def c0, 0.0f, 0.0f, 2.0f, 0.0f
def c1, 0.4f, 0.5f, 0.9f, 16.0f

dcl t0.xy
dcl t1.xyz
dcl t2.xyz

dcl_2d s0
dcl_2d s1

// Normalize light direction
dp3 r1.w, t1, t1     //pass 1           
rsq r1.w, r1.w       //pass 1 co-issue
mul r1.xyz, t1, r1.w  //pass 2

// Calculate halfway vector
add r0.xyz, c0, -t2   //pass 3
dp3 r0.w, r0, r0       //pass 4
rsq r0.w, r0.w         //pass 4 c-i
mad r0.xyz, r0, r0.w, r1  //pass 5
dp3 r0.w, r0, r0         //pass 6
rsq r0.w, r0.w           //pass 6 c-i
mul r0.xyz, r0, r0.w   //pass 7

// Load and normalize normal
texld r2, t0, s0          //pass 8
dp3 r2.w, r2, r2         //pass 9
rsq r2.w, r2.w           //pass 9 c-i
mul r2.xyz, r2, r2.w   //pass 10

// Calculate lighting
dp3 r1.w, r2, r0        //pass 11 
dp3 r1.xyz, r2, r1     // pass 12
pow r1.w, r1.w, c1.w   //pass 12 c-i
mad r1.xyz, r1, c1, r1.www  //pass 13

// Add base texture
texld r0, t0, s1         //pass 14
mul r0, r1, r0          //pass 15

mov oC0, r0
 
Back
Top