Discuss NV40's shader unit architecture

DemoCoder · Apr 13, 2004

NVidia is going to really have to beef up their compiler to schedule all this, so I expect initial drivers won't demonstrate full performance unless assembly code is hand written for the NV40.

Plus, to use the free nrm_pp they'll have to go detect all dp4_pp,rsq_pp,mul_pp sequences (possibly not sequential tokens) and change them to nrm_pp due to the fact that FXC never generated NRM until 9.0c (unreleased) and most hand coders never used DX9 "macros" because they were warned not to use them.

They should provide a driver checkbox option ("force low precision normalization") and even detect dp4/rsq/mul and force it to NRM_PP, it's 3 slots into 1!

anaqer · Apr 13, 2004

DemoCoder said:
They should provide a driver checkbox option ("force low precision normalization") and even detect dp4/rsq/mul and force it to NRM_PP, it's 3 clock cycles into 1!

I'm not exactly sure, but isn't normalisation usually sensitive to precision? :?:

DemoCoder · Apr 13, 2004

Well, given that most people used to get by using a cube-map normalizer (non-FP) I think it would be on-par. Point is, make it an end-user visible option, and if it causes artifacts in the game, the user can disable it. Or make it a CoolBits setting, or something.

anaqer · Apr 13, 2004

DemoCoder said:
Point is, make it an end-user visible option, and if it causes artifacts in the game, the user can disable it.

That would be the best solution hands down... let's hope it ends up that way.

Dave Baumann · Apr 13, 2004

"The register combiner is dead! Long live the register combiner!"

pocketmoon66 · Apr 13, 2004

DaveBaumann said:
"The register combiner is dead! Long live the register combiner!"

As long as the HLSL/Cg/GLSL compiler hides it I won't worry

Luminescent · Apr 13, 2004

Please tell me there are no register combiners in NV40.

GameCat · Apr 14, 2004

The first thing I thought about when i saw the Shader unit 2 "2+2" co issue was register combiners as well. Hell, it probably does AB+CD (or RG+BA in colour terms) in a single cycle. Just like the GeForce 256 combiners, albeit with slightly more precision

Obviously it's more capable than register combiners since it presumably can do full 4 component vector ops but stil...

Bambers · Apr 14, 2004

seems a whole load of hardocps review pictures have been leaked.

not sure if this has been posted but it hasn't been talked about in this thread.

http://mbnet.fi/elixir/NV40/10813058804U5m1JdTdm_6_1_l.jpg

that slide gives a little more info on the pipes. Both units can be 3/1 or 2/2.

KimB · Apr 14, 2004

DemoCoder said:
NVidia is going to really have to beef up their compiler to schedule all this, so I expect initial drivers won't demonstrate full performance unless assembly code is hand written for the NV40.

Fortunately they're already had experience with this on the NV3x, so it may not take as long as you'd think for decent performance. If the FP register performance hit has been significantly reduced, the NV40 could really put out a tremendous amount of processing power, once the compilers get up to speed.

Plus, to use the free nrm_pp they'll have to go detect all dp4_pp,rsq_pp,mul_pp sequences (possibly not sequential tokens) and change them to nrm_pp due to the fact that FXC never generated NRM until 9.0c (unreleased) and most hand coders never used DX9 "macros" because they were warned not to use them.

Yeah, that's pretty scary. It's really too bad Microsoft elected to continue assembly language shaders through PS 3.0. This may require nVidia to rewrite by hand shaders for most older games that used PS 2.0 for optimal performance.

rwolf · Apr 14, 2004

Looks like Nvidia is being open with this architecture instead of trying to keep everything a secret like NV30.

Simon F · Apr 14, 2004

With regard to the VS unit...
See image here
...I'm a bit intrigued by the "penalty free" branching. Perhaps in many cases this may be so, but surely there will be situations where branches will impact.

It says that the VS in NV40 is MIMD so that each of the 6 (?) Vertex units can be running independently. What happens if one vertex requires, say, 200 instruction cycles and the others only 10 cycles? The other vertex units could potentially move on to new vertices when they finish, but surely NV40 hasn't got infinite post-VS vertex buffering. That long vertex must eventually stall the pipeline and so surely the branching can't be entirely penalty free <shrug>

Kristof · Apr 14, 2004

Simon F said:
That long vertex must eventually stall the pipeline and so surely the branching can't be entirely penalty free <shrug>

OK, Simon repeat after me : MAAAAARKEEEEETTIIIIIIIIIIIIIIINNNNGGGgggg...

K-

991060 · Apr 14, 2004

In nVIDIA's CineFX 3.0 technical brief, it says:
"The operating system or APIs can impose limits, but the hardware is not limited to shader program length." with regard to both VS and PS in NV40.

Any thoughts? May it's another incarnation of F-Buffer?

LeStoffer · Apr 14, 2004

Chalnoth said:
DemoCoder said:

NVidia is going to really have to beef up their compiler to schedule all this, so I expect initial drivers won't demonstrate full performance unless assembly code is hand written for the NV40.

Click to expand...

Fortunately they're already had experience with this on the NV3x, so it may not take as long as you'd think for decent performance. If the FP register performance hit has been significantly reduced, the NV40 could really put out a tremendous amount of processing power, once the compilers get up to speed.

First, nVidia claim better FP32 performance (default over FP16) unless storage requirements are high, so the register problem is reduced.

Second, a new Shader 3.0 HLSL is coming this summer (in a DX9 SDK update) and I think that it is safe to assume that the primary input is from nVidia this time round!

991060 · Apr 14, 2004

I found sth interesting in Dave's review:
according to the fillrate test

Code:

ps_2_0

dcl v0
dcl v1

def c0, 0.3f, 0.7f, 0.2f, 0.4f

add r0, c0, v1
add r0, r0, -v0
mov oC0, v0

needs 2 clock to execute on NV40, and

Code:

ps_2_0

dcl v0
dcl v1

def c0, 0.3f, 0.7f, 0.2f, 0.4f
def c1, 0.9f, 0.3f, 0.8f, 0.6f

add r0, c0, v1
mad r0, c1, r0, -v0
mad r0, v1, r0, c1
mad r0, v0, c0, r0
mov oC0, r0

needs 4 clocks

Does it mean shader unit2 is not enabled in current driver?

Arun · Apr 14, 2004

Damnit - that's strange!
Unless there were some evil restrictions for dependancy - it'd be surprising, but definitively not out of the question.

Uttar

Evildeus · Apr 14, 2004

Seems like HFR found also something interesting (babelfish is your friend

):

Nous avons donc dÃ» nous rÃ©soudre Ã tester les Pixel Shader 3.0 via quelques petits shaders Ã©crits en assembleur par nos soins (Ã©tant donnÃ© que le compilateur HLSL ne gÃ¨re pas encore les Pixel Shader 3.0). Nous nÂ´avons cependant pas eu le temps de tester ce point en profondeur. Nous nous sommes concentrÃ© sur le coÃ»t des branchements en Ã©vitant de saccager les performances dÂ´avance. Par exemple nous avons effectuÃ© un branchement rÃ©alisÃ© sur une valeur qui dÃ©pend du triangle. Les 4 pipelines de chaque quad engine peuvent ainsi fonctionner en mÃªme temps. Nous avons bien entendu Ã©viter toute utilisation de texture. Ce test ressemblait Ã ceci :
If xxxx
Ecran rouge
Else
Ecran vert
Endif
CÂ´est le branchement le plus simple que nous pouvions rÃ©aliser. Il a nÃ©cessitÃ© pas moins de 9 passages dans le pipelines ce qui est Ã©norme. Nous nous attendions Ã 2 passages en espÃ©rant que NVIDIA ait prÃ©vu son architecture pour que Ã§a puisse se faire en un seul. Ce rÃ©sultat nous a donc dÃ©Ã§u, mais il est cependant possible quâ€™il ne soit attribuable quÂ´aux drivers peu avancÃ©s : lÂ´avenir nous le dira !

http://www.hardware.fr/articles/491/page6.html

nutball · Apr 14, 2004

Isn't there some sort of horrid race-condition in those MADs if you do 2 or more in parallel?

991060 · Apr 14, 2004

And this shader needs 6400/420.1 ＝ 15 colcks, I add my speculation as the comment.

Code:

ps_2_0

def c0, 0.0f, 0.0f, 2.0f, 0.0f
def c1, 0.4f, 0.5f, 0.9f, 16.0f

dcl t0.xy
dcl t1.xyz
dcl t2.xyz

dcl_2d s0
dcl_2d s1

// Normalize light direction
dp3 r1.w, t1, t1     //pass 1           
rsq r1.w, r1.w       //pass 1 co-issue
mul r1.xyz, t1, r1.w  //pass 2

// Calculate halfway vector
add r0.xyz, c0, -t2   //pass 3
dp3 r0.w, r0, r0       //pass 4
rsq r0.w, r0.w         //pass 4 c-i
mad r0.xyz, r0, r0.w, r1  //pass 5
dp3 r0.w, r0, r0         //pass 6
rsq r0.w, r0.w           //pass 6 c-i
mul r0.xyz, r0, r0.w   //pass 7

// Load and normalize normal
texld r2, t0, s0          //pass 8
dp3 r2.w, r2, r2         //pass 9
rsq r2.w, r2.w           //pass 9 c-i
mul r2.xyz, r2, r2.w   //pass 10

// Calculate lighting
dp3 r1.w, r2, r0        //pass 11 
dp3 r1.xyz, r2, r1     // pass 12
pow r1.w, r1.w, c1.w   //pass 12 c-i
mad r1.xyz, r1, c1, r1.www  //pass 13

// Add base texture
texld r0, t0, s1         //pass 14
mul r0, r1, r0          //pass 15

mov oC0, r0

Discuss NV40's shader unit architecture

DemoCoder

anaqer

DemoCoder

anaqer

Dave Baumann

Gamerscore Wh...

pocketmoon66

Luminescent

GameCat

Bambers

KimB

rwolf

Rock Star

Simon F

Tea maker

Kristof

991060

LeStoffer

991060

Arun

Unknown.

Evildeus

nutball

991060

Similar threads