Discuss NV40's shader unit architecture

This one goes to 6400/626.0 =10 clocks although I count 11 passes.
Code:
ps_2_0

def c0, 0.0f, 0.0f, 2.0f, 0.0f
def c1, 0.4f, 0.5f, 0.9f, 16.0f

dcl t0.xy
dcl t1.xyz
dcl t2.xyz

dcl_2d s0
dcl_2d s1

// Normalize light direction
dp3_pp r1.w, t1, t1  // free fp16 nrm
rsq_pp r1.w, r1.w     
mul_pp r1.xyz, t1, r1.w

// Calculate halfway vector
add_pp r0.xyz, c0, -t2 //pass 1
dp3_pp r0.w, r0, r0     //pass 2
rsq_pp r0.w, r0.w       //pass 2 c-i
mad_pp r0.xyz, r0, r0.w, r1 //pass 3
dp3_pp r0.w, r0, r0   // free fp16 nrm
rsq_pp r0.w, r0.w
mul_pp r0.xyz, r0, r0.w

// Load and normalize normal
texld_pp r2, t0, s0     //pass 4
dp3_pp r2.w, r2, r2    //pass 5
rsq_pp r2.w, r2.w      //pass 5 c-i
mul_pp r2.xyz, r2, r2.w //pass 6

// Calculate lighting
dp3_pp r1.w, r2, r0	// pass 7
dp3_pp r1.xyz, r2, r1	// pass 8
pow_pp r1.w, r1.w, c1.w     //pass 8 c-i
mad_pp r1.xyz, r1, c1, r1.www  //pass 9

// Add base texture
texld_pp r0, t0, s1             //pass 10
mul_pp r0, r1, r0               //pass 11

mov_pp oC0, r0
 
Uttar said:
Damnit - that's strange!
Unless there were some evil restrictions for dependancy - it'd be surprising, but definitively not out of the question.

Uttar

As far as I am able to test (only time for 6000 different shaders) it works but the current driver does a very bad job.

And yes there are restrictions.

But I am still check the results.
 
Code:
ps_2_0 

def c0, 0.0f, 0.0f, 2.0f, 0.0f 
def c1, 0.4f, 0.5f, 0.9f, 16.0f 

dcl t0.xy 
dcl t1.xyz 
dcl t2.xyz 

dcl_2d s0 
dcl_2d s1 

// Normalize light direction 
dp3_pp r1.w, t1, t1  // pass 1: free fp16 nrm (SU1) 
rsq_pp r1.w, r1.w      
mul_pp r1.xyz, t1, r1.w 

// Calculate halfway vector 
add_pp r0.xyz, c0, -t2 //pass 1 (SU2)
dp3_pp r0.w, r0, r0     //pass 2 (SU2) 
rsq_pp r0.w, r0.w       //pass 3 (SU1)
mad_pp r0.xyz, r0, r0.w, r1 //pass 3 (SU2)
dp3_pp r0.w, r0, r0   // pass 3 or 4*: free fp16 nrm (SU1)
rsq_pp r0.w, r0.w 
mul_pp r0.xyz, r0, r0.w 

// Load and normalize normal 
texld_pp r2, t0, s0     //pass 4 (SU1)
dp3_pp r2.w, r2, r2    //pass 4 or 5*: free fp16 nrm (SU1)
rsq_pp r2.w, r2.w      
mul_pp r2.xyz, r2, r2.w 

// Calculate lighting 
dp3_pp r1.w, r2, r0   // pass 5 (SU2) 
dp3_pp r1.xyz, r2, r1   // pass 6 (SU2) 
pow_pp r1.w, r1.w, c1.w     //pass 7 (SU1) 
mad_pp r1.xyz, r1, c1, r1.www  //pass 7 (SU2) 

// Add base texture 
texld_pp r0, t0, s1             //pass 8 (SU1) 
mul_pp r0, r1, r0               //pass 8 (SU2) 

mov_pp oC0, r0
I only get 8 passes...?

* depends on whether the NRM is parallel or serial to the other operation in SU1.
 
I didn't make it clear, I was assuming SU2 doesn't work at all.

And Xmas, are you sure about the restrictions mentioned above?
 
Just a humble note here... I was a bit bored, so I took 13 individual test results from these pixel shader tests: http://www.hexus.net/content/reviews/review.php?dXJsX3Jldmlld19JRD03NDcmdXJsX3BhZ2U9MTg=

and calculated on avarage ratio of 6800 / R360. There were total of 15 test were both card had a result, but I removed the one with less performace difference and one with most performance difference. Result avarage ratio for remaining 13 tests is 2.647.

I then scaled this result with: [ (425 MHz / 400 MHz) / (16 pipes / 8 pipes)] = 0.53125 to remove the pipe count and clockrate difference.

According to the final result, single 6800 pipe has 1.406 times the speed of single R360 pipe. That 1.4 ratio is roughly present throught all invidual tests, so maybe it's at least a decent approximation :)
 
eSa said:
Just a humble note here... I was a bit bored, so I took 13 individual test results from these pixel shader tests: http://www.hexus.net/content/reviews/review.php?dXJsX3Jldmlld19JRD03NDcmdXJsX3BhZ2U9MTg=

and calculated on avarage ratio of 6800 / R360. There were total of 15 test were both card had a result, but I removed the one with less performace difference and one with most performance difference. Result avarage ratio for remaining 13 tests is 2.647.

I then scaled this result with: [ (425 MHz / 400 MHz) / (16 pipes / 8 pipes)] = 0.53125 to remove the pipe count and clockrate difference.

According to the final result, single 6800 pipe has 1.406 times the speed of single R360 pipe. That 1.4 ratio is roughly present throught all invidual tests, so maybe it's at least a decent approximation :)

Noting boring about that, thanks!

The major problem this community face now is: What the heck shall we speculate, test, discuss and bitch about months in and months out with the twitchy NV3x int12/FP16 pipeline gone?! :oops:

:p
 
LeStoffer said:
The major problem this community face now is: What the heck shall we speculate, test, discuss and bitch about months in and months out with the twitchy NV3x int12/FP16 pipeline gone?! :oops:
:p

Hopefully, we'll be able to focus more on actual games since now (soon at least) we might have some real DX9 games to use as comparision.
 
991060 said:
This one goes to 6400/626.0 =10 clocks although I count 11 passes.
Code:
...
// Add base texture
texld_pp r0, t0, s1             //pass 10
mul_pp r0, r1, r0               //pass 11
...
It's possible the texld and mul are done in the same clock.
 
LeStoffer said:
The major problem this community face now is: What the heck shall we speculate, test, discuss and bitch about months in and months out with the twitchy NV3x int12/FP16 pipeline gone?! :oops:
Well, by appearances, the NV4x pipelines look every bit as complex, it's just that the baseline performance appears to have been increased significantly.

The only thing I really want to know is whether or not there is any shader replacement going on for the synthetic tests. The ~2x performance in essentially every game tested (and some reviewers did select a wide variety of games....) would seem to indicate that there is nothing benchmark-specific going on in these drivers, but we can't be sure.

In other words, I guess I'm asking, "Is the compiler that good yet? Or is this only a preview of how good the compiler should get in ~6 months time?"
 
991060 said:
I found sth interesting in Dave's review:
according to the fillrate test
Code:
ps_2_0

dcl v0
dcl v1

def c0, 0.3f, 0.7f, 0.2f, 0.4f

add r0, c0, v1
add r0, r0, -v0
mov oC0, v0

needs 2 clock to execute on NV40, and
Code:
ps_2_0

dcl v0
dcl v1

def c0, 0.3f, 0.7f, 0.2f, 0.4f
def c1, 0.9f, 0.3f, 0.8f, 0.6f

add r0, c0, v1
mad r0, c1, r0, -v0
mad r0, v1, r0, c1
mad r0, v0, c0, r0
mov oC0, r0
needs 4 clocks

Does it mean shader unit2 is not enabled in current driver?

AFAIKS both these examples have register dependencies that would prevent simultaneous ops.

I doubt very much that there are no restrictions on how ALU ops can be paired and I'd expect register dependencies at least to be a restriction.
 
Chalnoth said:
The only thing I really want to know is whether or not there is any shader replacement going on for the synthetic tests. The ~2x performance in essentially every game tested (and some reviewers did select a wide variety of games....) would seem to indicate that there is nothing benchmark-specific going on in these drivers, but we can't be sure.

In other words, I guess I'm asking, "Is the compiler that good yet? Or is this only a preview of how good the compiler should get in ~6 months time?"

The NV40 architecture seems straight forward (even if it is a bit premature to say!), so developers who have been writing 'straight DX9 code' should see good performance - albeit it can be improved if they take advantage of the specialities of the NV40 core of course.

Anyway, some performance goodies might come with DX 9.0c and the updated HLSL compiler for VS 3 and PS 3. I assume nVidia has targeted their compilers and input to the HLSL to work optimal under DX 9.0c so things might well improve (like on Far Cry), me thinks.
 
Well, it would definitely be nice to see improvement with future compiler/DX releases, but with performance like the NV40 is showing in current benchmarks, it really doesn't need any improvement, I don't think :)
 
Chalnoth said:
Well, it would definitely be nice, but with performance like the NV40 is showing in current benchmarks, it really doesn't need any improvement, I don't think :)

I agree! It already looks stellar indeed... 8)
 
I'm still worried that the X800 XT will force them to revert to their forced partial-precision "cheating." It's an ace up their sleeve in case they need some extra juice.
 
eSa said:
Just a humble note here... I was a bit bored, so I took 13 individual test results from these pixel shader tests: http://www.hexus.net/content/reviews/review.php?dXJsX3Jldmlld19JRD03NDcmdXJsX3BhZ2U9MTg=
There's something funny about the Hexus numbers. Over at Tech Report, most of the numbers are lower with pp on. I eyeballed the values on Tech-Reports chart, and they're all 18-22% higher on Hexus (did they overclock?). The R360 scores, however, match perfectly.

I did the same thing you did with Tech-Reports score, but used a geometric average instead (that one enormously large score will make a big difference in an arithmetic average). I can't say for sure Tech-Report is right, but it's more inline with what we see here at Xbit. Also, R360 is clocked at 412, isn't it?

Anyway, the number I get is 1.13 (NV pipe / RV360 pipe). With the Xbit shaders, I get 1.01. That's pretty much what I expected, since NVidia won't be able to do 2 shader ops per clock very often given the restrictions. Also, NV40 has about 22% less bandwidth per pipe, and while this isn't a big deal for the long shaders, it could make a difference in some of them.

In any case, we can clearly see that a 16-pipe R420, especially if clocked near 600 MHz, should definately challenge if not significantly outpace NV40 in some shaders.

All in all, though, I am EXTREMELY impressed with NV40's performance. It is quite simply a developer's dream card, and NVidia can probably use this architecture for it's next generation too. My only question is how much profit NVidia is earning on these cards, as that's one enormous piece of silicon cooled by what looks like an expensive HSF.

The funny thing is that in the benchmarks that really show NV40's amazing pixel shading power, NV38 looks like absolute shit!
 
Mintmaster said:
The funny thing is that in the benchmarks that really show NV40's amazing pixel shading power, NV38 looks like absolute shit!

... and about bloody time that they 'admitted' it, I might add! :devilish:
 
Back
Top