Discuss NV40's shader unit architecture

991060 · Apr 14, 2004

This one goes to 6400/626.0 =10 clocks although I count 11 passes.

Code:

ps_2_0

def c0, 0.0f, 0.0f, 2.0f, 0.0f
def c1, 0.4f, 0.5f, 0.9f, 16.0f

dcl t0.xy
dcl t1.xyz
dcl t2.xyz

dcl_2d s0
dcl_2d s1

// Normalize light direction
dp3_pp r1.w, t1, t1  // free fp16 nrm
rsq_pp r1.w, r1.w     
mul_pp r1.xyz, t1, r1.w

// Calculate halfway vector
add_pp r0.xyz, c0, -t2 //pass 1
dp3_pp r0.w, r0, r0     //pass 2
rsq_pp r0.w, r0.w       //pass 2 c-i
mad_pp r0.xyz, r0, r0.w, r1 //pass 3
dp3_pp r0.w, r0, r0   // free fp16 nrm
rsq_pp r0.w, r0.w
mul_pp r0.xyz, r0, r0.w

// Load and normalize normal
texld_pp r2, t0, s0     //pass 4
dp3_pp r2.w, r2, r2    //pass 5
rsq_pp r2.w, r2.w      //pass 5 c-i
mul_pp r2.xyz, r2, r2.w //pass 6

// Calculate lighting
dp3_pp r1.w, r2, r0	// pass 7
dp3_pp r1.xyz, r2, r1	// pass 8
pow_pp r1.w, r1.w, c1.w     //pass 8 c-i
mad_pp r1.xyz, r1, c1, r1.www  //pass 9

// Add base texture
texld_pp r0, t0, s1             //pass 10
mul_pp r0, r1, r0               //pass 11

mov_pp oC0, r0

Xmas · Apr 14, 2004

SU1 cannot add (MUL or complex function). SU2 can (MAD or DP{2|3|4}, basically).

Demirug · Apr 14, 2004

Uttar said:
Damnit - that's strange!
Unless there were some evil restrictions for dependancy - it'd be surprising, but definitively not out of the question.

Uttar

As far as I am able to test (only time for 6000 different shaders) it works but the current driver does a very bad job.

And yes there are restrictions.

But I am still check the results.

991060 · Apr 14, 2004

Are we expecting better performance from more mature drivers?

Xmas · Apr 14, 2004

Code:

ps_2_0 

def c0, 0.0f, 0.0f, 2.0f, 0.0f 
def c1, 0.4f, 0.5f, 0.9f, 16.0f 

dcl t0.xy 
dcl t1.xyz 
dcl t2.xyz 

dcl_2d s0 
dcl_2d s1 

// Normalize light direction 
dp3_pp r1.w, t1, t1  // pass 1: free fp16 nrm (SU1) 
rsq_pp r1.w, r1.w      
mul_pp r1.xyz, t1, r1.w 

// Calculate halfway vector 
add_pp r0.xyz, c0, -t2 //pass 1 (SU2)
dp3_pp r0.w, r0, r0     //pass 2 (SU2) 
rsq_pp r0.w, r0.w       //pass 3 (SU1)
mad_pp r0.xyz, r0, r0.w, r1 //pass 3 (SU2)
dp3_pp r0.w, r0, r0   // pass 3 or 4*: free fp16 nrm (SU1)
rsq_pp r0.w, r0.w 
mul_pp r0.xyz, r0, r0.w 

// Load and normalize normal 
texld_pp r2, t0, s0     //pass 4 (SU1)
dp3_pp r2.w, r2, r2    //pass 4 or 5*: free fp16 nrm (SU1)
rsq_pp r2.w, r2.w      
mul_pp r2.xyz, r2, r2.w 

// Calculate lighting 
dp3_pp r1.w, r2, r0   // pass 5 (SU2) 
dp3_pp r1.xyz, r2, r1   // pass 6 (SU2) 
pow_pp r1.w, r1.w, c1.w     //pass 7 (SU1) 
mad_pp r1.xyz, r1, c1, r1.www  //pass 7 (SU2) 

// Add base texture 
texld_pp r0, t0, s1             //pass 8 (SU1) 
mul_pp r0, r1, r0               //pass 8 (SU2) 

mov_pp oC0, r0

I only get 8 passes...?

* depends on whether the NRM is parallel or serial to the other operation in SU1.

991060 · Apr 14, 2004

I didn't make it clear, I was assuming SU2 doesn't work at all.

And Xmas, are you sure about the restrictions mentioned above?

Xmas · Apr 14, 2004

991060 said:
I didn't make it clear, I was assuming SU2 doesn't work at all.

And Xmas, are you sure about the restrictions mentioned above?

http://www.3dcenter.org/artikel/nv40_technik/index2_e.php

eSa · Apr 14, 2004

Just a humble note here... I was a bit bored, so I took 13 individual test results from these pixel shader tests: http://www.hexus.net/content/reviews/review.php?dXJsX3Jldmlld19JRD03NDcmdXJsX3BhZ2U9MTg=

and calculated on avarage ratio of 6800 / R360. There were total of 15 test were both card had a result, but I removed the one with less performace difference and one with most performance difference. Result avarage ratio for remaining 13 tests is 2.647.

I then scaled this result with: [ (425 MHz / 400 MHz) / (16 pipes / 8 pipes)] = 0.53125 to remove the pipe count and clockrate difference.

According to the final result, single 6800 pipe has 1.406 times the speed of single R360 pipe. That 1.4 ratio is roughly present throught all invidual tests, so maybe it's at least a decent approximation

LeStoffer · Apr 14, 2004

eSa said:
Just a humble note here... I was a bit bored, so I took 13 individual test results from these pixel shader tests: http://www.hexus.net/content/reviews/review.php?dXJsX3Jldmlld19JRD03NDcmdXJsX3BhZ2U9MTg=

and calculated on avarage ratio of 6800 / R360. There were total of 15 test were both card had a result, but I removed the one with less performace difference and one with most performance difference. Result avarage ratio for remaining 13 tests is 2.647.

I then scaled this result with: [ (425 MHz / 400 MHz) / (16 pipes / 8 pipes)] = 0.53125 to remove the pipe count and clockrate difference.

According to the final result, single 6800 pipe has 1.406 times the speed of single R360 pipe. That 1.4 ratio is roughly present throught all invidual tests, so maybe it's at least a decent approximation

Noting boring about that, thanks!

The major problem this community face now is: What the heck shall we speculate, test, discuss and bitch about months in and months out with the twitchy NV3x int12/FP16 pipeline gone?!

Bjorn · Apr 14, 2004

LeStoffer said:
The major problem this community face now is: What the heck shall we speculate, test, discuss and bitch about months in and months out with the twitchy NV3x int12/FP16 pipeline gone?!

Hopefully, we'll be able to focus more on actual games since now (soon at least) we might have some real DX9 games to use as comparision.

KimB · Apr 14, 2004

991060 said:
This one goes to 6400/626.0 =10 clocks although I count 11 passes.

Code:

... // Add base texture texld_pp r0, t0, s1 //pass 10 mul_pp r0, r1, r0 //pass 11 ...

It's possible the texld and mul are done in the same clock.

KimB · Apr 14, 2004

LeStoffer said:
The major problem this community face now is: What the heck shall we speculate, test, discuss and bitch about months in and months out with the twitchy NV3x int12/FP16 pipeline gone?!

Well, by appearances, the NV4x pipelines look every bit as complex, it's just that the baseline performance appears to have been increased significantly.

The only thing I really want to know is whether or not there is any shader replacement going on for the synthetic tests. The ~2x performance in essentially every game tested (and some reviewers did select a wide variety of games....) would seem to indicate that there is nothing benchmark-specific going on in these drivers, but we can't be sure.

In other words, I guess I'm asking, "Is the compiler that good yet? Or is this only a preview of how good the compiler should get in ~6 months time?"

ERP · Apr 14, 2004

991060 said:
I found sth interesting in Dave's review:
according to the fillrate test

Code:

ps_2_0 dcl v0 dcl v1 def c0, 0.3f, 0.7f, 0.2f, 0.4f add r0, c0, v1 add r0, r0, -v0 mov oC0, v0

needs 2 clock to execute on NV40, and

Code:

ps_2_0 dcl v0 dcl v1 def c0, 0.3f, 0.7f, 0.2f, 0.4f def c1, 0.9f, 0.3f, 0.8f, 0.6f add r0, c0, v1 mad r0, c1, r0, -v0 mad r0, v1, r0, c1 mad r0, v0, c0, r0 mov oC0, r0

needs 4 clocks

Does it mean shader unit2 is not enabled in current driver?

AFAIKS both these examples have register dependencies that would prevent simultaneous ops.

I doubt very much that there are no restrictions on how ALU ops can be paired and I'd expect register dependencies at least to be a restriction.

Demirug · Apr 14, 2004

ERP, it is not a register dependencies problem. The Problem is that only SU2 can do ADDs.

2 ADD = 2 cyles
4 ADD = 4 cyles

http://www.3dcenter.org/artikel/nv40_technik/index2_e.php

LeStoffer · Apr 14, 2004

Chalnoth said:
The only thing I really want to know is whether or not there is any shader replacement going on for the synthetic tests. The ~2x performance in essentially every game tested (and some reviewers did select a wide variety of games....) would seem to indicate that there is nothing benchmark-specific going on in these drivers, but we can't be sure.

In other words, I guess I'm asking, "Is the compiler that good yet? Or is this only a preview of how good the compiler should get in ~6 months time?"

The NV40 architecture seems straight forward (even if it is a bit premature to say!), so developers who have been writing 'straight DX9 code' should see good performance - albeit it can be improved if they take advantage of the specialities of the NV40 core of course.

Anyway, some performance goodies might come with DX 9.0c and the updated HLSL compiler for VS 3 and PS 3. I assume nVidia has targeted their compilers and input to the HLSL to work optimal under DX 9.0c so things might well improve (like on Far Cry), me thinks.

KimB · Apr 14, 2004

Well, it would definitely be nice to see improvement with future compiler/DX releases, but with performance like the NV40 is showing in current benchmarks, it really doesn't need any improvement, I don't think

LeStoffer · Apr 14, 2004

Chalnoth said:
Well, it would definitely be nice, but with performance like the NV40 is showing in current benchmarks, it really doesn't need any improvement, I don't think

I agree! It already looks stellar indeed... 8)

nobie · Apr 14, 2004

I'm still worried that the X800 XT will force them to revert to their forced partial-precision "cheating." It's an ace up their sleeve in case they need some extra juice.

Mintmaster · Apr 14, 2004

eSa said:
Just a humble note here... I was a bit bored, so I took 13 individual test results from these pixel shader tests: http://www.hexus.net/content/reviews/review.php?dXJsX3Jldmlld19JRD03NDcmdXJsX3BhZ2U9MTg=

There's something funny about the Hexus numbers. Over at Tech Report, most of the numbers are lower with pp on. I eyeballed the values on Tech-Reports chart, and they're all 18-22% higher on Hexus (did they overclock?). The R360 scores, however, match perfectly.

I did the same thing you did with Tech-Reports score, but used a geometric average instead (that one enormously large score will make a big difference in an arithmetic average). I can't say for sure Tech-Report is right, but it's more inline with what we see here at Xbit. Also, R360 is clocked at 412, isn't it?

Anyway, the number I get is 1.13 (NV pipe / RV360 pipe). With the Xbit shaders, I get 1.01. That's pretty much what I expected, since NVidia won't be able to do 2 shader ops per clock very often given the restrictions. Also, NV40 has about 22% less bandwidth per pipe, and while this isn't a big deal for the long shaders, it could make a difference in some of them.

In any case, we can clearly see that a 16-pipe R420, especially if clocked near 600 MHz, should definately challenge if not significantly outpace NV40 in some shaders.

All in all, though, I am EXTREMELY impressed with NV40's performance. It is quite simply a developer's dream card, and NVidia can probably use this architecture for it's next generation too. My only question is how much profit NVidia is earning on these cards, as that's one enormous piece of silicon cooled by what looks like an expensive HSF.

The funny thing is that in the benchmarks that really show NV40's amazing pixel shading power, NV38 looks like absolute shit!

LeStoffer · Apr 14, 2004

Mintmaster said:
The funny thing is that in the benchmarks that really show NV40's amazing pixel shading power, NV38 looks like absolute shit!

... and about bloody time that they 'admitted' it, I might add!

Discuss NV40's shader unit architecture

991060

Xmas

Porous

Demirug

991060

Xmas

Porous

991060

Xmas

Porous

eSa

LeStoffer

Bjorn

KimB

KimB

ERP

Demirug

LeStoffer

KimB

LeStoffer

nobie

Mintmaster

LeStoffer

Similar threads