Dawn FP32 figures

Ante P

Veteran
5800 Ultra:
Normal: 30 fps

UTTARS VERSIONS:
FP32: 17 fps
16IN32: 17 fps
FP16: 20 fps
FullFP16: 21 fps
FX12: 30 fps

MDOLENCS VERSION
FP32: 17 fps



5900 Ultra:
Normal: - 30 fps (capped at 30 fps?)

UTTARS VERSIONS:
FP32: 19 fps
16IN32: 17 fps
FP16: 28 fps
FullFP16: 28 fps
FX12: 30 fps

MDOLENCS VERSION
FP32: 19 fps


Seems like the conclusion that the major float improvement would concern FP32 was pretty erroneous...
 
Fillrate Tester
--------------------------
Display adapter: NVIDIA GeForce FX 5900 Ultra
Driver version: 6.14.10.4461
Display mode: 1280x1024 A8R8G8B8 85Hz
Z-Buffer format: D24S8
--------------------------

FFP - Pure fillrate - 1772.702026M pixels/sec
FFP - Z pixel rate - 3371.229492M pixels/sec
FFP - Single texture - 1707.441040M pixels/sec
FFP - Dual texture - 1583.501953M pixels/sec
FFP - Triple texture - 805.216064M pixels/sec
FFP - Quad texture - 546.234985M pixels/sec
PS 1.1 - Simple - 892.257690M pixels/sec
PS 1.4 - Simple - 565.649109M pixels/sec
PS 2.0 - Simple - 422.224335M pixels/sec
PS 2.0 PP - Simple - 561.533508M pixels/sec
PS 2.0 - Longer - 169.349930M pixels/sec
PS 2.0 PP - Longer - 338.089874M pixels/sec
PS 2.0 - Longer 4 Registers - 181.126297M pixels/sec
PS 2.0 PP - Longer 4 Registers - 421.864349M pixels/sec
PS 2.0 - Per Pixel Lighting - 83.823418M pixels/sec
PS 2.0 PP - Per Pixel Lighting - 105.561607M pixels/sec


Could anyone post 9800 results for comparison?

Edit: redid the test in 1280x0124
 
For comparison, here are the 5800 Ultra results:
Marc said:
Fillrate Tester
--------------------------
Display adapter: NVIDIA GeForce FX 5800 Ultra
Driver version: 6.14.10.4410
Display mode: 1280x1024 A8R8G8B8 85Hz
Z-Buffer format: D24S8
--------------------------

FFP - Pure fillrate - 1957.946899M pixels/sec
FFP - Z pixel rate - 3547.966064M pixels/sec
FFP - Single texture - 1650.619629M pixels/sec
FFP - Dual texture - 1435.494629M pixels/sec
FFP - Triple texture - 799.567261M pixels/sec
FFP - Quad texture - 785.587708M pixels/sec
PS 1.1 - Simple - 989.477356M pixels/sec
PS 1.4 - Simple - 624.802979M pixels/sec
PS 2.0 - Simple - 628.450745M pixels/sec
PS 2.0 PP - Simple - 628.446899M pixels/sec
PS 2.0 - Longer - 378.286591M pixels/sec
PS 2.0 PP - Longer - 378.284943M pixels/sec
PS 2.0 - Longer 4 Registers - 304.032501M pixels/sec
PS 2.0 PP - Longer 4 Registers - 377.960114M pixels/sec
PS 2.0 - Per Pixel Lighting - 60.536434M pixels/sec
PS 2.0 PP - Per Pixel Lighting - 67.032890M pixels/sec
One positive note is the 1.5x-2.0x performance increase between NV30 and NV35 (NV35 1.5-2*NV30 @fp16, clock for clock) with the per pixel lighting test. It could be that the shader is optimized with a shader structure that allows NV35's two serial units per pipeline execute simultaneoulsy. Otherwise, there is no reason, why NV35 is so much slower than NV30 in some of the full precision tests. Either NV30 was always forcing fp16, or something is horribly skewed. The numbers for the Dawn benchmark speak a totally different language, perhaps OpenGl.

We may now view the penalty NV35 pays for register usage. Fp16 in fp32 registers just slaughters NV35's performance. It seems to hold the same number of on-chip registers even though the number of fp units has doubled. With the fp16 in fp32 Dawn shader test, the computations work on 16-bit operands, but the operands are held in 32-bit registers and the results prove to us that it is not because of unit execution latency but register penalty. Remember, it was already shown that register counts double for fp32, so if Nvidia origninally used 4 registers for fp16, it becomes 8 for fp32, which adds an overhead of 1 cycle to all affected calculations (see thepkrl's thread), making most instructions compute in 2 instead of 1 cycle. In essence, fp32 can cut performance by a factor of 2 (4 registers become 8 ), or 1.4 (3 registers become 6) relative to fp16. Refer to this post for a probable reason why. Here are the 9800pro's result:
Fillrate Tester
--------------------------
Display adapter: RADEON 9800 PRO
Driver version: 6.14.10.6343
Display mode: 1280x1024 A8R8G8B8 85Hz
Z-Buffer format: D24S8
--------------------------

FFP - Pure fillrate - 2647.106689M pixels/sec
FFP - Z pixel rate - 2553.387451M pixels/sec
FFP - Single texture - 2559.151367M pixels/sec
FFP - Dual texture - 1371.133057M pixels/sec
FFP - Triple texture - 739.213257M pixels/sec
FFP - Quad texture - 599.856689M pixels/sec
PS 1.1 - Simple - 1491.677002M pixels/sec
PS 1.4 - Simple - 1491.671387M pixels/sec
PS 2.0 - Simple - 1491.681396M pixels/sec
PS 2.0 PP - Simple - 1491.666992M pixels/sec
PS 2.0 - Longer - 750.331787M pixels/sec
PS 2.0 PP - Longer - 750.339600M pixels/sec
PS 2.0 - Longer 4 Registers - 750.343262M pixels/sec
PS 2.0 PP - Longer 4 Registers - 750.334106M pixels/sec
PS 2.0 - Per Pixel Lighting - 158.507538M pixels/sec
PS 2.0 PP - Per Pixel Lighting - 158.507568M pixels/sec
 
this might be a silly post, but has anyone tested whether the Dawn demo actually tells you the correct number of frames? Its not like nv is beyond reproach. ;) Anyone willing to use fraps to see whether or not the numbers are accurate to what is shown by the dawn demo.

thanks,
 
epicstruggle said:
this might be a silly post, but has anyone tested whether the Dawn demo actually tells you the correct number of frames? Its not like nv is beyond reproach. ;) Anyone willing to use fraps to see whether or not the numbers are accurate to what is shown by the dawn demo.

thanks,

I am using fraps :)
There is no "in-demo" fps meter.
 
Another thing I've noticed.
Dawn seems capped at 30 fps if you look at the FX12 and Normal fiigures right.
But if you lower the quality extremely (ie FX12, LOD 15+ 320x240 res etc.) it will reach a higher fps than 30 and also if you run the demo on Atis boards you get like 40-60 fps.

But at 1024x768 with normal settings the fps meter just refuses to go beyond 30 on nvidia boards.
 
Ante P said:
I am using fraps :)
There is no "in-demo" fps meter.
I cant run the demo with my sis651 integrated graphics. :) so wasnt sure whether or not the demo had a fps meter or not. thanks for the quick heads up.

later,
 
Luminescent said:
I ran the demo on my 9800pro @1280x1040, and recieved something in the neighborhood of 33 fps (average) with the Cat 3.4s.

What kind of a messed up aspect ratio is that? =)

try it at 1024x768 (and make sure no AA/AF/Vsync is activated)
 
Sorry Ante, I meant 1280x1024. Did you use 1024x768 for your tests (with Dawn)?

At 1024x768 the average framerate for the Dawn demo with the 9800pro is 34.55769231 fps (with no AA/Aniso/Vsync).
 
I forgot to mention that the 5900pro is achieving 30fps with FX12 forced on the same units it's using for fp32 (ops are executed with same theoretical throughput). So it seems fp32 registers are very, very limiting factors.
 
Luminescent said:
I forgot to mention that the 5900pro is achieving 30fps with FX12 forced on the same units it's using for fp32 (ops are executed with same theoretical throughput). So it seems fp32 registers are very, very limiting factors.

Yup. Usefull stuff for my NV35 preview. :)
 
Luminescent said:
Sorry Ante, I meant 1280x1024. Did you use 1024x768 for your tests (with Dawn)?

At 1024x768 the average framerate for the Dawn demo with the 9800pro is 34.55769231 fps (with no AA/Aniso/Vsync).

okidok, thanks
 
Here is the code for MDolenc's per pixel lighting test, which seems to execute rather well on NV35 in comparison to NV30 and even R300/350:
MDolenc said:
ps_2_0

def c0, 0.0f, 0.0f, 2.0f, 0.0f
def c1, 0.4f, 0.5f, 0.9f, 16.0f

dcl t0.xy
dcl t1.xyz
dcl t2.xyz

dcl_2d s0
dcl_2d s1

// Normalize light direction
dp3 r1.w, t1, t1
rsq r1.w, r1.w
mul r1.xyz, t1, r1.w

// Calculate halfway vector
add r0.xyz, c0, -t2
dp3 r0.w, r0, r0
rsq r0.w, r0.w
mad r0.xyz, r0, r0.w, r1
dp3 r0.w, r0, r0
rsq r0.w, r0.w
mul r0.xyz, r0, r0.w

// Load and normalize normal
texld r2, t0, s0
dp3 r2.w, r2, r2
rsq r2.w, r2.w
mul r2.xyz, r2, r2.w

// Calculate lighting
dp3 r1.w, r2, r0 // N.H
dp3 r1.xyz, r2, r1 // N.L
pow r1.w, r1.w, c1.w
mad r1.xyz, r1, c1, r1.www

// Add base texture
texld r0, t0, s1
mul r0, r1, r0

mov oC0, r0
It seems the instructions reuse registers often, pixel programs are more interdependent and complex.
While the code for the "Longer" PS2.0 test is:
MDolenc said:
ps_2_0

dcl v0
dcl v1

def c0, 0.3f, 0.7f, 0.2f, 0.4f
def c1, 0.9f, 0.3f, 0.8f, 0.6f

add r0, c0, v1
mad r0, c1, r0, -v0
mad r0, v1, r0, c1
mad r0, v0, c0, r0
mov oC0, r0
 
I never said it runs the code faster than R350 (besides R350 is using fp24 precision, so the comparison is unfair with either of NV35's modes); it just seems to be a little more competitive rather than blown out of the water, as on the other tests.

I am no code code/architecture analyst (yet ;)), not even close, so I'm still puzzled as to why it does comparatively better on the last two tests than on the others. :?:
 
An interesting thing to note also is that FX12 is faster than FP16, even on the NV35, while there are no such thing as a FX12 register!
Registers on the NV3x are either FP16 or FP32.

Okay, so what do we have here?
- FX12 is very slightly faster than FP16, even on the NV35.
- FP16 in FP32 registers is slower than full FP32.

Now, this is rather strange. The reason for FX12 being faster than FP16 might simply be that the hardware is being "smart" and not worrying about the exponents, thus having lower latency.

But why is FP16 in FP32 slower than full FP32?
One possible reason would be relative to the free one-per-cycle "MOV" instruction in the NV3x.
In the NV30, you've got one such unit per "pipeline", and one FP unit for every of them. In the NV35, you've also got one per "pipeline", but you've got two FP units for every of them!

Now, one way to implement this "move half the bits to 0" thing is to use that MOV unit. But considering MOV always seems free on the NV30, it has to be supposed you have one unit specifically for that, and one for real MOVs.
But in the NV35, what nVidia *might* have done is still keep only one such unit per pipeline, but now they've got 2 FP32 MAD units instead of one. So, in the case you'd have to clear half of two FP32 registers to do an FP16 op, you'd have to use that MOV unit to do it by fully cleaning the register first.
And since your MOV unit is busy, you don't have that free one-per-clock MOV instruction anymore - and Dawn *does* use that in order to make texturing faster, as shown as thepkrl ( sp? ) - thus, you'd lose a FP instruction once in a while, resulting in a slight performance hit.

Of course, that's all VERY speculative - and very uncertain - so should it make no sense, please feel free to say it! :)


Uttar
 
Back
Top