Dawn FP16/FX12 VS FP32 performance - numbers inside

It's not that fp16 is significantly faster then fp32 (or fp32 being significantly slower)... The problem on NV3x is that it only has 2 fp32 registers (or 4 fp16 registers) free and after that performance goes down. I'm sure you won't see much of a difference if you have code with 2 registers and 32 fp32 instructions and same code but using 32 fp16 instructions. Fx12 is however two times faster then floating point on nv30.

You also won't see much (if any) difference between fp16 and fp32 if you are doing just "color operations". However if you want to do something that's not just like "color operations" you can bump into precision limitations VERY quickly and they will be VERY obvious (for example there is no point of using textures larger then 1024x1024 for dependant reads in fp16). Try the Mandelbrot demo from Humus on a Radeon (with fp24) and on GeForce FX 5900 (and hopefully it will be run in fp32 to illustrate my point :rolleyes:) and zoom in a bit...

And Uttar: T&L pipeline is far from fx12... ;)
 
Actually, I don't know exactly what's going on with JakUp scores' - AnteP's are significantly different.

AnteP, 5800 @ 5800 Ultra, resolution unknown ( probably 1024x768 )
Normal: 30 fps
FP32: 17 fps

JakUp, 5800 Ultra, 1024x768 no AA/AF, Quality mode:
Normal: 19fps
FP32: 15fps

I don't quite understand the "normal" difference between both. Surely FRAPS can't be THAT inaccurate? Hehe.

And surely a driver version couldn't do any similar work either...
Hmm...

Strange stuff going on here! Pretty much certain the NV35 isn't much faster at FP32 either - it's faster, but not *that* much faster...

BTW, most Dawn shaders use Cg, but all the ones using FX12/FP16 use assembly language to be faster ( probably because they're the bulk of the pixels )


Uttar
 
So according to Ante P (who seems to be a pretty reliable source) the 5800 at Ultra speeds performs about the same at Dawn as does the 5900 Ultra, and receives a similar (although somewhat greater?) performance hit when switching from mixed (mostly FX12) shaders to FP32.

From this information, it look like there is little or no actual improvement in the shader pipeline from NV30->NV35--which would seem to imply that the benchmarks in the NV35 reviews that show otherwise were cheats. It looks that almost all, if not the entire, improvement between the NV30 and the NV35 can be ascribed to the increased bandwidth.
 
Actually, I'm not sure at all.
My guess, right now, is that there is a minimal difference between FP32 on the NV30 and on the NV35 ( maybe a 20% or 25% per-clock increase, but again, the clock is lower on the NV35 to not have a dustbuster, remember that )

The FP16 numbers given to me by MikeC seem to indicate the NV35 is *very* good at FP16. So could it be that the NV30 only got a performance hit from using FP32 registers, while the NV35 *also* got a performance hit from using FP32 instructions?

So, you'd get:
NV30: 4 FP32 units
NV35: 4 FP32 units + 8 FP16 units

Hmm, I'll make the FP16 .zip file public so people can check that out...


Uttar
 
I'd like to chime in on the whole FP32 and visualisation market point. Industrial Light and Magic, one of the chief proponents of Nvidia hardware, have adopted the FP16 or 'half' mode for their new OpenEXR file format citing that FP32 is a complete overkill for film use. FP32 requires too much storage per frame and the differences in quality are not readily apparant over FP16.

So if FP32 isn't useful for visualisation and is apparantly completely useless for real time gaming, is it anything more than a 'we are better than the competition' bullet point on a list of specifications? Sorta like 32 bit rendering was in the TNT days?
 
Well, remember OpenEXR is a *file* format.
That means that OpenEXR is a 64-bit per pixel ( 4xFP16 ) format. Generally, the framebuffer is 32-bit per pixel ( 4xFX8 )

I really think Industrial Light and Magic was only talking about files there. Could be wrong, obviously :)


Uttar
 
Uttar said:
So, you'd get:
NV30: 4 FP32 units
NV35: 4 FP32 units + 8 FP16 units

IMO, doubtful. Its probably more like 4 FP32 or 8 FP16. I suspect that the NV30 chip just wasn't as efficient at opportioning two FP16 instructions as it could have been, and this is probably one of the major fixes with NV35.
 
Hi guys,

Would someone with a '5900 be kind enough to post a couple of screenies to show the difference between the original FP16 Dawn, and the patched FP32 version of Dawn?

I'd love to see how noticable things really are!

Cheers :)
 
YeuEmMaiMai said:
Something tells me that the nV3X design was not really meant for FP32 to be fast but just to get their foot in the door so to say...

This is exactly the case.

On the way back to the hotel from E3 one night I skadged a lift on the NVIDIA bus. When I got on I noticed the name badge of Kieth Galocy, a name I recognised from the 3dfx days. We were taling about a number of things such what he's up to and the NV3x parts etc., and he made that exact same point himself. He said that NV30 is really a good DX8 performer, but with DX9 hardware, which is similar to the previous generations. With the number of FX12 units NV30 is a superb DX8 class performer, not not quite so hot a DX9 performer.

ATI took a slightly more generic route, of not bothering with full FP32 precision, but more FX24 units that can generically cope with both DX9 and DX8 shaders, so they have ended up with a more balanced architecture for current titles and new titles - if NV stick to running DX9 shaders at full DX9 precision half of the NV30 pipeline is wasted as its not float.
 
Man...

Maybe the R3xx is just significantly better @ shading than the NV3x...?

I've been playing with the OGL-wrapper Dawn demo, and I'm just about locked @ 50 FPS across the board.

I've gone from 10x7 -> 16x12, and even threw 4xFSAA into the mix...Only when at 1600x1200 did it hit about 40 FPS, rather than ~ 50 FPS (used FRAPS to monitor framerate).

Am I the only one surprised by these results? Heck, what would the R3xx do if it didn't have to go through a wrapper???
 
Uttar said:
Actually, I don't know exactly what's going on with JakUp scores' - AnteP's are significantly different.

AnteP, 5800 @ 5800 Ultra, resolution unknown ( probably 1024x768 )
Normal: 30 fps
FP32: 17 fps

JakUp, 5800 Ultra, 1024x768 no AA/AF, Quality mode:
Normal: 19fps
FP32: 15fps

I don't quite understand the "normal" difference between both. Surely FRAPS can't be THAT inaccurate? Hehe.

Uttar

I'm using the 44.10 drivers and I was running at 1024x768 with no AA/AF all driver settings at default.

How's that FX12 version coming along? ;)

Too bad my 9700 Pro is on vacation, I'd like to see some comparisons.

BTW does anyone have an idea on how to grab screenshots at indentical frames in Dawn?
Can you somehow alter the animations and camera fly paths?
 
Doomtrooper said:
Cg is not a language though, Nvidia for sure has used it and compiled Dawn in Opengl using their proprietary extensions.

OK, so they used it for the shaders. That's what I figured. So doesn't that make the OGL wrapper in part a pseudo-Cg compiler for ATI cards? Since R3x0 doesn't support FP32, I don't think it could use the output of it. Or maybe I'm confused. That happens a lot. :p
 
The Baron said:
Doomtrooper said:
Cg is not a language though, Nvidia for sure has used it and compiled Dawn in Opengl using their proprietary extensions.

OK, so they used it for the shaders. That's what I figured. So doesn't that make the OGL wrapper in part a pseudo-Cg compiler for ATI cards? Since R3x0 doesn't support FP32, I don't think it could use the output of it. Or maybe I'm confused. That happens a lot. :p

You're both wrong. Cg is a language. The compiler was run at compile time and the code was compiled into NV opengl extention calls. The ATI wrapper thing is interpreting the opengl calls and replacing them with ATI and/or ARB extension calls.

By the time the executable is run, Cg is nowhere to be seen, and only the output of the Cg compiler is around--and its indistinguishable from hand written shader calls. (Except, as Uttar said, it leaves non-essential MOVs in there, apparently).
 
so Cg isn't compiled on-the-fly? Gotcha. Makes a lot more sense how the OGL wrapper would work then.
 
The Baron said:
so Cg isn't compiled on-the-fly? Gotcha. Makes a lot more sense how the OGL wrapper would work then.

It can be, when you use the runtime compiler, but apparently its not with the Dawn demo.
 
The fragment shaders in the dawn demo are finely tooned hand written shaders sent to the hardware via nvidia's proprietary extensions, cg was not used for due to performance issues. Another interesting tidbit about this is if you look at the original dawn pictures that nvidia claimed were rendered via hardware (really done through emultated hardware) you'll notice that they are of higher quality than the real time dawn demo released. Wonder if nvidia knew that their hardware wouldn't be able to render dawn in the quality of those original images and they just wanted to get peoples attention, or if they overestimated the speed of their hardware.
 
Dave since this appears to be the case, I really feel sorry for people who bought the card thinking they are going to get some performance out of it. Either that or nVidia drivers are going to be 20MB downloads just for the nV3x cards since they will have to sub a lot of shader code to make up for the cards weaknesses. Sad really $500 for a card and people got short changed.

DaveBaumann said:
YeuEmMaiMai said:
Something tells me that the nV3X design was not really meant for FP32 to be fast but just to get their foot in the door so to say...

This is exactly the case.

On the way back to the hotel from E3 one night I skadged a lift on the NVIDIA bus. When I got on I noticed the name badge of Kieth Galocy, a name I recognised from the 3dfx days. We were taling about a number of things such what he's up to and the NV3x parts etc., and he made that exact same point himself. He said that NV30 is really a good DX8 performer, but with DX9 hardware, which is similar to the previous generations. With the number of FX12 units NV30 is a superb DX8 class performer, not not quite so hot a DX9 performer.

ATI took a slightly more generic route, of not bothering with full FP32 precision, but more FX24 units that can generically cope with both DX9 and DX8 shaders, so they have ended up with a more balanced architecture for current titles and new titles - if NV stick to running DX9 shaders at full DX9 precision half of the NV30 pipeline is wasted as its not float.
 
Back
Top