Dawn FP16/FX12 VS FP32 performance - numbers inside

Arun

Unknown.
Moderator
Legend
Hey everyone,

I've just released a small patch for nVidia cards allowing it to be 100% FP32 ( see nV News forums for link, too lazy to put it here )
I've asked for MikeC of nV News to benchmark it ( as well as a non-public FP16 version ) and he accepted to give me some numbers. Thanks Mike! :)

1024x768:

Default Shaders - 29fps
FP16 - 27fps
FP32 - 18fps

1600x1200:

Default Shaders - 27fps
FP16 - 25fps
FP32 - 17fps

The original shader files are a mix of FX12, FP16 and FP32. The small, precision parts such as the eyes are FP32.
The bulk of the work is done mostly in FX12 ( guestimate: 80% ) and with some FP16 ( guestimate: 20% )

As you can see, FP16 ( replace FX12 by FP16, keep FP32 where it already was ) performance is practically identical to default performance. This would indicate the NV35 got no FX12 hardware ( or maybe a very little bit to explain the 5% performance hit or so )

The FP32 performance is however the 2/3 of the FP16 performance. This is a big performance hit!
The difference is that everything is FP32, and thus also that the number of used registers is doubled ( 4FP32 instead of 4FP16/2FP32 )

My theory right now is that the NV3x might be able to use a small poll of FX12 units shared with T&L ( maybe, but probably not ) explaining the slight performance hit of FP16.
Also, the NV35 would be 100% FP32 from top to bottom, but would get very big performance hits from using more registers ( heck, maybe even more than the NV30, although these tests can't show that )


Any other ideas of what those numbers could mean? Or any feedback?


Uttar
 
so basically what you are saying is that nVidia removed the INT 12 stuff but still kept the FP 16 hardware?

Why didn't nVidia just go all out and maximize for FP32? It would have made them look a lot better than they do now.

Something tells me that the nV3X design was not really meant for FP32 to be fast but just to get their foot in the door so to say...
 
YeuEmMaiMai said:
Something tells me that the nV3X design was not really meant for FP32 to be fast but just to get their foot in the door so to say...

Your endless one sided comments are getting very tiring.
 
No.
What I'm saying is that I believe nVidia removed the FX12 hardware and replaced it by FP32 hardware. This was already said before, but no benchmarks were able to proof that.

However, doubling the number of used registers drastically reduced performance. This problem was already present in the NV30.


Uttar
 
Umm no they arent. They are a plausible explanation. Wether they are true or not, im sure neither you nor i can determine since you didnt develop the thing :)

Your beliefs are irrelevant, since you have found no direct proof of what you say. Your benchmarks don't show it. In actuality you are seeing a bug already present in NV30 so what is more likely?
 
well care to explain the chips crappy performance? how else can you explain it. Your lack of understanding about what the nV30 really is is tiresome. It is a poor performer expecially once you enable all of the FEATURES that are supposed to make this card the "Dawn of cinematic computing"

I am sorry if you do not understand what I said so let be put it in a second grade level for you.

Father: Uhh billy the GF 256 could do AA but it was very slow and could not be used.
Billy: Daddy why did they say it could do it then?
Father: They said it becuse they could claim that the card has that feature

You know the same could be said for ATi, FSAA on any card below the Radeon 95/700 + cards) some features on their cards could not be used at a reasonable speed but it does get their foot in the door so that they can improve it in later hardware designs.

Looking at nV30 all and I do mean all of it's CinFX features are very slow FP32. FSAA, PIXEL SHADERS, VORTEX SHADERS, etc compared to ATi. I mean really it would be sad if S3 delta chrome turned out to be faster than the GFFX now wouldn't it?

JF_Aidan_Pryde said:
YeuEmMaiMai said:
Something tells me that the nV3X design was not really meant for FP32 to be fast but just to get their foot in the door so to say...

Your endless one sided comments are getting very tiring.
 
YeuEmMaiMai: You are confusing the NV30 with the NV35. The NV35 got quite good FP32 performance, really. Still not 100% on par with the R350, but it depends on the situations, sometimes it can be even faster. The NV30's FP32 performance is complete crap, though.

Please stop stating false facts about thing you don't know anything about. Thank you :p

Uttar
 
If nVidia was not caught doing some funny stuff with the drivers, those results could be good. nV35 may have good FP32 performance but once again IQ is poor compared to the competition. Turn on 8xFSAA and whoa you will see what I mean about performance tanking big time while looking a lot worse than r350.

Man I hope nVidia cranks out a good product soon as competition is good.
 
Am I missing it or did no one run this on an nv30? Seeing default vs 16fp vs 32fp would be interesting on that chip to verify these improvements everyone is talking about for the nv35.
 
YeuEmMaiMai said:
If nVidia was not caught doing some funny stuff with the drivers, those results could be good. nV35 may have good FP32 performance but once again IQ is poor compared to the competition. Turn on 8xFSAA and whoa you will see what I mean about performance tanking big time while looking a lot worse than r350.

Man I hope nVidia cranks out a good product soon as competition is good.

Augh! First, Please don't introduce AA IQ issues in a shading-related thread.
Secondly, FP32 IQ on the NV3x is at all times higher or equal to FP24 IQ on the R3xx. While nVidia may cheat in 3DMark 2003 by using precision they aren't allowed to use, they obviously cannot do that on a hand-made shader program!

Enbar: You aren't missing anything. Don't know anyone with a NV30, so couldn't test it. Anyone care to test it? :D


Uttar
 
Could it be that some of the half precision MOVH instructions are causing a performance impact with the increased fp16 usage, due to a loss of optimization opportunity?

For example, MOVH, MULX, MADX might be one clock cycle, but MOVH, MULH, MADH might be two. Example from your posting of the skin shader.

Or maybe the answer is more apparent in some of the other shader code that was changed.
 
Uttar said:
...While nVidia may cheat in 3DMark 2003 by using precision they aren't allowed to use, they obviously cannot do that on a hand-made shader program!
...

For clarity in the light of many shader benchmark confusions out there, I have to disagree with this one sentence: They can and have, by ignoring the precision requests presented by shaders. It is, IMO, very unlikely that they still are for the NV35, but without verification of the precision of output at the same time as we discuss performance, that hasn't been established yet. I'd say "cannot" is incorrect, and "do not" seems very likely true, but is not objectively established to my knowledge...in any case, I wouldn't argue against "do not".
 
demalion said:
Uttar said:
...While nVidia may cheat in 3DMark 2003 by using precision they aren't allowed to use, they obviously cannot do that on a hand-made shader program!
...

For clarity in the light of many shader benchmark confusions out there, I have to disagree with this one sentence: They can and have, by ignoring the precision requests presented by shaders. It is, IMO, very unlikely that they still are for the NV35, but without verification of the precision of output at the same time as we discuss performance, that hasn't been established yet. I'd say "cannot" is incorrect, and "do not" seems very likely true, but is not objectively established to my knowledge...in any case, I wouldn't argue against "do not".

Hehe. Well, nVidia would be completely insane if they did that.
Because if they automatically transformed native FP32 requests into FP16 ones or something, then they would also be doing it for professional applications, where such difference might not tolerated at all.

Anyway, back to the point...
I must admit I could optimize the FP32 code slightly more, because nVidia did some useless MOVs, which in the case of true FP16 code, increase performance. But with true FP32, those could be retrieved and you might gain 3 or 4% performance.

As for the FP16 performance hit...
All I did really for the FP16 version is change all the MULX, DP3X, ... into MULH, DP3H, ... - nothing else.
I've got three guesses:
- The fragment pipeline can share the FX12 power of T&L
- Some instructions, which are not run natively, might use shortcuts when they know only FX12 precision is requested and thus run faster.
- I've done something wrong.

Hmm... :)


Uttar
 
Uttar said:
Also, the NV35 would be 100% FP32 from top to bottom, but would get very big performance hits from using more registers ( heck, maybe even more than the NV30, although these tests can't show that )


Any other ideas of what those numbers could mean? Or any feedback?

I can't even possibly begin to guess until we have the effects of this patch on both NV30 and NV35 hardware.
 
Joe DeFuria said:
Uttar said:
Also, the NV35 would be 100% FP32 from top to bottom, but would get very big performance hits from using more registers ( heck, maybe even more than the NV30, although these tests can't show that )


Any other ideas of what those numbers could mean? Or any feedback?

I can't even possibly begin to guess until we have the effects of this patch on both NV30 and NV35 hardware.

Fine, then go on nV News and pester everyone until a user with a NV30 accepts to benchmark it ;)


Uttar
 
JF_Aidan_Pryde said:
YeuEmMaiMai said:
Something tells me that the nV3X design was not really meant for FP32 to be fast but just to get their foot in the door so to say...

Your endless one sided comments are getting very tiring.
Some of you are jumping on YeuEmMaiMai for this statement, but I think it's a vaild assumption. I don't know if he is correct, but supporting slow FP32 and making FP16 fast should give a lower transistor count than supporting FP32 at full speed. Ati obviously made another tradeoff which has become the better decision in hindsight. If DirectX9 required FP32 or had a minimum requirement of FP16 things might be different.
 
YeuEmMaiMai said:
...
Something tells me that the nV3X design was not really meant for FP32 to be fast but just to get their foot in the door so to say...

Whatever that "something" was is right in my view...;)

fp32 basically gets nVidia some marketing points in the professional rendering Quadro markets (providing either applications or nVidia's drivers can be coaxed into delivering it consistently.) It's not useful for 3D games for a couple of reasons:

(1) Current and projected 3D game engines will not show much of a rendering difference between fp16 and fp32, if any (Most likely none.)

(2) fp32 is much slower than fp16 in the nv3x architecture, and fp24 as employed in the R3xx.

fp32 is also something nVidia PR can pimp--they've already done it by flat-out saying that ATi's fp24 isn't comparable to nVidia's fp32. Of course that's nonsense--nv3x's fp16 used in 3D games will be inferior to ATi's fp24--a distinction nVidia PR assumes will be overlooked by most people (who they hope will inaccurately assume fp32 is being done all the time.) But as I've noted the current and projected limitations of game engines in regard to rendering precision, it's interesting to note that most likely ATI's superior rendering quality comes from other areas of R3xx architecture and not necessarily from the greater precision of fp24 over fp16 (at least until game engines can render to fp24 levels of precision.) However, fp24 precision may well help currently in things like FSAA where the extra precision might be employed whether the game engine is capable of it or not. So whereas nVidia's PR efforts boast a 128-bit pipeline, it is in fact a 64-bit pipeline contrasted with R3xx's 96-bit pipeline as far as 3D gaming is concerned.

Really, I'm hard-pressed to think of any actual professional rendering advantages that might be apparent when contrasting nv3x's fp32 to R3xx's fp24, but that's another debate...;)
 
Doomtrooper said:
Uttar said:
Fine, then go on nV News and pester everyone until a user with a NV30 accepts to benchmark it ;)


Uttar

Why not send it to Dave B. here, he has a 5900, It would be interesting to see.

Yes, but what we need now is 5800 numbers, not 5900 :)


Uttar
 
For clarity in the light of many shader benchmark confusions out there, I have to disagree with this one sentence: They can and have, by ignoring the precision requests presented by shaders. It is, IMO, very unlikely that they still are for the NV35, but without verification of the precision of output at the same time as we discuss performance, that hasn't been established yet. I'd say "cannot" is incorrect, and "do not" seems very likely true, but is not objectively established to my knowledge...in any case, I wouldn't argue against "do not".
If I understand your comment correctly, Then I have to say the indications are they are doing exactly that and yes on the Nv35. ITs pretty clear that if they were not still Substituting lower Precision on the Fly, the Performance of the Nv35 would be suffering to a much greater extent. This goes beyond Benchmarks to applications even now. Its just that there are not any out there to prove. Other than Games shown at E3 on Nvidia hardware, and the Obvious lowering of Presision being done by Carmack for the all theNv3x cards. AS it shows the Nv35 would lose *ALL* Doom-III benchmarks if they were forced to use What they tout in all their PR statements, and are telling everyone makes them *better*.
 
Back
Top