David Kirk finally admits weak DX9 vs DX8 performance - GeFX

g__day

Regular
http://www.guru3d.com/article/article/91/2

"Through a great talk given by Chief Technology Scientist, David Kirk, NVIDIA basically claims that if 16-bit precision is good enough for Pixar and Industrial Light and Magic, for use in their movie effects, it's good enough for NVIDIA. There's not much use for 32-bit precision PS 2.0 at this time due to limitations in fabrication and cost, and most notably, games don't require it all that often. The design trade-off is that they made the GeForce FX optimized for mostly FP16. It can do FP32, when needed, but it won't perform very well. Mr. Kirk showed a slide illustrating that with the GeForce FX architecture, its DX9 components have roughly half the processing power as its DX8 components (16Gflps as opposed to 32Gflops, respectively). I know I'm simplifying, but he did point it out, very carefully. "

* * *

Finally they state publically what the best sites have figured out over 8 months ago :) - only now its them finally stating this publically. I applaud them coming clean - even though David Kirk didn't reflect on how they launched NV FX series as the Dawn of 32 bit processing because 16 bit wasn't enough - and now they're publically saying it is not realistically available and no one does it...

except ATi can do it (fp32) on 0.15 mircon technology for 9 out of 11 steps in their graphics pipeline Dave has informed us :).
 
As of 44.03, released some months ago, the NVIDIA drivers were re-architected such that the 3DMark03 'cheating' was taken care of. NVIDIA stated that the drivers, pre 44.03, simply realized that it didn't need to draw certain areas of the benchmark scene, and so it didn't. It wasn't a 3DMark03 specific optimization, just a general optimization that it would also do for other applications as well.

Ehhh... :rolleyes:

So their drivers "realize" that certain things did not need to be drawn ?
So it was "general" optimisation also in other applications... so people just did not find those then... sigh...

K-
 
It's not like the users said we must have fp32 rather Nvidia shouting it from the hilltops and it has severely backfired on them.
One has to wonder if they hadn't pushed so hard that it's a requirement for CineFX maybe DX9 spec would have been just FP16 - and so would R3x0 products too.
NV increased the stakes without having the products to back it up - unfortunately for them, ATI did.
 
Kristof said:
So their drivers "realize" that certain things did not need to be drawn ?
So it was "general" optimisation also in other applications... so people just did not find those then... sigh...
It was confirmed a while ago in DXDEV mailing list that some drivers compute AABBs of static vertex buffers, and perform frustum culling before drawing. At least they do this when fixed function VP is used, probably in some simpler cases of vertex shaders as well.
There were lengthy debates on this - because every well written game does culling on it's own, so the drivers perform needless job... But it helps enormously for non-cullings apps (and probably benchmarks).
 
Kristof said:
As of 44.03, released some months ago, the NVIDIA drivers were re-architected such that the 3DMark03 'cheating' was taken care of. NVIDIA stated that the drivers, pre 44.03, simply realized that it didn't need to draw certain areas of the benchmark scene, and so it didn't. It wasn't a 3DMark03 specific optimization, just a general optimization that it would also do for other applications as well.

Ehhh... :rolleyes:

So their drivers "realize" that certain things did not need to be drawn ?
So it was "general" optimisation also in other applications... so people just did not find those then... sigh...

K-

If Nvidia drivers were clever enough to not draw certain areas of the benchmark when these areas were not seen, why then did the drivers continue not to draw these areas when the camera was taken off it's fixed path - one of the normal modes of the application?

These "general optimisations" were just static clip planes hacked into the drivers in order to cheat a specifically recognised test. Nothing general, clever, or honest about that.
 
Bouncing Zabaglione Bros. said:
Kristof said:
As of 44.03, released some months ago, the NVIDIA drivers were re-architected such that the 3DMark03 'cheating' was taken care of. NVIDIA stated that the drivers, pre 44.03, simply realized that it didn't need to draw certain areas of the benchmark scene, and so it didn't. It wasn't a 3DMark03 specific optimization, just a general optimization that it would also do for other applications as well.

Ehhh... :rolleyes:

So their drivers "realize" that certain things did not need to be drawn ?
So it was "general" optimisation also in other applications... so people just did not find those then... sigh...

K-

If Nvidia drivers were clever enough to not draw certain areas of the benchmark when these areas were not seen, why then did the drivers continue not to draw these areas when the camera was taken off it's fixed path - one of the normal modes of the application?

These "general optimisations" were just static clip planes hacked into the drivers in order to cheat a specifically recognised test. Nothing general, clever, or honest about that.

Well, people have noticed that under opengl, using display lists (which are static), seems to have an effect that the drivers are doing some sort of frustum culling.

For 3dmark, I wouldnt say that they were using "clip planes", but just somehow stored when/where they needed to clear the buffers..
 
So at Editor's Day, 24-bit float isn't good enough, but for Dr. Kirk, more than 16-bit is overkill???

Left hand, right hand...
 
NeARAZ said:
It was confirmed a while ago in DXDEV mailing list that some drivers compute AABBs of static vertex buffers, and perform frustum culling before drawing. At least they do this when fixed function VP is used, probably in some simpler cases of vertex shaders as well.
There were lengthy debates on this - because every well written game does culling on it's own, so the drivers perform needless job... But it helps enormously for non-cullings apps (and probably benchmarks).
Yes - but doing culling against the view frustum is very different from what was apparently going on, since the display would still be correct whether you were 'on the rails' or not.

More to the point, simple BB checks against the frustum should not be a useful optimisation at the driver level because the app should be doing this already at the higher level - you would just be unnecessarily duplicating work.

If the app is written badly enough that it doesn't perform gross-culling like this then the app writer deserves to get sucky frame rates ;)

As of 44.03, released some months ago, the NVIDIA drivers were re-architected such that the 3DMark03 'cheating' was taken care of. NVIDIA stated that the drivers, pre 44.03, simply realized that it didn't need to draw certain areas of the benchmark scene, and so it didn't. It wasn't a 3DMark03 specific optimization, just a general optimization that it would also do for other applications as well.
Of course if the 'optimisation' wasn't 3DMark specific, then when any apparent driver detection of 3DMark was discovered and disabled by Futuremark the 'optimisation' would still have been there and taking effect...

Was it?

And if the 'optimisation' was in any way general and valid then when you went off the rails it shouldn't have affected the image that was displayed since it should have then detected that it needed to draw the bits that it was previously leaving out, shouldn't it?

o_O
 
He's also not getting the point (with respect to film studios). 16-bit precision may be good enough for "color" output, but almost all internal calculations are done in fp32. I don't think any studio uses fp16 for depth-mapped shadows.

Some VFX companies do use the full fp32 for even color output. For example both Matrix sequels used it.

-M
 
Mr. Blue said:
He's also not getting the point (with respect to film studios) [fp16 is "good enough" for them so it's good enough for NVIDIA].
Sad but true. It's simply not good enough for Pixar and Industrial Light and Magic.

The low precision fp16 is used at ILM for compositing, HDR lightmaps and final frame output.

Pixar traditionally low precision integer based encoding for final frame output, and until recently, didn't even do much compositing. (At the Siggraph 2003 special session "Finding Nemo: Story, Art, Technology and Triage" they mentioned that they've rediscovered compositing. There was much rejoicing. Yeaaaah.)

Shading - 32 bit and higher precision in the film industry.

Reminding everyone again - In OpenGL the precision requirement is accurate to about one part in 10^5 - this is not a new requirement either, it's been that way for 10 years. Coincidently it's also the precision requirement for floats in C.

-mr. bill
 
I have to make a correction. Guru3D has not fully understood what David Kirk has said and the architecture page about the NV35/36/38 he has showed.

David Kirk never said that GeForce FX have 16 Gflps for DX9 and 32 Gflps for DX8. He has said that NV35/38 full DX9 units have a power of 16 Gflps and that the NV35/38 two mini-units/combiners have a power of 32 Gflps. It's different... Actually, we have to add these two numbers to have the full DX9 power which is the same that the DX8 power in NV35/36/38 (but not in NV30/31/34)
 
Did you get to see the slides? All we have is Guru3d saying Kirk showed the slides and did point out there relative processing power very carefully.

What was actually said by Dave Kirk if it wasn't what Guru3d reported?

If Kirk meant to say fp32 is fine or that the power is always additative- why didn't he say that instead and show slides combining processing power rather than what we are told he said coupled to a slide that shows half performance with a narrative that described weak fp32 performance?
 
Tridam said:
I have to make a correction. Guru3D has not fully understood what David Kirk has said and the architecture page about the NV35/36/38 he has showed.

David Kirk never said that GeForce FX have 16 Gflps for DX9 and 32 Gflps for DX8. He has said that NV35/38 full DX9 units have a power of 16 Gflps and that the NV35/38 two mini-units/combiners have a power of 32 Gflps. It's different... Actually, we have to add these two numbers to have the full DX9 power which is the same that the DX8 power in NV35/36/38 (but not in NV30/31/34)

Hmm...

We know the full units' calculation is:
clock speed * Vec4 * 2 due to MAD * 4 "pipelines"
-> 500*4*2*4 = 16000

Looking back at: http://www.nvidia.com/content/areyouready/facts.html

NVIDIA's number was 51000 ( = 51GFlps ) for the pixel shader alone.

Now, if we delve in that CineFX article at 3DCenter.org, a few ways remain to get to this conclusion...
We know there's one free log/exp/lit operation per cycle.
We also know we could do ( 2xFP?? RCP + 1FP16 RSQ ) in parallel, theorically at least.

So, that's 500*4*((2*4)+4+3) = 30GFlops.
We also got the double ADD poer in 4x2 cases instead of 1x4 ones. That's:
500*4*(4+8+4+3) = 38GFlops

That means we got 11GFlops strangely missing in action.
We could add another 8GFlops if we consider the free MOV operation...

That means we still got 3GFlops missing in action, or 6 ops/clock.
I would tend to believe those operations are to be found in the Triangle Setup unit.

I do not believe Triangle Setup is to be part of the Pixel Shader, but try explaining that to NVIDIA's marketing department which could have benefited from a bigger number than 50GFlops for hyping the part ;)

---

So, I think it IS possible to get the 51GFlops number NVIDIA originally claimed. Whether all of these functions are exposed, such as the 4x2 ADD technology, remains to be seen.
And no matter what, as indirectly said by David Kirk himself in that presentation, the general purpose power of the NV30 is really only 16GFlops.

Actually, him saying 16GFlops would make me believe the 4x2 ADD technology is NOT activated, and most likely simply never will be. The number, otherwise, would have been 24GFlops.

The theorical total general purpose FP16 power of the NV35 at 500Mhz ( hey, mine is clocked at 497Mhz, so stop complaining about it shipping at 450Mhz :p ) would thus be 48GFlops, while, if you include the special-purpose units ( and not the triangle setup ones which simply do not fit in this category ), you get 80GFlops.

If you do not count texturing as flops, the NV35 is 32Gflops in FP16 ( either 16GFlops or 32GFlops in FP32 ) during texturing.

ATI's calculation for the R350/R360 is roughly, excluding texturing: 400*8*2*4 = 25.6GFlops. Considering at 10% clock difference to be fair... +- 28GFlops compared to 16/32/48 for the NV35.

But then again, the NV35's 4xScalar function is either buggy or not exposed, and thus giving yet another practical advantage for the R300. You could argue that's compensated by all the NV35's special-purpose units for simplification though.

And when you take into account register usage, a 12 register shader is 40% slower on the NV35/NV38.

So, that's +- 10/20/30 general-purpose GFlops for the NV35.
And the R300 got approximatively 28GFlops.

These calculations most certainly explain the NV35's trouble, IMO, as well as the NV30's marketing BS :)


Uttar
 
THe_KELRaTH said:
It's not like the users said we must have fp32 rather Nvidia shouting it from the hilltops and it has severely backfired on them.
One has to wonder if they hadn't pushed so hard that it's a requirement for CineFX maybe DX9 spec would have been just FP16 - and so would R3x0 products too.

That's not true at all. Once it was decided that R300 would be FP, FP16 never even came up as a possibility. There's no way to do correct texture addressing, in any sort of generalized way, with FP16.
 
sireric said:
That's not true at all. Once it was decided that R300 would be FP, FP16 never even came up as a possibility. There's no way to do correct texture addressing, in any sort of generalized way, with FP16.

Interesting, and I'm genuinely suprised as I would have thought that even though there are limitations it would have been the initial next step from Integer followed on to FP24 and so on.
 
g__day said:
Did you get to see the slides? All we have is Guru3d saying Kirk showed the slides and did point out there relative processing power very carefully.

What was actually said by Dave Kirk if it wasn't what Guru3d reported?

I was at the Editor's Day and I have a paper version of these slides. David Kirk was talking about general units gflops vs mini-units/combiners gflops. He didn't talk about DX9 gflops vs DX8 gflops.
 
THe_KELRaTH said:
sireric said:
That's not true at all. Once it was decided that R300 would be FP, FP16 never even came up as a possibility. There's no way to do correct texture addressing, in any sort of generalized way, with FP16.

Interesting, and I'm genuinely suprised as I would have thought that even though there are limitations it would have been the initial next step from Integer followed on to FP24 and so on.

I gave the reasoning -- You can't do proper texture addressing in FP16. Or we would of had to support two formats. No, once we unified texture and color, we had to find the proper size that would fit. FP32 would of worked, but it was overkill. We analyzed to find the sweet spot. So did MS. Everybody agreed on FP24.
 
Back
Top