NV40 floating point performance?

Graphics_Krazy · Mar 29, 2004

It's curious to note that NVIDIA still advertises to use FP16 as much as possible, even with the NV40. Doesn't sound like they've fixed their FP32 performance this go around.

KimB · Mar 29, 2004

I don't see why it should be a problem as long as the register usage performance hit is dramatically decreased. FP16 should be enough for many calculations (specifically, anything to do with color, and may be useful for other data types as well).

Graphics_Krazy · Mar 29, 2004

Yeah - but I just thought that with a PS3.0 part, FP32 was going to be the norm.

Why is there a difference in performance, I mean is NVIDIA doing something different to get FP32 ?? Would it be possible for them to use two FP16 units to act as one FP32 ??

Tim Murray · Mar 29, 2004

why do people assume that PS3.0 support = FP32? there is no increase in minimum full precision in PS3.0, nor is the _pp flag removed.

they do not have any non FP32 ALUs, to the best of my knowledge.

davepermen · Mar 29, 2004

DemoCoder said:
What makes you think the NV40 only supports FP16 in the shaders? The only step we are speculating to be FP16 is the "fixed function" framebuffer blend. NV40 definately supports FP32 in the shaders, and runs it faster than the NV3x did.

haven't said that. i just replied to a statement that sounded like "what is the use of fp32 in shaders, if the (intermediate) output will be fp16 only anyways", and i stated it will be at least bether than fp16 everywhere.

i haven't yet found a problem, or unhappy limitation in the nv40 yet, except this one to possibly not support fp32 blending (we haven't one to buy yet, so we can still only guess..). this is so far great.

then again, it won't be a buy for me anyways, as i don't have much money currently, and, if the termal/power situation follows the nv30 and prescott way, it will definitely not be that great in my xpc.

oh, and i would use the gpu as gpgpu as well, if i'd get it.

i'm looking forward to get a passive coolable, non-extra-power-needing nv40

(well, or something that requires less or equal than my original radeon9700pro, and produces less or equal heat and noise compared to my radeon). then, and if i get some money, it could be a buy. speed is not primary issue. general full fp32 support, and hw that is optimised espencially for this (means at least rather fast fp32 support, compared to nv30

), this is important. (because else, we get a lot of driver issues again, and can't trust any shader anymore, hehe).

LeGreg · Mar 29, 2004

The Baron said:
why do people assume that PS3.0 support = FP32? there is no increase in minimum full precision in PS3.0

PS3.0 is FP32 quasi IEEE mininum in full precision
(still FP16 minimum in _pp, that didn't change).

See the Dx9 shader spec.

JohnH · Mar 29, 2004

991060 said:
nutball said:

Does DX9 even support blending into FP render targets? ISTR reading a Microsoft presentation around NV30/R300 launch-time that said it didn't.

Click to expand...

Last time I checked the spec, NO.

Dx9 has supported floating point RT blending from the word go, however it is a capable feature that is only required if you wish to expose shader model 3.0. Its NV3x/R3xx that don't support this not Dx.

John.

KimB · Mar 29, 2004

LeGreg said:
PS3.0 is FP32 quasi IEEE mininum in full precision
(still FP16 minimum in _pp, that didn't change).

See the Dx9 shader spec.

Where? I have glanced over the DX9 shader spec a number of times, and never seen this distinction. It's certainly not under the "Pixel Shader Differences" page at the MSDN.

Mintmaster · Mar 29, 2004

Chalnoth said:
Mintmaster said:

If you look at my post, I said FP16 will be fine for water simulation. I was just trying to explain to Chalnoth why blending is very useful for simulation.

Click to expand...

I still don't see why you'd want to use greater than FP16 when your final output is going to be of lower precision anyway. At least, I don't see why you'd want to do it for blending.

Chalnoth, you don't get it. You're simulating in a texture, and the output is not going to the screen. It's used to displace vertices. Of all the arguments supporting higher precision, vertex displacement is the most important one relevant to real-time graphics. That's why nearly all the NV3x 32-bit arguments are BS.

Listen to bloodbob - he understands why blending is important. If you have a lot of little objects affecting your simulation, you have to have major renderstate changes each time just to draw a few pixels without blending. With blending you draw them all together.

Chalnoth said:
Framebuffer blending is just a performance optimization, and you can do it in the pixel shader if you need that functionality.

Again, the volume fog demo is virtually impossible without blending. For each triangle, you'd have to copy the area being rendered into a temporary texture, and then use it as texture input when drawing it. With all the renderstate changes for each volume fog polygon, software rendering would probably be faster.

As for your suggestions with the volume fog, nice try (honestly), but they are rather pointless. First of all, normalizing a value for FP calculations does nothing - the whole point behind FP numbers in computers is sort of an automatic normalization of the span of the mantissa bits. Your idea of doing each object separately is also a bad one, not only because it'll require a lot of renderstate changes, but it also breaks their solution of the situation when an object is in the fog (I'm assuming the depth values used in your idea are relative to the centre of the object, or else you're back to the original problem of subtracting two large numbers and getting a small one).

Finally, there are a lot of things to like about geometric volume fog as opposed to layered, alpha-blended volume fog:
1. You don't need to sort in software, or figure out what the slices look like. You just need the volumes themselves, which are very easy to animate.
2. All you get in the end is the thickness of fog in front of each object. You can texture this or do whatever you want.
3. It likely has higher performance due to the fillrate demands of quality layered fog.
4. It has better quality wrt banding - especially when objects are located inside it - provided you have enough precision.

Again, none of these specifics really matter. All I'm saying is that FP32 blending is not pointless for the realtime graphics used in gaming, although as usual developers will take their time using it. NV40's FP16 blending is a huge step forward, but higher precision blending, even if it's just I16, would eventually be nice.

In fact, I think NV40 might have I16 blending, judging by that paper's comparison (although they maimed the I16 shot quite badly, making me think it could be a shot at R3xx's I16 format).

KimB · Mar 30, 2004

Mintmaster said:
As for your suggestions with the volume fog, nice try (honestly), but they are rather pointless. First of all, normalizing a value for FP calculations does nothing - the whole point behind FP numbers in computers is sort of an automatic normalization of the span of the mantissa bits.

When you're adding, though, for maximum accuracy you want to add numbers that have the same order of magnitude. By limiting volume size, you're maximizing the amount of accuracy at least within each volume. Of course, with this simple implementation you'll have decreasing accuracy at larger distances, which is where the logarithmic renormalization idea may help.

Your idea of doing each object separately is also a bad one, not only because it'll require a lot of renderstate changes, but it also breaks their solution of the situation when an object is in the fog (I'm assuming the depth values used in your idea are relative to the centre of the object, or else you're back to the original problem of subtracting two large numbers and getting a small one).

Actually, I didn't assume that. I merely assumed that the most important area to have accuracy was in the near range.

Finally, there are a lot of things to like about geometric volume fog as opposed to layered, alpha-blended volume fog:
1. You don't need to sort in software, or figure out what the slices look like. You just need the volumes themselves, which are very easy to animate.
2. All you get in the end is the thickness of fog in front of each object. You can texture this or do whatever you want.
3. It likely has higher performance due to the fillrate demands of quality layered fog.
4. It has better quality wrt banding - especially when objects are located inside it - provided you have enough precision.

1. Sorting should be trivial.
2. Yes, that could be a potential benefit, but not a great one.
3. Depends on how complex your fog geometry is. That's one thing I like about alpha-blended volume fog: maximal complexity is only a function of texture size, and thus has a smaller impact on performance.
4. The only major banding issues should be from fog slices intersecting geometry. I think there may be a way to solve this by applying the fog's 3D texture to the surface of any object within the defined fog volume.

Again, none of these specifics really matter. All I'm saying is that FP32 blending is not pointless for the realtime graphics used in gaming, although as usual developers will take their time using it. NV40's FP16 blending is a huge step forward, but higher precision blending, even if it's just I16, would eventually be nice.

I never said it was useless. My point was that FP32 blending would be a performance optimization, and if it wouldn't be used very much, then the transistors would be better spent elsewhere. I see FP32 as a format that is to be primarily used for data that is not color data, and thus the standard blending functions are much less likely to be useful than they are for color data.

In fact, I think NV40 might have I16 blending, judging by that paper's comparison (although they maimed the I16 shot quite badly, making me think it could be a shot at R3xx's I16 format).

Maimed it? I'm not so sure. Notice the caption of 200,000:1 dynamic range. That's vastly above FX16's capabilities, and so would obviously look pretty bad.

3dcgi · Mar 30, 2004

AndrewM said:
Hey Uttar, werent you the one that was saying a few months ago that they fixed the register issues? Now you're saying it's not fixed?

For those that don't know, there will likely always be issues with register usage. Just as adding more cache to a CPU will improve performance in some cases adding more registers will improve shader performance in some cases. There is probably some point where register usage is generally not a problem though and the bottleneck shifts elsewhere. Worst case shaders might be long with a lot of texture fetches. As more pixel threads fill the pipe the registers will get used up.

Lezmaka · Mar 30, 2004

3dcgi said:
AndrewM said:

Hey Uttar, werent you the one that was saying a few months ago that they fixed the register issues? Now you're saying it's not fixed?

Click to expand...

For those that don't know, there will likely always be issues with register usage. Just as adding more cache to a CPU will improve performance in some cases adding more registers will improve shader performance in some cases. There is probably some point where register usage is generally not a problem though and the bottleneck shifts elsewhere. Worst case shaders might be long with a lot of texture fetches. As more pixel threads fill the pipe the registers will get used up.

Are you sure you understand the register problem with NV3x? (Or maybe it's me who doesn't understand it

)

From what I understand, the problem with NV3x is that the more registers you use, the larger performance hit you have. When you use 4 or more registers (not sure what the real number is), there's a performance drop. When you run out of registers, there's definately going to be a drop no matter the architecture, but I don't think this is what's happening here. I haven't heard of any CPU or GPU having a problem like this.

DemoCoder · Mar 30, 2004

They likely increased the number of simulaneously "live" registers, but probably didn't fix it in totality. e.g. maybe instead of 4, it's 16 now. PS3.0 allows up to 32. Maybe after 16, you get a slow down. Or maybe it's 8 without performance loss.

You start to get diminishing returns after awhile because most code blocks don't have so many live registers.

davepermen · Mar 30, 2004

Chalnoth said:
I never said it was useless. My point was that FP32 blending would be a performance optimization, and if it wouldn't be used very much, then the transistors would be better spent elsewhere. I see FP32 as a format that is to be primarily used for data that is not color data, and thus the standard blending functions are much less likely to be useful than they are for color data.

there are two situations where fp32 blending is useful:

1) to finally have a full solution where everything works the same way. very useful for doing much much much more complex renderings, too, not only full-realtime-related. think of 3dsmax, running all in realtime, and then you press render, and it renders in say 1fps. for this, it needs to have all in fp32 to get good, high quality results, that are determinable, estimatable, and like that, usable. first time, gpu's could be used to accelerate rendering everywhere.

fp32 as the general solution, fp16, fx12, or what ever, for realtime-optimisations where needed.

2) as you said, non-colour-data. people are now since some years following the dream of shaders, and imagine all sort of things that could be done with those very powerful very efficient streaming data processors. there's just a problem: precicion. we could use the hw to process geometry, audio, to raytrace, to do tons of funny things. there's just one issue: we have to tweak here and there to do this and that in a way we don't loose too much precicion. gpgpu shows quite some simple things that are doable, there is much more.

once the pipeline is 100% fp32, it gives a full spu (second processing unit) in parallel to the fp32 working cpu (sse is fp32 only, too.. so for fast math, fp32 is everywhere).

i do understand all the todays gamers who just want their games fast. but actually, the games of today work fast enough on all sort of hw. i don't think it would really hurt to lose 1-2% of performance, because we spend some transistors more in the blending unit to have fp32 blending. (espencially because actually we won't use it in todays games so it won't hurt

). on the other hand, it would give a great base for future developments of ALL KIND. not only games.

but actually, it doesn't mather. just wait and see what we get, and what we can do with it.

KimB · Mar 30, 2004

davepermen said:
there are two situations where fp32 blending is useful:

1) to finally have a full solution where everything works the same way. very useful for doing much much much more complex renderings, too, not only full-realtime-related. think of 3dsmax, running all in realtime, and then you press render, and it renders in say 1fps. for this, it needs to have all in fp32 to get good, high quality results, that are determinable, estimatable, and like that, usable. first time, gpu's could be used to accelerate rendering everywhere.

I don't see why this is a problem. You can always emulate blending in the shader, as stated above. For offline rendering, the performance hit would be much less of an issue.

Support for FP blending in no way increases the flexibility of the processor. It is a performance optimization.

2) as you said, non-colour-data. people are now since some years following the dream of shaders, and imagine all sort of things that could be done with those very powerful very efficient streaming data processors. there's just a problem: precicion. we could use the hw to process geometry, audio, to raytrace, to do tons of funny things. there's just one issue: we have to tweak here and there to do this and that in a way we don't loose too much precicion. gpgpu shows quite some simple things that are doable, there is much more.

My point was that we're talking only about blending here. Blending is a very specific mathematical operation, an operation that may not be meaningful for many other data types. Transistors would have to be expended to support FP32 blending, and if there aren't many situations under which it would be used, why would the optimization be worth it?

LeGreg · Mar 30, 2004

Chalnoth said:
Where? I have glanced over the DX9 shader spec a number of times, and never seen this distinction. It's certainly not under the "Pixel Shader Differences" page at the MSDN.

so you still need a confirmation ?

KimB · Mar 30, 2004

Well, it would be nice. The hardest evidence of it so far was that leaked ATI presentation. Since I didn't get it quickly enough to download it directly from ATI, there I figure there is always the possibility of tampering. So it would be nice to get confirmation from another source (which would also lend some credence to the other claims in that document).

nutball · Mar 30, 2004

Chalnoth said:
Support for FP blending in no way increases the flexibility of the processor. It is a performance optimization.

You keep repeating that FP32 blending is a performance optimisation. When a "performance optimisation" makes possible something that was previously impossible, I'd say it's more of an "enabling technology" than a "performance optimisation", wouldn't you?

I just don't buy the "oh, it'll never get used, so it's a waste of transistors" idea either. If it's there, and it's fast enough, it will get used. With the current trend to doing more and more vertex processing and general processing on the GPU, FP32 blends will become an essential feature in the future IMO.

My point was that we're talking only about blending here. Blending is a very specific mathematical operation, an operation that may not be meaningful for many other data types.

It is meaningful for other data types. Bear in mind that the blending equations on the current fixed-function pipeline are going to have to be relaxed a bit -- they're clamped, etc., and that just doesn't make sense with FP data.

Even as they stand (with clamping removed), blending will allow you to add or subtract values from those held in the render target, eg.

newpos = oldpos + velocity * time

See? That's a blend!

KimB · Mar 30, 2004

nutball said:
You keep repeating that FP32 blending is a performance optimisation. When a "performance optimisation" makes possible something that was previously impossible, I'd say it's more of an "enabling technology" than a "performance optimisation", wouldn't you?

It doesn't make anything possible that was previously impossible.

In the NV40 you will be able to render to a FP32 texture. You will be able to read from a FP32 texture. You can therefore emulate any blending operation you want.

newpos = oldpos + velocity * time

See? That's a blend!

That's a step forward for your argument. But I don't think blending would be a big help in this case. That is, I would typically expect that each vertex would only be updated once per frame. Unlike the fog data example used previously, this could easily be done by using FP32 render to texture.

davepermen · Mar 30, 2004

Chalnoth said:
I don't see why this is a problem. You can always emulate blending in the shader, as stated above. For offline rendering, the performance hit would be much less of an issue.

nope, you can't emulate blending well at all in shaders. you would need to render to texture each triangle, bind the texture, render the next to the same texture, bind the texture, render the next, thus leading from 1 pass per blending to one pass per triangle per blending. the state-changes will fuck up performance completely.

and how do you explain that the offline renderer can render in some seconds in fp16 with those funny bandings, but renders in tens of minutes on fp32 with great quality?

emulating blending with shaders isn't easy. else, blending would have never been needed. it was the highest requested feature for next gen hw.

NV40 floating point performance?

Graphics_Krazy

KimB

Graphics_Krazy

Tim Murray

the Windom Earle of mobile SOCs

davepermen

LeGreg

JohnH

KimB

Mintmaster

KimB

3dcgi

Lezmaka

DemoCoder

davepermen

KimB

LeGreg

KimB

nutball

KimB

davepermen

Similar threads