NV40 floating point performance?

davepermen said:
nope, you can't emulate blending well at all in shaders. you would need to render to texture each triangle, bind the texture, render the next to the same texture, bind the texture, render the next, thus leading from 1 pass per blending to one pass per triangle per blending. the state-changes will fuck up performance completely.
But that would primarily be for color data, where FP16 should be enough for most cases. And even then, there would be ways to get around doing one pass per triangle. All that you have to guarantee per pass is that no triangles overlap. For example, with the previous vertex buffer update example, you wouldn't want to update any vertex twice.

Regardless, personally I think we should see firsthand what can and cannot be done with FP16 blending before calling for FP32 blending.
 
i know yet quite some things that can and can not be done.

trick is, the specs are all knowable yet since years, the question is just, when will what hw come with wich features. and now, there is nv40 with fp16 blending and not fp32 blending.

it will only be a precicion issue. but this is an issue for certian tasks. espencially if you don't blend just once, but several times. (several ten, houndred, thousand times.. thinking back about offline rendering).



if i'd have to do fp32 blending in the end, i'd bether render into 4 fx32 (or 8 fx16) targets, each time scaled properly to extract the important part of the fp32). with fx128, we have enough to store all fp32.. and then recombine for the next use afterwards, if needed. possibly 4x fx16 is even enough (that would lead to simply rendering to 4 buffers simultaneusly, if we can have blending on each while rendering to all 4 (not sure about that yet, have to further check the nv specs once out).
 
davepermen said:
i know yet quite some things that can and can not be done.
Care to share? I haven't yet seen a convincing example of something you'd want to use FP32 blending for.
 
hm, i shouldn't..

lets see...

well, there are two directions i can give you:

geometric processing et all, means cummulating some geometric data, that should then be re-used

non-rastericed data. espencially integration of some data. there, cummulative error could hurt quite bad.. depending on what blending functions are supported, there is an endless amount of possibilities (i talked yet about rendering to 4 targets at the same time, if blending is enabled there.. can't imagine yet myself, as i forgot to think about that..).

i don't have any specific situation i can give you, have to dive first exactly into the specs, to see what maps how well.., but it goes from the integration trough 3d volumetric data, over geometrical processing of arbitary geometry (hint: raytracing, raycasting), to geometric processing, to tons of other stuff..

it would be nice to be sure that all is done in fp32, to not have to care about the different precicions of the different parts of the pipeline, wich i have to do now again, to minimize error..

i'm not sure how much it hurts for realtime apps that are game-like. there, precicion doesn't mather that much. but it can have major impacts on non-mainstream-designed apps. i'm thinking of one specific app of a realtime viewdependent beam-tracing app wich does full global illumination. lets see how much hurts fp16 there to implement hdr.
 
Lezmaka said:
3dcgi said:
AndrewM said:
Hey Uttar, werent you the one that was saying a few months ago that they fixed the register issues? Now you're saying it's not fixed? :)
For those that don't know, there will likely always be issues with register usage. Just as adding more cache to a CPU will improve performance in some cases adding more registers will improve shader performance in some cases. There is probably some point where register usage is generally not a problem though and the bottleneck shifts elsewhere. Worst case shaders might be long with a lot of texture fetches. As more pixel threads fill the pipe the registers will get used up.

Are you sure you understand the register problem with NV3x? (Or maybe it's me who doesn't understand it :) )

From what I understand, the problem with NV3x is that the more registers you use, the larger performance hit you have. When you use 4 or more registers (not sure what the real number is), there's a performance drop. When you run out of registers, there's definately going to be a drop no matter the architecture, but I don't think this is what's happening here. I haven't heard of any CPU or GPU having a problem like this.
Yeah, I understand the problem. I'm just trying to say that while the problem might be "fixed" for some shaders it might still be a problem for other shaders. In the following thread sireric gives an example of how not being careful with dynamic branching can lead to a GPR shortage. Dependent fetches could also cause a shortage of GPR's because threads are kept active longer.
http://www.beyond3d.com/forum/viewtopic.php?t=9985&highlight=
 
There will probably be an update to ShaderMark, and there are forum members here who can contribute for PS/VS 3.0 benchmarks.
 
3dcgi said:
Lezmaka said:
3dcgi said:
AndrewM said:
Hey Uttar, werent you the one that was saying a few months ago that they fixed the register issues? Now you're saying it's not fixed? :)
For those that don't know, there will likely always be issues with register usage. Just as adding more cache to a CPU will improve performance in some cases adding more registers will improve shader performance in some cases. There is probably some point where register usage is generally not a problem though and the bottleneck shifts elsewhere. Worst case shaders might be long with a lot of texture fetches. As more pixel threads fill the pipe the registers will get used up.

Are you sure you understand the register problem with NV3x? (Or maybe it's me who doesn't understand it :) )

From what I understand, the problem with NV3x is that the more registers you use, the larger performance hit you have. When you use 4 or more registers (not sure what the real number is), there's a performance drop. When you run out of registers, there's definately going to be a drop no matter the architecture, but I don't think this is what's happening here. I haven't heard of any CPU or GPU having a problem like this.
Yeah, I understand the problem. I'm just trying to say that while the problem might be "fixed" for some shaders it might still be a problem for other shaders. In the following thread sireric gives an example of how not being careful with dynamic branching can lead to a GPR shortage. Dependent fetches could also cause a shortage of GPR's because threads are kept active longer.
http://www.beyond3d.com/forum/viewtopic.php?t=9985&highlight=

So in other words, nothing that matters for games, only really a potential issue for scientific modelling? (ie: a Quadro variant that supports FP32 blends would address the concerns you have, not relevant to the 3d gaming consumer).
 
radar1200gs said:
So in other words, nothing that matters for games, only really a potential issue for scientific modelling? (ie: a Quadro variant that supports FP32 blends would address the concerns you have, not relevant to the 3d gaming consumer).

? Didn't you read the rest of the thread?
 
radar1200gs said:
So in other words, nothing that matters for games, only really a potential issue for scientific modelling? (ie: a Quadro variant that supports FP32 blends would address the concerns you have, not relevant to the 3d gaming consumer).
I can't predict how register issues will affect games and applications so I can't say that this is more of a modeling problem as opposed to games. This isn't just an Nvidia issue it's a design tradeoff Ati also has to make. Both Nvidia and Ati will try to make it so this is rarely an issue. My original point was only that in my opinion there is probably no "fixing" the register usage issue. It will always be an issue in extreme circumstances.
 
Back
Top