NV40 floating point performance?

MDolenc · Mar 28, 2004

nutball said:
Does DX9 even support blending into FP render targets?

It does now. Don't remember when it was put in spec though. I think it was at the same time when antialiasing for render target textures came in so 9.0a or 9.0b?

From what I heard "future hwardware" should have significantly higher float performance, but half still remains faster.

Even on NV3x you will see that half is a bit faster even when registers are not a problem.

991060 · Mar 28, 2004

nutball said:
Does DX9 even support blending into FP render targets? ISTR reading a Microsoft presentation around NV30/R300 launch-time that said it didn't.

Last time I checked the spec, NO.

nutball said:
This might be the reason for NV (and to some degree ATI) to be enamoured of OpenGL. They can introduce extensions to expose new hardware functionality in a much simpler way than trying to cajole MS into tweaking the DX specs.

Yes, I know what you and Chalnoth suggested. But MRTs and general FP texture are already written into the spec for over 18 months. Isn't it good for nVIDIA to let D3D folks know NV40 can finally delivery these 2 import features( I was saying:" Wow, NV40 rocks!" when seeing these 2 features mentioned in the OpenGL paper

)? I can only think nVIDIA may not be 100% sure that these 2 features can be used in D3D flawlessly.

KimB · Mar 28, 2004

They're not only delivering them, but taking the features a step further, with full FP16 filtering support. This makes the storage of HDR textures a reality, something which may help to simplify some calculations. Much more importantly, however, FP16 blending is supported. This makes HDR much, much easier to accomplish.

Full and complete support for FP16 render targets and textures could well be the most exciting part of the architecture, in my mind.

Arun · Mar 28, 2004

991060 said:
Uttar said:

Register penalties remain the primary problem in the NV4x architecture.

Click to expand...

Agreed.

Then why did you did you believe the NV40 has less per-pipeline raw power?

991060 said:
Uttar said:

The cost of texturing, calculated in lost arithmetic efficiency, is also increased because you lose an ALU for only ONE texturing operation;
Uttar

Click to expand...

Not necessarily, what if NV4X's primary ALU can only do 1 tex fetch per clock?

That's what I said:

Uttar said:
NVIDIA has however probably also reduced the number of certain key special-operation units in each of those ALUs in order to reduce their cost though.

I know that sentence wasn't too comprehensible, but eh

My point is, even if they slightly reduce cost of this big ALU (by reducing the number of RCP units in it IIRC), you still can't do a MAD at the same time. So the cost, transistor-wise, is slightly lower while the cost performance-wise is a fair bit higher.

Uttar

nutball · Mar 28, 2004

Chalnoth said:
Full and complete support for FP16 render targets and textures could well be the most exciting part of the architecture, in my mind.

Yep. For my applications it's a shame it's not FP32, I guess that will have to wait until NV5x. But FP16 is a massive step forward!

I wasn't going to buy an NV40, until I found out it had FP blending

Now I've got to work out how to get 100+Watts of heat out of my case quietly!

KimB · Mar 28, 2004

nutball said:
Yep. For my applications it's a shame it's not FP32, I guess that will have to wait until NV5x. But FP16 is a massive step forward!

What would you use FP32 filtering and blending for?

nutball · Mar 28, 2004

Chalnoth said:
nutball said:

Yep. For my applications it's a shame it's not FP32, I guess that will have to wait until NV5x. But FP16 is a massive step forward!

Click to expand...

What would you use FP32 filtering and blending for?

It's the blending I'm really after, for this sort of thing:

http://www.gpgpu.org/

I have a number applications in mind, fluid dynamics and radiative transfer being the primary examples.

KimB · Mar 28, 2004

Well, as with previous hardware, if you need to do it, you should be able to. It'll just be slower. Use two textures. One pass you render to texture 1 and read from texture 2. The next pass you read from texture 1 and render to texture 2, and so on. If you're not doing realtime work, this shouldn't be a problem. Personally, I'm not entirely sure that it will be important to optimize FP32 blending for reason of games, and for anything else, if it's not realtime, who cares about a few % performance?

991060 · Mar 28, 2004

Chalnoth said:
They're not only delivering them, but taking the features a step further, with full FP16 filtering support. This makes the storage of HDR textures a reality, something which may help to simplify some calculations. Much more importantly, however, FP16 blending is supported. This makes HDR much, much easier to accomplish.

Full and complete support for FP16 render targets and textures could well be the most exciting part of the architecture, in my mind.

Errr, maybe I'm not stating my idea clearly enough....

Firstly I'm wondering why nVIDIA put less info in D3D papers. Someone suggested that some of NV40's new features are not strictly defined in the D3D spec, thus there's no need to over-hype the new features now. Then I was saying that the basic part of those features are already defined, hence it's properly to announce them to the public in GDC. If nVIDIA didn't plan to do so, that could mean they still have problems to expose those features in drivers( just speculation though ).

991060 · Mar 28, 2004

Uttar said:
Then why did you did you believe the NV40 has less per-pipeline raw power?

The speculation about raw power is just from a picture of NV*X's pipeline I have seen.

Uttar said:
I know that sentence wasn't too comprehensible, but eh

I hate to read these B3D style complex sentences.

Uttar said:
My point is, even if they slightly reduce cost of this big ALU (by reducing the number of RCP units in it IIRC), you still can't do a MAD at the same time. So the cost, transistor-wise, is slightly lower while the cost performance-wise is a fair bit higher.
Uttar

I'm still confused by how ALU is implemented in hardware. Do you guys have any introductive paper about this kind of stuff, it seems to be a barrier to me for a long time.

DemoCoder · Mar 28, 2004

It comes down to the fact that the NV demo team prefers OGL and OGL can expose vendor specific features before DX. It also might be easier for them to develop the OGL ICD first before the DX driver. I've been to many developer conventions, and OGL is quite common in presentations.

nutball · Mar 28, 2004

Chalnoth said:
Well, as with previous hardware, if you need to do it, you should be able to. It'll just be slower. Use two textures. One pass you render to texture 1 and read from texture 2. The next pass you read from texture 1 and render to texture 2, and so on. If you're not doing realtime work, this shouldn't be a problem. Personally, I'm not entirely sure that it will be important to optimize FP32 blending for reason of games, and for anything else, if it's not realtime, who cares about a few % performance?

Yeah, unfortunately I'm constrained to developing under Linux. Render-to-texture (GLX_ARB_render_texture) isn't available in the Linux drivers, and never will be according to NVIDIA, as the need for RTT is/will be provided by uber-buffers (whenever the hell they're going to arrive!).

As for whether we'll ever see FP32 blending in hardware, well we'll just have to see.

I'm rather curious how blending will be exposed in the new hardware, whether it will be a distinct pipeline stage a la the current fixed-function integer pipeline, or whether it's more flexible (eg. frame-buffer contents available in the pixel shader). Given the FP16 limitation, I'm presuming it's the former, not the latter. Which is interesting, because naively one might assume that you need a whole lot fewer floating-point units on-chip if blending happens in the pixel shader, rather than in a distinct functional unit :?

KimB · Mar 28, 2004

Considering there is no GLX_render_texture extension, I guess that would be kinda hard. I would think that's more a function of Linux rather than nVidia's drivers.

DemoCoder · Mar 28, 2004

Frame buffer reads in shaders have been considered by ARB before but rejected because of issues with performance.

nutball · Mar 28, 2004

DemoCoder said:
Frame buffer reads in shaders have been considered by ARB before but rejected because of issues with performance.

Yeah, I've seen those discussions.

krychek · Mar 28, 2004

ATI's "rendering light shafts" gdc 2004 presentation talks about problems in 8-bit per channel blending but doesn't mention anything about FP blending. Does this mean that R420 won't have support for FP blending?
Nice to see that atleast with nv40, the FP targets have become fully useful

.

VVukicevic · Mar 28, 2004

Chalnoth said:
Considering there is no GLX_render_texture extension, I guess that would be kinda hard. I would think that's more a function of Linux rather than nVidia's drivers.

Well, there is GLX_ATI_render_texture -- it's supported by ATI's fglrx drivers, but the spec isn't available from the extension registry. nvidia's driver supports GLX_SGIX_pbuffer, which kinda-sorta gets you there as well.

KimB · Mar 28, 2004

Ah, I was about to ask if the pbuffer extension was supported. That should be plenty for what you need, nutball, as you don't need filtering, right?

Mintmaster · Mar 28, 2004

Chalnoth said:
Personally, I'm not entirely sure that it will be important to optimize FP32 blending for reason of games, and for anything else, if it's not realtime, who cares about a few % performance?

Well, one important application (IMO the most important capability acquired by the next-gen hardware) is using pixel shader output in the vertex shader using the vertex texturing feature, esp. for simulations. Water and cloth simulations will be rendering disturbances to the texture, often using blending to superimpose them or blend them more naturally. You wouldn't want to redraw the entire texture for each small perturbation, especially if you have a lot of them.

I personally think FP16 will be sufficient most of the time for these applications, but I'm pretty sure you will occasionally encounter severe precision issues with only a 10-bit mantissa. We're not talking about colour here, but rather position. Simulation stability can also be upset with insufficient precision, and everything can start vibrating or "explode".

Imagine playing a game when all of a sudden, your character's clothes fly off! (Hehe, the next TR

)

KimB · Mar 28, 2004

I don't think that has anything to do with blending, Mintmaster.

NV40 floating point performance?

MDolenc

991060

KimB

Arun

Unknown.

nutball

KimB

nutball

KimB

991060

991060

DemoCoder

nutball

KimB

DemoCoder

nutball

krychek

VVukicevic

KimB

Mintmaster

KimB

Similar threads