NV40 floating point performance?

KimB

Legend
I was thinking about DemoCoder's statement that he thought that based upon the docs, the NV40 was still going to be a multi-precision architcture like the NV30. I didn't thoroughly read them, but the only thing I did remember seeing was data types being multiple-precision, not processing. So, I decided to go back and see what I could find.

I felt the obvious place to look for whether or not the NV40 would be a multi-precision architecture would be to look in the DX9 optimization doc:
http://developer.nvidia.com/object/gdc_2004_presentations.html

If you'll note, there is no mention in the DX9 optimization document about using the _pp hint to optimize shaders, which you would think would be there if that was still necessary for optimal performance.

Of course, it is possible that nVidia decided that this was old news, and thus didn't need mentioning. The paper only does describe a couple of techniques of optimization, and certainly does not cover all of the different 'do's and 'don't's of DX9 programming.

And, then there's this hint from the GLSL paper:
• Supports HLSL-style types – float, half,
fixed and equivalent vector, matrix types
– half precision (fp16) is sufficient for most
shading calculations (colors, unit vectors)
– faster on GeForce FX series processors
– no penalty on other hardware
Is nVidia only talking about current hardware here? Or the NV40 as well?

That said, given that the NV30 does gain benefit from using FP16 over FP32, and given nVidia's past history, the NV40 will likely be more of an evolutionary step over the NV30 than a revolutionary one, it may still seem more likely that the NV40 will still gain from using FP16. This would occur if nVidia has not nixxed the FP register usage performance hit entirely, but has instead reduced it. It would also happen if nVidia has taken the multi-precision architecture a step further and physically increased processing power when using FP16.

So, will the NV40 actually benefit from using FP16 in the shader?
 
I can almost say "definitely" :D
Based on what I know so far, NV40 has less raw power per pipeline per clock comparing to NV35, but the register usage penalty is reduced a lot ( not completely removed though ). So I'd say in a real world case, the number ( IPC per pipeline ) from each architecture could be very close. NV35 maybe fasters in shorter shader while NV40 can outperform it when executing longer shaders.

edit: I could be totally wrong though, depending on how honest nVIDIA is this time to developers. ;) But I think a less power pipeline explains the transistor budget, right? :rolleyes:
 
991060 said:
I can almost say "definitely" :D
Based on what I know so far, NV40 has less raw power per pipeline per clock comparing to NV35, but the register usage penalty is reduced a lot ( not completely removed though ). So I'd say in a real world case, the number ( IPC per pipeline ) from each architecture could be very close. NV35 maybe fasters in shorter shader while NV40 can outperform it when executing longer shaders.

edit: I could be totally wrong though, depending on how honest nVIDIA is this time to developers. ;) But I think a less power pipeline explains the transistor budget, right? :rolleyes:

No; IMHO you're damn close to reality.
 
I just reviewed nvidia's DX9 optimization pdf, there's nothing particularly interesting there. I mean batching, sorting are already well known in the community and they can be applied to any platform, not necessarily NV3X/NV4X. nvidia was just repeating old stuff. My guess is that they don't want to tell you the detail this time, maybe some reality would hurt the sale?
 
991060 said:
edit: I could be totally wrong though, depending on how honest nVIDIA is this time to developers. ;) But I think a less power pipeline explains the transistor budget, right? :rolleyes:
Yeah, but the NV30/35 only had 4 pipelines. If the rumors are correct, the NV40 may have 16. If this is true, then it won't matter if each pipeline can do a bit less, particularly if the FP register performance hit is dramatically reduced.
 
991060 said:
I just reviewed nvidia's DX9 optimization pdf, there's nothing particularly interesting there. I mean batching, sorting are already well known in the community and they can be applied to any platform, not necessarily NV3X/NV4X. nvidia was just repeating old stuff. My guess is that they don't want to tell you the detail this time, maybe some reality would hurt the sale?
Batching was only the first part of the pdf. If you'll notice, the first page's title is: "Last Year: Batch, Batch, Bath."

You might want to read a bit more further down, where it gets into texture atlasing and instancing.
 
DemoCoder said:
One of the nV40 GDC papers says that instructions like NRM run much faster at FP16.
Thanks, DemoCoder, that game me something to look for.

The statement above that is actually more telling, as it explicitly states in the new OpenGL Extensions doc that it is recommended to use FP16 as often as possible. So I guess that clinches it: the NV40 will run faster with FP16, likely due to a continued performance hit from using too many registers. Hopefully the performance hit from using FP registers has been reduced significantly.
 
Chalnoth said:
991060 said:
edit: I could be totally wrong though, depending on how honest nVIDIA is this time to developers. ;) But I think a less power pipeline explains the transistor budget, right? :rolleyes:
Yeah, but the NV30/35 only had 4 pipelines. If the rumors are correct, the NV40 may have 16. If this is true, then it won't matter if each pipeline can do a bit less, particularly if the FP register performance hit is dramatically reduced.

If it's truly 16 SIMDs vs 4 SIMDs, then it makes more sense to assume that in that case SIMD=!SIMD.

Of course will the modifications, 4x times the amount of SIMDs, higher clockrates or whatever else make a difference, yet I don't expect it to make a difference up to degree many would expect.
 
DemoCoder said:
One of the nV40 GDC papers says that instructions like NRM run much faster at FP16.

I'd assume this is due to the reciprocal square root calculation, they probably use newton raphson to increase the precision, fp32 has more iterations to reach the required precision.
 
They also say to always use write masks. I can't conceive of a reason that this increases performance other than it gives the driver compiler the opportunity to pack multiple variables into registers, thus lowering the register count. It also would allow the driver to better figure out dependencies if they were using a "pool of scalar ALUs" architecture.
 
ERP said:
DemoCoder said:
One of the nV40 GDC papers says that instructions like NRM run much faster at FP16.
I'd assume this is due to the reciprocal square root calculation, they probably use newton raphson to increase the precision, fp32 has more iterations to reach the required precision.
I'm sure that's why the doc separates this instruction out as being particularly better in FP16. Separating this instruction out as particularly better in FP16 may be indicative of a reduced FP register performance hit, as the FP register performance hit may hide the added latency of using the FP32 RSQ.

The doc also does state that FP16 should be used as much as possible, which is what I was looking for as evidence that it was more of an evolved NV30 (as one would expect) than a revolutionary design.
 
Chalnoth said:
Batching was only the first part of the pdf. If you'll notice, the first page's title is: "Last Year: Batch, Batch, Bath."

You might want to read a bit more further down, where it gets into texture atlasing and instancing.

Actually I did read the whole pdf, what I mean is that nVIDIA didn't provide any hardware specific info in the D3D paper ;)

Now I'm wondering why nVIDIA tends to "leak" more info in OpenGL related doc than in D3D papers. I've been seeing this happen again and again( take the fp16 filtering/blending, MRTs support stuff in this year's OpenGL doc for example ). Is it because the OpenGL doc is less popular among the community so that a little wealthier content won't hurt?
 
Hmm, maybe.

But isn't D3D a much more popular API in game developing community than OpenGL? I still think nVIDIA should tell more info about their next-gen hardware to D3D people, especially in such an important event.
 
991060 said:
Hmm, maybe.

But isn't D3D a much more popular API in game developing community than OpenGL? I still think nVIDIA should tell more info about their next-gen hardware to D3D people, especially in such an important event.

The early "future looking" technology has traditionally been available in OpenGL first, in the form of a vendor extension, though I expect this to slow down somewhat as development becomes more shader-centric.

I don't really think it was an orchestrated effort to "give the OpenGL folks more info", though -- the talks are written by different people, so they just chose to include different bits.
 
991060 said:
But isn't D3D a much more popular API in game developing community than OpenGL? I still think nVIDIA should tell more info about their next-gen hardware to D3D people, especially in such an important event.
Remember that nVidia has always put out their own extensions that are much closer to the hardware than what is available in Direct3D. This means that white papers on OpenGL extensions are bound to give more insight into the hardware.

Edit:
That's quite possible, too, VV. Could just be a difference in culture between the D3D and OGL types.

Edit 2:
One good thing that the paper doesn't say is to use integer precision when possible. Remember that this is OpenGL, so integer precision is definitely possible. Fortunately it looks like any available integer precision will be no faster than FP16.
 
Register penalties remain the primary problem in the NV4x architecture.

IMO though, and this part is speculation:
The cost of texturing, calculated in lost arithmetic efficiency, is also increased because you lose an ALU for only ONE texturing operation; in the NV3x, it was possible to only lose one every 2. NVIDIA has however probably also reduced the number of certain key special-operation units in each of those ALUs in order to reduce their cost though.

Uttar
 
991060 said:
Now I'm wondering why nVIDIA tends to "leak" more info in OpenGL related doc than in D3D papers. I've been seeing this happen again and again( take the fp16 filtering/blending, MRTs support stuff in this year's OpenGL doc for example ). Is it because the OpenGL doc is less popular among the community so that a little wealthier content won't hurt?

Does DX9 even support blending into FP render targets? ISTR reading a Microsoft presentation around NV30/R300 launch-time that said it didn't. :?:

This might be the reason for NV (and to some degree ATI) to be enamoured of OpenGL. They can introduce extensions to expose new hardware functionality in a much simpler way than trying to cajole MS into tweaking the DX specs.
 
Uttar said:
Register penalties remain the primary problem in the NV4x architecture.

Agreed.

Uttar said:
The cost of texturing, calculated in lost arithmetic efficiency, is also increased because you lose an ALU for only ONE texturing operation;
Uttar

Not necessarily, what if NV4X's primary ALU can only do 1 tex fetch per clock? ;)
 
Back
Top