Does NV30 execute fp32 and fp16 at ~ the same rate?

In Digit-life's Geforce FX Ultra review, the NV30 is forced to execute a pixel shader 2.0 test at, both, fp16 and fp32 precision levels. The results ( http://www.digit-life.com/articles2/gffx/gffx-ref-p4.html ) show no difference between the execution rate of the two.

Does anyone have other pixel shader benchmark numbers for NV30 which may indicate a difference between single and half precision performance?
 
The following is from nv gl extensions doc:

"Should fragment programs be allowed to use multiple precisions for
operands and operations?
RESOLVED: Yes. Low-precision operands are generally adequate for
representing colors. Allowing low-precision registers also allows for
a larger number of temporary registers (at lower precision).
Low-precision operations also provide the opportunity for a higher
level of performance.

Applications are free to use only high-precision operations or mix
high- and low-precision operations as necessary.
What levels of precision are supported in arithmetic operations?
RESOLVED: Arithmetic operations can be performed at three different
precisions. 32-bit floating point precision (fp32) uses the IEEE
single-precision standard with a sign bit, 8 exponent bits, and 23
mantissa bits. 16-bit floating-point precision (fp16) uses a similar
floating-point representation, but with 5 exponent bits and 10
mantissa bits. Additionally, many arithmetic operations can also be
carried out at 12-bit fixed point precision (fx12), where values in
the range [-2,+2) are represented as signed values with 10 fraction
bits."

So I would imagine with nv_fragment_program extension you get considerable speed if using lower precision.
 
Pocketmoon_ has done the best investigation into this so far.

article
thread

His article clearly shows that going from FP32 to FP16 to int12 usually has a very large performance benefit when using the proprietary NV_fragment_program extensions for OpenGL. The standard ARB_fragment_program OpenGL extensions do not allow precision hints at all. DX9's PS 2.0 does allow for a partial precision hint (_pp), but both the data you point to and all other data on the subject shows that with current drivers it makes absolutely no performance difference. (Incidentally, pocketmoon_ didn't try to compile for _pp PS 2.0, presumably either because Cg does not currently allow it or because we had already established it makes no difference.)

So, the NV_fragment_program data demonstrates clearly that the NV30 architecture really is capable of executing FP16 faster than FP32. Presumably this functionality will eventually be available to PS 2.0, but it isn't with current drivers. One conspiracy theory as to why this is the case is that the current drivers force FP16 in all cases; in other words, _pp "works", but standard PS 2.0--which calls for a minimum of FP24 for certain operations--does not.

Pocketmoon_'s results may actually be seen as preliminary evidence in favor of this theory, as the PS 2.0 path generally performs between FP32 NV_fragment_program and FP16 NV_fragment_program results (when they might be expected to do worse than NV_fragment_program, due to the latter shader language being "closer to the metal" of the actual NV30 architecture). But, AFAIK, there hasn't been any actual investigation as to what precision is being used by, e.g., examining output from a mandelbrot generating shader (which should be highly sensitive to changes in calculation precision). It bears mentioning that any WHQL drivers would presumably have to enable FP32 as the default fpr PS 2.0.
 
Hmmm...., according to those benchmarks, in NV30's FP30 mode, partial precision yields only minor gains with respect to full precision (fp16 was supposed to be twice as fast). This may be due to the extra scheduling the hardware has to address for parallelism.

How does the R300/R350 compare in these benchmarks? What are the units for the benchmarks anyways, are they just numbers?
 
I've had some good discussions with various folks, some Nvidia, some not and here's the current thinking:

Halves (or PP) and Floats operate at the same speed IF register usage is the same. The gain offered by halfs is that they use less register resource. And less register usage means more 'potential ' speed.

The current NV drivers are based on the original MS shader spec, which MS has since incorrect (a typo ?!). The spec was updated about a month ago. The result is that, in effect, the NV30 runs PS2 shaders at partial presision where it's allowed to.

The big speed gain is the use of the Fixed data type since NV30 can execute fixed and non-fixed operations (including texture samples) in parallel. Fixed is only supported in the FP30 profiles which is Nvidia only + OpenGL only .

The benchmarks where obtained by simply rendering a full screen textured quad at 640x480 resolution.

Units are fps.

If someone sends me a R350 I'll happily repeat the tests :) The comparison is really aimed at the various Cg profiles and HLSL. I think it's an accepted fact the ATI runs pure pixel shaders much faster than NV30, although at the end of the days it how balanced the card is that matters.
 
pocketmoon_ said:
I've had some good discussions with various folks, some Nvidia, some not and here's the current thinking:

Halves (or PP) and Floats operate at the same speed IF register usage is the same. The gain offered by halfs is that they use less register resource. And less register usage means more 'potential ' speed.

Aaaah: From the beyond3d R300 vs NV30 tech compare the NV30 can store up to 64 values in the temp registers when using FP16 while it can store 32 values in FP32 (as R300 can do). There are other interesting differences according to that tech compare, however, like NV30 skip having dedicated constant registers, but instead store those in instruction slots.

pocketmoon_ said:
The big speed gain is the use of the Fixed data type since NV30 can execute fixed and non-fixed operations (including texture samples) in parallel. Fixed is only supported in the FP30 profiles which is Nvidia only + OpenGL only.

Dispite all the but-drivers-will-improve-a-whole-lot this is what will probably haunt the CineFX shader architecture. I maintain that this design choice is one of the primary reasons that we are having Cg today: Developers will have to deal with the twitchyness of the CineFX and at least nVidia is giving them a decent tool to do so...
 
I do not believe, that NV30 is able to execute all FP16 instructions two time faster that FP32 ones. Usually all chips handle only most complex instructions in different ways. For example computing inverse or square root is very expensive, and using partial precision helps. Some instructions take exactly 1 clock, so how could they run faster? It is very questionable, that they have two completely different pipeline configurations, with FP16 configuration able to execute two instructions in a single clock.
 
_arsil said:
I do not believe, that NV30 is able to execute all FP16 instructions two time faster that FP32 ones. Usually all chips handle only most complex instructions in different ways. For example computing inverse or square root is very expensive, and using partial precision helps. Some instructions take exactly 1 clock, so how could they run faster? It is very questionable, that they have two completely different pipeline configurations, with FP16 configuration able to execute two instructions in a single clock.

there is actually some mentioning about fp16 performance in a released gdc03 paper (can get them at nvidias developer site ,"d3d tutorial1_ shaders")

why use fp16 precision?

- speed

- on some HW using fewer register leads to faster performance
- fp16 take half the register space of fp32 , so can be 2x faster

http://developer.nvidia.com/view.asp?IO=presentations
gdc papers
http://developer.nvidia.com/docs/IO/4449/SUPP/D3DTutorial1_Shaders.pdf
d3d tutorial1_shaders
 
I guess the latency overhead gained in fp32 by writing/reading to/from registers is not present for fp16 (32 vs. 64 registers).
 
Back
Top