Dawn FP16/FX12 VS FP32 performance - numbers inside

YeuEmMaiMai said:
Dave since this appears to be the case, I really feel sorry for people who bought the card thinking they are going to get some performance out of it. Either that or nVidia drivers are going to be 20MB downloads just for the nV3x cards since they will have to sub a lot of shader code to make up for the cards weaknesses. Sad really $500 for a card and people got short changed.

DaveBaumann said:
YeuEmMaiMai said:
Something tells me that the nV3X design was not really meant for FP32 to be fast but just to get their foot in the door so to say...

This is exactly the case.

On the way back to the hotel from E3 one night I skadged a lift on the NVIDIA bus. When I got on I noticed the name badge of Kieth Galocy, a name I recognised from the 3dfx days. We were taling about a number of things such what he's up to and the NV3x parts etc., and he made that exact same point himself. He said that NV30 is really a good DX8 performer, but with DX9 hardware, which is similar to the previous generations. With the number of FX12 units NV30 is a superb DX8 class performer, not not quite so hot a DX9 performer.

ATI took a slightly more generic route, of not bothering with full FP32 precision, but more FX24 units that can generically cope with both DX9 and DX8 shaders, so they have ended up with a more balanced architecture for current titles and new titles - if NV stick to running DX9 shaders at full DX9 precision half of the NV30 pipeline is wasted as its not float.

Kinda like getting a V8 muscle car with bicycle wheels... only they can't be changed! Not the 5900 though, tha's not such a bad card really - just it looks a bit naff where FSAA in concerned.
 
Uttar said:
Well, remember OpenEXR is a *file* format.
That means that OpenEXR is a 64-bit per pixel ( 4xFP16 ) format. Generally, the framebuffer is 32-bit per pixel ( 4xFX8 )

Ok, I'm being a little dense here I suppose, but isn't each FP16 pixel made up of a RGBA value with each channel a 16bit floating point value?

Or are you saying that each FP16 pixel is created from 4 values that together make a complete 16 bit floating point value? Essentially 4xFP4? Doesn't make sense to me.

If this is the former, then this is exactly what OpenEXR is. Which makes sense seeing as OpenEXR was designed to take advantage of NV's FP16 mode.

The way it is rendered to the frame buffer is irrelevant, I was talking about the calculated pixel format.

The data type implemented by class half is identical to Nvidia's 16-bit floating-point format ("fp16 / half"). 16-bit data, including infinities and NANs, can be transferred between OpenEXR files and Nvidia 16-bit floating-point frame buffers without losing any bits.
 
cellarboy said:
Or are you saying that each FP16 pixel is created from 4 values that together make a complete 16 bit floating point value? Essentially 4xFP4? Doesn't make sense to me.
You're right, it doesn't make sense. FX12, FP16, FP32 are respective computing formats that are the respective accuracies of the intermediate storage between computing steps. The numbers represent the number of bits per color channel.

Storage formats are different and, I believe, include an 8-bit framebuffer, a 16-bit floating point buffer for intermediate rendering, and a 32-bit floating point buffer also for intermediate rendering (that can be used as a packed format, where data can be stored in any combination of 8-bit int, 16-bit float, and 32-bit float data, up to a max of 128 bits per pixel).

Side note: One thing to take away from this is that the way the FX architecture is designed, the output is always 8-bit integer. This means that depending on the intermediate calculations, using 16-bit and 32-bit floating point formats may or may not provide any benefit (from looking at the Dawn shaders, most shaders require at least one or two FP ops).
 
openexr is a image file format that uses FP16 pixel format.
you can load images as textures to graphic card in FP16 using ati texture float or nv half float extensions. then you can process textures with shaders. and render them to FP16 render targets using ati pixel format float or nv float buffer extensions then save to openexr again.
 
DaveBaumann said:
Its probably more like 4 FP32 or 8 FP16. I suspect that the NV30 chip just wasn't as efficient at opportioning two FP16 instructions as it could have been, and this is probably one of the major fixes with NV35.

Nvidia's official claim is that NV35 can do twelve 32-bit per component (128-bit) floating point operations peak per clock cycle. I think it is a reasonable theory to blaim lacking FP32 register space for the performance loss over FP16.
 
Uttar said:
As for the FP16 performance hit...
All I did really for the FP16 version is change all the MULX, DP3X, ... into MULH, DP3H, ... - nothing else.
I've got three guesses:
- The fragment pipeline can share the FX12 power of T&L
- Some instructions, which are not run natively, might use shortcuts when they know only FX12 precision is requested and thus run faster.
- I've done something wrong.

How about:

FX12 goes down the FP16 units in denormalized form.

The increased performance of FX12 comes from lower latency, because:

1. arguments don't need aligning (when adding)
2. - and results don't need to be normalized.

Cheers
Gubbi
 
What are the results of forced fp16 @ 1024x768 on the 5800 ultra? If the 5900 ultra scores ~27 fps and the 5800 ultra scores ~30 fps with the completely mixed format, then it can almost be confirmed that the 5900 has more raw fp performance, especially clock for clock, than the 5800 ultra; this, or the performance figures reflect NV35's higher bandwith (even though the demo seems to be more computationally bound).

What percentage of the original ultra demo instructions were explicitly FX12, FP16, etc.?

All this info seems to lend credibility to this conclusion over here, which we came to in a Beyond3D thread.
 
Do you guys believe that these scores are the result of Nvidia drivers forcing fp16 for the 5900 ultra, along with other instruction optimizations? I belive the Futuremark pixel shader 2.0 test defaults to the highest precision available, so defaulting to to fp16 would seem to help scores a lot. If the dawn demo saw fp32 at 2/3 the performance of fp16 (and the dawn demo originally specified fp32 for some objects), the futuremark shader, which specifies the highest precision for every instruction, should yield an even greater performance delta between fp16 and fp32. As opposed to the dawn situation, futuremark would benefit from increased register usage performance (in the move from fp32 to fp16, if this is the case) accross the board (rather than on a select number of instructions).

The perormance delta between the pixel shader results in versions 320 and 330, of futuremark, ranges between 50% and 53%, where 330 is less than version 320. In dawn, forced fp32 is about 46% percent slower than fp16. The numbers seem to fit.

Being that the NV3x pays a large penalty for register usage (about 50% more latency for using more than 2 registers in fp32 mode), if Nvidia forced fp16 and got these results (33.1 fps vs. 14.5 @ 1024x768), it kind of confirms the fact that NV35 has 12 fp shader units (at least more than the R350).
 
Well, my guess right now really is only that the "register combiners" ( if they still exist and aren't more general than that ) are now upgraded from FX12 to FP16 and that they're capable of doing 1FP32 op/clock or 2FP16 ops/clock compared to 2FX12 ops/clock.

I don't remember nVidia *anywhere* saying that they are capable of 12 FP32 ops. If there was such a claim, I'd love to have a link to it.

What I remember however is nVidia claiming:
- Tbe NV35 is a 12 operations/clock architecture ( see my leaked preliminary PR list )
- The NV35 got doubled FP performance over the NV30

Now, the first doesn't tells us that it's FP32.
And the second is, using my theory, correct for FP32.

So my guess is that nVidia kept the FP32/Tex unit 100% unchanged beside maybe a few minor optimizations, and then they upgraded the register combiners from FX12 to FP16 with the ability of doing FP32 in two clocks or maybe by uniting the two units per "pipeline" ( since I still think the NV3x *might* not have pipelines )


Uttar
 
Here it is, quoted from its initial source and confirmed to me in a pm:
I actually got reply on this from NVidia 2 hours ago.
It actually seams that integer logic is gone from NV35 pixel shaders. It is capable of 3 floating point (and it doesn't care that much about fp16 vs. fp32 either) instructions per pipe per clock (12 floating point instructions per clock total) or 2 floating point instructions + 2 texture look-ups per pipe per clock.
The original post also goes on to explain how the register performance impact is still very much present in NV35.

In the same thread, I posted this diagram of the NV35 pipeline (according to the confirmed information and thepkrl's research):
temporary registers (R0,R1,..,H0,H1,..)
|
FLOAT (perhaps does DDX/DDY for dependent fetches)
| \
| TEXTURE <-- f[TEX0],f[TEX1],.. (DDX/DDY is free)
| /
FLOAT <-- temp registers
|
FLOAT <-- temp registers
|
(loopback to temporary registers or output)
and compared it with one of NV30's supposed pipelines:
(thepkrl)
temporary registers (R0,R1,..,H0,H1,..)
|
FLOAT (perhaps does DDX/DDY for dependent fetches)
| \
| TEXTURE <-- f[TEX0],f[TEX1],.. (DDX/DDY is free)
| /
INTEGER <-- f[COL0],f[COL1]
|
INTEGER <-- f[COL0],f[COL1]
|
(loopback to temporary registers or output)
 
Uttar said:
I don't remember nVidia *anywhere* saying that they are capable of 12 FP32 ops. If there was such a claim, I'd love to have a link to it.

There is no link. The source of this information is Luciano Alibrandi from Nvidia Europe. He specified that Nvidia means 12 FP32 operations per clock.
 
Okay, alright then...

But something ain't quite normal here!
All *modified* shader programs are using four FP16 registers ( or sometimes less! )
According to theckprl's results, the difference between 2FP32 registers ( = 4FP16 registers ) and 4FP32 registers is *minimal*, and 4FP32 registers was thus agreed to be the "sweetspot"

I'm sorry, but that's just ridiculous. Either the NV35 got a ten times more problematic register usage problem, or the NV35 isn't as fast, operation wise, doing FP32.


Uttar

EDIT, reply to Luminescent PM ( makes no sense to keep this private ) :
Okay, alright then...

But something ain't quite normal here!
All *modified* shader programs are using four FP16 registers ( or sometimes less! )
According to theckprl's results, the difference between 2FP32 registers ( = 4FP16 registers ) and 4FP32 registers is *minimal*, and 4FP32 registers was thus agreed to be the "sweetspot"

I'm sorry, but that's just ridiculous. Either the NV35 got a ten times more problematic register usage problem, or the NV35 isn't as fast, operation wise, doing FP32.


Uttar
 
Damn, I must be tried those days!

There's a ridiculously easy way to test if the problem for the NV32 is *only* register usage: use FP32 registers with FP16 instructions! AFAIK, the compiler does not optimize that :)

Can't do it now ( in Linux, don't have the files on this PC ) , but I'll try to have it available later today.


Uttar
 
Actually, more than 2 registers in fp32 mode causes operations to execute at a 1.45 op per clock cycle (per pipeline) performance rate - a substantial latency impact (50%) in comparison to the 1 op per clock cycle (per pipline) performance of one or two registers. Under such register conditions (4 or 3 v.s 2 or 1), NV35 at fp32 should run at 2/3 the speed if operating at fp16 (1.45 ops per clock vs. 1). This performance delta is exactly the one observable in the Dawn demo.

I'll quote myself to demonstrate how I derived these performance figures:
Note: For those who are skeptical, this is how the numbers add up:
Here it says that for 16 adds (or 16 1 cycle ops, for the NV3x) and the use of 3 registers, the NV30 takes 5.8 cycles. Since the NV30 has 4 pipelines and each add instruction takes 1 cycle, the performance should be 4 cycles. This means that per pipeline each instrution is taking 5.8/4 cycles, yielding 1.45 cycles (almost 50% more time for using 3 versus 2 or 1 registers).
 
3dcgi said:
JF_Aidan_Pryde said:
YeuEmMaiMai said:
Something tells me that the nV3X design was not really meant for FP32 to be fast but just to get their foot in the door so to say...

Your endless one sided comments are getting very tiring.
Some of you are jumping on YeuEmMaiMai for this statement, but I think it's a vaild assumption. I don't know if he is correct, but supporting slow FP32 and making FP16 fast should give a lower transistor count than supporting FP32 at full speed. Ati obviously made another tradeoff which has become the better decision in hindsight. If DirectX9 required FP32 or had a minimum requirement of FP16 things might be different.

IMO supporting full FP32 would be the way to go in the future. FP 24 is mostly a stop gap thing which ATI did so as to reclaim their performance crown which worked for them at this point of time. But I dont doubt for a second that they would be going FP32 for the future cards. If they dont then they would lose out.
This is sort of similar to the 16bit vs 32 bit stuff. The first line of cards ( TNT) was a bit slow to do 32 bit everywhere but it got the attention on 32 bit and showed the advantages of using 32 bit. The later generations improved on that and we now have 32 bit everywhere.
Doing fp16 at full speed and fp32 at half is as good a choice as ati doing 24 bits if not better. It may not seem so at now but it will pay off as time goes on. ;)
 
Back
Top