Dawn FP16/FX12 VS FP32 performance - numbers inside

Discussion in 'Architecture and Products' started by Arun, May 25, 2003.

  1. theory

    Newcomer

    Joined:
    May 25, 2003
    Messages:
    12
    Likes Received:
    0
    Location:
    Surrey, UK
    Kinda like getting a V8 muscle car with bicycle wheels... only they can't be changed! Not the 5900 though, tha's not such a bad card really - just it looks a bit naff where FSAA in concerned.
     
  2. cellarboy

    Newcomer

    Joined:
    Jun 18, 2002
    Messages:
    143
    Likes Received:
    2
    Location:
    Calgary, Alberta
    Ok, I'm being a little dense here I suppose, but isn't each FP16 pixel made up of a RGBA value with each channel a 16bit floating point value?

    Or are you saying that each FP16 pixel is created from 4 values that together make a complete 16 bit floating point value? Essentially 4xFP4? Doesn't make sense to me.

    If this is the former, then this is exactly what OpenEXR is. Which makes sense seeing as OpenEXR was designed to take advantage of NV's FP16 mode.

    The way it is rendered to the frame buffer is irrelevant, I was talking about the calculated pixel format.

     
  3. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    696
    Likes Received:
    446
    Location:
    Slovenia
    Uttar: Can you zip these shaders and made them available online?
     
  4. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
  5. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    You're right, it doesn't make sense. FX12, FP16, FP32 are respective computing formats that are the respective accuracies of the intermediate storage between computing steps. The numbers represent the number of bits per color channel.

    Storage formats are different and, I believe, include an 8-bit framebuffer, a 16-bit floating point buffer for intermediate rendering, and a 32-bit floating point buffer also for intermediate rendering (that can be used as a packed format, where data can be stored in any combination of 8-bit int, 16-bit float, and 32-bit float data, up to a max of 128 bits per pixel).

    Side note: One thing to take away from this is that the way the FX architecture is designed, the output is always 8-bit integer. This means that depending on the intermediate calculations, using 16-bit and 32-bit floating point formats may or may not provide any benefit (from looking at the Dawn shaders, most shaders require at least one or two FP ops).
     
  6. dominikbehr

    Newcomer

    Joined:
    Apr 19, 2002
    Messages:
    72
    Likes Received:
    0
    Location:
    Sunnyvale, CA
    openexr is a image file format that uses FP16 pixel format.
    you can load images as textures to graphic card in FP16 using ati texture float or nv half float extensions. then you can process textures with shaders. and render them to FP16 render targets using ati pixel format float or nv float buffer extensions then save to openexr again.
     
  7. ram

    ram
    Newcomer

    Joined:
    Feb 6, 2002
    Messages:
    218
    Likes Received:
    0
    Location:
    Switzerland
    Nvidia's official claim is that NV35 can do twelve 32-bit per component (128-bit) floating point operations peak per clock cycle. I think it is a reasonable theory to blaim lacking FP32 register space for the performance loss over FP16.
     
  8. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    How about:

    FX12 goes down the FP16 units in denormalized form.

    The increased performance of FX12 comes from lower latency, because:

    1. arguments don't need aligning (when adding)
    2. - and results don't need to be normalized.

    Cheers
    Gubbi
     
  9. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    What are the results of forced fp16 @ 1024x768 on the 5800 ultra? If the 5900 ultra scores ~27 fps and the 5800 ultra scores ~30 fps with the completely mixed format, then it can almost be confirmed that the 5900 has more raw fp performance, especially clock for clock, than the 5800 ultra; this, or the performance figures reflect NV35's higher bandwith (even though the demo seems to be more computationally bound).

    What percentage of the original ultra demo instructions were explicitly FX12, FP16, etc.?

    All this info seems to lend credibility to this conclusion over here, which we came to in a Beyond3D thread.
     
  10. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    Do you guys believe that these scores are the result of Nvidia drivers forcing fp16 for the 5900 ultra, along with other instruction optimizations? I belive the Futuremark pixel shader 2.0 test defaults to the highest precision available, so defaulting to to fp16 would seem to help scores a lot. If the dawn demo saw fp32 at 2/3 the performance of fp16 (and the dawn demo originally specified fp32 for some objects), the futuremark shader, which specifies the highest precision for every instruction, should yield an even greater performance delta between fp16 and fp32. As opposed to the dawn situation, futuremark would benefit from increased register usage performance (in the move from fp32 to fp16, if this is the case) accross the board (rather than on a select number of instructions).

    The perormance delta between the pixel shader results in versions 320 and 330, of futuremark, ranges between 50% and 53%, where 330 is less than version 320. In dawn, forced fp32 is about 46% percent slower than fp16. The numbers seem to fit.

    Being that the NV3x pays a large penalty for register usage (about 50% more latency for using more than 2 registers in fp32 mode), if Nvidia forced fp16 and got these results (33.1 fps vs. 14.5 @ 1024x768), it kind of confirms the fact that NV35 has 12 fp shader units (at least more than the R350).
     
  11. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    Well, my guess right now really is only that the "register combiners" ( if they still exist and aren't more general than that ) are now upgraded from FX12 to FP16 and that they're capable of doing 1FP32 op/clock or 2FP16 ops/clock compared to 2FX12 ops/clock.

    I don't remember nVidia *anywhere* saying that they are capable of 12 FP32 ops. If there was such a claim, I'd love to have a link to it.

    What I remember however is nVidia claiming:
    - Tbe NV35 is a 12 operations/clock architecture ( see my leaked preliminary PR list )
    - The NV35 got doubled FP performance over the NV30

    Now, the first doesn't tells us that it's FP32.
    And the second is, using my theory, correct for FP32.

    So my guess is that nVidia kept the FP32/Tex unit 100% unchanged beside maybe a few minor optimizations, and then they upgraded the register combiners from FX12 to FP16 with the ability of doing FP32 in two clocks or maybe by uniting the two units per "pipeline" ( since I still think the NV3x *might* not have pipelines )


    Uttar
     
  12. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    Here it is, quoted from its initial source and confirmed to me in a pm:
    The original post also goes on to explain how the register performance impact is still very much present in NV35.

    In the same thread, I posted this diagram of the NV35 pipeline (according to the confirmed information and thepkrl's research):
    and compared it with one of NV30's supposed pipelines:
     
  13. ram

    ram
    Newcomer

    Joined:
    Feb 6, 2002
    Messages:
    218
    Likes Received:
    0
    Location:
    Switzerland
    There is no link. The source of this information is Luciano Alibrandi from Nvidia Europe. He specified that Nvidia means 12 FP32 operations per clock.
     
  14. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    Read my previous post for a quote which in reference to official information.
     
  15. ram

    ram
    Newcomer

    Joined:
    Feb 6, 2002
    Messages:
    218
    Likes Received:
    0
    Location:
    Switzerland
    So there we have two sources now ;)
     
  16. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    Okay, alright then...

    But something ain't quite normal here!
    All *modified* shader programs are using four FP16 registers ( or sometimes less! )
    According to theckprl's results, the difference between 2FP32 registers ( = 4FP16 registers ) and 4FP32 registers is *minimal*, and 4FP32 registers was thus agreed to be the "sweetspot"

    I'm sorry, but that's just ridiculous. Either the NV35 got a ten times more problematic register usage problem, or the NV35 isn't as fast, operation wise, doing FP32.


    Uttar

    EDIT, reply to Luminescent PM ( makes no sense to keep this private ) :
    Okay, alright then...

    But something ain't quite normal here!
    All *modified* shader programs are using four FP16 registers ( or sometimes less! )
    According to theckprl's results, the difference between 2FP32 registers ( = 4FP16 registers ) and 4FP32 registers is *minimal*, and 4FP32 registers was thus agreed to be the "sweetspot"

    I'm sorry, but that's just ridiculous. Either the NV35 got a ten times more problematic register usage problem, or the NV35 isn't as fast, operation wise, doing FP32.


    Uttar
     
  17. ram

    ram
    Newcomer

    Joined:
    Feb 6, 2002
    Messages:
    218
    Likes Received:
    0
    Location:
    Switzerland
    May be some further benchmarking could enlighten us ...
     
  18. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    Damn, I must be tried those days!

    There's a ridiculously easy way to test if the problem for the NV32 is *only* register usage: use FP32 registers with FP16 instructions! AFAIK, the compiler does not optimize that :)

    Can't do it now ( in Linux, don't have the files on this PC ) , but I'll try to have it available later today.


    Uttar
     
  19. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    Actually, more than 2 registers in fp32 mode causes operations to execute at a 1.45 op per clock cycle (per pipeline) performance rate - a substantial latency impact (50%) in comparison to the 1 op per clock cycle (per pipline) performance of one or two registers. Under such register conditions (4 or 3 v.s 2 or 1), NV35 at fp32 should run at 2/3 the speed if operating at fp16 (1.45 ops per clock vs. 1). This performance delta is exactly the one observable in the Dawn demo.

    I'll quote myself to demonstrate how I derived these performance figures:
     
  20. ♪Anode

    Newcomer

    Joined:
    May 9, 2003
    Messages:
    10
    Likes Received:
    0
    IMO supporting full FP32 would be the way to go in the future. FP 24 is mostly a stop gap thing which ATI did so as to reclaim their performance crown which worked for them at this point of time. But I dont doubt for a second that they would be going FP32 for the future cards. If they dont then they would lose out.
    This is sort of similar to the 16bit vs 32 bit stuff. The first line of cards ( TNT) was a bit slow to do 32 bit everywhere but it got the attention on 32 bit and showed the advantages of using 32 bit. The later generations improved on that and we now have 32 bit everywhere.
    Doing fp16 at full speed and fp32 at half is as good a choice as ati doing 24 bits if not better. It may not seem so at now but it will pay off as time goes on. :wink:
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...