FP16? But it's the current year!

Discussion in 'Architecture and Products' started by Markus, Oct 27, 2016.

  1. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    Yes, but explicit half type still doesn't mean gpu has to actually execute at fp16. Why would it, if fp32 is faster (on specific hardware)? It's also not something that's required for SM6.
    Having an explicit half type however is a good idea. You wouldn't want to run stuff at int8 by accident.
     
    Razor1 and sebbbi like this.
  2. digitalwanderer

    digitalwanderer Dangerously Mirthful
    Legend

    Joined:
    Feb 19, 2002
    Messages:
    17,120
    Likes Received:
    1,653
    Location:
    Winfield, IN USA
    Dumb question, but should I be waiting for fp64 soon now? Is that how it works?
     
  3. I.S.T.

    Veteran

    Joined:
    Feb 21, 2004
    Messages:
    3,174
    Likes Received:
    389
    FP64 won't be needed for real time graphics for a long time, if ever.
     
    digitalwanderer likes this.
  4. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    7,714
    Likes Received:
    6,005
    The opposite, not thinking my performance will tank but instead I won't benefit from it but I still paid a hefty premium - for a feature that should have been there if this is going to be the future state.
     
  5. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Personally I dont like 8 bit integer ALU or 10/12 bit limited range fixed point ALU. Have to be too careful and workaround too much around limited precision and range (simultaneously). Small bit depth types are however perfect for memory storage.

    For int math (ALU), 16 bit is enough for surprisingly many cases (int is lossless, so range is all that matters). Fp16 is also often enough for general purpose math (unlike 10/12 bit fixed point types which are highly situational). Range is rarely an issue for fp16 (+-65504), and precision is good enough when working with 8 bit inputs/outputs. Of course you need fp32 ALU for position math (transforms, etc) and modern lighting math, but fp16 is perfect for post processing, LDR math before lighting (colorize, etc) and for normal vector related math, etc, etc.
     
  6. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    Is there someplace I can read up on using FP16 correctly?
     
  7. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Sure but then your back to losing the specific optimisations that the engine/post processing required with the FP16 operations.
    Look how physics can be broken in game engines (not the same thing but just an example, and I am not suggesting all physics for this), but I guess it depends what the focus and implementation is, at a minimum as hinted by Sebbi possibly messing up AA, maybe lighting-shadows abnormalities/etc.
    Cheers
     
    #27 CSI PC, Oct 30, 2016
    Last edited: Oct 30, 2016
  8. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Opportunity cost issue. You can't lose the optimizations if they never would have existed in the first place. I'm not sure we'll be seeing any effects that exist solely because of the existence of FP16. The current optimizations look to be doing the same task in less time with less resources. Promoting FP16 to FP32 should never mess up the rendering unless you somehow relied on lossy math for some sort of randomness. The results may look ever so slightly different, but the variation should be small. Maybe it explodes your register usage and performance tanks, but again what was the alternative? FP16 should always provide equal or better results considering alternatives.
     
    milk likes this.
  9. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Read 32 bit float articles targeted towards scientific audience. For them fp32 is "half precision" compared to fp64. This article seems to have it all, but is way too deep for most rendering programmers: https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

    The most important thing in float math is to avoid catastrophic cancellation. This means subtracting two (big) values from each other that are close to each other. This operation commonly happens when you use world space coordinates in your math, and calculate vectors between two points. For example localLight->surfacePixel, camera->surfacePixel, vertex1->vertex2 (edge math in world space), etc. The solution for this is to avoid doing math in world space. Just don't do it. It causes problems even on fp32 if your world is big enough. The first thing you should do is subtract camera position from all world space data (*). Absolutely no math before this. This way your floating point error is localized around the camera. Closer to camera = less error, further away = more error. Perspective projection makes everything smaller at distance, normalizing the error, assuring that no matter what distance, the error is always smaller than some subpixel fraction.

    The previous trick is not enough for all fp16 cases. Instead of performing math in camera centered space, you'd sometimes want to perform math in surface local space. Subtract surface coordinate from the other position data. Do this subtract in fp32 (to minimize catastrophic cancellation) and rest of the operations in fp16.

    In general you should avoid adding two floating point numbers together if their magnitude differs a lot. Example: Time counter should never be floating point (fp32). Accumulated time gets large and the added time per frame stays small. The precision of the add gets lower and lower, and frame time (animations) start to judder in a few hours. Similarly if you are adding multiple light sources together, you should not simply add them one at a time in a loop. Instead you should first add lights pairwise, and then these results pairwise, etc. This results in equal amount of add operations. But if we assume that the lights are roughly the same intensity, the adds always are performed between two numbers that are roughly the same magnitude, reducing the floating point error. Or simply you could use fp32 light accumulation counter and do the heavy math in fp16. You only perform a single fp32 add per light. However modern GGX lighting math requires fp32 precision in some math, but again you can carefully isolate the fp32 math from the fp16 math if you know the relative magnitudes of the operands. It is tricky if you borrow a lighting formula from a paper and don't understand exactly how it works.

    Optimizing for fp16 is kind of similar than optimizing data storage. You need to know you data and do some analysis. People have been optimizing their storage (memory bandwidth) for ages. Slides 33-36 of this presentation are a good example (error analysis on compressed normal): https://michaldrobot.files.wordpress.com/2014/05/gcn_alu_opt_digitaldragons2014.pptx

    (*). Subtracting camera position from world space positions itself causes catastrophic cancellation. But you do it only once and before any other math. If your world is large, I recommend using uint32 for positions instead of fp32. 3x uint32 (xyz) can present the whole earth at a few millimeter precision (including all the space inside earth). You only need a single integer ALU instruction (subtraction) to convert world space coordinates to camera space coordinates. Follow it by a single float multiply-add to scale the coordinate accordingly. Integer subtraction is full rate on all GPUs. No catastrophic cancellation at all.
     
    #29 sebbbi, Oct 31, 2016
    Last edited: Oct 31, 2016
    Kej, Rodéric, pcchen and 7 others like this.
  10. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Agreed. People should use int32 world space coordinates instead (bounded around the game world). Integer coordinates guarantee even precision around the game world. Float precision is only good in the middle of the game world. Further away you go, lower the precision becomes. This is a testing nightmare. Int32 guarantees 256x better minimum precision than fp32 (regardless of the world size). This allows 256x256x256 (3d) larger game worlds than fp32 (that is 65536x more square miles of terrain). This is more than enough for most games. No need program silly world shifting hacks to make it work.
     
    digitalwanderer, Alexko, BRiT and 4 others like this.
  11. I.S.T.

    Veteran

    Joined:
    Feb 21, 2004
    Messages:
    3,174
    Likes Received:
    389
    Now that I didn't know. Just to be 100% clear, I was referring to precision in shaders. Your info is definitely new to me, though! Comes with being a laywoman, etc.
     
  12. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Floating point is floating point everywhere. Same problems in float code in CPU and in GPU. Time counters obviously are mostly handled by CPU side, but some engines have GPU side animation based on time counters as well and/or GPU side particle simulation. Floating point (fp32) world space positions are problematic for both CPU and GPU when the world size increases. As I said earlier, shader code can sidestep this problem somewhat by using camera centered coordinate space.

    But in general fp16 ALU is more applicable to shaders (rendering) as big part of per pixel work is done on LDR data [0,1] range (or [-1,1] range for normal vectors). Even for HDR color processing, the final data output is 8 bits (or 10 bits for HDR10). fp16 is sufficient for most of the math. But the developer has to be much more careful and aware of the numeric ranges. Fp16 optimization is definitely not easy to do right.

    This is also the reason why programmers working on scientific simulations prefer pure fp64. You could get the same results with mixed fp32 & fp64 code, but it is much harder to get it right. Same is true for mixed fp16 & fp32 code.
     
  13. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,583
    Likes Received:
    703
    Location:
    Guess...
    Does this mean that HDR games will be able to take less advantage of FP16 in the future? That would be sadly ironic given that future hardware will enable double rate FP16 just as we're moving into the era of HDR games.
     
  14. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    FP16 has 11 significant bits so it is enough for HDR10, not enough for HDR12. But then there's also INT16.
     
  15. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    HDR10 and HDR12 are not linear. HDR curve is closer to floating point curve (logarithm) than linear (uint16 normalized). Fp output is definitely better than normalized integer output (linear brightness). Exponent bits matter (except for the highest bits as HDR standard max brightness doesn't reach 65504). Fp16 sign bit is obviously useless. Fp16 output should be fine for HDR12.

    Fp16 math on HDR12 output is debatable. I would prefer at least 1 bit more ALU precision compared to storage/output precision (otherwise you get no rounding). Temporal supersampling (8x) + jittered rounding recover roughly 3 bits of color depth. Fp16 math before temporal pass and fp32 math after it should be fine for most cases.
     
  16. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    I suppose there may be some blog and forum litterature in the context of mobile games?
     
    #36 Blazkowicz, Oct 31, 2016
    Last edited: Oct 31, 2016
  17. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    You're right. I was only thinking of the final output... Slight tangent here: what are actual display formats that HDR displays accept?
     
  18. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    You create DXGI_FORMAT_R16G16B16A16_FLOAT swap chain. 1.0 is mapped to 80 nits. Driver converts to correct format (HDR10 / HDR12).

    More info:
    https://developer.nvidia.com/displaying-hdr-nuts-and-bolts
     
    Kej, I.S.T. and BRiT like this.
  19. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,049
    Likes Received:
    1,010
    This touches on something I have thought about.
    While it is relatively easy to just look at the number formats and proclaim that something is sufficient or not, the underlying application here is not numerical analysis, but gaming. And it seems to me that in order for a numerical error to actually matter, a pixel error has to be:
    1. Large enough to be readily detectable
    2. Be consistent over time, as errant pixels showing up in a single frame is extremely unlikely to get noticed
    3. The pixel error pretty much needs to correlate with similar errors on its neighbours to create a larger area that is objectionably anomalous (and consistently so over time).

    And to judge that, you need hands on experience with performing precision experiments with actual games. Just thinking about it, it would seem you could get away with a lot. But that's just being an armchair expert, no better than just performing the numerical analysis.
    How does it play out in reality?
     
    DavidGraham and milk like this.
  20. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
    Thanks Sebbbi, I read that document a long time ago. Your other advice is appreciated.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...