Native FP16 support in GPU architectures

Discussion in 'Architecture and Products' started by xpea, Oct 17, 2014.

  1. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    If you ignore the exception of a handful of laymen tech geeks like some of us the majority doesn't and shouldn't really know how many FLOPs each solution has. In the given case IMG isn't marketing end products since that's the rightful job of their licensees, but even then I don't recall a single case where Apple or anyone else ever quoted N GFLOPs of whatever for the GPUs.

    DP FLOPs and CPU cores aside it is NVIDIA actually that started marketing its GPUs more actively than anyone else in the ULP SoC market where all of the sudden an ALU lane became a "core" and the recent GK20A in Tegra K1 went from the initial "projected" 364 GFLOPs down to 326 GFLOPs in developer boards.

    If manufacturers would get knocked out of their socks with those raw numbers and if they'd actually appeal to the average consumer, they'd stand in line by now to get K1 SoCs in their devices. In reality it seems to be doing well, but so far I haven't seen the foundations of the ULP market moving yet either.

    Personally as a matter of fact I don't even oppose to the above to be honest; marketing based on GFLOPs (I'm just borrowing it as an example here) seems far healthier than N device getting 50k points in something as worthless as Antutu or a gazillion of vertices/sec quoted for any other GPU.
     
  2. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    You make it sound like all the ideas come from somewhere other than engineering. I doubt that's true where you worked, but I know it's not true everywhere. Feature ideas come from everywhere and if a company doesn't have all elements coming up with ideas it's not healthy.
     
  3. MrMilli

    Newcomer

    Joined:
    Apr 6, 2008
    Messages:
    10
    Likes Received:
    0
    This confusion about the added FP16 units is really weird. IMG stated that the reason they added those units is power consumption. If the developer uses FP16 for objects that don't need the precision of FP32 then it would cut the power consumption because the more complex FP32 ALU's don't need to be fired up. Considering the small screens of devices outfitted with IMG tech, it makes perfect sense.

    I added this quote from Anand.
    On the topic of Tegra K1, FP64 is at 1/24 the FP32 rate. Meaning it's there because the original design had it. Not in any way useful to be used that this performance level.

    As for the Apple A8, even the quad core GX6450 performs nicely and Apple is known for not pushing clock frequencies. The comparison to the K1 falls short because you're comparing a phone SOC to a tablet SOC. The real comparison will come from A8X.
     
  4. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    True for the first and the latter is rather irrelevant in the context I've put it in. Since my point is about the design decision to sacrifice more die area in order to save power consumption (which was the purpose behind dedicated FP64 units in Kepler and onwards), I'm willing to bet that even though the GK20A peak FP64 rate should be lower than any of the existing ARM Midgaard cores: http://forum.beyond3d.com/showpost.php?p=1856222&postcount=184 DP rate/mW is in GK20A's favor. Vivante has also double precision if memory serves well and I suspect there also a 4:1 SP/DP ratio from the same ALUs.

    I don't disagree at all; it's just that there's a very specific reasoning behind me bringing up the FP64 units in GK20A. Even assuming that they'd clock at peak 850MHz (which sounds very aggressive) for FP64 it's merely 13.6 GFLOPs DP at best.
     
  5. ams

    ams
    Regular

    Joined:
    Jul 14, 2012
    Messages:
    914
    Likes Received:
    0
    Well wait a second. The Series 6 G6400 is supposed to have exclusively FP32 ALU's for rendering purposes. So how in the world does it come up with ~ 2400 mB PSNR for lower precision and ~ 3500 mB PSNR for higher precision in GFXBench when mobile Kepler has a much higher ~ 4460 mB PSNR for both metrics?
     
  6. ams

    ams
    Regular

    Joined:
    Jul 14, 2012
    Messages:
    914
    Likes Received:
    0
    Running existing software faster such as Angry Birds is not at all challenging for the latest and greatest ultra mobile GPU's (and if I want extra battery life, I can set a framerate target of 30fps to dramatically extend battery life in these games). The challenge is to bring higher fidelity PC and console-class graphics and games to the ultra mobile space without dramatically sacrificing image quality and without requiring game developers to significantly re-work their code while trying to figure out when and when not to use higher precision shaders. The PC and console game development work needs to be heavily leveraged before bringing these higher quality games to mobile.
     
  7. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    Again bounce back to one of former replies where I've given you links where they explain their design choices skipping improved rounding in their ALUs which got Rogues down to DX10.0. If you'd bother to actually read what others have to say, I wouldn't have to repeat the same cheese for the third time already.

    Even the PC will increasingly turn into the direction to save as much power as they can. Having dedicated FP16 ALUs DOES NOT stop the above process it supports it. If you would had even bothered to also read the two pages of the Rogue thread where you just asked for someone to decypher the PSNR stuff for you, you would had noticed that IMG is pushing to get rid of lowp entirely, which is a push for higher precision shaders from the bottom to the top.

    You still don't want to understand how the hw actually works or what the real purpose of the use of dedicated FP16 ALU actually is but continue to ride on the same tired yadda yadda for several pages and threads. FP32 hw is there and in most occassions at higher rates than competing hw.

    Last but not least: driving console or PC games into ULP mobile without any severe leverage on a sidenote is nonsense because you'd have other headaches such as storage, download bandwidth, prices and what not. ULP mobile are at best up to $10 games such as Infinity Blade and yes you'll also have stuff like Angry Birds or Farm Heroes Saga and what not. There's no place for those devices for big length $50 games at least not yet and I'm sure there will ever be unless cloud services in the future change the landscape radically.

    Again what sebbi said in another post which also doesn't seem to have come across so far:

    While harping over and over again about supposed "console quality" amongst others, the next best thing I expect to read is that the PS3 should not be counted as a console :roll:
     
  8. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    Lack of post processing effects in games is IMHO the biggest difference between mobile and console graphics. Mobile games tend to have zero post effects. FP16 is more than enough for post processing (DOF, bloom, motion blur, color correction, tone mapping). As FP16 makes post processing math 2x faster on Rogue (all the new iDevices), it will actually be a big thing towards enabling console quality graphics on mobile devices. Obviously FP16 is not enough alone, we also need to solve the bandwidth problem of post processing on mobiles. On chip solutions (like extending the tiling to support new things) would likely be the most power efficient answers.
     
    Heinrich4 likes this.
  9. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,344
    Likes Received:
    176
    Location:
    On the path to wisdom
    They will, most likely, use FP32 for some pixel shader code. But the parts that need FP32 aren't actually that frequent. Some developers will want to play it safe and use FP32 wherever there is any doubt, but even then there are shader parts that are obviously fine with FP16.

    I don't think it is particularly challenging to write pixel shader code for a "console quality", "high fidelity" mobile game that is something like 70% FP16 and 30% FP32.

    There's more to a GPU than just ALUs. My guess for this specific case would be precision of varyings (post-transform vertex attributes) stored in memory. As a TBDR, Rogue writes transformed vertices to memory. It would be perfectly sensible to store mediump varyings as FP16 values.

    But that's a guess. The only information I can readily find on this specific test is this quote: "The overall score is the peak signal-to-noise ratio (PSNR) based on mean square error (MSE) compared to a pre-rendered reference image. There are also two variants of this test – one forces shaders to run with high precision, while the other does not."
    And without knowing where the reference image comes from, how it was generated, and what the scene was tuned for, I don't think I can gain any useful understanding from this test at all.
     
  10. Novum

    Regular

    Joined:
    Jun 28, 2006
    Messages:
    335
    Likes Received:
    8
    Location:
    Germany
    I sometimes wish modern GPUs still had FP16 support. Not even because of throughput, but because of register file pressure. We have compute shaders were occupancy was a real problem.
     
  11. swaaye

    swaaye Entirely Suboptimal
    Legend

    Joined:
    Mar 15, 2003
    Messages:
    9,045
    Likes Received:
    1,119
    Location:
    WI, USA
    This thread brings forth magical NV3x memories! Though I suppose it's a little more current than that for those of you working on PS3... :)
     
  12. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York

    Lol I was thinking the same thing. Funny how FP16 was taboo back in 2003.
     
  13. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    4,309
    Likes Received:
    1,107
    Location:
    35.1415,-90.056
    I think it's better than you realize. The Core 2 Duo E8600 3.33Ghz was basically the fastest and latest model dual core of that era. I chose this one because it will have the best potential showing of IPC on that gen, compared to the quads that also existed at the time but had some interesting tradeoffs.

    Compare that E8600 to a current Haswell i5-4440s. They're both 3.3-ishGHz in single threaded code (technically the 8600 is 33mhz faster), they both have similar TDP ratings, they both have similar last-level cache sizes, but the i5 clearly pulls away in IPC:

    The Intel ARK specification comparison: http://ark.intel.com/compare/35605,75040

    An amalgamation of a ton of benchmark scores: http://www.cpu-world.com/Compare/592/Intel_Core_2_Duo_E8600_vs_Intel_Core_i5_i5-4440S.html

    I'm the disparity could be partly linked to memory latency and throughput enhancements, but even so, that's still IPC that wasn't available in the C2D world.

    Intel continues, generation after generation, to deliver IPC increases in whatever way they can. It's amazing to me that IPC has continued to increase, even if only slightly, simply because it's still a compounding gain after a few generations of single-digit-percentage gains.
     
  14. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    Unless I'm understanding here something wrong: do you mean that NV3 memories are more current than anything PS3? :???:
     
  15. keldor314

    Newcomer

    Joined:
    Feb 23, 2010
    Messages:
    132
    Likes Received:
    13
    The thing that everyone's forgetting is that ALUs aren't the main driver of power consumption or die area any more - it's communication across the chip and off that really does it, as well as complexity inside each scheduler. One of the problems is that adding FP16 makes your scheduler more complex, since it's adding an extra set of instructions, as well as a big crossbar to allow data to get to the FP16 ALU. This means that adding the additional FP16 support may actually take more power and even die space than just using FP32 everywhere.

    Also, trying to make it so your register file could be accessed both as 16 and 32 bits would probably be a big mess. You'd have extra bank conflicts, need more ports, or perhaps just go to a second layer of SIMD (Nvidia actually does this for it's multimedia instructions, though I don't know if Maxwell has them any more). Overall, you'd probably lose more than you gained.

    Now, FP16 only requires half as much bandwidth coming on and off chip and/or from the caches. But FP32 can take advantage of this too - simply read it as FP16 and convert to/from FP32 in the core. This is very cheap since all you have to do is reroute bits. This is the approach Nvidia uses - others may do it too, I'm not sure.

    Another thing to consider is that some computations (like vertices - half precision isn't enough to point to a single pixel!) need FP32, while others can get away with FP16. However, in a unified core, you have to assume you need FP32 in many cases, so you really have to have both, or else move back to separate pixel and vertex shaders, which brings programmability back into the dark ages.
     
    #55 keldor314, Oct 21, 2014
    Last edited by a moderator: Oct 21, 2014
  16. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,344
    Likes Received:
    176
    Location:
    On the path to wisdom
    I'm sure the people designing those chips did not "forget" this. They simply came to a different result.
     
  17. OlegSH

    Regular

    Joined:
    Jan 10, 2010
    Messages:
    805
    Likes Received:
    1,635
    But keldor is right, the actual win from FP16 is nowhere close to 2 times perf/watt for the whole SOC in real games, its mostly single digit % numbers
     
  18. milk

    milk Like Verified
    Veteran

    Joined:
    Jun 6, 2012
    Messages:
    3,977
    Likes Received:
    4,102
    Keldor is speculating. The engineers at powerVR ran actual tests.
     
  19. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,344
    Likes Received:
    176
    Location:
    On the path to wisdom
    Even that is still a win. And we're also talking about single digit % SoC area increases.

    If anyone expected "close to 2 times perf/watt for the whole SOC in real games", I don't see it in this thread.
     
  20. mc6809e

    Newcomer

    Joined:
    Jan 24, 2007
    Messages:
    46
    Likes Received:
    5
    But they don't have similar L1 cache. The i5 has twice the L1 that the Core 2 does. That makes a big difference.

    Double the L1 of the Core 2 duo and you probably get near identical performance with the i5.

    What Intel is great at is using advanced processes to pack huge numbers of transistors onto wafers allowing for big fast caches. It's a huge advantage, but it's mostly just fast memory. You have to give credit to Intel material and circuit engineers for keeping the x86 arch alive for so long.

    And look at the way AMD and NVidia are chafing to get at 22/20nm. Intel has a weaker GPU arch, but their access to the most advanced process still allows them to stay in the game.

    Not that I expect Intel to displace NVidia, but Intel is still stealing some of the low end away from them.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...