Native FP16 support in GPU architectures

Ailuros · Oct 18, 2014

mc6809e said:
Problems do crop up, though, when customers begin to focus on on what they think is a single important feature of a product like FP ops per second. Do most customers really know that there might be a difference between 16-bit floats or 32-bit floats or 64-bit floats? Many do not.

If you ignore the exception of a handful of laymen tech geeks like some of us the majority doesn't and shouldn't really know how many FLOPs each solution has. In the given case IMG isn't marketing end products since that's the rightful job of their licensees, but even then I don't recall a single case where Apple or anyone else ever quoted N GFLOPs of whatever for the GPUs.

It's that sort of ignorance I was referring to. If customers desire max FP ops per second, though, then limited precision is going to be the norm as it tends to maximize FP numbers and cards are going to be designed with that in mind. NVidia seems to understand that well as the DP of their most recent commodity cards is dismal. AMD does have the R9 280x with very good DP performance (probably the best DPFlops/$ on the market), but how many customers has it given them?

In the past it was the clock rate for CPUs. In that case it led to the disaster which was the Pentium 4. Intel had actually planned a 10GHz P4 but the physics got in the way.

It took a long time for people to realize that clock rate isn't everything.

The current fad is CPU cores/chip.

Compare the single thread performance of a Core 2 Duo with a similarly clocked Core i7. They're isn't a huge difference except perhaps in benchmarks that are memory latency sensitive.

DP FLOPs and CPU cores aside it is NVIDIA actually that started marketing its GPUs more actively than anyone else in the ULP SoC market where all of the sudden an ALU lane became a "core" and the recent GK20A in Tegra K1 went from the initial "projected" 364 GFLOPs down to 326 GFLOPs in developer boards.

If manufacturers would get knocked out of their socks with those raw numbers and if they'd actually appeal to the average consumer, they'd stand in line by now to get K1 SoCs in their devices. In reality it seems to be doing well, but so far I haven't seen the foundations of the ULP market moving yet either.

Personally as a matter of fact I don't even oppose to the above to be honest; marketing based on GFLOPs (I'm just borrowing it as an example here) seems far healthier than N device getting 50k points in something as worthless as Antutu or a gazillion of vertices/sec quoted for any other GPU.

3dcgi · Oct 19, 2014

xpea said:
have you ever worked for a GPU vendor ? I did and I can tell as a fact that marketing is as much important (even sometimes more) than engineering. New features request are from management, sales and marketing teams, then engineers are in charge to make it true. not the other way around...

You make it sound like all the ideas come from somewhere other than engineering. I doubt that's true where you worked, but I know it's not true everywhere. Feature ideas come from everywhere and if a company doesn't have all elements coming up with ideas it's not healthy.

MrMilli · Oct 19, 2014

This confusion about the added FP16 units is really weird. IMG stated that the reason they added those units is power consumption. If the developer uses FP16 for objects that don't need the precision of FP32 then it would cut the power consumption because the more complex FP32 ALU's don't need to be fired up. Considering the small screens of devices outfitted with IMG tech, it makes perfect sense.

I added this quote from Anand.

The end result being that Rogue has dedicated FP16 units, not unlike how NVIDIA has dedicated FP64 CUDA cores on their desktop architectures. FP16 units take up additional die space – you have to place them alongside your FP32 units – but when you need to compute an FP16 number, you don’t have to fire up a more expensive FP32 unit to do the work. Hence the inclusion of FP16 units is a tradeoff between die size and performance, with the additional performance coming from the reduction in power consumption and heat when operating on FP16 values.

On the topic of Tegra K1, FP64 is at 1/24 the FP32 rate. Meaning it's there because the original design had it. Not in any way useful to be used that this performance level.

As for the Apple A8, even the quad core GX6450 performs nicely and Apple is known for not pushing clock frequencies. The comparison to the K1 falls short because you're comparing a phone SOC to a tablet SOC. The real comparison will come from A8X.

Ailuros · Oct 19, 2014

MrMilli said:
On the topic of Tegra K1, FP64 is at 1/24 the FP32 rate. Meaning it's there because the original design had it. Not in any way useful to be used that this performance level.

True for the first and the latter is rather irrelevant in the context I've put it in. Since my point is about the design decision to sacrifice more die area in order to save power consumption (which was the purpose behind dedicated FP64 units in Kepler and onwards), I'm willing to bet that even though the GK20A peak FP64 rate should be lower than any of the existing ARM Midgaard cores: http://forum.beyond3d.com/showpost.php?p=1856222&postcount=184 DP rate/mW is in GK20A's favor. Vivante has also double precision if memory serves well and I suspect there also a 4:1 SP/DP ratio from the same ALUs.

I don't disagree at all; it's just that there's a very specific reasoning behind me bringing up the FP64 units in GK20A. Even assuming that they'd clock at peak 850MHz (which sounds very aggressive) for FP64 it's merely 13.6 GFLOPs DP at best.

ams · Oct 19, 2014

Ailuros said:
That's another challenge for you that you've already left unanswered at another occassion. Since you "think" you know as much about the quality tests in Gfxbench, what and how it exactly does as an application and how every piece of hw works in it it's rather your turn to see the flaws in your own logic or start delivering some facts, documentation or whatever else to "enlighten" us what is going on. Adrenos have FP32 ALUs too, Vivante GPUs too so why the heck do they all come up with similar PSNR scores?

Well wait a second. The Series 6 G6400 is supposed to have exclusively FP32 ALU's for rendering purposes. So how in the world does it come up with ~ 2400 mB PSNR for lower precision and ~ 3500 mB PSNR for higher precision in GFXBench when mobile Kepler has a much higher ~ 4460 mB PSNR for both metrics?

ams · Oct 19, 2014

sebbbi said:
In this case, the increased FP16 execution resources allow the GPU to run majority of the existing software faster while also maintaining good battery life.

Running existing software faster such as Angry Birds is not at all challenging for the latest and greatest ultra mobile GPU's (and if I want extra battery life, I can set a framerate target of 30fps to dramatically extend battery life in these games). The challenge is to bring higher fidelity PC and console-class graphics and games to the ultra mobile space without dramatically sacrificing image quality and without requiring game developers to significantly re-work their code while trying to figure out when and when not to use higher precision shaders. The PC and console game development work needs to be heavily leveraged before bringing these higher quality games to mobile.

Ailuros · Oct 19, 2014

ams said:
Well wait a second. The Series 6 G6400 is supposed to have exclusively FP32 ALU's for rendering purposes. So how in the world does it come up with ~ 2400 mB PSNR for lower precision and ~ 3500 mB PSNR for higher precision in GFXBench when mobile Kepler has a much higher ~ 4460 mB PSNR for both metrics?

Again bounce back to one of former replies where I've given you links where they explain their design choices skipping improved rounding in their ALUs which got Rogues down to DX10.0. If you'd bother to actually read what others have to say, I wouldn't have to repeat the same cheese for the third time already.

Running existing software faster such as Angry Birds is not at all challenging for the latest and greatest ultra mobile GPU's (and if I want extra battery life, I can set a framerate target of 30fps to dramatically extend battery life in these games).
The challenge is to bring higher fidelity PC and console-class graphics and games to the ultra mobile space without dramatically sacrificing image quality and without requiring game developers to significantly re-work their code while trying to figure out when and when not to use higher precision shaders. The PC and console game development work needs to be heavily leveraged before bringing these higher quality games to mobile.

Even the PC will increasingly turn into the direction to save as much power as they can. Having dedicated FP16 ALUs DOES NOT stop the above process it supports it. If you would had even bothered to also read the two pages of the Rogue thread where you just asked for someone to decypher the PSNR stuff for you, you would had noticed that IMG is pushing to get rid of lowp entirely, which is a push for higher precision shaders from the bottom to the top.

You still don't want to understand how the hw actually works or what the real purpose of the use of dedicated FP16 ALU actually is but continue to ride on the same tired yadda yadda for several pages and threads. FP32 hw is there and in most occassions at higher rates than competing hw.

Last but not least: driving console or PC games into ULP mobile without any severe leverage on a sidenote is nonsense because you'd have other headaches such as storage, download bandwidth, prices and what not. ULP mobile are at best up to $10 games such as Infinity Blade and yes you'll also have stuff like Angry Birds or Farm Heroes Saga and what not. There's no place for those devices for big length $50 games at least not yet and I'm sure there will ever be unless cloud services in the future change the landscape radically.

Again what sebbi said in another post which also doesn't seem to have come across so far:

Sometimes it requires more work to get lower precision calculations to work (with zero image quality degradation), but so far I haven't encountered big problems in fitting my pixel shader code to FP16 (including lighting code). Console developers have a lot of FP16 pixel shader experience because of PS3. Basically all PS3 pixel shader code was running on FP16.

It is still is very important to pack the data in memory as tightly as possible as there is never enough bandwidth to lose. For example 16 bit (model space) vertex coordinates are still commonly used, the material textures are still dxt compressed (barely 8 bit quality) and the new HDR texture formats (BC6H) commonly used in cube maps have significantly less precision than a 16 bit float. All of these can be processed by 16 bit ALUs in pixel shader with no major issues. The end result will still be eventually stored to 8 bit per channel back buffer and displayed.

Could you give us some examples of operations done in pixel shaders that require higher than 16 bit float processing?

EDIT:
One example where 16 bit float processing is not enough: Exponential variance shadow mapping (EVSM) needs both 32 bit storage (32 bit float textures + 32 bit float filtering) and 32 bit float ALU processing.

However EVSM is not yet universally possible on mobile platforms right now, as there's no standard support for 32 bit float filtering in mobile devices (OpenGL ES 3.0 just recently added support for 16 bit float filtering, 32 bit float filtering is not yet present). Obviously GPU manufacturers can have OpenGL ES extensions to add FP32 filtering support if their GPU supports it (as most GPUs should as this has been a required feature in DirectX since 10.0).

While harping over and over again about supposed "console quality" amongst others, the next best thing I expect to read is that the PS3 should not be counted as a console

sebbbi · Oct 20, 2014

ams said:
The challenge is to bring higher fidelity PC and console-class graphics and games to the ultra mobile space without dramatically sacrificing image quality and without requiring game developers to significantly re-work their code while trying to figure out when and when not to use higher precision shaders.

Lack of post processing effects in games is IMHO the biggest difference between mobile and console graphics. Mobile games tend to have zero post effects. FP16 is more than enough for post processing (DOF, bloom, motion blur, color correction, tone mapping). As FP16 makes post processing math 2x faster on Rogue (all the new iDevices), it will actually be a big thing towards enabling console quality graphics on mobile devices. Obviously FP16 is not enough alone, we also need to solve the bandwidth problem of post processing on mobiles. On chip solutions (like extending the tiling to support new things) would likely be the most power efficient answers.

Xmas · Oct 20, 2014

ams said:
Games designed for modern day consoles and PC are not designed with lower precision pixel rendering in mind. So console and PC game developers that are looking to bring their games to the ultra mobile space with high visual fidelity and console quality will surely be making use of the FP32 ALU's for pixel rendering.

They will, most likely, use FP32 for some pixel shader code. But the parts that need FP32 aren't actually that frequent. Some developers will want to play it safe and use FP32 wherever there is any doubt, but even then there are shader parts that are obviously fine with FP16.

I don't think it is particularly challenging to write pixel shader code for a "console quality", "high fidelity" mobile game that is something like 70% FP16 and 30% FP32.

ams said:
Well wait a second. The Series 6 G6400 is supposed to have exclusively FP32 ALU's for rendering purposes. So how in the world does it come up with ~ 2400 mB PSNR for lower precision and ~ 3500 mB PSNR for higher precision in GFXBench when mobile Kepler has a much higher ~ 4460 mB PSNR for both metrics?

There's more to a GPU than just ALUs. My guess for this specific case would be precision of varyings (post-transform vertex attributes) stored in memory. As a TBDR, Rogue writes transformed vertices to memory. It would be perfectly sensible to store mediump varyings as FP16 values.

But that's a guess. The only information I can readily find on this specific test is this quote: "The overall score is the peak signal-to-noise ratio (PSNR) based on mean square error (MSE) compared to a pre-rendered reference image. There are also two variants of this test – one forces shaders to run with high precision, while the other does not."
And without knowing where the reference image comes from, how it was generated, and what the scene was tuned for, I don't think I can gain any useful understanding from this test at all.

Novum · Oct 20, 2014

I sometimes wish modern GPUs still had FP16 support. Not even because of throughput, but because of register file pressure. We have compute shaders were occupancy was a real problem.

swaaye · Oct 20, 2014

This thread brings forth magical NV3x memories! Though I suppose it's a little more current than that for those of you working on PS3...

trinibwoy · Oct 20, 2014

swaaye said:
This thread brings forth magical NV3x memories! Though I suppose it's a little more current than that for those of you working on PS3...

Lol I was thinking the same thing. Funny how FP16 was taboo back in 2003.

Albuquerque · Oct 21, 2014

mc6809e said:
Compare the single thread performance of a Core 2 Duo with a similarly clocked Core i7. They're isn't a huge difference except perhaps in benchmarks that are memory latency sensitive.

I think it's better than you realize. The Core 2 Duo E8600 3.33Ghz was basically the fastest and latest model dual core of that era. I chose this one because it will have the best potential showing of IPC on that gen, compared to the quads that also existed at the time but had some interesting tradeoffs.

Compare that E8600 to a current Haswell i5-4440s. They're both 3.3-ishGHz in single threaded code (technically the 8600 is 33mhz faster), they both have similar TDP ratings, they both have similar last-level cache sizes, but the i5 clearly pulls away in IPC:

The Intel ARK specification comparison: http://ark.intel.com/compare/35605,75040

An amalgamation of a ton of benchmark scores: http://www.cpu-world.com/Compare/592/Intel_Core_2_Duo_E8600_vs_Intel_Core_i5_i5-4440S.html

I'm the disparity could be partly linked to memory latency and throughput enhancements, but even so, that's still IPC that wasn't available in the C2D world.

Intel continues, generation after generation, to deliver IPC increases in whatever way they can. It's amazing to me that IPC has continued to increase, even if only slightly, simply because it's still a compounding gain after a few generations of single-digit-percentage gains.

Ailuros · Oct 21, 2014

swaaye said:
This thread brings forth magical NV3x memories! Though I suppose it's a little more current than that for those of you working on PS3...

Unless I'm understanding here something wrong: do you mean that NV3 memories are more current than anything PS3? :???:

keldor314 · Oct 21, 2014

The thing that everyone's forgetting is that ALUs aren't the main driver of power consumption or die area any more - it's communication across the chip and off that really does it, as well as complexity inside each scheduler. One of the problems is that adding FP16 makes your scheduler more complex, since it's adding an extra set of instructions, as well as a big crossbar to allow data to get to the FP16 ALU. This means that adding the additional FP16 support may actually take more power and even die space than just using FP32 everywhere.

Also, trying to make it so your register file could be accessed both as 16 and 32 bits would probably be a big mess. You'd have extra bank conflicts, need more ports, or perhaps just go to a second layer of SIMD (Nvidia actually does this for it's multimedia instructions, though I don't know if Maxwell has them any more). Overall, you'd probably lose more than you gained.

Now, FP16 only requires half as much bandwidth coming on and off chip and/or from the caches. But FP32 can take advantage of this too - simply read it as FP16 and convert to/from FP32 in the core. This is very cheap since all you have to do is reroute bits. This is the approach Nvidia uses - others may do it too, I'm not sure.

Another thing to consider is that some computations (like vertices - half precision isn't enough to point to a single pixel!) need FP32, while others can get away with FP16. However, in a unified core, you have to assume you need FP32 in many cases, so you really have to have both, or else move back to separate pixel and vertex shaders, which brings programmability back into the dark ages.

Xmas · Oct 21, 2014

keldor314 said:
The thing that everyone's forgetting is that ALUs aren't the main driver of power consumption or die area any more - it's communication across the chip and off that really does it, as well as complexity inside each scheduler. One of the problems is that adding FP16 makes your scheduler more complex, since it's adding an extra set of instructions, as well as a big crossbar to allow data to get to the FP16 ALU. This means that adding the additional FP16 support may actually take more power and even die space than just using FP32 everywhere.

I'm sure the people designing those chips did not "forget" this. They simply came to a different result.

OlegSH · Oct 21, 2014

Xmas said:
I'm sure the people designing those chips did not "forget" this. They simply came to a different result.

But keldor is right, the actual win from FP16 is nowhere close to 2 times perf/watt for the whole SOC in real games, its mostly single digit % numbers

milk · Oct 21, 2014

Keldor is speculating. The engineers at powerVR ran actual tests.

Xmas · Oct 21, 2014

OlegSH said:
But keldor is right, the actual win from FP16 is nowhere close to 2 times perf/watt for the whole SOC in real games, its mostly single digit % numbers

Even that is still a win. And we're also talking about single digit % SoC area increases.

If anyone expected "close to 2 times perf/watt for the whole SOC in real games", I don't see it in this thread.

mc6809e · Oct 21, 2014

Albuquerque said:
I think it's better than you realize. The Core 2 Duo E8600 3.33Ghz was basically the fastest and latest model dual core of that era. I chose this one because it will have the best potential showing of IPC on that gen, compared to the quads that also existed at the time but had some interesting tradeoffs.

Compare that E8600 to a current Haswell i5-4440s. They're both 3.3-ishGHz in single threaded code (technically the 8600 is 33mhz faster), they both have similar TDP ratings, they both have similar last-level cache sizes, but the i5 clearly pulls away in IPC:

But they don't have similar L1 cache. The i5 has twice the L1 that the Core 2 does. That makes a big difference.

Double the L1 of the Core 2 duo and you probably get near identical performance with the i5.

What Intel is great at is using advanced processes to pack huge numbers of transistors onto wafers allowing for big fast caches. It's a huge advantage, but it's mostly just fast memory. You have to give credit to Intel material and circuit engineers for keeping the x86 arch alive for so long.

And look at the way AMD and NVidia are chafing to get at 22/20nm. Intel has a weaker GPU arch, but their access to the most advanced process still allows them to stay in the game.

Not that I expect Intel to displace NVidia, but Intel is still stealing some of the low end away from them.

Native FP16 support in GPU architectures

Ailuros

Epsilon plus three

3dcgi

MrMilli

Ailuros

Epsilon plus three

ams

ams

Ailuros

Epsilon plus three

sebbbi

Xmas

Porous

Novum

swaaye

Entirely Suboptimal

trinibwoy

Meh

Albuquerque

Red-headed step child

Ailuros

Epsilon plus three

keldor314

Xmas

Porous

OlegSH

milk

Like Verified

Xmas

Porous

mc6809e

Similar threads