Native FP16 support in GPU architectures

LOL I stopped here. You have no clue of what you are talking about.
I will just keep this in a corner of my bookmarks and will bring it back few months later :cool:

I suggest you read up on IMG some more, for example.

"have you ever worked for a GPU vendor ? I did and I can tell as a fact that marketing is as much important (even sometimes more) than engineering. New features request are from management, sales and marketing teams, then engineers are in charge to make it true. not the other way around..."

Isn't that directly opposite to what the CEO at IMG said at one of the semi recent AGM's? It might be how the other GPU vendors work but IMG do not work like you describe. Engineers have to predict the features marketing and customers want years before marketing or customers even know they will want the features. IMG look at 5+ years into the future which is to far for most marketing or customers.
 
As stated before, there is no argument to be had here against FP16, only a misunderstanding on the part of those arguing against it.

1. PowerVR had an FP32 pipeline for pixel shader ops before.
2. They saw a way to improve performance by improving dedicated FP16 resources.
3. They did that, and overall performance did improve! Their approach was right. End of story.

If an app calls for FP32 precision in a shader op, a compliant core like PowerVR does it at FP32. Obviously.

Since this baseless argument still persisted, though, I at least tried to solidify some kind of logical objection to using FP16 for the detractors to use as their argument so that the discussion wasn't completely pointless. The only argument to be had is that somehow Imgtec is holding the industry back by expanding dedicated FP16 performance, but that line of reasoning was easy to show as ridiculous simply by using counter examples.
 
have you ever worked for a GPU vendor ? I did and I can tell as a fact that marketing is as much important (even sometimes more) than engineering. New features request are from management, sales and marketing teams, then engineers are in charge to make it true. not the other way around...
I do work for a GPU vendor and I do participate in design reviews of future cores. I have seen one marketing person in one thread in the last year. He sent out one email with one sentence: "it's great to see the process work as it should". I was actually upset that he interrupted productive conversation with this statement, but other than that - no, there was no sales or marketing teams influencing the process. Things are measured and scheduled and we deliver to the best of our abilities. There's no non-technical person between me and CEO, including me and CEO. I can talk to my manager's manager's manager about the internals of our hardware and we will be on the same page most of the time.

Is there anything else you'd like to know about working for a GPU vendor that actually cares about technology?
 
The argument against FP16 ALUs was lost from the start, however, as the improvement in game performance and benchmark results from a Series 6XT part (boosted in no small part by said FP16 ALUs) versus an otherwise similar Series 6 part with the same number of FP32 ALUs proved to be very real. FP32 ops performance obviously hasn't been much of a limiting factor.

That is incorrect. When Anandtech reviewed the iPhone 6, they found that GX6450 in iPhone 6 was "only" ~ 17% faster than G6430 in iPhone 5s with 3dmark Unlimited Graphics score and with BaseMark Hangar Offscreen. The more significant performance gains were in GFXBench which must make more use of the lower precision FP16 ALU's.
 
The reasons have been given before. All else being equal, FP16 operations take less power, requires less internal (and external) bandwidth, the hardware takes much less die area which for a given level of performance which lowers cost and improves yield which lowers cost again. Alternatively, for a given budget of die space and power draw, FP16 yields much better performance.

If the goal is to achieve higher fidelity graphics and console-quality graphics, one would really need to rely on the FP32 ALU's for pixel rendering (and the FP32 ALU's cannot be used in conjunction with the FP16 ALU's). Note that SoC die size and transistor count actually increase because the designers are adding FP16 ALU's on top of adding FP32 ALU's.

If you look at ImgTech's website, they note that "FP32 [offers] improved precision for better image quality in console-quality games".

If you look at their developer guide, they note that "lower precision calculations can be performed faster, but need to be used carefully to avoid trouble with visible artefacts being introduced".

I get it that adding FP16 ALU's gives them a way to improve performance and consumption (at the expense of rendering precision), but by making the FP32 ALU's a second class citizen in unit counts, there will always be a huge deficit in performance between the two choices. And interestingly enough, the performance deficit is growing: in Series 6, there was a 1.5:1 ratio of FP16 ALU's to FP32 ALU's, and now with Series 6XT, there is a 2:1 ratio of FP16 ALU's to FP32 ALU's.
 
Last edited by a moderator:
That is incorrect. When Anandtech reviewed the iPhone 6, they found that GX6450 in iPhone 6 was "only" ~ 17% faster than G6430 in iPhone 5s with 3dmark Unlimited Graphics score and with BaseMark Hangar Offscreen. The more significant performance gains were in GFXBench which must make more use of the lower precision FP16 ALU's.

Is there any specific proof that Gfxbench makes more us of FP16 ALUs or is it just a gut feeling? How much faster is a G6430 at the same frequency as a G6400 Rogue in Manhattan or T-Rex, while the first has both FP32 and FP16 ALUs while the latter has exclusively FP32 ALUs? http://gfxbench.com/compare.jsp?ben...&os1=Android&api1=gl&D2=Apple+iPad+Air&cols=2 The 6400 is clocked slightly higher, but hey where the HELL has all the magnitude of FP16 ALU throughput magically gone for the 6430, it's actually 50% more on a clock per clock basis..... A 6400 doesn't have framebuffer compression either for the record so the conclusion should be that Gfxbench favors more than usual framebuffer compression (and yes that's a joke...)

If the goal is to achieve higher fidelity graphics and console-quality graphics, one would really need to rely on the FP32 ALU's for pixel rendering (and the FP32 ALU's cannot be used in conjunction with the FP16 ALU's). Note that SoC die size and transistor count actually increase because the designers are adding FP16 ALU's on top of adding FP32 ALU's.
You don't say :LOL: Again since Kepler NV is devoting a bit more die area for dedicated FP64 ALUs in order to win in terms of power consumption. Why didn't they just combine FP32 ALUs as in the past or like AMD is doing for Radeons? Because the point IS to decrease power consumption at the cost of a bit more die area.

As a matter of fact GK20A being a single cluster Kepler grandchild also has 8 FP64 SPs which also cost "more die area". Not that you'd need them all that often in a ULP mobile design but when you do you'd burn significantly LESS power for using those for FP64 then combining existing FP32 ALUs to get double precision.

It's true that you cannot run FP32 and FP16 in parallel on a Rogue ALU, but none of them consists of a single ALU either.

Last but not least: is GK20A handling in your opinion INT10 via existing FP32 ALUs also or do they have dedicated hw for that too?
 
I more wonder are Arm, Intel, AMD and NV prepared for the change in landscape once mobile real time ray tracing arrives. It’s getting closer by the day and now Unreal 4.5 engine support it, Unity is adding support the only thing missing is the mobile GPU that is getting closer. Personally I believe Wizard will come out when 16nm arrives then we might truly see a change in the landscape. Will be kind of crazy to have mobile games with better lights and shadows then desktop games with high end GPU’s.

IMG also have a lot of tech in the pipeline that is outside Intel, AMD and NV target markets. IMG should see substantial growth in the next 5 years. What happens if Apple do ray tracing on 16nm making it a must have feature how will Arm, Intel, AMD and NV be placed in the GPU mobile market then?

I'm curious as to why ray tracing would be more impactful in mobile than in the desktop or professional markets, where it has been entrenched in a few niches for quite some time, but with limited impact on the market as a whole. Caustic was selling product to the professional market before IMGTech bought them, and it didn't revolutionize anything. What has changed since then?
 
I'm curious as to why ray tracing would be more impactful in mobile than in the desktop or professional markets
Ray tracing is inherently memory or computation intensive (or anything in between). IDK if mobile would be the right choice since it's very much constraints driven. But perhaps there's some magic dust out there that would make it great for mobile, I don't know, I don't work on ray tracing HW.
 
I'm curious as to why ray tracing would be more impactful in mobile than in the desktop or professional markets, where it has been entrenched in a few niches for quite some time, but with limited impact on the market as a whole. Caustic was selling product to the professional market before IMGTech bought them, and it didn't revolutionize anything. What has changed since then?

Frankly no idea; I'm not even sure how the former Caustic high end RT SKUs fit in the picture with licensing GPU IP with dedicated RT hw for the ULP mobile space only. If they gained good enough business with the first they should keep on selling dedicated RT SKUs also in the future.

They're definitely trying to grow interest for RT in different markets, but so far it doesn't seem to have brought anything worth mentioning. I don't think Caustic managed to ever sell any hw when they still were independent, but would like to stand corrected.
 
If the goal is to achieve higher fidelity graphics and console-quality graphics, one would really need to rely on the FP32 ALU's for pixel rendering
Why? FP16 is sufficient for many operations, even "console-quality" ones.
 
Ihttp://gfxbench.com/compare.jsp?ben...&os1=Android&api1=gl&D2=Apple+iPad+Air&cols=2 The 6400 is clocked slightly higher, but hey where the HELL has all the magnitude of FP16 ALU throughput magically gone for the 6430, it's actually 50% more on a clock per clock basis

That comparison is a mess. Totally different OS, totally different drivers and driver overhead scores, different GPU clock operating frequencies, different mem. bandwidth, etc. And strangely enough, the render precision quality is virtually identical at ~ 2400 mB PSNR for lower precision and ~ 3500 mB PSNR for higher precision (care to explain how this could be possible when the Venue 8 GPU is supposed to contain only FP32 ALU's for pixel rendering?).

As a matter of fact GK20A being a single cluster Kepler grandchild also has 8 FP64 SPs which also cost "more die area"

You've got it completely backward. Dedicated FP64 execution units are only required for use by the scientific or GPGPU community, and are either cut down or disabled on consumer products. The fact that a small number of these units still exist on consumer products is simply to provide a very basic level of compatibility. That is totally different than what we are discussing here, which is the doubling of FP16 execution units compared to FP32 execution units.
 
Why? FP16 is sufficient for many operations, even "console-quality" ones.

Games designed for modern day consoles and PC are not designed with lower precision pixel rendering in mind. So console and PC game developers that are looking to bring their games to the ultra mobile space with high visual fidelity and console quality will surely be making use of the FP32 ALU's for pixel rendering.
 
Games designed for modern day consoles and PC are not designed with lower precision pixel rendering in mind. So console and PC game developers that are looking to bring their games to the ultra mobile space with high visual fidelity and console quality will surely be making use of the FP32 ALU's for pixel rendering.
Sometimes it requires more work to get lower precision calculations to work (with zero image quality degradation), but so far I haven't encountered big problems in fitting my pixel shader code to FP16 (including lighting code). Console developers have a lot of FP16 pixel shader experience because of PS3. Basically all PS3 pixel shader code was running on FP16.

It is still is very important to pack the data in memory as tightly as possible as there is never enough bandwidth to lose. For example 16 bit (model space) vertex coordinates are still commonly used, the material textures are still dxt compressed (barely 8 bit quality) and the new HDR texture formats (BC6H) commonly used in cube maps have significantly less precision than a 16 bit float. All of these can be processed by 16 bit ALUs in pixel shader with no major issues. The end result will still be eventually stored to 8 bit per channel back buffer and displayed.

Could you give us some examples of operations done in pixel shaders that require higher than 16 bit float processing?

EDIT:
One example where 16 bit float processing is not enough: Exponential variance shadow mapping (EVSM) needs both 32 bit storage (32 bit float textures + 32 bit float filtering) and 32 bit float ALU processing.

However EVSM is not yet universally possible on mobile platforms right now, as there's no standard support for 32 bit float filtering in mobile devices (OpenGL ES 3.0 just recently added support for 16 bit float filtering, 32 bit float filtering is not yet present). Obviously GPU manufacturers can have OpenGL ES extensions to add FP32 filtering support if their GPU supports it (as most GPUs should as this has been a required feature in DirectX since 10.0).
 
Last edited by a moderator:
That comparison is a mess.

Only for you and for obvious reasons, but point for point.

Totally different OS, totally different drivers and driver overhead scores, different GPU clock operating frequencies, different mem. bandwidth, etc.
There are 6200/6230 Rogues out there for further comparisons under Android and guess what they'll show you the exact same trend. Shouldn't you in the meantime come up with data that contradicts the former?

Offscreen fillrates usually give away the frequency differences: http://gfxbench.com/compare.jsp?ben...1=Android&api1=gl&D2=Dell+Venue+7+3740&cols=2

And strangely enough, the render precision quality is virtually identical at ~ 2400 mB PSNR for lower precision and ~ 3500 mB PSNR for higher precision (care to explain how this could be possible when the Venue 8 GPU is supposed to contain only FP32 ALU's for pixel rendering?).
That's another challenge for you that you've already left unanswered at another occassion. Since you "think" you know as much about the quality tests in Gfxbench, what and how it exactly does as an application and how every piece of hw works in it it's rather your turn to see the flaws in your own logic or start delivering some facts, documentation or whatever else to "enlighten" us what is going on. Adrenos have FP32 ALUs too, Vivante GPUs too so why the heck do they all come up with similar PSNR scores? Want to overthink your "theories" once more from scratch?

Did you even bother to read the links I provided just the other day for the quality tests?

You've got it completely backward.
We were waiting for you to set all records straight.

Dedicated FP64 execution units are only required for use by the scientific or GPGPU community, and are either cut down or disabled on consumer products. The fact that a small number of these units still exist on consumer products is simply to provide a very basic level of compatibility. That is totally different than what we are discussing here, which is the doubling of FP16 execution units compared to FP32 execution units.
It's not at all different if you wouldn't desperately try to dismiss the actual point here. If you would bother to understand that it's a conscious design decision, you'd also understand the reasoning behind dedicated FP16 units. Forget me completely; there are more than one challenges from true industry insiders in this thread and we'll se if you will be able to address even one of them.

On a sidenote there are high chances that GM200 has half as much FP64 as FP32 ALUs with chances that GM200 has over 50% more FP64 units than GK110. NV is using that much more FP64 ALUs because they need them for the HPC and other professional markets they'll use it in and it's the very same necessity that drove IMG's engineers to add more FP16 in SeriesXT. Higher perf/W and yes it's as simple as that. Z/stencil and geometry throughput also increased in SeriesXT cores, but hey why would anyone expect an IHV to have higher efficiency in a piece of refresh hw is completely beyond me (pun intended) :p

If NV would had chosen the same design decisions as in the past (or as AMD) and got double precision from merging existing FP32 ALUs (to get 3:1, 2:1 or whatever else SP/DP ratio) they would save that "bit" of extra die area, but would burn a LOT more power for those cases where FP64 is actually needed. AMD Hawaii has a 1:2 ratio, would you like to remind me what kind of DP ratios it gets, at which TDP and at which DP/W ratio exactly?

Having dedicated FP16 ALUs in Rogue means that wherever they use FP16 they burn less than half the power than using a FP32 unit for it. What is common here is that in both cases (the former and this one) engineers dedicate a bit more die area in order to save power, and that's beyond any doubt a lesson learned from the ULP space.
 
Games designed for modern day consoles and PC are not designed with lower precision pixel rendering in mind. So console and PC game developers that are looking to bring their games to the ultra mobile space with high visual fidelity and console quality will surely be making use of the FP32 ALU's for pixel rendering.

OK, first I'll quote Sebbbi from the new Rogue architectural thread:
sebbbi said:
I personally think Rogue is a step in the right direction. With many other mobile architectures you still have to use the painful ~FP10 formats to extract the best performance. I sincerely hope that lowp vanishes soon, and the ES standard would dictate FP16 minimum for mediump.

FP16 is great for pixel shaders. You can avoid vast majority of precision issues by thinking about your numeric ranges and transforms. Doing your pixel shader math in view space or in camera centered world space goes long way to the right direction.

Obviously you need FP24/32 for vertex shaders. However most of the GPU math is done in pixel shaders as high resolutions such as 2048x1536 and 2560x1600 are common in mobile devices. The decision to increase the FP16 ALUs was a correct one.

Increased FP16 ALUs also means that the performance trade off from dropping the lowp ALUs is decreased. This is certanly a good thing. I hope that other mobile GPU developers also think alike and lowp is soon gone
So there you have it from someone who is actually intimate with this problem area.

I also somehow feel that you don't quite grasp the hows and whys of numerical precision in general. That's OK by the way, even among people who are professionals in fields that are affected, it is a topic that is often considered tangential (=sweep under the rug). It's typically not treated at all below university level, and there the initial courses typically look at worst case error propagation, truncation vs. rounding, classical problem areas such as derivatives and matrix inversions - useful, but a bit removed from practical problems which typically involve other sources of error in underlying models or whatever, removed from the absolute numerical measurability of the mathematical discipline of the course, and which determine the significance of numerical precision.

Basics - when we are talking pixels on cell phone screens these days, those are 8-bit integer values for Red, Green and Blue respectively. In days gone by you would do your graphics calculations in 8-bit integer/pixel color, progressing to 10-bit for better precision, to fp10 for less precision but better handling of dynamic range and fixed point formats (which I'll ignore for now for simplicity) to fp16 and fp32.

FP16, since it is the focus of the discussion has one sign bit, a 5-bit mantissa, and 11 bits worth of significand (through a trick). For most calculations you have precision to spare. Errors accumulate slowly enough that they don't reach significant levels. However, there are problematic operations, and of course algorithms that are good in some ways, but are dubious in terms of numerical precision. And that is where higher precision formats may be of aid. OR you could work around those specific problems in other ways, depending on your priorities and the hardware capabilities at hand.

At the end of the day, we are talking about gaming graphics on mobile devices with very high pixel densities. We are not talking about life support systems or satellite control after all. So if you lack previous experience, it would be a viable approach to simply default to FP16, run your code, and see if everything looks OK. If it does, you're good. If it doesn't, look at what causes the problem, and see if you can work around it, or if it is small enough and contained enough to be fixed by increasing the precision. (If you're using a strongly divergent algorithm, you're asking for problems and you might want to change approach.) Given the nature of gaming, and the devices the games run on, some errors are perfectly OK, after all it's not as if there are no compromises going on in other areas than numerical precision that have large visual consequences. With accumulated experience in the field, you'll know where to expect issues (as in sebbbis example above), and you'll save yourself a bit of work.

Generally, if you have limited resources, and in these devices you always do, you avoid wasting resources unnecessarily. Saving bandwidth, power, computational resources and to some extent memory is always a good idea. So using FP16 (or even lower) as default makes sense. The interesting part is actually in those areas where precision starts being an issue to be reckoned with - can you find alternative ways to express your algorithm that is less numerically demanding? Alternatively, are there visually similar algorithms that are less demanding? If the problem arises from a particular operation in the algorithm, can you do type-casting tricks to ensure that you have enough precision right there, and then convert back? Is that profitable in terms of of computational intensity, or would it actually be more efficient to brute force it? And so on - my experience in these matters is from a different field, where brute force typically wins out because of very lax financial accountability :), but even there making poor algorithmic choices will bite you. Throwing significand bits at a numerically poorly expressed algorithm is a band-aid at best.

Ouch, too wordy, and probably too vague. I'd better quit. Apologies.
Bottom line is you use the precision you need for the problem at hand. In the face of limited resources, waste does not make sense, and weakens your competitiveness in the marketplace.
 
have you ever worked for a GPU vendor ? I did and I can tell as a fact that marketing is as much important (even sometimes more) than engineering. New features request are from management, sales and marketing teams, then engineers are in charge to make it true. not the other way around...

You've got a point.

The problem for engineers is that they get paid to build things that are essentailly spec'd by ignorant customers.

Marketing doesn't bother worrying about this. The customer has the money and the company needs the money. Marketing is there to make this happen.

Engineers fret because they are justifiably bothered by the design detours created by the situation, but of course there's no money to do anything without sales.
 
The problem for engineers is that they get paid to build things that are essentailly spec'd by ignorant customers.

Marketing doesn't bother worrying about this. The customer has the money and the company needs the money.
That's what every single company does. They try to design and sell stuff that customers want to buy. In this case, the increased FP16 execution resources allow the GPU to run majority of the existing software faster while also maintaining good battery life. Most customers like faster execution (smoother animation) and better battery life. It's definitely a right thing to give the customers what they want, assuming of course it increases the customer satisfaction and makes you sell more units.
 
The problem for engineers is that they get paid to build things that are essentailly spec'd by ignorant customers.
Not really. Customers can "spec" in an abstract sense. They want stuff prettier, faster, with bigger numbers (battery life, screen size). It is very far from this level of abstraction to the design of the HW and decisions that people try to discuss here. In fact quite a few seem misinformed about how stuff like fp16 affects any of the metrics mentioned.
 
Not really. Customers can "spec" in an abstract sense. They want stuff prettier, faster, with bigger numbers (battery life, screen size). It is very far from this level of abstraction to the design of the HW and decisions that people try to discuss here. In fact quite a few seem misinformed about how stuff like fp16 affects any of the metrics mentioned.

Problems do crop up, though, when customers begin to focus on on what they think is a single important feature of a product like FP ops per second. Do most customers really know that there might be a difference between 16-bit floats or 32-bit floats or 64-bit floats? Many do not.

It's that sort of ignorance I was referring to. If customers desire max FP ops per second, though, then limited precision is going to be the norm as it tends to maximize FP numbers and cards are going to be designed with that in mind. NVidia seems to understand that well as the DP of their most recent commodity cards is dismal. AMD does have the R9 280x with very good DP performance (probably the best DPFlops/$ on the market), but how many customers has it given them?

In the past it was the clock rate for CPUs. In that case it led to the disaster which was the Pentium 4. Intel had actually planned a 10GHz P4 but the physics got in the way.

It took a long time for people to realize that clock rate isn't everything.

The current fad is CPU cores/chip.

Compare the single thread performance of a Core 2 Duo with a similarly clocked Core i7. They're isn't a huge difference except perhaps in benchmarks that are memory latency sensitive.
 
Problems do crop up, though, when customers begin to focus on on what they think is a single important feature of a product like FP ops per second. Do most customers really know that there might be a difference between 16-bit floats or 32-bit floats or 64-bit floats? Many do not.

We have this saying in Poland: dogs bark but caravan goes on. This is part of any competitive industry. People compare irrelevant, out of context characteristics of cars, food, whatever. The less you know about the subject, the more likely you are to have a simplistic view of it and under-appreciate the complexity that goes into making stuff happen. I used to work on an operating system component some time back and it drove me crazy when people were judging our work on some absurd grounds. But then I stopped listening to the noise and focused on what's important. This made me happier and, hopefully, the product better.

It is, in a way, a huge, huge problem of journalism these days. Tech and gaming are no different than cars and fitness. A lot of people writing (and being listened to!) have no idea about their craft, be it understanding what they write about or writing in itself. There's no Carl Sagan of GPUs that would teach and entertain you at the same time, who'd get into nuance of how stuff works and why so. We're lucky there's a modern day Carl Sagan of astrophysics who cares and is eloquent enough to attract people. ;]

So until we get this messiah of computing, this is the reality and there's no point in getting upset over it. But I digress...

It took a long time for people to realize that clock rate isn't everything.
The current fad is CPU cores/chip.

First of all, I don't think it's a fad. Second of all the only response I have is that you lead by example. You build something that's on paper inferior but in reality surpasses peoples expectations. Some will be convinced, some won't. And that's fine by me.
 
Back
Top