FP16? But it's the current year!

Markus

Newcomer
I've heard FP16 throughput numbers quoted for both Playstation pro and Vega as if it was a meaningful metric.

Suddenly it became a feature for full precision shaders to also be able to act in FP16 mode with twice the through put in 2016. I don't get it.

Are we talking about some new whizz-bang compute shader application like neural networks or whatever? Are we talking about re-enabling the Geforce FX shader path?
 
It comes down to how much precision is sufficient.
Even when the transition to 32-bit registers began, it was recognized that some workloads did not need that much precision. ATI went with an intermediate 24-bit precision for pixel shaders for a time.
There are processing steps and workloads that are fine with 16 bits or fewer, such as various postprocessing steps on targets that are already lower-precision, neural network training (with inference even going lower), or algorithms that will iteratively approach a sufficiently accurate result.
Some things, such as addressing larger memory spaces, position data, or algorithms that accumulate data (and error) from multiple sources, benefit from more bits in the representation.

That would need to be weighed against the costs of having higher precision: power, hardware, storage, bandwidth.

For a period of time, the benefit in reaching the same level of precision used for other programmable architectures, larger memories, and more advanced algorithms was high enough to justify bumping everything to a consistent and sufficient internal precision. Then limited bandwidth growth, power consumption, and poorer silicon density+cost gains gave designers a reason to revisit that initial trade-off.

We may not necessarily be done at FP16, as some workloads can get away with less and there are low-power concepts with FPUs that dynamically vary their precision to shave things down further.
 
I've been similarly confused about this sudden clamour for 16bit :|
Then limited bandwidth growth, power consumption, and poorer silicon density+cost gains gave designers a reason to revisit that initial trade-off.
I guess that makes sense but I can't help feeling its an odd regression.
 
I've been similarly confused about this sudden clamour for 16bit :|
I guess that makes sense but I can't help feeling its an odd regression.
It'd be a regression if it came at the expense of 32bit fp, but it isn't. It's granularity. Granularity is good. Quality isn't gonna suffer unless the developers wants it to, and mostly, 16bits will be used when it makes no difference, so no loss.
Of course when GPUs went 32, their vendors said it was something absolutely necessary we couldn't live without, because that's what sales people do, but nothing ever is as Black and White as marketing talk makes things out to be.
 
Why did we go from 24bit in DX9 to 32bit in DX10/11, by the way?
 
Precision. You can't use FP16 for everything. There's actually limited situations where this fits in.
 
I remember 3dlabs saying they used more than FP32 for their vertex shaders. FP36 I think. Others used FP32 as MDolenc said.

In addition to unified shaders FP32 enabled more GPGPU usage.
 
I've heard of 10 or 12-bit integer as a good compromise between 8 and 16-bit for neural networks according to some people. NV30 supported a FX12 4-way dot product (sadly without an extra accumulate, but that'd be easy to add). Therefore, NV30 is the future. /s

I think the benefit of FP16 was forgotten because of:
- NVIDIA pushing it inappropriately giving it a bad reputation.
- Possibly as a consequence of the above, Microsoft pushing FP32 more in DX10.
- Both NVIDIA and AMD aggressively pushing GPGPU which obviously required FP32.

Then FP16 was reintroduced on mobile which made people realise the power benefit and how useful it is in general. And now neural networks training also benefit from FP16, which is generating even more interest.
 
All existing games (except a few HDR games) output image at 8 bits per channel (RGB8). Input textures are also commonly 8 bit per channel (and BC compressed = lower quality as 8 bit).

As your input and output data has only 8 bit precision you don't need to calculate all intermediate math at 32 bit. Games don't store intermediate buffers as 32 bit floats either. Rgba16f is used commonly for HDR data and rgb10 and rgba8 for other intermediate buffers. 16 bit float processing is fine for most math in games. Results cannot be distinguished by naked eye from full 32 bit float pipeline, as long as the developer knows what he/she is doing. Especially if temporal AA is used.

Unfortunately writing good mixed fp16/fp32 code requires good knowledge about floating point math behaviour and some basic numeric range analysis (inputs/outputs and intermediate values). It is possible to write math in a way that minimizes floating point issues, allowing you to use fp16 more often. Of course if you use fp16 in a wrong way, you get banding and other artifacts.
 
I remember you said once, that if a programmer knows what he is doing, even lower bit depth Integers can do the job just fine for a lot of the workloads. Maybe as the cheap wins in silicon become harder to achieve, architectural changes that open more doors for devs to shave off bits and bytes off of their code might be a great part of the performance gains of the future. All layman speculation over here though.
 
I could see pathfinding, audio mixing, and possibly geometry culling benefiting from FP16. With the geometry levels we may be seeing with DX12/Vulkan being able to cull at twice the speed could be significant. Accuracy should be less of an issue there as well. In the case of ASW, if near and far objects are in separate render targets there may be some areas where lower precision is practical. Even FP64 for geometry at extreme (celestial) distances might be useful in that scenario. Although stratifying the depth buffers likely addresses that concern. I know some of the space games had issues reprojecting celestials because of distance hacks.

But 24bit. Was 24bit enough?
For a lot of workloads probably, but not all. Too much precision isn't really an issue beyond performance. 24 bits also isn't a size that will pack efficiently. Supporting 24 bit over 32 helps transistor count, but that's about it. I'm not sure anyone has ever used a card with 24 bit memory channels? At least not recently. Maybe on a FPGA with 8 bit registers or something.
 
Wow, a very long time since I've heard of FX12. IIRC that's in Pixel Shader 1.4 and supported by Radeon 8500/9000/9200. Also what you likely wanted to run on a geforce FX if a renderer code path was available.
I think Doom 3 ran at 60 fps with FP32 shaders on the FX 5800 Ultra but that was very peculiar.
 
This conversation is making me cringe just slightly. Nvidia borked the 1070 and 1080 to be half rate on 16F IIRC. And if future consoles are going to be full tilt on 16Fp. Lol. Oh man, need to consider selling.
 
Why? What makes you think driver will even expose half precision to DX? As it stands now this is a feature for CUDA. For DX low precision hints are simply ignored.
 
.......
Unfortunately writing good mixed fp16/fp32 code requires good knowledge about floating point math behaviour and some basic numeric range analysis (inputs/outputs and intermediate values). It is possible to write math in a way that minimizes floating point issues, allowing you to use fp16 more often. Of course if you use fp16 in a wrong way, you get banding and other artifacts.
I am curious how this will be done effectively for multi-platform game engines, or when it comes to porting a game that is making use of a mix of FP16/FP32 at its core/post processing effects.

Cheers
 
Why? What makes you think driver will even expose half precision to DX? As it stands now this is a feature for CUDA. For DX low precision hints are simply ignored.
Explicit float16 'half' types (vs. minprec fuzziness)
Was listed on a slide for SM6.0 at GDC.

I am curious how this will be done effectively for multi-platform game engines, or when it comes to porting a game that is making use of a mix of FP16/FP32 at its core/post processing effects.

Cheers
Shouldn't be difficult for the compiler to promote to FP32. Performance is obviously less, but the compilers would have determined that to be the better solution. Even at half rate there should still be some bandwidth and memory savings.
 
Last edited:
Back
Top