Common Targets
A reliable target for FP16 optimisation is the blending of colour and normal maps. These operations are typically heavy on data-parallel ALU operations. What’s more, such data frequently originates from a low-precision texture and therefore fits comfortably within FP16’s limitations. A typical game frame has a plentiful supply of these operations in gbuffer export and post-process shaders, all ripe for optimisation.
BRDFs are an attractive but difficult candidate. The portion of a BRDF that computes specular response is typically very register- and ALU-intensive. This would seem a promising target. However, caution must be exercised. BRDFs typically contain exponent and division operations. There are currently no FP16 instructions for these operations. This means that at best there will be no parallelisation of those operations; at worst it will introduce conversion overhead between FP16 and FP32.
All is not lost. There is a suitable optimization candidate in the typical BRDF equation: the large number of vectors and dot products typically present. Whilst individual dot products are more a data reduction operation than a data parallel operation, many dot products can be performed in parallel using SIMD code. These dot products often feed back into FP32 BRDF code, so care must be taken not to introduce FP16 to FP32 conversion overhead that exceeds the gains made.
Finally, TAA or checker-boarding systems offer strong potential for optimisation alongside surprising risks. These systems perform a great deal of colour processing, and ALU can indeed be the primary bottleneck. UV calculations often consume much of this ALU work. It is tempting to assume these screen-space UVs are well within the limits of FP16. Surprisingly, the combination of small pixel velocities and high resolutions such as 4K can cause artefacts when using FP16. Exercise care when optimising similar code.