Is there someplace I can read up on using FP16 correctly?
Read 32 bit float articles targeted towards scientific audience. For them fp32 is "half precision" compared to fp64. This article seems to have it all, but is way too deep for most rendering programmers:
https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
The most important thing in float math is to avoid catastrophic cancellation. This means subtracting two (big) values from each other that are close to each other. This operation commonly happens when you use world space coordinates in your math, and calculate vectors between two points. For example localLight->surfacePixel, camera->surfacePixel, vertex1->vertex2 (edge math in world space), etc. The solution for this is to avoid doing math in world space. Just don't do it. It causes problems even on fp32 if your world is big enough. The first thing you should do is subtract camera position from all world space data (*). Absolutely no math before this. This way your floating point error is localized around the camera. Closer to camera = less error, further away = more error. Perspective projection makes everything smaller at distance, normalizing the error, assuring that no matter what distance, the error is always smaller than some subpixel fraction.
The previous trick is not enough for all fp16 cases. Instead of performing math in camera centered space, you'd sometimes want to perform math in surface local space. Subtract surface coordinate from the other position data. Do this subtract in fp32 (to minimize catastrophic cancellation) and rest of the operations in fp16.
In general you should avoid adding two floating point numbers together if their magnitude differs a lot. Example: Time counter should never be floating point (fp32). Accumulated time gets large and the added time per frame stays small. The precision of the add gets lower and lower, and frame time (animations) start to judder in a few hours. Similarly if you are adding multiple light sources together, you should not simply add them one at a time in a loop. Instead you should first add lights pairwise, and then these results pairwise, etc. This results in equal amount of add operations. But if we assume that the lights are roughly the same intensity, the adds always are performed between two numbers that are roughly the same magnitude, reducing the floating point error. Or simply you could use fp32 light accumulation counter and do the heavy math in fp16. You only perform a single fp32 add per light. However modern GGX lighting math requires fp32 precision in some math, but again you can carefully isolate the fp32 math from the fp16 math if you know the relative magnitudes of the operands. It is tricky if you borrow a lighting formula from a paper and don't understand exactly how it works.
Optimizing for fp16 is kind of similar than optimizing data storage. You need to know you data and do some analysis. People have been optimizing their storage (memory bandwidth) for ages. Slides 33-36 of this presentation are a good example (error analysis on compressed normal):
https://michaldrobot.files.wordpress.com/2014/05/gcn_alu_opt_digitaldragons2014.pptx
(*). Subtracting camera position from world space positions itself causes catastrophic cancellation. But you do it only once and before any other math. If your world is large, I recommend using uint32 for positions instead of fp32. 3x uint32 (xyz) can present the whole earth at a few millimeter precision (including all the space inside earth). You only need a single integer ALU instruction (subtraction) to convert world space coordinates to camera space coordinates. Follow it by a single float multiply-add to scale the coordinate accordingly. Integer subtraction is full rate on all GPUs. No catastrophic cancellation at all.