The PC gaming industry moved away from mixed precision rendering many many years ago. Mixed precision created a mess for game developers who had to create custom mixed precision code paths for different architectures.
There are more than one half float extensions in OGL_ES and they get used by everyone in the mobile space exactly to save power without any exception. If you think that NV uses FP32 values for everything even for INT cases you're quite mistaken.
For the record the idea of using dedicated units for FP64 in desktop GPUs by NVIDIA is an idea that originates actually from the ULP SoC market where dedicating specialized units at the cost of added die area in order to save power is quite common.
Power efficiency, power consumption, heat, noise, etc. matter in all areas now, not just mobile. The GPU performance and GPU power efficiency is good enough nowadays where there should not be an overwhelming need to render pixels with reduced precision IMHO.
I could have sworn that it was NVIDIA that claimed up until their Aurora GPU (Tegra4) that FP20 is perfectly sufficient for pixel shader ALUs. What has changed as dramatically overnight that suddenly FP20 has been rendered insufficient and FP32 across everything suddenly has become a necessity.
There isn't even one developer (and no not just those that deal with mobile games exclusively) I've talked to recently that hasn't claimed that FP16 is more than just a common place in the ULP mobile space.
Again Rogues have FP32 ALUs for wherever FP32 is a necessity and yes they could had saved that wee bit more die area they spared for FP16 ALUs, but they wouldn't save as much power. Last time I enquired a bit about FP64 units, synthesis alone for one of those was 0.025mm2@28LP for 1GHz. Now imagine how much lower that would be for a FP16 unit at half the frequency roughly and under 20SoC.
Also do not take AMD and Intel's decisions for special FP16 instructions too lightely. It's likelier that NVIDIA will also follow pace in due time and beyond that it's just a matter how you lay things out in hw and as long as you provide that type of precision that each application requests it is not an issue.
On the former generation Series5 GPU IP you could get from each ALU lane:
2*FP32
2*FP16
Vec3 or Vec4 INT8
There was no performance gain from using FP16 in that case either; what you are arguing about here is purely a matter of implementation in hw. As I said it does not mean that FP16 gets used in spots where higher precision (up to OGL_ES2.x "highp") is a requirement.
Last but not least there's an ungodly amount of Mali200/4x0 out there with a maximum precision of FP16 for pixel shaders which belongs to the lowest common denominator these days.