It comes down to how much precision is sufficient.
Even when the transition to 32-bit registers began, it was recognized that some workloads did not need that much precision. ATI went with an intermediate 24-bit precision for pixel shaders for a time.
There are processing steps and workloads that are fine with 16 bits or fewer, such as various postprocessing steps on targets that are already lower-precision, neural network training (with inference even going lower), or algorithms that will iteratively approach a sufficiently accurate result.
Some things, such as addressing larger memory spaces, position data, or algorithms that accumulate data (and error) from multiple sources, benefit from more bits in the representation.
That would need to be weighed against the costs of having higher precision: power, hardware, storage, bandwidth.
For a period of time, the benefit in reaching the same level of precision used for other programmable architectures, larger memories, and more advanced algorithms was high enough to justify bumping everything to a consistent and sufficient internal precision. Then limited bandwidth growth, power consumption, and poorer silicon density+cost gains gave designers a reason to revisit that initial trade-off.
We may not necessarily be done at FP16, as some workloads can get away with less and there are low-power concepts with FPUs that dynamically vary their precision to shave things down further.