This has been confirmed by Apple GPU driver team leader on Twitter
I have also run some benchmarks (look in the posts above) that show FMA throughput on M1 and A14 GPUs.
I would guess mainly because it's a very intricate engineering puzzle. Getting the deferred rendering behavior in hardware while considering all the edge cases sounds extremely complicated... In the end, it seems that IMG were the only one to get it right (which kind of makes sense when you think that they have been at it consistently since forever), and they have patented the hell out of it. Apple has inherited their tech, possibly making some improvements here and there. I don't think anything else on the market is currently "real" TBDR — Mali and friends seem to be using simple tiled immediate renderer, often relying on hacks like vertex shader splitting to make the data management simpler.
As to the inherent limitations, well, this has been the very topic of this thread, so I would recommend you to read it. Just keep in mind that there has been a lot of confusion considering various rendering approaches, and some posts are discussing limitations of mobile hardware that does not necessarily apply to Apple GPUs.
In my book, the main limitation of TBDR is that it has to keep the results of the vertex shader stage until the tile is fully shaded. Vertex shader outputs are stored in a buffer in the device RAM which puts additional bandwidth pressure on the already bandwidth-limited GPU. Usually it's not a problem, since TBDR will save much more bandwidth in the fragment shading stage later on, but if you have a lot of small primitives (and things like transparency etc. on top), deferred shading might end up more expensive. This is the reason why TBDR is though to be incompatible with some of the more current techniques like mesh shading. The later is based around the idea of generating many small primitives on the GPU and shading them immediately — a great fit for modern desktop GPUs, but doesn't really work for the TBDR GPUs.
What I really like about Apple's implementation is that shading ultimately occurs over well defined tiles with strong memory ordering guarantees. Basically, pixel shading is nothing more than a compute shader invocation over a 16x16 or a 32x32 pixel grid, and this compute shader will fetch the necessary primitive and texture data. And Metal makes this fact fully transparent to the programmer. You can actually use your own compute shaders to work on tile data, and you can transform them in a variety of ways. Even more, you can run a sequence of pixel and compute shaders, and they will share the same fast persistent on-chip memory (local/shared memory) to communicate between different shader invocations. This allows a lot of things to be implemented in a rather elegant fashion and makes the GPU less reliant on memory bandwidth.