What I was saying (and what I should've worded better, apparently) was that the metric 'M tiles * N TMUs/tile in a multi-tile setup = same number of TMUs in a non-tiled setup' is flawed. In a non-tiled setup you can have all TMUs working toward a single fragment at a time. No so in a multi-tile setup - there the number of TMUs per tile puts a hard cap to how many TMUs can work toward a common fragment. Ergo my hapless mentioning of all units working on the same tile.
If we really meant to compare an MP, locale-division setup to something more canonical, we'd need to go to further lengths, looking at ALU/ROP, ALU/TMU, TMU/ROP, etc, ratios, not just summing things up.
You can't have all TMUs in a chip working towards a single fragment in any other highly parallel but single-core GPU setup. All of them follow the same hierarchical layout that dedicates some number of TMUs to a single SIMD stream.
But I don't really see what difference it makes in any reasonable GPU workload. Fragments with a lot of TMU dependencies will take longer to execute, but four will be computable in parallel. You should always have at least this level of parallelism in anything worth doing on a GPU. There may be less simultaneous load-balancing granularity overall, but that's again compensated for by having good thread load-balancing.
The actual ratios you're looking for are 2 TMUs to 4 USSE2s, each of which has a vec4-ish FMADD ALU (can co-issue, and if that means USSE1 style operations it'd be vec2 FP16 or vec2 FP32 with a shared input). As far as I understand it each USSE2 is capable outputting a pixel per cycle, but I don't know what constitutes as an ROP in this case. I think part of the blending is handled in the fragment shading and part as a fixed function output per-cycle. I don't know if the USSE2 has dedicated resources for that or needs to take instruction issue slots and ALUs.
This topic actually brings to my mind an interesting consideration. Each of the four USSE2s in an SGX543 should be capable of operating on a completely independent thread, and switching between another 3 (I think?) threads with zero overhead. So that means that there are 16 threads in flight in parallel from an MP4 configuration with quick switching among a pool of 64. This should mean that there's little overhead in running a whole bunch of separate execution paths or even different shaders on the GPU in parallel, compared to other GPUs. I wonder what kind of impact that'll have on general purpose compute.