It depends (TM). There's a lot of MULs but the ALU performance "per flop" is probably not higher for 8XT than for 7XTP (especially as 7XTP had serial FMAs, unlike 7XT and 8XT where these ALUs are parallel) - the benefit is that it's higher perf/mm2, not necessarily higher perf/flop. But for the USC overall rather than just the ALU, we also gain efficiency from more pipelines processing in parallel with the "primary" pipeline for "free", e.g. complex ops (reciprocal etc.), conditional branches, texturing, memory load/store, etc...
And a lot of low-level optimisations to avoid various real-world bottlenecks that can't be described just by looking at peak rates. In terms of overall performance, for a lot of mobile workloads (including the most complex benchmarks), the ALU:TEX ratio was arguably slightly too high on 7XTP, so in terms of overall perf/mm2 we benefit from having slightly fewer flops per pixel/clk - and for higher-end cores/content, we have the ability to scale up to 3 USCs per 8-wide TPU (vs 2 USCs currently).
Also we tend to talk a lot about ALU efficiency, but I think it's worth highlighting that the new 8-wide TPU is *really* efficient - it's a lot smaller than 2x the 7XTP 4-wide TPU at pretty close to 2x the performance Geometry performance is also great, framebuffer compression is slightly improved, etc... All of that is less "sexy" than ALU changes so it doesn't get as much attention, but it all adds up to a very efficient and balanced architecture.
And a lot of low-level optimisations to avoid various real-world bottlenecks that can't be described just by looking at peak rates. In terms of overall performance, for a lot of mobile workloads (including the most complex benchmarks), the ALU:TEX ratio was arguably slightly too high on 7XTP, so in terms of overall perf/mm2 we benefit from having slightly fewer flops per pixel/clk - and for higher-end cores/content, we have the ability to scale up to 3 USCs per 8-wide TPU (vs 2 USCs currently).
Also we tend to talk a lot about ALU efficiency, but I think it's worth highlighting that the new 8-wide TPU is *really* efficient - it's a lot smaller than 2x the 7XTP 4-wide TPU at pretty close to 2x the performance Geometry performance is also great, framebuffer compression is slightly improved, etc... All of that is less "sexy" than ALU changes so it doesn't get as much attention, but it all adds up to a very efficient and balanced architecture.