Xenon isn't really flush with excecution units as I understand it. In fact all 3 cores combined are only roughly equal to an A64 in terms of seperate units (I think, its a while since I looked).
Perhaps in unit count, but not in functionality.
Since Xenon has an additiona int unit, there is a total of six full integer units for three cores.
A64 has three int units and three AGUs, which limits the full range they can be applied to.
For scalar FP, each VMX unit can issue one math op and one memory op.
A64 can issue one ADD + MUL + MEM.
Over three cores, Xenon can manage 3 math and 3 store ops.
A64 in one core could handle a max of 3 ops of the prescribed mix, period.
The load/store unit on A64 can handle two ops. I don't know about Xenon, but if each core's load/store can only handle one op, it's still more than A64.
Each Xenon core has its own L1, so from a cache perspective, things could be interesting. If Xenon is fully dual or pseudo-dual ported in its data caches, it would have three times the cache porting of an A64. Otherwise, a single-ported cache would leave Xenon with an additional port. Since bank conflicts can restrict A64 to a single access, Xenon's advantage is probably greater. (for the single-ported or true dual-ported instances)
On fully threaded code, Xenon is also capable of a sustained instruction issue of six instructions per clock, while A64 can only manage three.
The instructions aren't equivalent, since reg/mem ops would count as two instructions under PowerPC, but code that tries to avoid such traffic would lead to a mix more favorable to Xenon.
Of course, the catch is that A64 can throw all that hardware at one thread, while Xenon can't. Similarly, hiccups in execution are more easily hidden by A64.