Although this may be inappropriate for this thread, I have to ask if you have any opinion on just how common code that ran at peak fp rate was on the 360. The CPU always seemed extremely top heavy in terms of peak fp vs. general code performance and overall capabilities. But then, my coding experience is from chemical science as opposed to games.
If (close to) 30% of CPU cycles were spent in tight, full rate loops, then you are obviously forced to find other ways to achieve your results, if on the other hand it was typically on the order of 5%, then the lack of strong SIMD capabilities per se are not necessarily a big deal when porting.
I never really saw any utilization statistics.
Those old in-order PPC cores were pathetic in running general purpose code, such as unoptimized game logic code. Memory stall cost was 600 lost cycles, and the CPU didn't have any data prefetch hardware. You had to manually prefect every single cache line (even in linear array accesses) or you suffered 600 cycle penalty for each 128 bytes (that was the cache line size). If you didn't cache optimize your memory access patterns and didn't manually write prefetch, that particular code was running at least 10x+ slower. Another big bottleneck for generic code was LHS stalls. There was no store forwarding hardware. If you accessed a memory location that was recently written, you suffered a ~40 cycle stall. The problem is that common calling conventions write parameters to stack (= memory) and then the function reads them from the stack (= memory) at the beginning. This is a sure way to get LHS stall. Practically every function call caused LHS stall, so people had to inline everything and that killed the instruction cache.
Vector code could reach high IPC, but only if you unrolled loops heavily. There was 128 vector registers (per thread, so 256 per core). As said earlier spilling data into stack resulted in 40 cycle LHS stalls, so you had to keep everything in registers. Moving data between scalar registers and vector registers also caused LHS stall, because there was no direct path between the two register files. Data had to move through memory (load->store = stall). You couldn't directly move vector data to scalar register to perform branching. Branching by float/vector data always caused a big stall. The vector pipeline was super deep (12-14 cycles for mad and aos dot product). Because of this, you had to have lots of independent instructions in flight. 2x128 registers made this possible. With no register renaming tight loops were not possible. One unrolled loop iteration had to contain enough independent instructions to use most of the registers to fill the pipelines.
Practically people where forced to separate vectorized code to big unrolled loops and only use vector registers and vector instructions in this chunk of code (to avoid LHS stalls). Mixed vector + scalar code was not efficient at all. So if you wanted to vectorize something, you had to hand write the whole processing function with pure VMX-128 intrinsics. And it had to process lots of data to hide the pipeline latency and the LHS stalls on both sides. So VMX-128 programming was similar to offloading data to GPGPU. But of course had much lower latency, so didn't have to do it asynchronously (and fetch the results back next frame).
We VMX-128 vectorized many things, and so did most studios. We had several threads (such as particle simulation) that were pretty much running pure VMX-128 code. But most of the generic game/engine code was not vectorized at all, and was thus running at very low IPC. I was frequently profiling game/engine code and adding cache hints and fixing the most crucial LHS stalls. Only a small part of the code was vectorized, but quite a big percentage of CPU time was spend in optimized VMX-128 loops. Hard to say any typical numbers since every game is different.
Update:
We had 6 (core locked) threads like most other studios (3 cores with SMT/hyperthreading). It was crucial to have two threads running per core as it was the only way to hide all these long stalls. Each hardware thread had their own register file, so SMT doubled the available registers. Two VMX-128 threads on the same core had in total 256 4-wide vector registers (that's a lot, even by todays standards). So it was definitely possible to keep the long pipelines fed with independent instructions.