Hmm hold on a sec, from what I'm seeing in my vmx code, it's all scheduled quite cleverly to the point that alot of the latency seems to be getting absorbed. I'm sure some will bring up a worse case scenario, but from where I'm standing it's looking pretty good. There's 128 registers that can be used (per core) so just batch up your loops, choose your algorithm carefully and vmx can be quite a hoot. I'm currently doing all manner of stuff via vmx on 360, stuff like calculating predicated tiles for each of 40000+ crowd, visibility tests for same 40000+ crowd, particle stuff, etc, and I'm barely scratching the surface of the power of one core. And yes, I am using the dot product instruction to a significant extent
On paper it may not seem wise (14 cycles) but that can be absorbed. Still got tons of cpu power left.
Cross platformness doesn't concern me in these cases because the heavy hitting code is written custom for each platform anyways. My PS3 implementation of this same code is completely different.