Seems as if we agree for the most part.
Consider a scatter algorithm, like IIR gaussian approximation. You'll need to do a forward and a backward pass over each row and column and the intermediate values need to be float precision if you want extreme blurs. 1280*3*sizeof(float) = 15360B, and that's assuming you somehow got rid of the alpha channel.
Even if everything is gather based, you can have more data to gather in LS than in L1.
If your effect has nicely independent pixels, then of course, as I stated earlier, it's a pretty even playing field. And if you do, say, a tonemapping, VMX128 should be a good chunk faster. Again, I'm not saying VMX128 is bad (or even VMX32 is bad) or that it doesn't have cases where it can be faster than an SPE.
This is more a case of SPEs being able to efficiently run a wider class of algorithms at high utilization. I'll need to think about if there is an interesting class of algorithms at which the VMX will be significantly faster for architectural reasons. The L2 cache lines are 128B, so that's a pretty DMA-able size...
If this has been your experience with writing PS3 games, then kudos to you guys for going all the way. This is not usually how it works.
I really don't think a lot of games have significantly more than 1% of their codebase on the SPUs, but that's just a gut feeling.
In any case, I can't really talk too much about ease of development, since I've not done a whole lot of VMX128 coding. So I'll stick to commenting on chip design, where I actually might know what I'm talking about.