And all modern CPUs have full support for conditional moves. The cases in which they are effective are few and far between.
Yeah, and about zero compilers do even a halfway decent job with them. As long as the cost of the branch is less than the cost of the computation within, it is pretty useless. The chances that it will be worse pretty much rest on misprediction rate and whether you need to branch outside of i-cache. Either way, it's pretty much a matter of bad luck, but that's the only kind of luck there is.
Also, in game code, at least in computational tasks, the amount of computation done in the body of each branch is rarely *that* big. That too, there's all kinds of miscellaneous little insertions that are usually meant to *avoid* computation.
The issues exist for both. The issues cost the same for both.
So you're saying larger caches and/or speculative prefetches do nothing for miss rate? When you keep in mind that in 360's case, the 1 MB is shared among 3 cores, so you either make the cache smaller (partitioning), or you endure the increase in miss rate caused by the fact that multiple threads can map whatever they're accessing to many of the same sets. Again, bad luck, but there isn't any other kind. If you have 2 MB for 2 cores vs. 1 MB for 3 cores vs. 512k for 1 main core (the rest are not really cache), guess who has the highest miss rate?
If your cache miss rate is 2% and your memory latency is 800 cycles, that's still lower mean CPI cost than a miss rate of 5% for a memory latency of 600 cycles.
I don't see why the difference would be practically night and day.
When you get right down to it, they're pretty much made for stream computing. The most significant fraction of their computing power is there. It's a matter of how frequently madame bad luck rears her head. And the thing is that with these next-gen CPUs, the performance loss per moment of bad luck of some type is pretty linear all the way.