For example, I would not have thought that as many things that joker454 is doing would be possible on just 1 VMX128 (and barely scratching the surface at that). joker454 one of these days you're going to have to tell us which one it is. Hints are great and all, but I'm a little slow
Well I'm sure it's been said by others before but we just can't go into specifics. I can generalize and talk smack about either console all day, but if I go as far as to say "ok this is how I implement X, this is my code, these are my data structures", then I'll get canned. Anything that reveals specific company code implementations is, shall we say, seriously frowned upon. Hence all the vagueness alas ;(
I'm interested, since he brought it up, on how this is handled with the PS3.
Strictly speaking, my predicated tiling code isn't needed on the PS3 version, so technically that part is free on PS3
My spu code for the visibility check on said 40k+ crowd is faster than my 360 VMX code though.
Of perhaps more interest is that both sets of code do not do the same thing. The 360 vmx visibility code is not a 100% accurate check, more like a 90% or so approximation, so it ends up sending more verts to the gpu than it needs to. But that's ok because the 360's gpu is a monster so I lean on it more.
My PS3 code does a full frustum check on every guy so it's 100% accurate. This is more complicated than the 360 version but that's ok because the PS3's cpu is a monster so I lean on it more.
So even though my PS3 visibility code does more than it's 360 counterpart, it still actually runs faster. It's definitly not an order of magnitude difference like was suggested somewhere earlier, but its clearly faster.
Given my cpu experience so far with both machines, two trends seem to have emerged. "Optimized" spu code will outrun "heavily optimized" 360 vmx code. Also, "sloppy" code will fare better on spu's than it would on 360 vmx. But I'm still learning here, so I'm definitly interested in hearing other peoples experiences.
Asher said:
The cases where Xenon can outperform SPEs rely mostly on branching code of any kind. This isn't a secret, and yes, there are ways to implement such code on SPEs as well, but the performance doesn't come close to Xenon. And yes, this is why there's a PPE in Cell as well. But there's only one PPE.
For the most part, I've banned branches on tight running heavy lifting code. It seems that no matter what I do, or what I try, code on these boxes always runs faster with no branches, even if it means adding way more instructions to get around them. Sometimes its substantially faster. From what I see, the minute you hit a branch the compiler can no longer effectively, or as effectively, schedule code to hide latency and bam, you're toast.
Using the visibility check as an example, it will loop thru all the crowd dudes but process them in batches of 8. So grossly simplified, it may be something like:
for( int i=0; i<40000; i+=8 )
{
Process(i);
Process(i+1);
Process(i+2);
etc...
}
...where Process() is inlined. Looking at the code generated by the compiler shows that in this particular case, that seems to be the sweet spot that lets it use all registers to mask lots of latency. So it's able to tear thru it. But add one branch in there and pain results. The non branch code is almost twice as long as the branch version, but it still smokes it. These cpu's seem to be extremely sensitive, proper instruction scheduling seems to be critically important. Quite the change from intels where you can feed them any garbage and they happily eat it.