Xenon VMX units - what have we learned?

NRP · Aug 14, 2007

inefficient said:
It doesn't strike me as being a feat for either CPU to accomplish this. Like he said, a single core wasn't breaking a sweat even with a sub optimal technique. Unless the feat itself was that sub optimal code still ran well on VMX128.

What do you mean by "sub optimal code"?

DeanoC · Aug 14, 2007

Asher said:
but the performance doesn't come close to Xenon.

That *might* be true if your code doesn't spread itself across all the SPUs but its not true in general, in fact there a good chance that Cell is significantly better at branchy code than Xenon's CPU.

Lets do some back of the envelope figures...

The *worst* case theoretical performance for the 7 SPUs (3.2Ghz at 17 cycles per branch) is 1.3 billion branches per second...

So each of the 2 extra Xenon main cores have to have a branch rate of ~650 million branches per second to keep up "in theory". Or to put it another way, a worse case timing of 5 cycles per branch.

Mintmaster · Aug 14, 2007

Doesn't that assume the PPU runs as fast as a Xenon core? I would expect that the reduced register count would prevent that. Anyway, your point still stands when looking at Cell vs. Xenon as a whole. I think Asher was mostly talking about per core performance, even if his statement didn't convey that.

I'm more curious about how easy it is to get Xenon running at it's theoretical potential for simple brute-force calculations like we've seen with Cell.

Also, what sort of workloads run a lot faster per clock on a SPU than a Xenon core? Are there inherent advantages to reading/writing data to a local store instead of regular cache?

nAo · Aug 14, 2007

Mintmaster said:
Are there inherent advantages to reading/writing data to a local store instead of regular cache?

it's faster, it's bigger and no one (ie other threads or your own data structures when associativy is not enough) competes for it, moreover you can have memory transactions using your local store (to read and/or write data) while your doing something else on the same memory with almost no impact of the performance (MFC reads 128 bytes per clock cycle from the local store..so it rarely stalls your code for any significant amount of time)

assen · Aug 14, 2007

Mintmaster said:
Are there inherent advantages to reading/writing data to a local store instead of regular cache?

I believe the LS latencies are L1-ish while its size is L2-ish - that should be a significant advantage.

archie4oz · Aug 14, 2007

assen said:
Everything is under NDA. Don't expect anyone to come out in a public forum and say "well, we used VMX128 in such and such way, and got such and such improvement over plain VMX".

Well for one, there's no such thing as "plain VMX." No two implementations have been identical since Moto^H^H^H Freescale first started shipping processors with AltiVec. Although I'd definitely say that the Waternoose implementation is by far the most dramatic departure from the first implementations. Nevermind the registers, which really isn't that significant (other than instruction coding to support it), the biggest change (which I'm personally not so keen about) was ripping out the VCIU and replacing it with another VFPU (for their dubious dot product instructions).

nAo · Aug 14, 2007

archie4oz said:
the biggest change (which I'm personally not so keen about) was ripping out the VCIU and replacing it with another VFPU (for their dot product instructions).

What is VCIU?

pjbliverpool · Aug 14, 2007

DeanoC said:
That *might* be true if your code doesn't spread itself across all the SPUs but its not true in general, in fact there a good chance that Cell is significantly better at branchy code than Xenon's CPU.

Lets do some back of the envelope figures...

The *worst* case theoretical performance for the 7 SPUs (3.2Ghz at 17 cycles per branch) is 1.3 billion branches per second...

So each of the 2 extra Xenon main cores have to have a branch rate of ~650 million branches per second to keep up "in theory". Or to put it another way, a worse case timing of 5 cycles per branch.

How many cycles does a Conroe core take per branch?

nAo · Aug 14, 2007

Branches aren't most of the time a big issue (data are), and ppl forget that although SPUs don't have branch prediction they have branch hinting, this allows good coders to implement code that never mis-hints a branch.
If you have enough work to do between your hint and your branch to hide hint latency you're going to do better than any branch predictor out there, no matter how fat and big it is.

AlNom · Aug 14, 2007

nAo said:
What is VCIU?

Stands for Vector Complex Integer Unit.

Some info here (and on VFPU):
http://developer.apple.com/hardwaredrivers/ve/instruction_crossref.html
http://arstechnica.com/cpu/03q1/ppc970/ppc970-7.html

If that instruction table still applies, it looks like they got rid of:

[SIZE=-1]vector multiply even integer
vector multiply odd integer
vector multiply-high and add saturate
vector multiply-low and add modulo
vector multiply-high round and add saturate
vector multiply-sum modulo
vector multiply-sum saturate
vector sum across 1/4 integer
vector sum across 1/2 signed integer
vector sum across signed integer

[/SIZE]

archie4oz · Aug 15, 2007

nAo said:
What is VCIU?

The little guy that does all your fixed-point vector multiplies...

nAo · Aug 15, 2007

archie4oz said:
The little guy that does all your fixed-point vector multiplies...

c'mon..fixed point math is so old fashion!!! (untrue, I love it!)

joker454 · Aug 15, 2007

BadTB25 said:
For example, I would not have thought that as many things that joker454 is doing would be possible on just 1 VMX128 (and barely scratching the surface at that). joker454 one of these days you're going to have to tell us which one it is. Hints are great and all, but I'm a little slow

Well I'm sure it's been said by others before but we just can't go into specifics. I can generalize and talk smack about either console all day, but if I go as far as to say "ok this is how I implement X, this is my code, these are my data structures", then I'll get canned. Anything that reveals specific company code implementations is, shall we say, seriously frowned upon. Hence all the vagueness alas ;(

BadTB25 said:
I'm interested, since he brought it up, on how this is handled with the PS3.

Strictly speaking, my predicated tiling code isn't needed on the PS3 version, so technically that part is free on PS3

My spu code for the visibility check on said 40k+ crowd is faster than my 360 VMX code though.

Of perhaps more interest is that both sets of code do not do the same thing. The 360 vmx visibility code is not a 100% accurate check, more like a 90% or so approximation, so it ends up sending more verts to the gpu than it needs to. But that's ok because the 360's gpu is a monster so I lean on it more.

My PS3 code does a full frustum check on every guy so it's 100% accurate. This is more complicated than the 360 version but that's ok because the PS3's cpu is a monster so I lean on it more.

So even though my PS3 visibility code does more than it's 360 counterpart, it still actually runs faster. It's definitly not an order of magnitude difference like was suggested somewhere earlier, but its clearly faster.

Given my cpu experience so far with both machines, two trends seem to have emerged. "Optimized" spu code will outrun "heavily optimized" 360 vmx code. Also, "sloppy" code will fare better on spu's than it would on 360 vmx. But I'm still learning here, so I'm definitly interested in hearing other peoples experiences.

Asher said:
The cases where Xenon can outperform SPEs rely mostly on branching code of any kind. This isn't a secret, and yes, there are ways to implement such code on SPEs as well, but the performance doesn't come close to Xenon. And yes, this is why there's a PPE in Cell as well. But there's only one PPE.

For the most part, I've banned branches on tight running heavy lifting code. It seems that no matter what I do, or what I try, code on these boxes always runs faster with no branches, even if it means adding way more instructions to get around them. Sometimes its substantially faster. From what I see, the minute you hit a branch the compiler can no longer effectively, or as effectively, schedule code to hide latency and bam, you're toast.

Using the visibility check as an example, it will loop thru all the crowd dudes but process them in batches of 8. So grossly simplified, it may be something like:

for( int i=0; i<40000; i+=8 )
{
Process(i);
Process(i+1);
Process(i+2);
etc...
}

...where Process() is inlined. Looking at the code generated by the compiler shows that in this particular case, that seems to be the sweet spot that lets it use all registers to mask lots of latency. So it's able to tear thru it. But add one branch in there and pain results. The non branch code is almost twice as long as the branch version, but it still smokes it. These cpu's seem to be extremely sensitive, proper instruction scheduling seems to be critically important. Quite the change from intels where you can feed them any garbage and they happily eat it.

archie4oz · Aug 15, 2007

nAo said:
c'mon..fixed point math is so old fashion!!! (untrue, I love it!)

Yes, I

fixed-point maths too. Floating-point is for pansies...

ADEX · Aug 15, 2007

AlStrong said:
Stands for Vector Complex Integer Unit.

Some info here (and on VFPU):
http://developer.apple.com/hardwaredrivers/ve/instruction_crossref.html
http://arstechnica.com/cpu/03q1/ppc970/ppc970-7.html

If that instruction table still applies, it looks like they got rid of:

[SIZE=-1]vector multiply even integer
vector multiply odd integer
vector multiply-high and add saturate
vector multiply-low and add modulo
vector multiply-high round and add saturate
vector multiply-sum modulo
vector multiply-sum saturate
vector sum across 1/4 integer
vector sum across 1/2 signed integer
vector sum across signed integer

[/SIZE]

The VCIU is an implementation specific unit, that they haven't included a unit with that name doesn't really tell you if they have dropped certain instructions or not as they could have just moved them. That said I know the saturate instructions were removed from the ISA when the SPUs were being designed as you can do the same thing with FP.

betan · Aug 15, 2007

joker454 said:
Using the visibility check as an example, it will loop thru all the crowd dudes but process them in batches of 8. So grossly simplified, it may be something like:

for( int i=0; i<40000; i+=8 )
{
Process(i);
Process(i+1);
Process(i+2);
etc...
}

You guys don't let the compiler do the loop unrolling for you?
I'm also curious whether the compiler can automatically (that is without __builtin_expect) insert pre-fetch instructions for relatively straightforward static branch prediction.

NRP · Aug 15, 2007

joker454 said:
For the most part, I've banned branches on tight running heavy lifting code. It seems that no matter what I do, or what I try, code on these boxes always runs faster with no branches, even if it means adding way more instructions to get around them. Sometimes its substantially faster. From what I see, the minute you hit a branch the compiler can no longer effectively, or as effectively, schedule code to hide latency and bam, you're toast.

<snip>

The non branch code is almost twice as long as the branch version, but it still smokes it. These cpu's seem to be extremely sensitive, proper instruction scheduling seems to be critically important. Quite the change from intels where you can feed them any garbage and they happily eat it.

Thanks Joker454 for a providing some insight into what "sub optimal" code would look like to these machines. Interesting stuff.

Fafalada · Aug 16, 2007

joker454 said:
it's all scheduled quite cleverly to the point that alot of the latency seems to be getting absorbed. I'm sure some will bring up a worse case scenario, but from where I'm standing it's looking pretty good. There's 128 registers that can be used (per core) so just batch up your loops

Oh don't get me wrong - I absolutely agree that a large register file + loops yields greats results, I was just doing my quaterly "why IBMs Reduced Instruction SIMDs suck" rant.
It's easy to run statistics that will show fancy SIMD offers relatively small gains in loop-heavy code over dumb one, and the silicon usage ratio is not favorable to that increase. But there's still lots of code where the above doesn't hold true, and in those cases a smart ISA with decent latencies can make all the difference.
But efficiency comparisons aside, IMO RISIMDs (I should probably copyright this, it's much better sounding the horizontal SIMD) are just not compiler or programmer friendly.
Even hardcore VMX nuts like Archie sorta agree with me on that

archie4oz said:
Floating-point is for pansies...

I like how whenever you stretch FP to its limits, usual best solution is to switch to fixed point to solve the issue. And let's not even get started on how it fares cross platform...
If simplified FP standard was ever devised for game-hw, I wonder if designers would include all the platform specific computation bugs as part of the standard.

ADEX said:
if they have dropped certain instructions

From one of the earlier SIMD discussions on this forum.
http://forum.beyond3d.com/showpost.php?p=950977&postcount=49

To be fair, there are obvious reasons as to why VCIU arithmetics were dropped, as noted above. It'd be an interesting poll to see how many people like/dislike the tradeoff.

Barbarian · Aug 16, 2007

betan said:
I'm also curious whether the compiler can automatically (that is without __builtin_expect) insert pre-fetch instructions for relatively straightforward static branch prediction.

On the X360 you don't need to insert hints for unconditional branched, the hardware can "predict" those by looking ahead. In contrast on the SPU you have to predict everything, even unconditional branches, because by default the hardware assumes fall-through.
Having said that, branches rarely turn out to be the biggest offenders in our codebase. We suffer significantly more from L2 misses and LHS (load-hit-store) penalties.
As far as I'm concerned, LHS is the biggest flop and both X360 and PS3's PPU suffer from it. Forget OOOe, I just want a store queue snoop/forwarding, that's all. Especially on PPC architecture where all conversions go through memory, having LHS penalty just blows, really, I don't know what they were thinking.
By the way, LHS is the number one reason why VMX rarely shows improvement over regular floating point code. If you want high performance VMX you have to baby it super carefully and watch the assembly code all the time so that no float or integer conversions sneak in. And you'll be surprised how much the compiler is NOT helping.

nAo · Aug 16, 2007

That's why I love SPUs local store, which is the juice that makes SPUs amazingly fast on optimized code (and makes static optimizations extremely pratical..)
BTW..just changed my mind about fixed point numbers, I've found a particular algorithm which needs a floating point representations..but a special one that uses more bits to store the exponent than bits used to store the mantissa

Xenon VMX units - what have we learned?

NRP

DeanoC

Trust me, I'm a renderer person!

Mintmaster

nAo

Nutella Nutellae

assen

archie4oz

ea_spouse is H4WT!

nAo

Nutella Nutellae

pjbliverpool

B3D Scallywag

nAo

Nutella Nutellae

AlNom

Moderator

archie4oz

ea_spouse is H4WT!

nAo

Nutella Nutellae

joker454

archie4oz

ea_spouse is H4WT!

ADEX

betan

NRP

Fafalada

Barbarian

nAo

Nutella Nutellae

Similar threads