Xenon VMX units - what have we learned?

nAo said:
but a special one that uses more bits to store the exponent than bits used to store the mantissa
Now you're just being evil - teasing like that without any notion of what the algorithm is for :devilish:
 
Stupid question here but what if Xenon had say 2 SPEs (equivelant) for every one of its cores? Would that have been a good solution for general and FP computing?
 
Stupid question here but what if Xenon had say 2 SPEs (equivelant) for every one of its cores? Would that have been a good solution for general and FP computing?

Probably not, it would be even more assymetric and strange architecture to program for than cell is. You would then have to split jobs between 3 ppu:s and 6 spu. Also those spu's would not fit in there physically so the question is purely speculative.
 
On the X360 you don't need to insert hints for unconditional branched, the hardware can "predict" those by looking ahead. In contrast on the SPU you have to predict everything, even unconditional branches, because by default the hardware assumes fall-through.
In case it wasn't clear, I was talking about SPUs only. For unconditional branches it should be very easy for compiler to insert a prefetch (hint) instruction, and for conditional ones even without microprofiling some static branch prediction should be trivial. But I guess compilers don't do those yet, based on your reply. I'm sure they will eventually.
Having said that, branches rarely turn out to be the biggest offenders in our codebase. We suffer significantly more from L2 misses and LHS (load-hit-store) penalties.
As far as I'm concerned, LHS is the biggest flop and both X360 and PS3's PPU suffer from it. Forget OOOe, I just want a store queue snoop/forwarding, that's all. Especially on PPC architecture where all conversions go through memory, having LHS penalty just blows, really, I don't know what they were thinking.
You are saying the load data is not forwarded from store queue, but only retrieved from cache?
If so, that's really shocking for me indeed. There is a store data queue right?
By the way, LHS is the number one reason why VMX rarely shows improvement over regular floating point code. If you want high performance VMX you have to baby it super carefully and watch the assembly code all the time so that no float or integer conversions sneak in. And you'll be surprised how much the compiler is NOT helping.
Thanks for the info.
 
Probably not, it would be even more assymetric and strange architecture to program for than cell is. You would then have to split jobs between 3 ppu:s and 6 spu. Also those spu's would not fit in there physically so the question is purely speculative.

I thought I did say it was a stupid question :LOL:
 
Oh don't get me wrong - I absolutely agree that a large register file + loops yields greats results, I was just doing my quaterly "why IBMs Reduced Instruction SIMDs suck" rant.

It's easy to run statistics that will show fancy SIMD offers relatively small gains in loop-heavy code over dumb one, and the silicon usage ratio is not favorable to that increase. But there's still lots of code where the above doesn't hold true, and in those cases a smart ISA with decent latencies can make all the difference.

But efficiency comparisons aside, IMO RISIMDs (I should probably copyright this, it's much better sounding the horizontal SIMD) are just not compiler or programmer friendly.
Even hardcore VMX nuts like Archie sorta agree with me on that :p

VMX was a 3 way design by Apple, Motorola and IBM with the design itself lead by Apple. It's generally very highly regarded and always has been.

I'm curious as to why you don't like it, and what you mean by "smart" / "dumb" SIMD.
 
You are saying the load data is not forwarded from store queue, but only retrieved from cache?
If so, that's really shocking for me indeed. There is a store data queue right?

There is a store data queue, but if you try to load the same data while it's still in the store queue, it incurs a LHS penalty, ie the load has to wait till the store queue writes out to cache/memory. It would have been a significant performance boost if the data could be forwarded from the store queue to the loader directly. LHS penalty can take up to 40-50 cycles, not counting any potential cache misses.
Even more unfortunate is, that often you can't do anything about it. If you have to convert an integer to a float, you have to go through memory and boom - LHS.
What's worse is that, if you have a 32bit integer even in memory already, the float unit can only load double-as-integer and so the compiler inserts a conversion to 64bit integer, puts it on the stack and loads from there with the floating point unit - boom LHS. The same problems exist for moving to/from VMX, which is a giant PITA to work around.
 
for example not having broadcasting modifiers sucks big time, it really doesn't make any sense in an ISA that doesn't not support dot products instructions on the vast majority of its implementations.
It's like they say: use SOA, but fill your code with splats everywhere!
 
There is a store data queue, but if you try to load the same data while it's still in the store queue, it incurs a LHS penalty, ie the load has to wait till the store queue writes out to cache/memory..
they 'funny' thing is that you hit the same problem even when you are not loading the same data due to aliasing issues introduced by the way the hw checks pointers..
 
they 'funny' thing is that you hit the same problem even when you are not loading the same data due to aliasing issues introduced by the way the hw checks pointers..
Conroe for the ultimate win and gold? :) (okay, that wasn't nice for you poor console devs, sorry! Even Barcelona doesn't fix this though, sigh...)
 
Well I'm sure it's been said by others before but we just can't go into specifics. I can generalize and talk smack about either console all day, but if I go as far as to say "ok this is how I implement X, this is my code, these are my data structures", then I'll get canned. Anything that reveals specific company code implementations is, shall we say, seriously frowned upon. Hence all the vagueness alas ;(


Thanks for the insight even though there isn't much you can say. You said earlier that you were barely scratching the surface on 1 VMX128. Could elaborate on this more?

Also, thanks for your examples on how this was coded on the PS3 as well.
 
for example not having broadcasting modifiers sucks big time, it really doesn't make any sense in an ISA that doesn't not support dot products instructions on the vast majority of its implementations.
It's like they say: use SOA, but fill your code with splats everywhere!

What do you mean by "broadcast modifiers"?

You're not talking about SOA-AOS conversions are you?
 
What do you mean by "broadcast modifiers"?

You're not talking about SOA-AOS conversions are you?
I'm talking abou this:

fmad r3, r2, r1, r0.xxxx

hence the ability to splat a scalar without having to explictely use an additional instruction, it would save registers space (no need to keep temporary splatted copies of subcomponents of a vector) and it would save additional splat instructions -> big win (but it would cost chip area and ISA 'area' :) )

Marco
 
There is a store data queue, but if you try to load the same data while it's still in the store queue, it incurs a LHS penalty, ie the load has to wait till the store queue writes out to cache/memory. It would have been a significant performance boost if the data could be forwarded from the store queue to the loader directly. LHS penalty can take up to 40-50 cycles, not counting any potential cache misses.
Even more unfortunate is, that often you can't do anything about it. If you have to convert an integer to a float, you have to go through memory and boom - LHS.
What's worse is that, if you have a 32bit integer even in memory already, the float unit can only load double-as-integer and so the compiler inserts a conversion to 64bit integer, puts it on the stack and loads from there with the floating point unit - boom LHS. The same problems exist for moving to/from VMX, which is a giant PITA to work around.
Interesting. I never knew that.

Still, how often do you really need this kind of conversion in a performance critical loop? The examples I'm thinking of involve a loop counter, but those can be FP. Do devs have a lot of data compressed into integer format that needs to be converted to float in perf. critical code?

If so, this could be a big advantage for the SPU.
I'm talking abou this:

fmad r3, r2, r1, r0.xxxx
These CPUs don't have this ability? Woah. Seems so simple, but I guess every instruction bit counts. Anyway, you only regularly use this sort of code to load scalar constants, right? It doesn't seem like most code segments would need enough scalars to put a big dent in the register capacity, even with 4x replication.
 
Still, how often do you really need this kind of conversion in a performance critical loop? The examples I'm thinking of involve a loop counter, but those can be FP. Do devs have a lot of data compressed into integer format that needs to be converted to float in perf. critical code?

If you wrote your code and data structures from scratch with 360 vmx in mind then you shouldn't need this kinda conversion at all, so you can get around the int->float lhs hits completely. There's more free memory on the 360 as well so you can trade larger data structures for more speed.

Where you can get bit though is when porting code over from another platform, quickly vmx'ing in and thinking it will run better. Some people will be surprised when they time their vmx code only to find that it runs no faster than ppu code.

If so, this could be a big advantage for the SPU.

Yeah, they are just plain better. As someone said earlier, you really need to baby vmx code to get full gains. The SPU's mercifully don't seem to be anywhere near as sensitive.
 
Would it have been possible to employ LS in the VMX128 similar to what Sony had done with the SPEs?

From Nao's comments, this seem to be a big factor on what keeps the SPEs from having to sit idle. It's also good to hear from a multiplatform dev that in some cases the Cell isn't harder to code for in some areas.

@ Barbarian: What do you suggest the engineers could have done to eliminate or alleviate this LHS penalty?
 
Would it have been possible to employ LS in the VMX128 similar to what Sony had done with the SPEs?

From Nao's comments, this seem to be a big factor on what keeps the SPEs from having to sit idle. It's also good to hear from a multiplatform dev that in some cases the Cell isn't harder to code for in some areas.

@ Barbarian: What do you suggest the engineers could have done to eliminate or alleviate this LHS penalty?

You can't put LS "in the VMX128" as it's not a separate processor, like the SPEs are. The next best thing is a large register file, and it is done.

The LHS penalty is eliminated and alleviated in desktop x86 CPUs, it is a relatively well known problem with a well known solution (snooping the store queue); it was probably a transistor count vs. performance trade-off consciously made by IBM engineers.

Re: floating point values as loop counters - it's not the best thing to do, as you have a pipeline flush penalty when branching on the result of a floating-point comparison operation - gee, who would want to branch after a compare?!
 
So the VMX units are "add-on" units as flec04 stated?

Are you saying that to lessen or negate the LHS penalty it would mean a sacrifice of increased transistors? Could elaborate on this please?

Is there any other negatives to having a lot of registers other than loss of space on chip (trade-offs)?
 
Back
Top