Now you're just being evil - teasing like that without any notion of what the algorithm is fornAo said:but a special one that uses more bits to store the exponent than bits used to store the mantissa
Now you're just being evil - teasing like that without any notion of what the algorithm is fornAo said:but a special one that uses more bits to store the exponent than bits used to store the mantissa
A shade of mystery....(hint included )Now you're just being evil - teasing like that without any notion of what the algorithm is for
Stupid question here but what if Xenon had say 2 SPEs (equivelant) for every one of its cores? Would that have been a good solution for general and FP computing?
A shade of mystery....(hint included )
In case it wasn't clear, I was talking about SPUs only. For unconditional branches it should be very easy for compiler to insert a prefetch (hint) instruction, and for conditional ones even without microprofiling some static branch prediction should be trivial. But I guess compilers don't do those yet, based on your reply. I'm sure they will eventually.On the X360 you don't need to insert hints for unconditional branched, the hardware can "predict" those by looking ahead. In contrast on the SPU you have to predict everything, even unconditional branches, because by default the hardware assumes fall-through.
You are saying the load data is not forwarded from store queue, but only retrieved from cache?Having said that, branches rarely turn out to be the biggest offenders in our codebase. We suffer significantly more from L2 misses and LHS (load-hit-store) penalties.
As far as I'm concerned, LHS is the biggest flop and both X360 and PS3's PPU suffer from it. Forget OOOe, I just want a store queue snoop/forwarding, that's all. Especially on PPC architecture where all conversions go through memory, having LHS penalty just blows, really, I don't know what they were thinking.
Thanks for the info.By the way, LHS is the number one reason why VMX rarely shows improvement over regular floating point code. If you want high performance VMX you have to baby it super carefully and watch the assembly code all the time so that no float or integer conversions sneak in. And you'll be surprised how much the compiler is NOT helping.
Probably not, it would be even more assymetric and strange architecture to program for than cell is. You would then have to split jobs between 3 ppu:s and 6 spu. Also those spu's would not fit in there physically so the question is purely speculative.
Oh don't get me wrong - I absolutely agree that a large register file + loops yields greats results, I was just doing my quaterly "why IBMs Reduced Instruction SIMDs suck" rant.
It's easy to run statistics that will show fancy SIMD offers relatively small gains in loop-heavy code over dumb one, and the silicon usage ratio is not favorable to that increase. But there's still lots of code where the above doesn't hold true, and in those cases a smart ISA with decent latencies can make all the difference.
But efficiency comparisons aside, IMO RISIMDs (I should probably copyright this, it's much better sounding the horizontal SIMD) are just not compiler or programmer friendly.
Even hardcore VMX nuts like Archie sorta agree with me on that
You are saying the load data is not forwarded from store queue, but only retrieved from cache?
If so, that's really shocking for me indeed. There is a store data queue right?
they 'funny' thing is that you hit the same problem even when you are not loading the same data due to aliasing issues introduced by the way the hw checks pointers..There is a store data queue, but if you try to load the same data while it's still in the store queue, it incurs a LHS penalty, ie the load has to wait till the store queue writes out to cache/memory..
Conroe for the ultimate win and gold? (okay, that wasn't nice for you poor console devs, sorry! Even Barcelona doesn't fix this though, sigh...)they 'funny' thing is that you hit the same problem even when you are not loading the same data due to aliasing issues introduced by the way the hw checks pointers..
Well I'm sure it's been said by others before but we just can't go into specifics. I can generalize and talk smack about either console all day, but if I go as far as to say "ok this is how I implement X, this is my code, these are my data structures", then I'll get canned. Anything that reveals specific company code implementations is, shall we say, seriously frowned upon. Hence all the vagueness alas ;(
Thanks for the insight even though there isn't much you can say. You said earlier that you were barely scratching the surface on 1 VMX128. Could elaborate on this more?
Also, thanks for your examples on how this was coded on the PS3 as well.
for example not having broadcasting modifiers sucks big time, it really doesn't make any sense in an ISA that doesn't not support dot products instructions on the vast majority of its implementations.
It's like they say: use SOA, but fill your code with splats everywhere!
I'm talking abou this:What do you mean by "broadcast modifiers"?
You're not talking about SOA-AOS conversions are you?
Interesting. I never knew that.There is a store data queue, but if you try to load the same data while it's still in the store queue, it incurs a LHS penalty, ie the load has to wait till the store queue writes out to cache/memory. It would have been a significant performance boost if the data could be forwarded from the store queue to the loader directly. LHS penalty can take up to 40-50 cycles, not counting any potential cache misses.
Even more unfortunate is, that often you can't do anything about it. If you have to convert an integer to a float, you have to go through memory and boom - LHS.
What's worse is that, if you have a 32bit integer even in memory already, the float unit can only load double-as-integer and so the compiler inserts a conversion to 64bit integer, puts it on the stack and loads from there with the floating point unit - boom LHS. The same problems exist for moving to/from VMX, which is a giant PITA to work around.
These CPUs don't have this ability? Woah. Seems so simple, but I guess every instruction bit counts. Anyway, you only regularly use this sort of code to load scalar constants, right? It doesn't seem like most code segments would need enough scalars to put a big dent in the register capacity, even with 4x replication.I'm talking abou this:
fmad r3, r2, r1, r0.xxxx
Still, how often do you really need this kind of conversion in a performance critical loop? The examples I'm thinking of involve a loop counter, but those can be FP. Do devs have a lot of data compressed into integer format that needs to be converted to float in perf. critical code?
If so, this could be a big advantage for the SPU.
Would it have been possible to employ LS in the VMX128 similar to what Sony had done with the SPEs?
From Nao's comments, this seem to be a big factor on what keeps the SPEs from having to sit idle. It's also good to hear from a multiplatform dev that in some cases the Cell isn't harder to code for in some areas.
@ Barbarian: What do you suggest the engineers could have done to eliminate or alleviate this LHS penalty?