DeanoC said:
SPE have a large register context because they need lots of loop unrolling to get decent speed (same for XeCPU, hence VMX128). If the compiler halves the register file it will seriously limit the extent to which the compiler can hide the in-orderness of it and the FLOPs rating will be reduced.
If you have some time DeanoC maybe you could take a look at this.
I'm just trying to understand why you said this a little better.
I would think Xenon's VMX units would have enough register space to handle two threads and I believe that is documented to be the case. However, it's doubtful to me that the VMX units would have 128 128bit registers like an SPE. This would make the SPE a bit more capable of handling the XLC scheme in question no? I mean even if the register space in an SPE was cut in half I would imagine there would still be more registers per thread than what resides in the VMX units.
Also, I thought the VSUs provided for OoOe at their level. So why would the VMX128 in an Xenon core need a large register space to hide the chips 'Inorderness'? I thought the register space was largely due to the VMX units being made capable of serving two threads simultaneously as to avoid stalls there. No one has yet explained just what the PPE's VMX unit is like or how it operates. (Look Ma! I'm fishin'!) Unless the VSUs do not provide OoOe at their level I'm guessing you meant 'inorderness' was being hidden at a higher level.
I get lost though because I don't know why you mentioned Xenon's VMX unit and not it's general purpose registers in a core.
As far as what the XLC scheme is trying to pull off I was under the impression that it was not meant to be a means of maximizing throughput or rather speed but to deal with being memory bound due to a large number of perhaps unpredictable DMA requests. So if the flops should suffer for doing this I would imagine it would be in relation to a more ideal situation than this. What I mean to say is that in a situation where DMA requests are stalling you this scheme would seem to be a way of getting around those stalls and thus if performing flops in between DMA requests flops performance should improve (relatively). Flops performance would be less than a more ideal situation where DMA requests aren't hanging you out to dry.
I'm trying to understand what it is you really said. Do you mean that having to unrool loops would break this 'trick'? Or are you saying this 'trick' should not be a first option as it would adversely affect performance due to loop unrolling? The latter makes sense to me (well maybe not completely) because this 'trick' seems to be for a specific case. The former I don't understand on my own so I'm asking for help. I'm also curious if scheme has no value at all anywhere else.
I'm lost again (just point and laugh...everyone else does) on loop unrolling in itself....wouldn't this be done at compile time? So then wouldn't unrolled loops affect the size of your code and thus how much space is consumed in an LS or in cache instead of the space in the core's general purpose registers?
Lost I am. Saving I will need. Hides the truth the dark side does...clouds my judgment...or is that just my
pills...nope....I probably don't know what I'm talking about.
---------------------------------------------------------------------------------
Seperate questions to anybody:
Why would an SPE's iop performance be less than it's flops performance? (Is version's Gints number wrong in post#70...or is this again a special case kind of thing like Cell being able to handle 64 threads)
Why can't a flop be exchanged for an iop? (3D games use flops not iops anyway more often than not no?)