I'm talking abou this:
fmad r3, r2, r1, r0.xxxx
hence the ability to splat a scalar without having to explictely use an additional instruction, it would save registers space (no need to keep temporary splatted copies of subcomponents of a vector) and it would save additional splat instructions -> big win (but it would cost chip area and ISA 'area' )
Marco
I assume you mean multiplying add (etc.) a vector (or 3) by a sub-component of another vector(sorry I haven't used assembly).
That would be a complex instruction from a hardware point of view, firstly the instruction may not fit in the standard instruction format screwing everything up.
A simpler version would be to multiply one or 2 vectors by the sub part of another, that would fit but it'll still complicate things. It'll involve having a splat hardware unit in front of the multiplier, that'll potentially increase latencies for everything behind it or alternately require the clock to be lowered - so you get one instruction faster but it slows everything else down.
The whole RISC philosophy is to make the the most common instructions in the smallest, simplest hardware as possible. It's cheaper to make and easier to clock higher. In the case of the XCPU and Cell, they've used the small core size to increase the number of cores. The downsize of course is the code size increases...