Gubbi said:
Do you have a concrete example ? Better how ?.
Inquiring minds want to know
Any details and I'd have to kill you
But taking a general tack, its fairly easy to see how we can improve a SIMD unit for the modern CPU landscape.
Problem A: RAM speeds
A SIMD unit can use alot of RAM (a 4x4 float matrix takes 1/2KB). RISC memory units are too slow (Load/Store as seperate instructions) what we want is old fashioned CISC direct to/from memory. Of course we actually want a small pool of very fast RAM. Lets call that the "register pool", saves any embarassment from RISC fans
So solution to problem A is to have so many registers, its uses the same amount of memory as 8 bit computers used to have. Cell SPU has mentioned 128 128bit registers (16KB), which sounds a good figure.
Problem B: RAM speeds
O.K even with lots of registers I have to read/write stuff sometimes. If I'm going to it would be good to compress everything, say using a decoder like that is fitted to every vertex shader (including PS2) to unpack/pack data.
So solution to problem B is to have dedicated instructions/units for packing/unpacking in the formats most likely to be encountered by GPUs and CPUs.
Problem C: RAM speeds
Still sometimes we are going to stall due to memory latency, so if that happens lets makes sure we have some thread contexts we can switch to see if they could be doing somethin useful.
So solution to problem C is the have multiple thread contexts per core. If one thread stalls, switch to another and do some useful work.
Problem D: We need to pretend that FLOPs figures are really important.
SIMD ALUs are cheap, so lets have a few. Makes the paper figures look good, even though the real problems are A, B and C.
So solution to problem D is to have N SIMD cores.
Note: I'm being overly sarcastic ;-) There are lots of good reasons why having multiple cores is a good things. Its just finding more than about 2 non graphical math intensive (physics and sound are the obvious candidates) tasks gets real hard quickly.
A good SIMD unit will address at least 2 of theses, a really good one will address all 4... The last two are really CPU architecture issues, but the SIMD units have to integrated into the thread architecture to get good performance.