Because you are wrong about that.Kryton said:Perhaps but it is not their speed that matters, it is the loading the SSE values that takes time. Why these overwrite the FPU stack entries I will never know, many are the mysteries of x86 (and why it still exists )
SSE has 8 new 128 Bit registers that required operating system so they registers get saved on context change. SSE can be used at the same time as x87 (or MMX/3DNow) code with no switching penalty, because the registers are independant (you don't switch modes when using SSE instructions). In x86_64 there are are 16 128bit registers.
The 8 64bit MMX/3DNow registers were aliased on floating point stack so no changes were required to operating system code. It was up to the program to ensure that that state of the FPU is correct and the the instructions to change and save the state are really expensive.