arjan de lumens
Veteran
The scalar SSE instructions tend to have large instruction encodings, leading to large executables, and they are sometimes actually slower than their x87 counterparts - in particular, the P4 can e.g. schedule 1 FADD per clock cycle, but only 1 ADDSS every 2 clock cycles. I also suspect (but don't know for certain; I don't have one for testing right here) that the AthlonXP actually decodes the ADDSS as a vector instruction, which adds a nearly 1 cycle penalty compared to FADD. The assumption that everyone has an SSE-capable processor is still not considered 100% safe either, even now.psurge said:arjan, Xmas:
I totally forgot about this x87 stuff. I would like to retract my earilier hasty comment. A serious question though: I had thought that SSE (1,2,3?) included scalar fp instructions (which encode the operand precision in the instruction) - are those not there yet from a compiler/HW POV?
As such, scalar SSE does not normally offer any real benefit over x87 for ordinary use (except perhaps on AMD64, where scalar SSE has access to 16 registers and is not hampered by silly scheduling limitations).