The biggest problem with the SSE/SSE2 units in P4 is the throughput of scalar operations. Both vector and scalar operations are 2 cycles throughput. This makes using SSE/SSE2 for non-vectorizedl FP operations to be slower than x87 in some cases.
Although it looks like Intel was going to replace x87 with SSE/SSE2, but Intel introduces a new x87 instruction in PNI, so it may not be the case. The new instruction is, actually, an important one. Why it's not introduced earlier is beyond me.
I hope that Intel fixed the throughput of scalar SSE/SSE2 operations in Prescott. That will make using SSE2 for normal FP worthwhile.