There are a few nasty problems with any sort of warp reformation scheme. The first problem is bank conflicts in the register file. When you have normal SIMD, this is easy to solve. You probably don't even need banking - just really big registers. Once you reform a warp, things get trickier. Now you can have warps from any lane in the SIMD, meaning you will almost certainly have at least one bank conflict. A bank conflict means that every instruction that uses the register file (pretty much all of them...) will stall for extra cycles in order to resolve the conflict. This means that you have only 50% utilization (a 2-way conflict somewhere in your reformed warp) or worse. You could mitigate this by adding more banks, say, if you can mix lanes from 4 warps, then 4 times as many banks, one set for each warp. However, this means that these 4 warps are stuck in lockstep to each other, since in order to reform, they must be at the same instruction. This will hurt the processor's ability to latency hide as well, since you have fewer options to switch to when you hit a stall. Effectively, you now have warps 4x larger sharing the same size pool of registers. Another issue is that you now have to send one register address per lane to the register file instead of a single address for the entire SIMD. This is a 50% increase in necessary bandwidth (Assume ~16 bits to address the register. Might actually be more like 12 bits... but still!).
Another problem is memory coalescense. Once a warp is split and reformed, your memory transactions are much less likely to be coalesced, since instead of nice contiguous accesses from consecutive lanes, you have a scattering. This puts a strain on memory bandwidth, as well as cache size, since if there was indeed an opportunity for a coalesced transaction had the warps been well ordered, then in order for portions of the fetched cache line not to be wasted, it needs to sit in the L1 or perhaps L2 cache until the rest of the lanes get around to that instruction. Since many problems are memory bound rather than ALU bound, you'd actually loose performance by reforming warps.
Yet another problem is that you need to perform some sort of reduction every time you branch on lane predicates so that you can actually reform the warp. The hardware for this could easily be costly, mostly in terms of incurred latency in resolving the branch. Also, in order to mitigate the second problem, it would be important to restore warps to a well ordered state as early as possible, which I suspect is a rather nontrivial problem with regards to scheduling.
---
Vector+Scalar is an interesting idea. It still has the register banking problem, since you'd get a conflict any time you tried so schedule a warp what happens to have the lane the scalar processor is working on, you have a bank conflict. Presumably the SIMD would have priority in these cases. And/or it would try to schedule warps that don't conflict. Also, it would hit your instruction cache. The bigger problem is that the SIMDs are large enough that executing even a small code clause in serial will take quite a few cycles, meaning it's quite likely that you would see the warp stall waiting on the scalar unit to finish. It would also be less effective on if-then control flow, as opposed to if-then-else, since in order to continue execution, the then clause has to be finished first. Hyperthreading could go a long way in mitigating this, of course, though it still gives you a whole new type of long duration stall to deal with.
The question is, as always, whether the extra hardware needed is worth the necessary power budget. Masked lanes don't take much power, and for most of the market, you're power bound rather than area bound. This is becoming more the case with every new process - area is scaling much better than power. That means that saving power per core lets you just toss in some more cores.