I'm still horrified that such a thread exists here, be ashamed epic, be very ashamed
So, in order to improve the discussion's level a bit (that is, making it B3D-quality, and not that of other forums I won't name), I propose to discuss it at a slightly deeper level. First of all, it has to be considered that no matter how many CPU cores you got, and no matter their level of SSE-like performance, the assumption is that each core still works on serial data. This implies very expensive scheduling and branch prediction systems, in addition to expensive branch misses. The advantage, of course, is that just about any kind of code runs at a "reliable" speed.
Now, consider a GPU. For the sake of argument, let us take the R600 and G80, so the first is... Err, I mean, let's take the G70 instead. The VS is a MIMD processor with relatively cheap branching and no "branch miss"-like behaviour. It can hide some latency by switching threads, but it remains quite limited in that, so the VTF performance is less than optimal. But overall, that's likely to improve, because one of the possible ways to improve branching after a certain point also is to increase latency tolerance.
On the other hand, the G70 PS is a SIMD "monster" that tries to minimize scheduling overhead and maximize latency tolerance, at least in cases of low register counts (possibly in the hope that shaders with more register requirements might have more ineherent parallelism to hide the latency anyway). Branching performance just isn't there at all because the batches are so huge the required coherence is nowhere to be found, since it uses an all-or-nothing scheme. It's questionable whether you really need it for more than optimizations, though.
So, the PS is going to be nearly unbeatable by any CPU, ever. The "FLOPs/mm²" are downright insane (CELL is a weakling in comparaison) and the latency hiding is downright incredible compared to what you could get on a modern CPU, because as I said in the beggining, CPUs assume things to be serial. If you wanted to have hundreds of "threads" in flight to hide latency like on a CPU, you'd need at least 50x as many registers as currently available. Other schemes such as a create use of L1 or L2 might seem attractive, but they don't quite cut it either, imo.
So if we put the PS out of the equation, what's left? The VS, the GS and the fixed-function stuff. I think we can firstly safely conclude that there's no way in hell you can get a perf/mm² or perf/watt ratio within 5-10% of that of a GPU for things like Triangle Setup or Rasterization on a GPU. It's about as much of extreme case as you can imagine, and it's not by mistake the first GPUs accelerated that, even before proper bilinear!
The final question thus is that of the VS and GS units. As is nicely explained by Bob and others in another recent thread, the GS is rather icky to get parallelism out of by increasing the number of threads, due to the temporary storage requirements. So fundamentally, what you want there, on a GPU, is ILDP (Instruction Level Distributed Processing). There is an excellent patent from NVIDIA that uses instruction buffers, btw, and that'd fit nicely, although they present it as a generic solution for PS or VS too, unified or not.
But the scheduling cost is obviously higher, and CPUs themselves have some rather basic kinds of instruction level parallelism, and that's (for some parts) what makes it possible for them to be pipelined and yet not work on as many threads as there are pipeline stages. So the gap is much smaller there. Still, if you needed a texture fetch-like operation in your GS, the fact that a CPU core is fundamentally serial means you'll dance on your head before you can properly hide the latency. I doubt those will be as used as on the PS or VS, though.
But the GS is straight in the middle of the programmable pipeline, so unless the PCI-E bandwidth got sensibly lower, there's no way you could "offload" your GS to the CPU, and not put your VS on there too. And it's far from impossible to make VS-like operations highly efficient on a properly engineered CPU (see: CELL). I don't believe it's as efficient when it comes to perf/mm², but that's not really the discussion here, as long as it remains viably possible for anything but the high-end.
Personally, as I'm sure a fair bit of people will have noticed, I'm a big fan of ILDP-like architectures for certain kinds of workloads, and I'd tend to believe there will be a serious convergence between CPUs and GPUs because of MIMD + ILDP in the VS/GS architectures of at least one IHV. But at the same time, that remains to be seen, and CPU manufacturers don't seem to move into that direction anyway.
In conclusion, there could be a convergence in the coming few years that'd make IGP-level or even entry-level chips once again not require VS or GS capability to any serious degree. But for the PS, that's downright unthinkable, and any attempt would be at least a few orders of magnitude slower. And all that depends on which direction CPU manufactuers are truly taking; personally, I don't feel it's appropriate for even just the VS/GS possibility to be taken very seriously, at least not quite yet. And should they ever manage it, it might already be time for a whole new book of algorithms that wouldn't fit quite as nicely anymore...
Uttar
P.S.: The above poster's discussion about CELL is obviously a fair point, but as I explained, I don't believe CELL can be anywhere nearly efficient enough for proper Pixel Shading with some basic nicely-cached bilinear, yet alone texture filtering with Trilinear+AF. So its rumored usage in PS3 as a GPU, which always was bogus anyway, is completely redundant to any proper discussion imo. As for embedded platforms, besides consoles, you really only have handheld-like ones. And if you look at it from a perf/watt pov, that makes it irrelevant for handheld applications imo. Using CELL for Graphics doesn't make sense unless you use REYES imo, and don't even get me started on that...