No, what will save AMD is Fusion. While I'm not that excited when in comes to integrating a full graphics core in the CPU, I'd gladly see a stream processing unit integrated into every single AMD x86 processor sold. Imagine what Phenom could be if it integrated the 320 stream processors from the R600.
Is it possible to answer the questions how many transistors just the stream processors from R600 use? If integrated in a Phenom, do you think it would be possible to run it at full speed (2.5+ GHz)?
I haven't seen any transistor breakdowns for the design, and it might be against AMD's best interest to give that kind of data away (not that there aren't reverse-engineering companies that do just that).
Nonetheless, GPUs compared to CPUs are extremely packed with execution units.
R600 is a big chip, with ~720 million transistors.
Barcelona, with four cores, is ~463 million.
I'm just going to make a number up and say that 1/4 of R600 is devoted to the stream processors.
If they are integrated into Phenom like all the other units, as you seem to describe, then they will be wired into each core.
If it is 1 R600's worth of SIMDs per core, that would almost triple the size of a Phenom processor.
If it is 1/4 of R600's processors per core, it would bump Phenom to 600 million transistors.
The R600 per core would be massive, and there are other considerations, such as how one would hook 320 stream processors to an x86 register file that tops off at 40 integer and 120 SSE/FP non-speculative and rename registers for K8 (Phenom should be the same for integer, not sure about FP).
The big version would have 4 SIMDs with 3 16-wide read and 1 16-wide write port each.
That's 12 read and 4 writes.
K8 had 5 and 5 for its fp register file.
I'm not sure what Barcelona's count is at at the moment.
Regardless, the width of the ports would also be larger.
Barcelona would have 128-bit ports, while the R600 ports would be 16*32 bits wide.
However, without a revamped decoder and additional instructions, the VLIW instruction packets would not exist to be decoded, and the streaming units would be saddled with a register file so small they would likely be able to call almost as many software-visible registers per clock as there are physical registers on the core.
The full R600 units per core would also require the widening of Phenom's issue rate. R600's 4 SIMDS operate on their own instructions, so the 3-wide decode rate of K10 would hinder it.
The 1/4 R600 per core could be folded into the standard issue width.
To sum up, any such change would range from insane to huge, especially for AMD, which is struggling to fab a chip as large as Barcelona.