Well lithogrophy got a new lease on life once the wonders of photon entanglement were discovered, I believe it's almost halves the previous limit.
The question is will interconnects keep up?
Right now, I believe the situation isn't all about faster switching transistors, it's all about interconnect latency and signal propagation over huge nets. I believe the P4 actually does a lot to address such things. We do have copper and it can move roughly 10 times the amount of current than aluminum at the same temp, but I'm not sure how long that'll last. Gold interconnects here we come.
There is also the fact that -I could be mistaken here- but verification tools for such fine processes is starting to lag a behind a bit.
As for parallel MPUs on a die, that sounds fine but the buffering logic could be a real pain, I don't think the connecting these chips with highspeed buses which can also provide necessary bandwidth will be that easy.
I think the hardest part will be getting coders to actually bother to code with considerations for multithreading designed to run on multiple CPUs not just make sure the app can do other things while waiting on something within another thread -there are some subtle differences-, it's hard enough to actually get them to bother to consider keeping their memory access clean and use optimal solutions even if speed isn't a concern, I'm not saying use assembly.
RANT
I can't believe all the morons who say we've got CPU cycles to burn, we don't if there are 10 applications that are built with the philosophy you don't really have to optimize that much and you have all those applications running, IE, ICQ, Outlook, WinAMP... all of a sudden all that'll be burning is a desire for the user to curse and swear at his or her slow computer.
/RANT
Now, I'm not saying the many CPUs on one die is a bad, actually I think the way to go is with simple in order CPU cores with a reasonably high clock speed. Then have a master CPU that is optimized to basically be a transmeta scheme you could be platform agnostic, it would do any necessary decoding -with a trace cache so it doesn't waste it's time-, could reorder instructions, assign code streams to each individual processor in the core, look a head in the code streams and remove any branches that can be removed or prepredict them... so on and so forth, it might even assign one or more idle processors to working on one or more different branches. But that's an easy way to end up on the wrong side of a combinatoral explosion -never thought I'd dip into my stats vocab. Of course the master CPU might just be a more general CPU that runs software to do the aforementioned things, it really doesn't matter exactly how it's implemented, but it'd be an interesting arch... one that isn't likely for today, since we probably can't put down that many transistors on a die, well at least for the desktop.