There is merit to what you both, pcchen and arjans, say.
It might be instructive to look at the directions that high performance computing have taken. From the Cray 1 and forward, we have seen HPC move from fast scalar+vector processors to primarily massively parallell architectures, either loosely (as in Beowulf clusters) or more tightly integrated. These sysems typically run a few very suitable problems.
But there is another trend, and that is the migration of scientific computing to "PC" level hardware. And in spite of my field being one which actually
does crunch supercomputer cycles, I would have to say that this is the dominant trend. There are two reasons for this, one of which is obviously cost. But the other reason is that for non-parallellizable code, you just can't get scalar processors that are much faster than current PC hardware. (Memory sizes are limiting though.)
Scientific codes have the advantage of not being very cost pressured - if a code is heavily used, you can count on professors having their graduate students check out possibilities for making it run faster.
The conclusion I would draw is that a lot of these codes simply do not parallellize well, not even to the point of benefitting from small scale SMP. I have dealt with such code myself.
So people parallellize physically, and buy multiple boxes/racks so that they at least can get away from queueing.
Algorithms and tools are certainly part of the problem - perhaps the single most hopeful development for parallell processing is the upcoming PS3. I'm totally uninterested in consoles per se, but I'll buy myself one of those if I can get development tools reasonably easy, because the architecture is so promising. In PC space the Hammer is interesting, but since we are dealing with physically distinct chips (and AMDs need to find a lucrative niche for their Opterons) I have difficulties seeing their topological flexibility ever having an impact in the mainstream. I'd predict that x86 will follow their current path for the foreseeable future.
So far, for a task to be productively migrated to a GPU, it has to be both parallellizeable and vectorizeable, can't require high precision, and must fit within the local memory of a gfx-card. (If it could be productively chopped up into smaller blocks that could be transferred over the AGP bus and processed locally, odds are that it could be partitioned into even smaller parts that would fit inside CPU caches, which would be faster still.)
I just can't see GPUs taking over the tasks of the CPU in any general sense. Particularly not on a platform controlled by Intel.
Entropy