First of all, in my humble opinion the Smith-Waterman example was a bad example, in the context of games. Sure it's still faster on the whole GPU, but I want to keep 90% of its processing power for graphics! The only physics algorithms that are efficient enough to only consume a minor fraction of processing power, are the embarrassingly parallel ones.
There's alot of hand waving being thrown around. I picked the Smith-Waterman example because it represents a class a problems that are not "embarassing parallel" yet run very efficiently on GPUs. Many dynamic programming problems are amenable to the same technique, thus, if you can get SW running fast, you can probably get BLAST, HMMR, Edit Distance, optimal polygon triangulation, and a whole bunch of other algorithms sped up.
The hand waving comes in when one simply says "oh, multi-core. CPU Problem solved." Well, an R520 runs ClawHMMR 30x faster than a 2.8Ghz P4, and 6x faster than a G5 with AltiVec optimizations. One of the startups I'm involved in sells a optimized version fo hmmsearch for multicore written in pure hand-tweaked ASM which is 10x faster than the C version, so at best, the CPU will still be 10x slower than the CPU, which is in line with NVidia CUDA benchmarks for bioinformatics.
Quad-core *might* bring a CPU within striking distance of the GPU on some of these problems, if *hand tuned* code was written for a quad-core CPU. Otherwise, you better buy a 16-core machine. By the time 8 and 16-core CPUs are ubiquitous in the mainstream, the G90 and R700 will probably be out and another factor of 3x or more faster. It is much more likely that devs will find middleware providers or tools to run algorithms they are looking for on either the CPU or the GPU, so I say, what's the problem? Havok isn't going to say "We are not longer shipping a physics engine for CPUs, therefore, everyone must buy a GPGPU and dedicate graphics performance to physics" In reality, users will have a choice, so I think the dichotomy provided is a false one.
Right there lies the problem. As much as game developers and GPGPU enthousiasts would like it, not everyone has a G80. And as Thorburn already noted, not everyone will have one (or something similar) for a very long time, but multi-core is already mainstream.
Well, Core 2 Duo midrange is "mainstream". Owning 4 or 8 cores will set you back $800-$1600. And yet, your performance still won't be up to next-gen GPUs. What if I turned your argument around and said "Not everyone will own a GPU at all! Why leave out people who don't want to buy 3D graphics cards? Mega-core CPUs are going mainstream. Let's just software render everything" This was Sony's dream with CELL. This is the Tim Sweeney argument. The problem is, any Moore's law improvements that help CPUs help GPUs, and you can't get around the fact that CPUs in the sense of general purpose x86, aren't going to beat chips optimized for data-parallel TLP running data-parallel or uber-threaded code.
Games that are meant to run on systems sold today are almost guaranteed to run on a dual-core CPU. But the GPU could be anything from a G80 to a X3000. That's anything between 500 and 20 GFLOPS, compared to 10 GFLOPS for one core of the cheap Pentium D.
Right, so gamers can own anything from X3000 to G80, but devs never have to worry about supporting people with Celerons and low-end CPUs from a few years ago? And "guaranteed to run on dual-core CPU" is different than "optimized to run on dual-core GPU" The number of hardware threads you use very much dictates the architecture of your game engine, it's not like a game designed with 2 threads is going to automatically max its performance on 8-cores. Dealing with a CPU ecosystem where some of your customers will have 8-cores, some will have 4, some will have 2, and still others will stil own old single cores, will present similar headaches as having to run on a variety of GPUs.
Most devs don't write their own physics engines. They license Havok, Novodex, use ODE, etc and tweak it to their needs. I don't envision devs writing custom code for CUDA and CTM, anymore than I envision them writing tons of game code in ASM optimized for different pipeline configurations of CPU microarchitectures.
I don't believe that in the long run mainstream GPUs will increase in performance faster than CPUs. CPUs have only just started to exploit thread parallelism, and the number of cores will double when transistor density doubles. GPUs are bound by the same advance in technology. So there's no reason to believe that game developers will be more inclined to use the GPU for physics in the future than they are today.
Both x86 ISAs and GPUs are constrained by the same process technology. Yet CPUs sitll have vastly inferior bandwidth, latency hiding, and parallel throughput. That isn't going to change as long as architecture stays the same.
General Purpose CPUs have to be a jack of all trades. For server chips, some manufacturers like Sun Microsystems and Azul, have opted to go with many simple cores with oodles of threads. Because their workload is databases, web servers, and other data parallel tasks, they have the freedom to design CPUs like this.
Likewise, GPU manufacturers don't have to perform well on the same applications that x86 ISA does, so they likewise can use their transistor budgets in different ways than convention CPUs.
In the long run, GPUs will maintain their lead, as long as CPUs need to be a jack of all trades. This forces compromises on the design of the pipelines, memory bus, et al. CPUs must deal with being fed non-threaded workloads, so they need complex OoOE logic. GPUs don't. Niagara don't. So when Intel can fit 16-32 x86 cores on a single chip, Nvidia and ATI will be fitting 4x-8x as many ALUs, and the G80++++ will have 1024 ALUs instead of 128. And their memory bandwidth will continue to destroy CPUs. When Intel and AMD start shipping CPUs with 512-bit buses, surface mounted to motherboards with surface mounted highspeed memory, then maybe we can talk.
Exactly. CPUs can still make a lot of architectural changes. By sacrifying cache area, simplifying out-of-order execution and trading branch prediction for hardware thread scheduling, there can be more functional units and peak performance goes up a lot. This way the CPU will look a lot more like a GPU, and the performance gap decreases. GPUs don't have that architectural freedom. All they can do is try to cram more functional units on the chip when the transistor budget increases.
Ass backwards. GPUs have far more architectural freedom than CPUs. CPUs have no virtual machine, no abstract interface for programming them. Compilers *model* low level CPU architecture and emit instruction sequences geared to running well on that microarchitecture. GPUs can't be programmed without writing code in a data parallel fashion ontop of an abstraction.
By design, GPU code doesn't share mutatable memory by default, pointers can't be aliased -- unless developers explicitly make it so. This readily lends itself to alot of flexibility. Any data-parallel task can be run on either an uber number of threads, or in a single very fast serial thread. The transformation is alot easier than the other direction, taking raw serialized ISA code and extracting TLP from it.
Intel/AMD architecture has to content with a long legacy. Gazillions of apps architected and compiled for non-threaded performance. The hypothetical "Fusion" style CPUs, with traditional "fast" serial pipelines (OoOE, et al) ala Core 2/K8* plus many many "GPU" style shader ALU have to contend with the fact that the two types of workloads demand different pipeline configurations, different memory buses, et al, because workload and memory access patterns are different.
If you built a Core 2-style chip, with a 384-bit memory bus, and 4 cores plus 128 scalar ALUs, it would be a hybrid monster of a chip, expensive to produce, and neither as good as a Core 2 that used ALL of its transistor budget on cache and extra cores, or a GPU that used ALL of its transistor budget on GPU style pipes.
CELL in the PS3 tries to have it both ways, by marrying a "traditional" (but without branch prediciton/OoOE) CPU with several co processing units coupled to fast on-chip memory and a very fast FlexIO memory bus, and also a fast inter-unit interconnect bus. The result however, is a chip that won't run general purpose single-threaded C code as fast as an x86, and won't run GPU code as fast as a GPGPU, but can run many algorithms FAST *if* lots of developer effort is put into architecting the algorithms for CELL's architecture.
It's harder to program than both x86 and writing DX10 shaders for GPUs.
The need to keep of specialized chips tailored to various worklords will only be all the more clear once Moore's law bottoms out. Then, unless we switch to a fundamentally different computing substrate (nanorod logic, RSFQ, quantum computing, etc) we'll only be able to increase performance by horizontal scaling, and that means software will have to deal with distributed nodes anyway.