GPU Physic vs. CPU Threading

First of all, in my humble opinion the Smith-Waterman example was a bad example, in the context of games. Sure it's still faster on the whole GPU, but I want to keep 90% of its processing power for graphics! The only physics algorithms that are efficient enough to only consume a minor fraction of processing power, are the embarrassingly parallel ones.

There's alot of hand waving being thrown around. I picked the Smith-Waterman example because it represents a class a problems that are not "embarassing parallel" yet run very efficiently on GPUs. Many dynamic programming problems are amenable to the same technique, thus, if you can get SW running fast, you can probably get BLAST, HMMR, Edit Distance, optimal polygon triangulation, and a whole bunch of other algorithms sped up.

The hand waving comes in when one simply says "oh, multi-core. CPU Problem solved." Well, an R520 runs ClawHMMR 30x faster than a 2.8Ghz P4, and 6x faster than a G5 with AltiVec optimizations. One of the startups I'm involved in sells a optimized version fo hmmsearch for multicore written in pure hand-tweaked ASM which is 10x faster than the C version, so at best, the CPU will still be 10x slower than the CPU, which is in line with NVidia CUDA benchmarks for bioinformatics.

Quad-core *might* bring a CPU within striking distance of the GPU on some of these problems, if *hand tuned* code was written for a quad-core CPU. Otherwise, you better buy a 16-core machine. By the time 8 and 16-core CPUs are ubiquitous in the mainstream, the G90 and R700 will probably be out and another factor of 3x or more faster. It is much more likely that devs will find middleware providers or tools to run algorithms they are looking for on either the CPU or the GPU, so I say, what's the problem? Havok isn't going to say "We are not longer shipping a physics engine for CPUs, therefore, everyone must buy a GPGPU and dedicate graphics performance to physics" In reality, users will have a choice, so I think the dichotomy provided is a false one.

Right there lies the problem. As much as game developers and GPGPU enthousiasts would like it, not everyone has a G80. And as Thorburn already noted, not everyone will have one (or something similar) for a very long time, but multi-core is already mainstream.

Well, Core 2 Duo midrange is "mainstream". Owning 4 or 8 cores will set you back $800-$1600. And yet, your performance still won't be up to next-gen GPUs. What if I turned your argument around and said "Not everyone will own a GPU at all! Why leave out people who don't want to buy 3D graphics cards? Mega-core CPUs are going mainstream. Let's just software render everything" This was Sony's dream with CELL. This is the Tim Sweeney argument. The problem is, any Moore's law improvements that help CPUs help GPUs, and you can't get around the fact that CPUs in the sense of general purpose x86, aren't going to beat chips optimized for data-parallel TLP running data-parallel or uber-threaded code.

Games that are meant to run on systems sold today are almost guaranteed to run on a dual-core CPU. But the GPU could be anything from a G80 to a X3000. That's anything between 500 and 20 GFLOPS, compared to 10 GFLOPS for one core of the cheap Pentium D.

Right, so gamers can own anything from X3000 to G80, but devs never have to worry about supporting people with Celerons and low-end CPUs from a few years ago? And "guaranteed to run on dual-core CPU" is different than "optimized to run on dual-core GPU" The number of hardware threads you use very much dictates the architecture of your game engine, it's not like a game designed with 2 threads is going to automatically max its performance on 8-cores. Dealing with a CPU ecosystem where some of your customers will have 8-cores, some will have 4, some will have 2, and still others will stil own old single cores, will present similar headaches as having to run on a variety of GPUs.

Most devs don't write their own physics engines. They license Havok, Novodex, use ODE, etc and tweak it to their needs. I don't envision devs writing custom code for CUDA and CTM, anymore than I envision them writing tons of game code in ASM optimized for different pipeline configurations of CPU microarchitectures.

I don't believe that in the long run mainstream GPUs will increase in performance faster than CPUs. CPUs have only just started to exploit thread parallelism, and the number of cores will double when transistor density doubles. GPUs are bound by the same advance in technology. So there's no reason to believe that game developers will be more inclined to use the GPU for physics in the future than they are today.

Both x86 ISAs and GPUs are constrained by the same process technology. Yet CPUs sitll have vastly inferior bandwidth, latency hiding, and parallel throughput. That isn't going to change as long as architecture stays the same.

General Purpose CPUs have to be a jack of all trades. For server chips, some manufacturers like Sun Microsystems and Azul, have opted to go with many simple cores with oodles of threads. Because their workload is databases, web servers, and other data parallel tasks, they have the freedom to design CPUs like this.

Likewise, GPU manufacturers don't have to perform well on the same applications that x86 ISA does, so they likewise can use their transistor budgets in different ways than convention CPUs.

In the long run, GPUs will maintain their lead, as long as CPUs need to be a jack of all trades. This forces compromises on the design of the pipelines, memory bus, et al. CPUs must deal with being fed non-threaded workloads, so they need complex OoOE logic. GPUs don't. Niagara don't. So when Intel can fit 16-32 x86 cores on a single chip, Nvidia and ATI will be fitting 4x-8x as many ALUs, and the G80++++ will have 1024 ALUs instead of 128. And their memory bandwidth will continue to destroy CPUs. When Intel and AMD start shipping CPUs with 512-bit buses, surface mounted to motherboards with surface mounted highspeed memory, then maybe we can talk.

Exactly. CPUs can still make a lot of architectural changes. By sacrifying cache area, simplifying out-of-order execution and trading branch prediction for hardware thread scheduling, there can be more functional units and peak performance goes up a lot. This way the CPU will look a lot more like a GPU, and the performance gap decreases. GPUs don't have that architectural freedom. All they can do is try to cram more functional units on the chip when the transistor budget increases.

Ass backwards. GPUs have far more architectural freedom than CPUs. CPUs have no virtual machine, no abstract interface for programming them. Compilers *model* low level CPU architecture and emit instruction sequences geared to running well on that microarchitecture. GPUs can't be programmed without writing code in a data parallel fashion ontop of an abstraction.

By design, GPU code doesn't share mutatable memory by default, pointers can't be aliased -- unless developers explicitly make it so. This readily lends itself to alot of flexibility. Any data-parallel task can be run on either an uber number of threads, or in a single very fast serial thread. The transformation is alot easier than the other direction, taking raw serialized ISA code and extracting TLP from it.

Intel/AMD architecture has to content with a long legacy. Gazillions of apps architected and compiled for non-threaded performance. The hypothetical "Fusion" style CPUs, with traditional "fast" serial pipelines (OoOE, et al) ala Core 2/K8* plus many many "GPU" style shader ALU have to contend with the fact that the two types of workloads demand different pipeline configurations, different memory buses, et al, because workload and memory access patterns are different.

If you built a Core 2-style chip, with a 384-bit memory bus, and 4 cores plus 128 scalar ALUs, it would be a hybrid monster of a chip, expensive to produce, and neither as good as a Core 2 that used ALL of its transistor budget on cache and extra cores, or a GPU that used ALL of its transistor budget on GPU style pipes.

CELL in the PS3 tries to have it both ways, by marrying a "traditional" (but without branch prediciton/OoOE) CPU with several co processing units coupled to fast on-chip memory and a very fast FlexIO memory bus, and also a fast inter-unit interconnect bus. The result however, is a chip that won't run general purpose single-threaded C code as fast as an x86, and won't run GPU code as fast as a GPGPU, but can run many algorithms FAST *if* lots of developer effort is put into architecting the algorithms for CELL's architecture.

It's harder to program than both x86 and writing DX10 shaders for GPUs.

The need to keep of specialized chips tailored to various worklords will only be all the more clear once Moore's law bottoms out. Then, unless we switch to a fundamentally different computing substrate (nanorod logic, RSFQ, quantum computing, etc) we'll only be able to increase performance by horizontal scaling, and that means software will have to deal with distributed nodes anyway.
 
I picked the Smith-Waterman example because it represents a class a problems that are not "embarassing parallel" yet run very efficiently on GPUs.
With very efficiently, do you mean it scales with GFLOPS, or do you mean it just runs a couple times faster on a G80 compared to a CPU?
Well, an R520 runs ClawHMMR 30x faster than a 2.8Ghz P4, and 6x faster than a G5 with AltiVec optimizations.
The first thing this tells me is that the P4 version is not SSE optimized. Secondly, it wouldn't be so impressive on a X1300. And last but not least, nobody wants to sacrifice framerates if a multi-core CPU can do it.
One of the startups I'm involved in sells a optimized version fo hmmsearch for multicore written in pure hand-tweaked ASM which is 10x faster than the C version, so at best, the CPU will still be 10x slower than the CPU, which is in line with NVidia CUDA benchmarks for bioinformatics.
That's awesome, but we're talking about games here, where dual-core has just become mainstream but G80's are unaffordable. Also, 10x is only barely enough for running things on the GPU without sacrificing framerates.
Quad-core *might* bring a CPU within striking distance of the GPU on some of these problems, if *hand tuned* code was written for a quad-core CPU. Otherwise, you better buy a 16-core machine.
What's wrong with hand tuned code?
By the time 8 and 16-core CPUs are ubiquitous in the mainstream, the G90 and R700 will probably be out and another factor of 3x or more faster.
Definitely, but while budget systems will get octa-cores they won't get a G90/R700, not even a G80/R600. So for a game developer it's a much safer bet to just run physics on the CPU.
It is much more likely that devs will find middleware providers or tools to run algorithms they are looking for on either the CPU or the GPU, so I say, what's the problem? Havok isn't going to say "We are not longer shipping a physics engine for CPUs, therefore, everyone must buy a GPGPU and dedicate graphics performance to physics" In reality, users will have a choice, so I think the dichotomy provided is a false one.
Oh I'm not saying there's any problem. In whatever direction this evolves there will be an adequate solution. My predicition is just that the CPU will prevail.
Well, Core 2 Duo midrange is "mainstream". Owning 4 or 8 cores will set you back $800-$1600.
Today, yes. But interestingly CPUs have a very non-linear price/performance. And the prices drop fast. A little over a year ago I payed over 500 € for an Athlon 64 X2, now I can buy a Pentium D for 100 €. That doesn't happen with GPUs. A Geforce 8800 GTS will not drop to 100 $. What I get for 100 € is a X1300. So compared to one year ago, for 200 € I get a high-end CPU and a low-end GPU. With a budget of 1000 € split evenly I'd get a slightly higher performing CPU but monstrous GPU.

So in the not so distant future, 200 € will get you an octa-core but still a not so hot GPU.
And yet, your performance still won't be up to next-gen GPUs. What if I turned your argument around and said "Not everyone will own a GPU at all! Why leave out people who don't want to buy 3D graphics cards? Mega-core CPUs are going mainstream. Let's just software render everything" This was Sony's dream with CELL. This is the Tim Sweeney argument. The problem is, any Moore's law improvements that help CPUs help GPUs, and you can't get around the fact that CPUs in the sense of general purpose x86, aren't going to beat chips optimized for data-parallel TLP running data-parallel or uber-threaded code.
Software rendering and physics are worlds apart. A MUL on a GPU is the same as a MUL on a CPU, for both physics and graphics. But a TEX takes one clock cycle on a GPU but tens of clock cycles on a CPU. So you really can't compare them this way. CPU rendering is no match for an X1300, but it could be on par with it for physics.
Right, so gamers can own anything from X3000 to G80, but devs never have to worry about supporting people with Celerons and low-end CPUs from a few years ago?
The worst that can happen is that's it's about two times slower. But when you expect a G80 for physics and you get a X3000 instead, that's a whole different story.

Like I already noted, there's only a small difference in performance for the range of CPUs, but a huge difference between the GPUs. For graphics, that problem is solved quite adequately by reducing the detail (resolution, anti-aliasing, effects, etc). But dialing down physics isn't that easy. By just restricting physics calculations to the CPU the whole problem is avoided. And for a few moving leaves and a couple explosions it's really adequate.
And "guaranteed to run on dual-core CPU" is different than "optimized to run on dual-core GPU" The number of hardware threads you use very much dictates the architecture of your game engine, it's not like a game designed with 2 threads is going to automatically max its performance on 8-cores. Dealing with a CPU ecosystem where some of your customers will have 8-cores, some will have 4, some will have 2, and still others will stil own old single cores, will present similar headaches as having to run on a variety of GPUs.
Very true, unfortunately. But luckily with some effort there are ways to design your application to run well on an arbitrary number of cores. Once it's done it won't need a redesign for years. Games are definitely fine for running on 1 to 8 cores. Starting from around 16 cores I believe a solution is necessary for fine-grained threading, likely in the form of hardware scheduling. But we'll see that when we get there. What's also a good thing is that single-core seems to be outphased really fast. In a couple years from now you'll hardly have to deal with single-core CPUs, but you'll still have a whole range of very low-end to very high-end graphics solutions.
Both x86 ISAs and GPUs are constrained by the same process technology. Yet CPUs sitll have vastly inferior bandwidth, latency hiding, and parallel throughput. That isn't going to change as long as architecture stays the same.
Multi-core is pushing bandwidth needs, and it's definitely evolving. There's just no need to have GPU-like bandwidths yet. Latency can be hidden with hyper-threading. And parallel throughput is clearly improving as well. So it's not like there's a fundamental problem to improve the CPUs architecture. With the current competition between Intel and AMD the future looks very interesting. The revolution has only just started, while the future of GPUs looks a little less exciting. But please educate me if there's anything to look forward to...
In the long run, GPUs will maintain their lead, as long as CPUs need to be a jack of all trades. This forces compromises on the design of the pipelines, memory bus, et al. CPUs must deal with being fed non-threaded workloads, so they need complex OoOE logic. GPUs don't. Niagara don't. So when Intel can fit 16-32 x86 cores on a single chip, Nvidia and ATI will be fitting 4x-8x as many ALUs, and the G80++++ will have 1024 ALUs instead of 128. And their memory bandwidth will continue to destroy CPUs. When Intel and AMD start shipping CPUs with 512-bit buses, surface mounted to motherboards with surface mounted highspeed memory, then maybe we can talk.
They can't continue to diverge. CPU peak performance will at least double when transistor count doubles, while GPU performance can at most double when transistor count doubles (clock increases aside). There will obviously remain a gap, but for physics in particular I don't believe it will be more interesting to run it on a GPU in the future than today.

Also, much of the time today's CPUs are running parallelizable workloads. So even for seemingly serial applications, multi-core is the right evolution. It doesn't make it any less 'jack of all trades'. The reason why they've used every other trick before going multi-core is because of the software complexity. It's a very big paradigm shift, and a new compiler alone doesn't make it much easier. But programming language assisted concurrency holds great promise, and we just have to rethink the way we program. When graphics cards had only just entered the consumer market, it also took a while to educate developers and keep things running concurrently. While it's sometimes still challenging, it's much better understood and accepted now.
Ass backwards. [...]
What I meant is that even with the current transistor budget the CPU has a lot of architectural freedom for improving performance. One Prescott core can fit a dozen Katmai cores, just to name one approach (not practical, but still). Ironically a Prescott's IPC is lower than a Katmai's IPC, so all the extra transistors were spent to reach 3+ GHz, just to avoid multi-threaded software issues. Now that they've hit a limit with that approach and embraced multi-core, they can rebudget the transistors to do some very interesting things. A GPU on the other hand is immediately bound by the transistor budget. You can rearrange some things, but you really just want the maximum number of execution units and keep them fed with data.
[...] Then, unless we switch to a fundamentally different computing substrate (nanorod logic, RSFQ, quantum computing, etc) we'll only be able to increase performance by horizontal scaling, and that means software will have to deal with distributed nodes anyway.
Stack 'em. :D
 
Back
Top