GPU vs CPU Architecture Evolution

DemoCoder · Sep 1, 2009

So as we can see, the recent history of CPUs shows them growing FLOPs by a staggering ~2x, while GPUs scaled at a mere 4x on an inconsequential order of magnitude difference in base capability.

Are you being sarcastic? The P4 was released in 2000, Nehalem in 2008. So by your metric, it took them 8 years to double it. Meanwhile, in your GPU comparison, it took AMD/ATI only 2 years to quadruple it. @3Ghz, that means they went from 12 GFlop to 96Gflop in 8 years, for an 8x increase. I don't even want to bother comparing the R200 to the Rv770. If you look at bandwidth in 2000 vs 2008 available to CPUs vs GPUs, it's a similar story.

The fact that CPUs must operate with commodity DIMMs is not really relevant, it's purely a manufacturing decision by the PC industry that hamstrings performance at the cost of flexibility. The vast majority of consumers don't really care how the RAM is wired up. Lots of people buy iPhones and Macs with non-removable batteries, or hard to change RAM, people buy consumer electronics goods, and consoles with hardwired manufacturing techniques. Laptops seem to be extremely popular these days and people often just buy a whole new machine rather than try and upgrade it. There is an assumption that what people love about PCs is how interchangeable the parts are, but I think that assumption really only applies to a niche market, and that the broad market for computers could be very vertically integrated (like Apple has done) without most people giving a shit.

The reality is, an ALU is cheaper than a CPU core, and if your problem is embarrassingly parallel, then packing in the ALUs is better than throwing in more general purpose cores. For most of the workloads modern PCs do (outside games and multimedia), there is very little gain from additional CPU-core level parallelism.

Moreover, with the move cloud based computing and the web browser dominating the CPU time of most PCs out there, the purely single-threaded nature of Javascript and the browser core means a lot of power is simply going to waste. You could put 64 Nehelem cores on a chip, and it wouldn't speed up the subjective latency of most applications that people use on a daily basis, but scaling GPGPUs certainly does lead to a very measurable difference in games.

Thus, I would say, scaling CPU cores these days matters more for server environments, like Google's data-centers, and for the client-side, at this point, no one will notice much of a difference except on the few applications that tax a system, e.g. games, multimedia, content-creation apps.

3dilettante · Sep 1, 2009

DemoCoder said:
Are you being sarcastic?

Yes. I was debating about adding a smiley.

Although the 65nm Cedar Mill was a 2006 chip, cutting the timeframe down.
Penryn would have a better ratio, as Nehalem's uncore reduced the amount of die devoted to CPU cores.
If I started with Penryn, however, CPUs would have regressed.

The fact that CPUs must operate with commodity DIMMs is not really relevant, it's purely a manufacturing decision by the PC industry that hamstrings performance at the cost of flexibility.

I think servers might share more of the blame. They gobble up very large amounts of RAM, and they are the market that pays the fat margins for CPUs. I don't think laptops have eclipsed this segment yet.

DemoCoder · Sep 1, 2009

3dilettante said:
Yes. I was debating about adding a smiley.

I'm glad, usually your analysis is super rock-solid, so when I saw this, I was like "oh, wait, he's obviously joking" given the adjectives used, but I doubted myself and thought maybe there was actually a valid argument I wasn't seeing and that the CPUs really were making more impressive gains.

aaronspink · Sep 1, 2009

Tahir2 said:
Intel recently integrated the memory controller on die which helps with IPC. Unless your definition of IPC is pure logic dedicated to processing data internally in the "logic" units of the CPU.

Integrating memory controllers, PCI-E controllers, adding more cache, QPI/HT3.0 all of these help with IPC in given situations.

those are all effectively NNC (net no change) things. the logic was going to be somewhere, by bringing it on chip you reduce the number of external chips you need and lower overall total power requirements while increasing performance in some cases.

aaronspink · Sep 1, 2009

rpg.314 said:
GPU's do have a trick up their sleeve though (not sure if they'll follow the trail because of various reasons), increase clock speed. rv770 runs at 750Mhz, GT200 runs at ~1.5G. They can still scale along that axis.

They can but with their design styles and experience it is likely to be a net negative as it will increase their power and their design cycles in addition to increasing their design complexity.

Tahir2 · Sep 1, 2009

Going to cross pollinate this thread with something I found here (purely coincidentally).

Really amazing discussion on the level of IPC theoretically possible in a perfect Universe and impossible in the real Universe. At least for now.

Linus Torvald said:
You do realize that Core 2 (and especially Nehalem) gets
almost twice the IPC of P4 on a lot of real-world loads?

hobold said:
A classical simulation experiment (a superscalar CPU with unlimited resources and perfect branch prediction) showed that this IPC limit is far away as well (theoretical IPC was around 2000), but apparently one can only creep towards it slowly like a logarithm function (simply speaking, doubling the resources of a more realistic CPU increases practical IPC by one).

3dilettante · Sep 1, 2009

AVX should double throughput numbers if it is implemented at full width in hardware.
FMAD would come in a later revision, which should provide twice the FLOPs at some additional fraction of die space. I think the expansion is something like 50% over a single unit.
Without seeing an actual physical implementation, however, it's too easy to fudge in either direction depending on how things like the number of operands in the micro-ops and whether it's a full-width or split-issue unit.
The recent split in SSE5 where AMD kept 3-operand non-destructive MAD where AVX ditched it is an indicator some compromises are being made.

Absent a total lack of growth for GPUs, I wouldn't expect a dearth of peak FLOPs being the downfall of more specialized FP throughput engines.
The core snapshot of Nehalem is pretty brutal in that regard, when you see how little of it is dedicated to computation, and that is not counting the swath of space dedicated to IOs and the uncore.

Larrabee is interesting in that it looks like it might be 5 times denser FLOP-wise than Nehalem, which puts it midway between the two extremes.
The compromises in that design, and the look of the next generation of mainstream processors, hint to me that Intel and AMD do not want to compromise so fully.

Nick · Sep 2, 2009

rpg.314 said:
I do not see why you think CPUs have a lot of room for growth. Agreed, GPU's are severly reticle size and power limited, but I dont see why you think CPUs have a lot of headroom for growth. CPUs must serve the ultra low cost market too, which is being eaten up by IGPs for GPU.

They have headroom in computational density. They can slowly evolve all the way to an architecture much like Larrabee, while keeping single-threaded performance steady. Current CPU architectures still cary a lot of weight from the MHz-race days, but they can gradually get rid of that (keeping the good bits of course).

This evolution has already started. They're no longer spending transistors on IPC unless it offers better than linear scaling. For instance Nehalem has less transistors than Yorkfield, primarily by reducing cache size per core, and spent a little of it on Hyper-Threading which yields 30% higher effective IPC. We'll see many more tradeoffs like this in the future.

Yeah, we have octa cores at 45 nm, but at what price? My claim, was qualified with ".. at the same price point". I don't think we had quad cores at ~$150 in 2006. The mainstreaming of quad cores has taken longer than the shrink time of 2 years.

According to pricewatch a mid-range 3.0 GHz Core 2 Duo still costs 160 $, today. So how you come to expect mainstream quad-cores to cost 150 by 2006 is a bit beyond me. You can however today buy a 2.5 GHz Core 2 Quad for 155 $, which in 2005 would have only gotten you a single-core.

How do you figure AVX and FMA can be done without a non-trivial increase in trannies? Can I have some explanation? Unless you are speaking of a 128 bit wide AVX unit, just like we had a 64 bit wide SSE unit prior to conroe. In which case, there is absolutely no perf increase at all. And IIRC, AVX wont have FMA.

Looking at a rough diagram of one of Nehalem's cores, and the entire die, I estimate that at most 5% of the area is taken up by the SSE units. So it would only take another 5% max to make them 256-bit wide. And far less than that to add FMA. Worst case, we're looking at a 3.6 fold increase in computational density for MAD operations.

And they will remain lousy simply because they need to care for single threaded IPC, which is still what matters (may be not the most). GPU's dont even look at that market and hence can run waaay faster at workloads designed for them.

They no longer have to increase single-threaded performance at the same pace, or at all. For a single thread Core i7 is practically identical to Core 2, except that it has an 8 MB L3 cache versus a 6 MB L2 cache which sometimes yields a small benefit, sometimes not. People do accept that overall performance can no longer be improved significantly without some developer effort. It's just an ongoing transition. Soon the applications for which it matters will all be multi-threaded (else the competition will show how it's done).

GPU's do have a trick up their sleeve though (not sure if they'll follow the trail because of various reasons), increase clock speed. rv770 runs at 750Mhz, GT200 runs at ~1.5G. They can still scale along that axis.

Uhm, no. They'll hit that concrete wall again. Also, high clock speed doesn't come without an additional cost to increase the number of pipeline stages and keep them fed. Just compare die sizes and ALU counts for RV770 and GT200. Long story short: GPUs are bound by exactly the same physical laws as CPUs. Clock frequencies still go up, but only at the same pace as process improvements allow.

And last but not the least, don't underestimate the burden of backward compatibility. For GPU's, all code gen is dynamic/ at runtime. So they can aggressively drop old and useless crap.

True, but you shouldn't overestimate it either. Intel's reckless number of SSE extensions hasn't exactly costed them their position as dominant CPU manufacturer.

It's also almost ironic how some of the older instructions can be implemented using instructions that supersede them. Take for instance pshufb. It can be used to replace any of the other shuffle or unpack instructions. So once you have the logic to support pshufb, implementing the others becomes a simple translation from the instruction format to a vector representing the byte reordering. So the RISC core itself doesn't suffer from the CISC ISA.

Also, even though GPUs can get rid of instructions that have become useless, overall the number of instructions is still increasing. And I bet a lot of instructions already have to stay because driver developers won't accept having to rewrite everything every generation, and to offer backward compatibility with things like CUDA and other things that are close to the metal. So I really wouldn't say they can "aggressively"drop things. Lastly, because there's a need to evolve toward more & smaller independent cores the GPU will also suffer form an increase in control logic.

Nick · Sep 2, 2009

3dilettante said:
So as we can see, the recent history of CPUs shows them growing FLOPs by a staggering ~2x, while GPUs scaled at a mere 4x on an inconsequential order of magnitude difference in base capability.

Past trends don't say a darn thing about the future. On the contrary, this is exactly why the CPU has so much headroom left, while GPU's can no longer repeat that kind of feat.

Also, I believe it's incorrect to factor out the die size. CPU's have long been hold back from using more silicon to achieve higher performance. But since the arrival of multi-core architectures they benefit from it just like a GPU. It may be another one-off, but it's something you have to add into the equation.

ShaidarHaran · Sep 2, 2009

Nick:
GPUs care about one task, one of the most parallel tasks around. Every new process node allows for large gains in parallelism via the introduction of additional SIMDs. Why you believe they will stop producing huge performance gains at any near point is beyond me.

Nick · Sep 2, 2009

ShaidarHaran said:
GPUs care about one task, one of the most parallel tasks around. Every new process node allows for large gains in parallelism via the introduction of additional SIMDs. Why you believe they will stop producing huge performance gains at any near point is beyond me.

I never said they'll stop increasing in performance. But it won't be solely from additional cores or wider ALUs.

If you could swap your current GPU for one that has a billion ALUs running at 1 MHz, would you do it? You'll quickly come to the unfortunate conclusion that task dependencies and slow running threads are killing performance. It might not even finish rendering a frame in 1/60'th of a second, and no amount of additional ALUs can help you.

So while graphics might be one of the most parallel tasks around, it's not infinitely parallel. According to OTOY, you can run ten instances of Crysis on a GPU before reaching full utilization. Unless some radical architectural changes are made (over many years), that will only get worse and GPUs will only be powerful on paper.

Graphics workloads are getting more diverse every day. And increasingly they are tasked with completely different things than graphics as well. To support such a wide variety of workloads they can't just keep adding cores or wider ALUs and expect it to run well. Larrabee might not have the highest throughput of all chips in 2010, but it will be superior at running certain workloads and quite possibly even achieve higher utilization for graphics.

DemoCoder · Sep 2, 2009

That argument's pretty handwavey.

Nick · Sep 2, 2009

DemoCoder said:
That argument's pretty handwavey.

How is it any more handwavey than ShaidarHaran's argument? "This is how things have been before and this is how it will be forever". That's not even an argument. At least I'm summing up the effects that prevent infinite performance scaling from additional cores and wider ALUs. So what are you missing, cold hard numbers, dates?

DemoCoder · Sep 2, 2009

Well, simply citing Amdahl's law tells you nothing of the relative scaling. So, you can't scale infinitely, so what? It can still turn out that the multiplier that you can scale by is still higher than the scalability of the serialized section. For example, if 95% of your problem is parallelizable, you can still get a 20x improvement by scaling ALUs, whereas the 5% that isn't, is bounded by CPU improvements in serial speed, which are effectively giving diminishing returns, so it makes sense to plow resources into the ALUs until you reach this point, and it is easy to speed that given the shader workloads that ARE highly parallel, we aren't there yet.

Furthermore, when GPUs do run up against Amdahl's law, scaling via multiple CPU cores is still worse because you're bound by single-thread blocks at that point.

Your argument boils down to an assertion that workloads will favor CPUs, but that is an unsupportable assertion. It also assumes that the problem areas that would favor CPUs are inherently un-streamable when we have seen in the past, that rather clever algorithmic decomposition techniques have been used to turn what was thought to be serial CPU oriented tasks into stream computing tasks.

You don't want to use induction from past data, well, fine, but the data is all we have at the moment, and I'd say it's better than plain ole speculation about hypothetical GPU killing workloads.

rpg.314 · Sep 2, 2009

3dilettante said:
To give an example, the 65nm single-core Pentium 4 was capable of 4 SP ops per clock in an area of 80 mm2.

Wait, doesn't the P4 have a 64 bit wide SSE engine? If not then which made the transition from 64 bit to 128 bit?

rpg.314 · Sep 2, 2009

Nick said:
They have headroom in computational density. They can slowly evolve all the way to an architecture much like Larrabee, while keeping single-threaded performance steady. Current CPU architectures still cary a lot of weight from the MHz-race days, but they can gradually get rid of that (keeping the good bits of course).

So a 16 core ooo CPU with 512 bit wide vector unit, yeah, sorta looks like lrb. But it will be behind specialized compute processors, lke GPU or lrb.

According to pricewatch a mid-range 3.0 GHz Core 2 Duo still costs 160 $, today. So how you come to expect mainstream quad-cores to cost 150 by 2006 is a bit beyond me. You can however today buy a 2.5 GHz Core 2 Quad for 155 $, which in 2005 would have only gotten you a single-core.

Dude, I meant that dual cores were mainstream in 2005-2006, and quad cores are/will be mainstream in 2009/2010. So mainstream core counts are doubling every 4 years, not every 2 years.

Looking at a rough diagram of one of Nehalem's cores, and the entire die, I estimate that at most 5% of the area is taken up by the SSE units. So it would only take another 5% max to make them 256-bit wide. And far less than that to add FMA. Worst case, we're looking at a 3.6 fold increase in computational density for MAD operations.

Ok now I see your point. Since SSE was already so less area intensive, AVX+FMA won't add much to the overall tranny count. I was thinking purely about the vector units.

True, but you shouldn't overestimate it either. Intel's reckless number of SSE extensions hasn't exactly costed them their position as dominant CPU manufacturer.

True, but that has a lot to do with that market's dynamics too. Architecturally, it doesn't make them a sound design either.

Also, even though GPUs can get rid of instructions that have become useless, overall the number of instructions is still increasing. And I bet a lot of instructions already have to stay because driver developers won't accept having to rewrite everything every generation, and to offer backward compatibility with things like CUDA and other things that are close to the metal. So I really wouldn't say they can "aggressively"drop things. Lastly, because there's a need to evolve toward more & smaller independent cores the GPU will also suffer form an increase in control logic.

Ok, so they'll drop them imcrementally. But they will be dropped.

Nick · Sep 2, 2009

DemoCoder said:
Well, simply citing Amdahl's law tells you nothing of the relative scaling. So, you can't scale infinitely, so what? It can still turn out that the multiplier that you can scale by is still higher than the scalability of the serialized section. For example, if 95% of your problem is parallelizable, you can still get a 20x improvement by scaling ALUs, whereas the 5% that isn't, is bounded by CPU improvements in serial speed, which are effectively giving diminishing returns, so it makes sense to plow resources into the ALUs until you reach this point, and it is easy to speed that given the shader workloads that ARE highly parallel, we aren't there yet.

Fair enough, I'll attempt to argument that we're actually not that far from the tilting point. It was getting late for me yesterday...

Indeed there will always be workloads that do scale well by merely increasing core count or ALU width. But the thing is those are getting more rare. Point in case, look at how raytracing fares on a GPU versus a CPU, or Larrabbe. It's a highly parallel workload, but why does every GPU rendered scene include only cubes or spheres? Because rays can scatter in any direction and you need coherence for today's GPU architectures to perform well. It's a workload that cries out for independently running threads. So CPUs and Larrabee are better equipped for that. Also, performance doesn't decimate on a CPU and Larrabee with high register pressure. And while I can't really put this into numbers, all workloads are moving away from being fully coherent, including graphics.

Why is it that while GT200 doubles the number of transistors compared to G92, framerates for Crysis only went up by 50% at high resolutions? Texture units, ROPs, bandwidth? No, the performance improvement is less than 50% for low resolutions. So where does this put us on Amdahl's curves?

Let's for a second go with your numbers: 95% of parallelizable workload. Optimistic as it may appear, increasing the number of theoretical FLOPS by 20 times will happen before the next decade is over. So unless they start to care a lot more about that 5%, scaling performance by simply adding more resources in parallel will be a dead end by then.

Nick · Sep 2, 2009

rpg.314 said:
Wait, doesn't the P4 have a 64 bit wide SSE engine? If not then which made the transition from 64 bit to 128 bit?

Yes, but it has separate ones for ADD and MUL so it can issue 4 SP operations per clock per core.

rpg.314 · Sep 2, 2009

Ok. Thanks for that.

rpg.314 · Sep 2, 2009

Nick said:
Fair enough, I'll attempt to argument that we're actually not that far from the tilting point. It was getting late for me yesterday...

Indeed there will always be workloads that do scale well by merely increasing core count or ALU width. But the thing is those are getting more rare. Point in case, look at how raytracing fares on a GPU versus a CPU, or Larrabbe. It's a highly parallel workload, but why does every GPU rendered scene include only cubes or spheres? Because rays can scatter in any direction and you need coherence for today's GPU architectures to perform well. It's a workload that cries out for independently running threads. So CPUs and Larrabee are better equipped for that. Also, performance doesn't decimate on a CPU and Larrabee with high register pressure. And while I can't really put this into numbers, all workloads are moving away from being fully coherent, including graphics.

I agree that we are moving away from fully coherent workloads.

Let's for a second go with your numbers: 95% of parallelizable workload. Optimistic as it may appear, increasing the number of theoretical FLOPS by 20 times will happen before the next decade is over. So unless they start to care a lot more about that 5%, scaling performance by simply adding more resources in parallel will be a dead end by then.

Remember the twin of Amadhal's Law, the gustafson law, which basically says that the serial, parallel fraction can be changed by changing the amount of work. Ie, for ray tracing for instance, doubling the number of pixels can increase the fraction of parallel work.

GPU vs CPU Architecture Evolution

DemoCoder

3dilettante

DemoCoder

aaronspink

aaronspink

Tahir2

3dilettante

Nick

Nick

ShaidarHaran

hardware monkey

Nick

DemoCoder

Nick

DemoCoder

rpg.314

rpg.314

Nick

Nick

rpg.314

rpg.314

Similar threads