GPU vs CPU Architecture Evolution

Tahir2

Veteran
Supporter
Hi there

I was wondering why it takes so long to design a new CPU yet the GPU designers are cranking out new models at a very quick pace (typically 18 months).

Intel have their tick tock strategy, AMD is not able to compete but even looking at Intel's strategy the latest major revision was the Core 2 and the i7 has been an evolution of that strategy.

With GPU's we have, what seem like, major revolutions a lot quicker. I understand that GPU artchitects are still adding additional features and performance whereas CPU's then have the simpler task of adding more performance.

Why does GPU design take less time than CPU design from a purely technical point of view? Is it the fabs, validation, what the CPU and GPU are tasked with?

Any comments would be appreciated.
 
I'd say there's a lot less validation that goes into a GPU than a CPU, since GPUs are specialized processors they don't need to be tested with eleventy billion applications. Also GPUs can produce errors and still be "good enough" whereas CPUs need to return the correct result for any given calculation 100% of the time.

There's a lot more reasons than that but with all the engineers we have around here I think they'd be better suited to fill in the other details.
 
Don't really know but I'd guess that CPUs are a mature technology while GPUs are still on a steep IP growth curve. As ShaidarHaran noted, CPUs are also significantly constrained by backwards compatibility and that requires them to behave in certain ways. The interface to GPUs, on the other hand, is more abstract and the required behavior is far less constrained. They're basically large array processors. Force them to work on random memory with tons of branching and I bet their performance would suffer significantly.
 
Hi there

I was wondering why it takes so long to design a new CPU yet the GPU designers are cranking out new models at a very quick pace (typically 18 months).

Intel have their tick tock strategy, AMD is not able to compete but even looking at Intel's strategy the latest major revision was the Core 2 and the i7 has been an evolution of that strategy.

With GPU's we have, what seem like, major revolutions a lot quicker. I understand that GPU artchitects are still adding additional features and performance whereas CPU's then have the simpler task of adding more performance.

Why does GPU design take less time than CPU design from a purely technical point of view? Is it the fabs, validation, what the CPU and GPU are tasked with?

Any comments would be appreciated.


I not so sure the appearance matches the reality as much as people think.

First I think it is appropriate to question the actual rate that "major" revision to GPUs are done. GPUs currently are on a development cycle much closer to CPUs than most people realize. In the old days this wasn't as true but currently, they are primarily constrained by fab advances just like everyone else.

Second, GPUs have slowed down in development cycle due to the increase in verification required. As they become more and more programmable, the amount of time required for verification will only go up.

Third, in the past GPUs could get away with being fairly buggy and relying on software to cover up the issues which with the increase in programming is no longer possible.


Also a lot of the so called "major" revisions are actually fairly minor. If you look at the last major introduction of new GPUs designs from both Nvidia and ATI, you pretty much how to go back to ~G80 and ~RV670 to find any significant difference.

In addition, I would wager that the levels of physical optimization between GPUs and CPUs are totally different along with various other aspects like DPM, etc. There will always be different overhead when you know you will be shipping in the hundred of millions vs the millions for a given design. There are even significant differences between mainstream CPUs and non-mainstream CPUs. I'd wager that both Intel and AMD have lower DPM targets than something like Power5/6/7 where they can worry about catching the marginal DPM in the machine burn in vs in the fab on the tester.

Lets put it this way, I haven't ever received a dead CPU but I have received several marginal/dead graphics cards. And have had several graphics cards that have died while the rest of the system was fine.
 
CPU are more constrained by power and sockets can't change every year. They all have to be suitable for OEMs, be it consumer or server parts, and have to be available for a longer time

A GPU takes place on a card in an expansion slot, it only has to conform to PCIe specifications, else the board can be anything.. They get away with multiple additional power connectors and there's a much bigger power range from low end to high end. They do whatever they want.

There's backwards compatibility on GPU too but it's small stuff (VGA, CGA, text mode, VESA) in the 2D engine and firmware, the rest (directX 5, 6, 7 etc., opengl) done through software (drivers, implementation)
 
Maybe that's why Larrabee is taking so long??

I've learnt a little while ago that Larrabee is basically a GPU, so that's not the reason. ;) The real reason I think for LRB is that it's not just a hardware innovation, but much more a software one. They basically have to write something like Catalyst for it, but completely from scratch.

Apart from that, the demands on a GPU are first of all simply higher, so the drive for improvements are larger, but they are also more specialised towards vector style calculations, which allows for more specialised in-order designs that are much easier to turn into multi-core configurations. We're slowly getting there now in the desktop space, but Windows (and desktop software development in general) has so far not done a great job in providing an environment that can easily benefit from multiple cores. Apple's work on Snow Leopard (Grand Central Dispatch) is interesting in that regard.
 
CPU are more constrained by power and sockets can't change every year. They all have to be suitable for OEMs, be it consumer or server parts, and have to be available for a longer time.

I disagree. Intel changes it's platform almost every year. They are the only makers of their chipsets now, so Intel sure as hell can change sockets every year. Power is an issue that is the fundamental limitation in VLSI circuits today. It applies equally to CPUs and GPUs. GPU server farms/clusters aren't popular today, so they can get away with much larger power budgets. As it is, they are already stretching the PCIe specs.

A GPU takes place on a card in an expansion slot, it only has to conform to PCIe specifications, else the board can be anything.. They get away with multiple additional power connectors and there's a much bigger power range from low end to high end. They do whatever they want.

Mobos on the other hand have to conform to only the form factor. Outside that, they are pretty free to guzzle as much power as they want.
 
I think some issues have been mentioned here, but I'll add a few:

1. Abstraction layers

GPUs have a complete software abstraction layer (DX, OGL, PTX) that includes a compiler. CPUs have the x86 ISA, which is constantly changing and not well codified and you don't control the compiler.

2. Validation

Validation is easier because of 1, also easier since correctness isn't quite so important and you can fix many things in your compiler.

3. Legacy code

GPU legacy code is almost all emulated/JIT'd. CPUs don't have that option.

4. Control logic

CPUs have way way more control logic than GPUs. Control logic is where all the complexity lives, datapaths are pretty easy.

5. Component vs. system

GPU vendors design systems, CPUs are components. CPUs need to provide way more visibility into their operation (e.g. TDP, power reqs), whereas GPUs don't always need to. For instance, NV fully specifies all their high end cards.

This also feeds back into the platform stability that was mentioned earlier.

6. Full custom vs. semi custom design

GPUs have to be rapidly ported to new processes (half nodes) on a regular basis. They tend to be less custom design than a CPU, which means lower design time. Note that GPUs are descended from ASICs and at one point were mostly synthesized, while CPUs used to have lots of dynamic logic that required huge amounts of manual effort.

7. RAS

CPUs have lots of reliability, availability and serviceability features. GPUs don't.

Anyway, this is just a list of a few items.

DK
 
Thanks for the replies everyone.

Recently Jen-sun Huang mentioned that GPU was set to increase 570x whereas CPU power would increase 3x over the same time frame of six years.

Apart from being marketing drivel and exaggeration there is some basis in what Jen-sun Huang was saying in so far that GPU performance is increasing faster than CPU performance over a given time frame and GPU's will become more useful in other applications.

If so, is it still not possible to design a Crusoe type CPU that emulates and relies on software for the rather mundane tasks of backward compatibility and then perhaps increasing performance?

Also, what is to stop AMD/Intel designing chipsets that and memory controllers that are taking advantage of the latest DRAM technologies for example? I have heard of the next step in DRAM for mainboards being DDR4 but we know it was not very successful for GPU's but GDDR5 is. Why are AMD/Intel not able to take advantage of the new technologies faster? The incubation time from concept to reality seems incredibly long and (thus) much more expensive compared to GPU manufacture. I believe DDR5 availability for mainboards would follow is AMD/Intel built their CPU's around it relatively quickly.
 
CPUs and GPUs have very different design targets as far as memory support is concerned.

CPUs must support expandable DIMM-based memory pools of commodity RAM.
DDR3 is the latest technology for that.

GDDR5 on GPUs is soldered to the PCB, offers no expandability, does not offer ECC, and has very limited capacity.
As it stands, the current form of GDDR5 is wholy innapropriate for CPUs.
 
Recently Jen-sun Huang mentioned that GPU was set to increase 570x whereas CPU power would increase 3x over the same time frame of six years.
He only said that GPU processing power would increase 1.5 times per year over the next 6 years (ie 50% faster) or about 11.4 times as powerful in six years.
 
Recently Jen-sun Huang mentioned that GPU was set to increase 570x whereas CPU power would increase 3x over the same time frame of six years.

Apart from being marketing drivel and exaggeration there is some basis in what Jen-sun Huang was saying in so far that GPU performance is increasing faster than CPU performance over a given time frame and GPU's will become more useful in other applications.
That's nonsense. GPU performance is hitting a concrete wall: power consumption.

Sure, we've seen some spectacular performance increase from GPUs in the last decade. But let's not forget we went from passively cooled chips to multi-GPU systems that require a 1000 Watt power supply. At the same power consumption, things are really not that impressive.

Actually GPUs are hitting a second wall as well: die size. Even if you manage to keep power consumption in check they can't keep increasing the die size the way they have before as a means to increase performance. And while multi-die solutions can increase yields it increases packaging cost, requires inter-die communication and doesn't help wafer cost. So die size growth slows too.

Also, GPUs have to spend an increasing amount of die space to control logic, registers and caches to become more programmable and flexible. They have to invest transistors in things that CPUs can already take for granted.

Last but not least: contrary to GPUs the CPU actually has headroom for improvement beyond Moore's Law. Back in the Pentium 4 days a doubling of the number of transistors did not nearly double the performance. Nowadays the focus isn't on achieving the hightest possible clock frequency any more, but to actually optimized performance per Watt. Given the starting point, there has been a spectacular increase in computational density and there's still potential for a lot more. AVX and FMA alone would increase performance fourfold with only a minor increase in transistor count!

What Jen-Sun probably referred to was double-precision performance. But from a technology standpoint a 570 fold increase doesn't mean anything. They could have had a single double-precision ALU at 1 Hz in their current chip and claim a gazillion fold incease in their next chip. Meh.

Once you start comparing apples to apples, CPUs are gaining performance faster than GPUs are, and will contintue to do so for at least the entire next decade. In the end they're bound by the same physical laws.
If so, is it still not possible to design a Crusoe type CPU that emulates and relies on software for the rather mundane tasks of backward compatibility and then perhaps increasing performance?
Why? Current x86 architectures are already RISC internally. The decoding of x86 instructions to RISC instructions is not that costly. The real bottleneck is ILP. Just look at Itanium. Even though it's a massive chip it's getting some serious competition from Intel's own multi-core x86 processors.
 
Last edited by a moderator:
Once you start comparing apples to apples, CPUs are gaining performance faster than GPUs are, and will contintue to do so for at least the entire next decade. In the end they're bound by the same physical laws.

While I agree with many of your points, I think CPU's have much less room for perf growth than GPU's. GPU's still have a lot of fixed function hardware, which will go away and be replaced by ALU's. CPU's are still being designed to increase their single threaded IPC more than cores. Core count is doubling once every 4 years (at a given price point) while trannies double every two years.

FMA is a one off thing. You can scale vector widths, but there is a severe performance cliff there too. LRB's float16 is after taking into consideration a lot of such factors. And so while we could have float32 AVX some day, I doubt if it will happen.
 
I think CPU's have much less room for perf growth than GPU's. GPU's still have a lot of fixed function hardware, which will go away and be replaced by ALU's.
There's only so much logic you can replace with ALU's. While I firmly believe that for instance texture units will eventually go away, they'll be largely replaced by generic gather units. In raw numbers, performance will go down for the same transistor count. Only in practice performance can potentially go up thanks to unification leading to higher utilization. But this thread is about the architecture and thus mainly about the raw numbers. And that's where CPUs still have lots of headroom and GPUs do not.
CPU's are still being designed to increase their single threaded IPC more than cores. Core count is doubling once every 4 years (at a given price point) while trannies double every two years.
We've had dual-cores at 90 / 65 / 45 nm, and we'll have quad-core at 65 / 45 / 32 nm. Octa-core starts at 45 nm. So they seem pretty much on schedule to me to double the number of cores every time the transistor density doubles.

Either way, sure, they haven't forgotten about single-threaded performance. But I don't see why they should. Clock frequency and IPC are still important factors and if they can increase performance more by spending transistors on those things instead of more cores then that's a more optimal design. NVIDIA also placed its bets on less shader cores that are clocked higher, and RV790 was all about keeping a small chip size while cranking up the frequency.
FMA is a one off thing.
It's all a one off thing. In the end it's about increasing computational density. So when we look at AVX+FMA it represents a fenomenal increase in computational density. Achieving four times higher throughput or more with hardly any extra transistors is not something GPUs still can.

Yes, that's because CPUs are lousy to begin with. But that's what the headroom argument is about. GPUs have the same kind of headroom for double-precision operations, but that matters much less (except when doing scientific calculations). For single-precision opterations GPUs no longer have any "one off" up their sleeves. In fact by making each ALU capable of double-precision operations they lower computational density for single-precision...
 
Just so we are absolutely clear here, you consider Larrabee a dead end?
Not at all. You have to look at all the software these devices aim to run in their lifetime. Larrabee tries to be the ultimate GPGPU. It aims to efficiently run any parallel workload you can throw at it, and double as a GPU. So naturally it has lots of cores and limited per-thread performance.

CPU's still have to cater for single-threaded software, which means that simultaneous improvements to effective IPC and more cores is the best strategy. But that's just today's situation. In the future, people will care less about performance improvements of ancient single-threaded software, and focus on the performance of the multi-threaded software they run. Besides, it's not like Intel or AMD invested a lot in IPC lately. Core 2 Duo versus Core i7, and K9 versus K10, they spent only a couple percent of transistors to IPC and twice the entire transistor budget on doubling the number of cores. In that perspective things have changed in revolutionary ways since the Pentium 4 / K8.

It's not unlikely for Larrabee's successor to follow the same route to some extent. If they see an opportunity to increase effective utilization by a significant amount with few extra transistors then that's obviously better than spending it on additional badly utilized cores. In fact I expect that to become more important over time, since even the most embarassingly parallel workload doesn't scale perfectly. So at some points it's more efficient to focus on making threads run faster than to add more cores to run more threads. We're already seeing this effect with graphics on the GPU. An increase in core count lowers relative utilization, even when everything else scales by the same factor.
 
Besides, it's not like Intel or AMD invested a lot in IPC lately.

Intel recently integrated the memory controller on die which helps with IPC. Unless your definition of IPC is pure logic dedicated to processing data internally in the "logic" units of the CPU.

Integrating memory controllers, PCI-E controllers, adding more cache, QPI/HT3.0 all of these help with IPC in given situations.
 
There's only so much logic you can replace with ALU's. While I firmly believe that for instance texture units will eventually go away, they'll be largely replaced by generic gather units. In raw numbers, performance will go down for the same transistor count. Only in practice performance can potentially go up thanks to unification leading to higher utilization. But this thread is about the architecture and thus mainly about the raw numbers. And that's where CPUs still have lots of headroom and GPUs do not.

I do not see why you think CPUs have a lot of room for growth. Agreed, GPU's are severly reticle size and power limited, but I dont see why you think CPUs have a lot of headroom for growth. CPUs must serve the ultra low cost market too, which is being eaten up by IGPs for GPU.

We've had dual-cores at 90 / 65 / 45 nm, and we'll have quad-core at 65 / 45 / 32 nm. Octa-core starts at 45 nm. So they seem pretty much on schedule to me to double the number of cores every time the transistor density doubles.

Yeah, we have octa cores at 45 nm, but at what price? My claim, was qualified with ".. at the same price point". I don't think we had quad cores at ~$150 in 2006. The mainstreaming of quad cores has taken longer than the shrink time of 2 years.

It's all a one off thing. In the end it's about increasing computational density. So when we look at AVX+FMA it represents a fenomenal increase in computational density. Achieving four times higher throughput or more with hardly any extra transistors is not something GPUs still can.

How do you figure AVX and FMA can be done without a non-trivial increase in trannies?:rolleyes: Can I have some explanation? Unless you are speaking of a 128 bit wide AVX unit, just like we had a 64 bit wide SSE unit prior to conroe. In which case, there is absolutely no perf increase at all. And IIRC, AVX wont have FMA.

Yes, that's because CPUs are lousy to begin with. But that's what the headroom argument is about. GPUs have the same kind of headroom for double-precision operations, but that matters much less (except when doing scientific calculations). For single-precision opterations GPUs no longer have any "one off" up their sleeves. In fact by making each ALU capable of double-precision operations they lower computational density for single-precision...

And they will remain lousy simply because they need to care for single threaded IPC, which is still what matters (may be not the most). GPU's dont even look at that market and hence can run waaay faster at workloads designed for them.

GPU's do have a trick up their sleeve though (not sure if they'll follow the trail because of various reasons), increase clock speed. rv770 runs at 750Mhz, GT200 runs at ~1.5G. They can still scale along that axis.

And last but not the least, don't underestimate the burden of backward compatibility. For GPU's, all code gen is dynamic/ at runtime. So they can aggressively drop old and useless crap.
 
To give an example, the 65nm single-core Pentium 4 was capable of 4 SP ops per clock in an area of 80 mm2.
The 45nm Nehalem, with 4 cores, is capable of 8 SP ops per core per clock, for a total of 32 per clock in an area of 263 mm2.

The P4 had a paltry op density of .05 per mm2.
Nehalem has .12 per mm2, more than double.

For reference, the 80nm R600 sported 320 units on a 430 mm2 die.
The 55nm RV770 increased this count to 800 on a 256 mm2 die.
This gives us .74 versus 3.1, respectively, before counting FMAD as double.
The clocks are somewhat lower, for the later GPU, while the CPUs could be found in the same speed grades.

So as we can see, the recent history of CPUs shows them growing FLOPs by a staggering ~2x, while GPUs scaled at a mere 4x on an inconsequential order of magnitude difference in base capability.
 
Last edited by a moderator:
Back
Top