GPU vs CPU Architecture Evolution

3dilettante · Sep 2, 2009

Nick said:
Past trends don't say a darn thing about the future. On the contrary, this is exactly why the CPU has so much headroom left, while GPU's can no longer repeat that kind of feat.

Your claims are that we've already seen an explosion in throughput, not just that we're going to have one.
The growth is comparatively modest, when actual numbers for GPUs and CPUs are used and we move beyond assertions.

Also, I believe it's incorrect to factor out the die size. CPU's have long been hold back from using more silicon to achieve higher performance.
But since the arrival of multi-core architectures they benefit from it just like a GPU. It may be another one-off, but it's something you have to add into the equation.

I didn't factor it out.
It's factored into the density numbers.
As far as Nehalem and RV770 are concerned, they are about equal in size, which would have allowed me to factor out die size anyway.

Nick · Sep 2, 2009

3dilettante said:
Your claims are that we've already seen an explosion in throughput, not just that we're going to have one.

If I ever said such a thing about the past it was about absolute chip throughput, not density. Eight times higher performance in several years time is nothing to get sarcastic about. And they'll be able to do that again but this time primarily thanks to increasing computational density.

GPU's have been able to increase performance in the past by increasing die size and recently they still aggressively increased FLOPS density, but they've pretty much played all their aces now. They cannot possibly increase FLOPS density by a factor three or four without increasing die size or waiting on process improvements. CPUs can.

Note that by no means I'm saying GPUs will therefore be outrun by CPUs. Yesterday's CPUs just sucked big time, today's CPUs still suck a little, and tomorrow's CPUs will actually perform quite nicely for throughput oriented workloads. GPUs already rock at throughput but they suck at running incoherent workloads. They too will become better. It's all about the convergence. They're both restricted by the same laws of physics but they're unrestricted in architecture. Anything the GPU can do, the CPU can eventually do too, and vice versa, and at some point in the future we won't have to name them separately any more.

Right now though, the ball is in the court of the CPUs, when it comes to throughput improvement.

3dilettante · Sep 2, 2009

Nick said:
If I ever said such a thing about the past it was about absolute chip throughput, not density. Eight times higher performance in several years time is nothing to get sarcastic about.

Absolute chip throughput, not counting MCMs, between Conroe and Nehalem is a factor of 2 improvement, due to multicore.
Conroe was released in mid-2006.

Counting MCMs that came out in late 2006, the gain in peak is down to the gain in clockspeeds, which is on the order of tens of percent.

To get a significant improvement, we would have to do as I have done and go all the way back to the last Netburst chips, which Intel hadn't significantly refreshed since 2004.
Northwood hit 3.0 GHz in 2003.
Nehalem's 8x gain took six years, close to seven if we content ourselves with some of the slower speed grades.
That was after 3 process node transitions.

And they'll be able to do that again but this time primarily thanks to increasing computational density.

The peak gains we see march pretty well with process node transitions.
8 times the throughput after three transistor budget doublings.

GPU's have been able to increase performance in the past by increasing die size and recently they still aggressively increased FLOPS density, but they've pretty much played all their aces now. They cannot possibly increase FLOPS density by a factor three or four without increasing die size or waiting on process improvements. CPUs can.

x86 CPUs can scale beyond process improvement when CPU designers allocate more than 1/5 of the core to ALUs, after which which maybe only 1/2 of the die is cores.

AVX's doubling of throughput conveniently comes in on a new process node, which doesn't bode well for that fraction changing too much.
Not sure when the FMAC will show up, but it might lag a half-generation.

rpg.314 · Sep 2, 2009

Even if you count MCMs, Core2Quad and corei7 have the same thoughput (per clock) and that is after 3 years. AVX is just making up for a bit of a time there.

Blazkowicz · Sep 2, 2009

Nick said:
So while graphics might be one of the most parallel tasks around, it's not infinitely parallel. According to OTOY, you can run ten instances of Crysis on a GPU before reaching full utilization. Unless some radical architectural changes are made (over many years), that will only get worse and GPUs will only be powerful on paper.

Graphics workloads are getting more diverse every day. And increasingly they are tasked with completely different things than graphics as well. To support such a wide variety of workloads they can't just keep adding cores or wider ALUs and expect it to run well. Larrabee might not have the highest throughput of all chips in 2010, but it will be superior at running certain workloads and quite possibly even achieve higher utilization for graphics.

That's a great argument for GPU mutualization. A single GPU in the house, serving real-time rendering (both through network and with attached screens) and other computations. I believe it's the way of the future.

dkanter · Sep 3, 2009

rpg.314 said:
Even if you count MCMs, Core2Quad and corei7 have the same thoughput (per clock) and that is after 3 years. AVX is just making up for a bit of a time there.

What are you smoking? The IPC for Nehalem and Harpertown are not even remotely close for most workloads.

DK

Nick · Sep 3, 2009

dkanter said:
What are you smoking? The IPC for Nehalem and Harpertown are not even remotely close for most workloads.

I think he's referring to theoretical throughput. Apart from widening the SSE execution units and adding more cores, indeed not much has happenend in the last few years. Effective IPC went up quite a bit since the Pentium 4 though, but that's not what was being discussed so far.

My point is that theoretical throughput of a CPU still has significant headroom, which can be achieved by increasing computational density. That's far greater potential than any IPC improvement. AVX and FMA are good for almost a fourfold increase in ALU density, and eventually they'll also be able to cram more cores into the same area by making simpler cores. In a couple decades, a core of a desktop CPU could look a lot like an Atom core, of that time period (possibly heterogenous like Cell but with the advantage of ISA compatibility). At the same time, GPU architectures have to invest more transistors in control logic and things like caches for a stack so I don't expect their ALU density to go up by any significant amount, or at all. We've seen similar transitions in the past:

G71: 211 GFLOPS / 278 Mtrans
G80: 346 GFLOPS / 686 Mtrans

On the other hand the effective utilization improves. But CPUs can also achieve much higher efficiency with scatter/gather support. It's these things that are the hardest to quantify...

Nick · Sep 3, 2009

3dilettante said:
To get a significant improvement, we would have to do as I have done and go all the way back to the last Netburst chips, which Intel hadn't significantly refreshed since 2004. Northwood hit 3.0 GHz in 2003.

It's a (logarithmic) hockeystick curve. In the Pentium 4 days, performance improvements had become almost linear. Which isn't impressive at all given Moore's exponential law. This is changing dramatically with the introduction of multi-core and a focus on performance per Watt (~computational density). It's still early days in the transition though, and it's a tough one because it involves a lot of 'activation energy' from the software side to get on the steep part of the curve. So there are still few data points to prove that we're in this revolutionary change right now.

What you're doing is going back to 2003 on the flat part of the curve, and connecting that with today. Indeed that's fairly unimpressive compared to how GPUs have been doing (though better than the flat part itself), but it doesn't mean that this is the (logarithmic) slope at which things will continue to evolve.

Actually it's not a (logarithmic) hockeystick either. The slope of the steep part is greater than Moore's law. CPUs will be able to catch up with the unused potential in the flat part of the curve. Eventually they'll run out of that though, and be limited entirely by process improvements. So the curve gets flatter again in a couple decades. GPUs have already increased ALU density to a maximum though, and are already at the point of being limited by process improvements. So their spectacularly steep curve is about to flatten. What you're getting is something like a hysterisis curve (except that it's not a loop). CPUs are on the bottom curve while GPUs are on the top curve. That doesn't have to mean that at the convergence point CPUs and GPUs will be one and the same. They can still spend their transistors on accelerating operations that benefit a specific range of workloads. There will be overlap though...

rpg.314 · Sep 3, 2009

Nick said:
I think he's referring to theoretical throughput.

Yeah, that's what I meant.

My point is that theoretical throughput of a CPU still has significant headroom, which can be achieved by increasing computational density. That's far greater potential than any IPC improvement. AVX and FMA are good for almost a fourfold increase in ALU density, and eventually they'll also be able to cram more cores into the same area by making simpler cores. In a couple decades, a core of a desktop CPU could look a lot like an Atom core, of that time period (possibly heterogenous like Cell but with the advantage of ISA compatibility). At the same time, GPU architectures have to invest more transistors in control logic and things like caches for a stack so I don't expect their ALU density to go up by any significant amount, or at all. We've seen similar transitions in the past:

I agree that CPU's give far less area to alu's now, but that is unlikely to change. Here's why.

3dilettante said:
x86 CPUs can scale beyond process improvement when CPU designers allocate more than 1/5 of the core to ALUs, after which which maybe only 1/2 of the die is cores.

Memory controllers, now PCIe and in future IGP's too will compete for the same die area. As for AVX, it comes in 2010 with sandy bridge and FMA after that, so don't hold your breath for that. And since you can already dual issue an add and a mul in the same clock cycle for SSE, I am not sure if implementation of FMA will bring any throughput benefits, (unless they are planning to implement dual issue of FMA, which I severely doubt.)

So all in all, CPU's have stood still for 4 years while GPU's arent standing. While it is true we'll likely see convergent evolution, (even hybrids ala fusion), I doubt if we'll be unable to tell them apart.

CPU's and GPU's are undoubtedly constrained by the same laws of physics, but that doesnt mean they are in identical situations. CPU's are regularly used in 4P even 8P mobos. So per socket thermals/power are lower for CPUs than GPUs (which will soon break the 300W barrier.)

Davros · Sep 3, 2009

Nick said:
you can run ten instances of Crysis on a GPU before reaching full utilization.

Really, mine struggles with one

rpg.314 said:
Core2Quad and corei7 have the same thoughput (per clock) and that is after 3 years.

Actually i7 performs about 10% faster per clock, when you switch off hyperthreading (generally)

rpg.314 · Sep 3, 2009

Davros said:
Really, mine struggles with one

Same here. Can anyone explain to me what is going on here? How the hell can OTOY people run 10 Crysis instances on one GPU without having to bother about framerate?

Actually i7 performs about 10% faster per clock, when you switch off hyperthreading (generally)

Even then, it cannot issue more spfp ops than 8 per clock cycle. What you are referring to is what it actually issues, which is different.

3dilettante · Sep 3, 2009

Nick said:
It's a (logarithmic) hockeystick curve. In the Pentium 4 days, performance improvements had become almost linear.

That was the case for clock speeds, after Northwood.
There was no hockey stick when it came to FP throughput.
The P3 to P4 transition was the same kind of vector bump as what happened between P4 and Core2. Given the significant clock bump Netburst had over the last P3s, the gain for vector throughput was more than double.

The general lack of peak scaling with Prescott (that is, until they went dual-core) is very much the same kind of lack of scaling between Penryn and Nehalem.
It's actually worse with Nehalem, since there are quad-core MCM Penryns, whereas the first Pentium D was a Prescott derivative.

The curve in FP throughput so far tracks with Moore's law, with stops and starts over 3-4 years.

Intel is basically due for another doubling of vector width, and that is what AVX offers.
In keeping with the trend the P4 established, we can expect core counts to scale later, although the lengths they will go with a symmetric solution beyond 4-6 cores for desktop look somewhat limited, going by the roadmaps.
FMAC comes in later, which will add a bump depending on how it is implemented.

What you're doing is going back to 2003 on the flat part of the curve, and connecting that with today. Indeed that's fairly unimpressive compared to how GPUs have been doing (though better than the flat part itself), but it doesn't mean that this is the (logarithmic) slope at which things will continue to evolve.

There's a good chance we can see this happen with AMD in 2011 with Llano.
Going by the speculative diagrams, we can expect an FP unit per core with an output of 8 64-bit results, which I'm interpreting to also allow 16 32-bit results per clock.
Half of those come from an FMAC unit.
If, and this is not quite certain from the diagram, the FADD and FMAC pipes can issue at the same time, we could see the quad-core Llano yield 96 SP ops a cycle.
Otherwise, it's still a nice 64 a cycle.

The latter is in keeping with Moore's law, the former is somewhat better, which is good growth for CPU cores.

Going by the roadmap and assuming good AMD execution, I think the Llano chip may potentially double or triple that FLOP count.

Raqia · Sep 3, 2009

3dilettante said:
There's a good chance we can see this happen with AMD in 2011 with Llano.
Going by the speculative diagrams, we can expect an FP unit per core with an output of 8 64-bit results, which I'm interpreting to also allow 16 32-bit results per clock.
Half of those come from an FMAC unit.
If, and this is not quite certain from the diagram, the FADD and FMAC pipes can issue at the same time, we could see the quad-core Llano yield 96 SP ops a cycle.
Otherwise, it's still a nice 64 a cycle.

The latter is in keeping with Moore's law, the former is somewhat better, which is good growth for CPU cores.

Going by the roadmap and assuming good AMD execution, I think the Llano chip may potentially double or triple that FLOP count.

Llano is based on the k10.5 core, not the bulldozer as they've mentioned on some AMD blog. I doubt FLOP count will improve by much. There is huge potential improvements in bandwidth though w/ better interconnects between CPU and GPU, but all in all, it sounds like a consumer part rather than a server one.

3dilettante · Sep 3, 2009

Whoops, I saw the same color box in the roadmap and thought it was the same design generation.
Rather it just indicated process node.

In that case, skip the FPU stuff.
The FLOP count should still have a big boost, at least for single-precision.

ShaidarHaran · Sep 4, 2009

3dilettante said:
That was the case for clock speeds, after Northwood.
There was no hockey stick when it came to FP throughput.
The P3 to P4 transition was the same kind of vector bump as what happened between P4 and Core2. Given the significant clock bump Netburst had over the last P3s, the gain for vector throughput was more than double.

I'm not to which specific parts you're referring here. When you say P4 do you mean the parts at the end of the life cycle (Prescott @ 3.73/3.8GHz) or do you mean the initial 1.3-1.5GHz Willamette parts? If the former I would agree with your statement. If the latter I'd have to disagree as Willamette was generally only on-par with a top-bin Coppermine P3.

3dilettante · Sep 4, 2009

I was thinking the later Northwood and then Prescott chips.
However, I had overlooked that the PIII could perform half of an SSE MUL and half of an SSE ADD, so it could do 4 SP ops, just not from the same instruction.

With Northwood's clocks finally hitting 3.0 GHz+, the 2x threshold was crossed, but not due to per-clock throughput.

elsence · Sep 4, 2009

Although i agree with many reasons that other members said here (AIB is a closed system for GPU designs in relation with CPU based platforms..., testing time needed etc...)

The main reason for me is the nature of the GPU.

GPUs are paraller designs,
NV&ATI can easily scale the core count, take account the bandwidth requirements (mem speed doubles every 3 years, maybe 2,5 years for G sector, add memory controller bandwidth increasements the past years) and increase the perf. in a much higher tempo than what the CPU guys can!

The fix function parts ratio vs programmable parts ratio in the design plays a role in the perf. scaling capability but not in the past, not now and not the next 4 years DX11/DX12!

It will play a role with maybe DX13 and above (Intel with Larrabee will push MS for 2 years cycles between DX from now on...)

Also i don't think ATI & NV will hit a power wall the next 4 years.

After 2014 is very difficult to predict... (at least for me, i don't have technology background...)

Also, since CPU guys have longer design cycles in relation with GPU guys, if a CPU design is not good can make or brake the succesful course for a CPU (or at least the ASP) so they plan with care!

3dilettante · Sep 4, 2009

GPUs are about as power-limited as CPUs are.

The highest-end chips might have a more generous TDP but the thermal budget is not boundless.
There is also the limited amount of power allowed under the PCIe specification.

We've already seen some instances of Nvidia and AMD cards exceeding this maximum.

Lower-end boards have their own set of power limits. The question of whether a card needs an additional power connector, fan noise, or whether a board can be cooled passively, leads to other lower power ceilings for different products.

I am also curious for substantiation on the difference in lengths of CPU and GPU design cycles.
It doesn't seem to be that large, as a lot of products are obviously based on a previous-gen product.

elsence · Sep 4, 2009

3dilettante said:
GPUs are about as power-limited as CPUs are.

The highest-end chips might have a more generous TDP but the thermal budget is not boundless.

Yes, i agree!
But the thermal budget i think is 150W for CPUs and 300W for GPUs!

Usually higher-end CPUs are 125W TDP (with 150W limit) and usually higher-end GPUs are around 200W or many times, much less, (with 300W limit)!
So GPUs have more headroom... (i mean to hit 300W wall)

(I am not gonna take extreme high-end products like GTX295 (SLI/CRossfire for me has many perf. problems we will talk again when we see completely shared mem usage) to make valid thermal budget analysis...)

There is also the limited amount of power allowed under the PCIe specification.

If you take the PCIe specification as the badget, GPUs hit the wall years ago!

We've already seen some instances of Nvidia and AMD cards exceeding this maximum.

Nearly half or more of the product line today (9600GT and up and 4730 (exclude 4750) and up) all the +80$ product line is above 75W!

Lower-end boards have their own set of power limits. The question of whether a card needs an additional power connector, fan noise, or whether a board can be cooled passively, leads to other lower power ceilings for different products.

I agree again!

I am also curious for substantiation on the difference in lengths of CPU and GPU design cycles.
It doesn't seem to be that large, as a lot of products are obviously based on a previous-gen product.

Yes the difference imo are not that large as many people think, but that depends also on the definition of what you mean by design cycle!
I just meant something like that:

Within a cycle, If for example Intel has a Pentium 4 C, it can do next something like Pentium 4 D, and then something like Pentium 4 E, or replace the above, with AMD phenoms progression..., all these minimal differencies in perf. within 2 years)

ATI for example can go from HD2000 tech, to HD3000 tech, to HD4000 tech and all these imo big differencies in perf. within 1 year and 1 quarter)

If you take as design cycles big ivents like Geometry on chip transformation or Shaders enabled GPUs or GPGPU enabled GPUs you can stress the notion of the GPU design cycle, but even then with this logic (which is correct, nothing wrong there) the CPUs equivalent cycles will be again larger!

Guys, this is my 3rd post and i am new to the forum staff, i just want to make conversation, is the above writing style percieved (English not native lan.) that i want to argue? (because i don't want that!)

Please advise in order to change it!

Thanks in advance!

OpenGL guy · Sep 4, 2009

elsence said:
f you take the PCIe specification as the badget, GPUs hit the wall years ago!

3dilettante said:

We've already seen some instances of Nvidia and AMD cards exceeding this maximum.

Click to expand...

Nearly half or more of the product line today (9600GT and up and 4730 (exclude 4750) and up) all the +80$ product line is above 75W!

75w is the maximum you can draw through the PCIe slot, that's why boards that draw more power have additional power connectors. The power connectors are also part of the PCIe spec and you are allowed to draw 75w from 6-pin and 150w from 8-pin connectors.

GPU vs CPU Architecture Evolution

3dilettante

Nick

3dilettante

rpg.314

Blazkowicz

dkanter

Nick

Nick

rpg.314

Davros

rpg.314

3dilettante

Raqia

3dilettante

ShaidarHaran

hardware monkey

3dilettante

elsence

3dilettante

elsence

OpenGL guy

Similar threads