GPU vs CPU Architecture Evolution

3dilettante · Sep 4, 2009

elsence said:
Yes, i agree!
But the thermal budget i think is 150W for CPUs and 300W for GPUs!

That would be the power budget for the card, based on how many power connectors there are.
The GPU itself can't take up that much.
Some margin must be set aside for RAM, on-board chips, and inefficiencies in the power circuitry. Then some extra margin is put aside to keep the device from spiking too high and not being compliant with the PCIe specifications.
A GPU that draws as much as the maximum would not have a board to run it on.

Furmark does show that certain instances can make GPUs spike very high above average consumption.

The majority of graphics boards sold are not going to be the top-end cards with two connectors.
Obviously, the dual-GPU cards that have 300W can't have both GPUs drawing 150W.
This means that most GPU chips occupy at most the high-end of the power band allowed for enthusiast CPUs.

GT200 boards are somewhere around 230 W.
This may allow the GPU itself to pull in a generous amount of wattage in the worst case, but I don't know how much other board components take away from the maximum.
It might only be something like 200 W for the chip itself.
It does not appear the manufacturers are eager to push much higher, and this level is not too much higher than the highest CPU TDPs.

If you take the PCIe specification as the badget, GPUs hit the wall years ago!

The specification would be the slot and then the additional specifications for power connectors.
GPUs have hit all three limits.

Within a cycle, If for example Intel has a Pentium 4 C, it can do next something like Pentium 4 D, and then something like Pentium 4 E, or replace the above, with AMD phenoms progression..., all these minimal differencies in perf. within 2 years)

ATI for example can go from HD2000 tech, to HD3000 tech, to HD4000 tech and all these imo big differencies in perf. within 1 year and 1 quarter)

If you take as design cycles big ivents like Geometry on chip transformation or Shaders enabled GPUs or GPGPU enabled GPUs you can stress the notion of the GPU design cycle, but even then with this logic (which is correct, nothing wrong there) the CPUs equivalent cycles will be again larger!

G80 was introduced in 2006. Its derivatives have been around ever since.
G300 may be a real revamp, but there are not that many details.

R600's legacy is very strong in the succeeding chips.
Evergreen sounds like there are some significant changes, but it sounds like the family resemblance remains.

There have only been a handful of real big transitions for both CPUs and GPUs in the last decade.

The behind-the-scenes articles are also indicating that GPU design times are between 3-5 years (counting delays, especially), which isn't too far from what it takes for a CPU to come to market.

Guys, this is my 3rd post and i am new to the forum staff, i just want to make conversation, is the above writing style percieved (English not native lan.) that i want to argue? (because i don't want that!)

Please advise in order to change it!

Thanks in advance!

It would be easier for me to quote you if your responses weren't inside the quote box.
As far as perception, using a lot of exclamation marks makes it seem like you're very excitable or hyperactive. That punctuation has more force than a period and implies that there is more emotion behind the sentence or that the point is very important. It can seem forceful if used a lot.

elsence · Sep 4, 2009

Yes, I know about the 75W PCI-express, AIB/GPU difference and the power connectors...

The GPU belongs to a closed system (AIB: add in board) that can change/designed/tested much more easily than CPU platform imo.

I wrote 200W, because 205W is for GT285 (according to NV's site) the typical highest No (i perceived that number as a design specification for AIBs manufactures that NV is giving to help them design AIB solutions based in this GPU
and as a good indicator of what highest power consumption this GPU has, i guess there may be some games/aplications that use higher than 205W, but that, are few and not asking continiously more power, nor the system will crash if they ask for 10-20W more... example use 4770 without power connector...)
(in a side note, i know what T in the TDP means and how the whole situation works...)

About the RAM, on-board chips, and inefficiencies in the power circuitry, i thought that the amount is too small in relation with GPU consumption and 300W limit! (or 150W/75W limit if you like...)

But anyway this doesn't matter for the future power scaling that we discuss here (i said for the next 4 years...),
since all those were existed in the graphics board before 5 years ago (6800Ultra)for example, now and will exist in the future!
I don't see the exclusive power need for RAM, on-board chips, and inefficiencies in the power circuitry, to increase much the next 4 years (maybe just a tad...)

About design cycles lets take NV's:

Q4 1999 DX7 GeForce 256 (Geometry transformation on chip)
Q1 2003 DX9 GeForce FX (boooo...) (I skiped DX8, lets say meaningful shaders and per pixel effect...)
Q4 2006 DX10 Geforce 8800GTX (GPGPU...)

3,5 years average!

But with the same logic (deep changes...) we must examine the last decade the CPU sector lets take Intel desktop CPU history:

Q3 2001 Pentium 4
Q3 2006 Core 2

Or AMD desktop CPU history:

Q3 2003 A64
2011 bulldozer (if AMD execute...)
Phenom architecture is similar with A64 imo (at least i can't count it as a vast difference like the GPU differencies or Intel core2 difference...)

So 5 years (i didn't count AMD because they screwed their whole CPU situation, look where they are now with their architecture, when the new nehalems launch in a few days they will be forced to sell the 3,4GHz 965 at around 200$..., i hope Bulldozer to be something good because competition is good for us and for the industry...)

So with this logic we have 1,5 year cycle difference between CPU & GPU...

But anyway, I see your points about power consumption and design cycles but i have a slightly different angle (i don't mean better, just different)

Thanks for your suggestion about writing style.
I hope i didn't make a mistake, i can't edit...

rpg.314 · Sep 5, 2009

You'll be able to edit after you have had 10 posts.

GPU's have shown an explosive growth simply by abandoning all restraints (nv in particular, ati was complicit in this too, but has now come to it's senses.) on die sizes and power budgets.

GT200 officially maxed out what could be done by the foundry (infact, it was over budget to begin with) and GTX295 is v. close to busting the PCIe power spec.

As far as design cycles go, GPU's are still adding new things (+shrinks) every year or so. But CPU's are a bit (not by much though) slower.

And yes, power scaling matters. Cooling gtx295 is harder than cooling an 6800 ultra. In fact, today, for all chips, power and I/O are far bigger constraints than raw compute.

elsence · Sep 5, 2009

rpg.314 said:
You'll be able to edit after you have had 10 posts.

GPU's have shown an explosive growth simply by abandoning all restraints (nv in particular, ati was complicit in this too, but has now come to it's senses.) on die sizes and power budgets.

GT200 officially maxed out what could be done by the foundry (infact, it was over budget to begin with) and GTX295 is v. close to busting the PCIe power spec.

As far as design cycles go, GPU's are still adding new things (+shrinks) every year or so. But CPU's are a bit (not by much though) slower.

And yes, power scaling matters. Cooling gtx295 is harder than cooling an 6800 ultra. In fact, today, for all chips, power and I/O are far bigger constraints than raw compute.

I agree with you that power scaling matters (i would say is a must, i mean Intel learned the hard way with P4 tech that, and in a IDF said that performance increasements / power consumption increasements ratio is extremely important for the future...)

The only thing i said, is what headroom a CPU has (e.g. PII X4 140W->150W) in relation with a high-end single GPU board (e.g. GT285 205W->300W) (i already explained why i didn't count GTX295 in my original post...)

And certainly, i didn't say cooling GTX295 is the same with cooling a 6800,
(of cource it is more difficult, i mean you are comparing a 2 X 576mm2 GPU board with a 288mm2 GPU board, isn't this natural?)

Oh, and thanks for the info about edit!

Nick · Sep 9, 2009

rpg.314 said:
And since you can already dual issue an add and a mul in the same clock cycle for SSE, I am not sure if implementation of FMA will bring any throughput benefits, (unless they are planning to implement dual issue of FMA, which I severely doubt.)

Why doubt it? The whole point of FMA is to increase throughput. They're not adding it for shows.

In theory they might only make one of the units capable of handling FMA instructions, but this seems unlikely as well. If an application uses only FMA instructions it won't benefit from a single unit versus having separate ADD and MUL units.

So it only makes sense to have two AVX units fully capable of FMA.

Nick · Sep 9, 2009

rpg.314 said:
Can anyone explain to me what is going on here? How the hell can OTOY people run 10 Crysis instances on one GPU without having to bother about framerate?

Of course they have to bother about framerates. But you can run ten instances of Crysis at medium settings without each instance having only 1/10'th the framerate.

That's because a single instance is at any point in time bottlenecked by something. Shaders can be bottlenecked by arithmetic units, or texture units, or bandwidth, or ROPs, or primitive setup, etc. Also, sometimes the CPU has to wait for the GPU, or the other way around. You can also be PCI-e bandwidth limited. These bottlenecks can alter in rapid succession.

So even the most balanced game fails to fully utilize the system. And you can't draw the end of the frame before the begin, so there's limited opportunity to even things out. With multiple instances however you keep everything highly utilized and you reach the maximum combined performance.

dkanter · Sep 9, 2009

Nick said:
Why doubt it? The whole point of FMA is to increase throughput. They're not adding it for shows.

In theory they might only make one of the units capable of handling FMA instructions, but this seems unlikely as well. If an application uses only FMA instructions it won't benefit from a single unit versus having separate ADD and MUL units.

So it only makes sense to have two AVX units fully capable of FMA.

Why do you say that? An FMA unit can do intermediate rounding if properly implemented. Everything else is just bookkeeping which the OOO can handle.

David

Nick · Sep 10, 2009

dkanter said:
Why do you say that?

Say what exactly?

An FMA unit can do intermediate rounding if properly implemented.

Sure.

Everything else is just bookkeeping which the OOO can handle.

What do you mean with "everything else"?

dkanter · Sep 10, 2009

Nick said:
Say what exactly?

Sure.

What do you mean with "everything else"?

I'm assuming you mean a 3x256b input and 1x256 output FMA, since it's operating on AVX regs (and the output may overwrite an input register).

You said there's only a benefit with 2 FMA units. If you do macro-op fusion like techniques to merge add and mul instructions (you may want to do it in the rename logic by renaming adds and muls that are dependent, rather than front-end since the add and mul may be separated in the Istream), you can save ROB and RS entries. The cost of a FP mul unit is only a little less than a FMA unit...so you could also save silicon, power and control logic.

Besides, having two 256b FMA units isn't helpful...there's no real way to feed them.

DK

Nick · Sep 10, 2009

dkanter said:
You said there's only a benefit with 2 FMA units. If you do macro-op fusion like techniques to merge add and mul instructions (you may want to do it in the rename logic by renaming adds and muls that are dependent, rather than front-end since the add and mul may be separated in the Istream), you can save ROB and RS entries. The cost of a FP mul unit is only a little less than a FMA unit...so you could also save silicon, power and control logic.

Ok, I see how macro-op fusion could theoretically make a single FMA unit useful, but I seriously doubt that's a worthwhile approach. It adds a lot of complexity to the decoders and saves you little in the execution units, while having low throughput gains. Having two full-fledged FMA units is really the simplest thing to do here.

Besides, having two 256b FMA units isn't helpful...there's no real way to feed them.

Why? Nehalem already supports 256-bit loads for movups. Also, the tiny Larrabee cores support 512-bit FMA operations. So surely they can support 2x256-bit for the tick or tock after Sandy Bridge.

silent_guy · Sep 10, 2009

Nick said:
With multiple instances however you keep everything highly utilized and you reach the maximum combined performance.

You're extrapolating the concept of CPU hyperthreading to a GPU. I think that's a mistake. Hyperthreading is possible because switching between contexts is very cheap (a relatively small amount of state and you're limited to only two contexts) and because the pipeline is very well balanced in that each section of the pipeline has roughly the same thoughput.

You don't have such luxuries on a GPU.

The fact that ROPs or TEX units may be underutilized during a long math based shader doesn't mean that you can assign those resources to some other context. After all, they're slave units to the shader.

Also, if you're going to run 10 instances of the same game concurrently, chances are very high that all of them will be running very similar shaders and thus experience the same choking point.

I think those initial claims were made after smoking something really good.

rpg.314 · Sep 10, 2009

Nick said:
Ok, I see how macro-op fusion could theoretically make a single FMA unit useful, but I seriously doubt that's a worthwhile approach. It adds a lot of complexity to the decoders and saves you little in the execution units, while having low throughput gains. Having two full-fledged FMA units is really the simplest thing to do here.

Why? Nehalem already supports 256-bit loads for movups. Also, the tiny Larrabee cores support 512-bit FMA operations. So surely they can support 2x256-bit for the tick or tock after Sandy Bridge.

Sandy bridge will not have FMA, so I think we'll just have to make do with twice as large vector issue per core in 2010.

dkanter · Sep 10, 2009

Nick said:
Ok, I see how macro-op fusion could theoretically make a single FMA unit useful, but I seriously doubt that's a worthwhile approach. It adds a lot of complexity to the decoders and saves you little in the execution units, while having low throughput gains. Having two full-fledged FMA units is really the simplest thing to do here.

Hint: it doesn't need to be in decode.

Why? Nehalem already supports 256-bit loads for movups. Also, the tiny Larrabee cores support 512-bit FMA operations. So surely they can support 2x256-bit for the tick or tock after Sandy Bridge.

I think you should study 256b loads in Nehalem more carefully. Particularly the latency for movups.

DK

Nick · Sep 10, 2009

silent_guy said:
The fact that ROPs or TEX units may be underutilized during a long math based shader doesn't mean that you can assign those resources to some other context. After all, they're slave units to the shader.

That's absurd. A texture lookup only requires the coordinates and the sampler index. This sampler index can be extended to samplers from multiple shaders. In other words, if a texture unit can sample different textures within the same shader it can also sample textures from different shaders. Right?

Also, if you're going to run 10 instances of the same game concurrently, chances are very high that all of them will be running very similar shaders and thus experience the same choking point.

Games have dozens of shaders with many different characteristics. Some perform a Gaussian blur and are completely TEX limited, while for instance vertex shaders are typically purely arithmetic, and particle shaders are ROP limited. There's a bit of overlap, but like I said you can't render the end of the frame before the begin. When you're running ten instances however, there's always one issuing arithmetic heavy work, one issuing texture lookup heavy work, one issuing ROP heavy work, one doing something CPU heavy, one using up all PCI-e bandwidth, etc. Combined they utilize the hardware way better than one could do on its own.

I think those initial claims were made after smoking something really good.

Cuban?

Nick · Sep 10, 2009

rpg.314 said:
Sandy bridge will not have FMA...

That's what I wrote.

Nick · Sep 10, 2009

dkanter said:
Hint: it doesn't need to be in decode.

Yes, I read your suggestion about doing it during register rename. However, that's far from trivial. It's basically as complex as adding another execution port. So what you're suggesting is pretty much equivalent to having three SIMD units. That's not going to happen any time soon. Supporting FMA on two units is far simpler and offers higher throughput to boot.

I think you should study 256b loads in Nehalem more carefully. Particularly the latency for movups.

http://www.intel.com/Assets/PDF/manual/248966.pdf - Section 2.2.5.1, Table 2-8.

What exactly should I be reading more carefully?

pcchen · Sep 10, 2009

Nick said:
That's absurd. A texture lookup only requires the coordinates and the sampler index. This sampler index can be extended to samplers from multiple shaders. In other words, if a texture unit can sample different textures within the same shader it can also sample textures from different shaders. Right?

They are probably not physically connected. In many GPU, a TMU is connected to a certain number of shader units, and it can only accept data from those units, not others. It's very expensive to make a full FIFO-queue-like system. Generally only the memory subsystem works this way.

Games have dozens of shaders with many different characteristics. Some perform a Gaussian blur and are completely TEX limited, while for instance vertex shaders are typically purely arithmetic, and particle shaders are ROP limited. There's a bit of overlap, but like I said you can't render the end of the frame before the begin. When you're running ten instances however, there's always one issuing arithmetic heavy work, one issuing texture lookup heavy work, one issuing ROP heavy work, one doing something CPU heavy, one using up all PCI-e bandwidth, etc. Combined they utilize the hardware way better than one could do on its own.

This is only possible if you can do cheap context switch. That is, if your context is small enough. On CPU it's easier because the states of a thread is relatively small (basically just the contents of these registers). On a GPU, a context can be pretty huge and it can be difficult to do a efficient context switch for this to be beneficial.

A possible way is to let each units to perform different works (e.g. if you have 10 TMU, assign each one for each context), but this will be meaningless because there's no way for them to balance between workloads. i.e. if some units stay idle, they can't accept works from another context because they are not physically connected.

Nick · Sep 10, 2009

pcchen said:
They are probably not physically connected. In many GPU, a TMU is connected to a certain number of shader units, and it can only accept data from those units, not others.

Sure, but every cluster of shader units runs multiple threads that each can have a different shader. So shaders that are texture lookup heavy can be interleaved with shaders that are arithmetic heavy or ROP heavy. When running multiple instances it's much more likely to get a good blend of various shader characteristics.

This is only possible if you can do cheap context switch. That is, if your context is small enough.

Thread contexts can be switched cheaply.

rpg.314 · Sep 10, 2009

The otoy people don't run into gpu ram (only 512 mb on 4850) till 8 instances of crysis, do they? I mean are the drivers smart enough to share that too?

pcchen · Sep 10, 2009

Nick said:
Sure, but every cluster of shader units runs multiple threads that each can have a different shader. So shaders that are texture lookup heavy can be interleaved with shaders that are arithmetic heavy or ROP heavy. When running multiple instances it's much more likely to get a good blend of various shader characteristics.

Thread contexts can be switched cheaply.

Only if the context is small... consider this, the MP in G8X/G9X has 16KB worth of registers and 16KB shared memory. How do you switch that context cheaply?

GPU vs CPU Architecture Evolution

3dilettante

elsence

rpg.314

elsence

Nick

Nick

dkanter

Nick

dkanter

Nick

silent_guy

rpg.314

dkanter

Nick

Nick

Nick

pcchen

Moderator

Nick

rpg.314

pcchen

Moderator

Similar threads