AMD: R9xx Speculation

How high is Larrabee's clocks suppose to be for the cores ? Radeon 5870 just broke the record for its core clock overclock at 1525 MHz stable....why cant GPUs handle core clocks as high as CPUs....Intel CPUs are in the 1.5 Billion transistors range, GPUs are not THAT far away and still cant even compete in core clock numbers...
 
How high is Larrabee's clocks suppose to be for the cores ? Radeon 5870 just broke the record for its core clock overclock at 1525 MHz stable....why cant GPUs handle core clocks as high as CPUs....Intel CPUs are in the 1.5 Billion transistors range, GPUs are not THAT far away and still cant even compete in core clock numbers...

Because gpu's don't high clock speeds to get their jobs done.

Prolly 5 cypress dies in the world can hit 1.5GHz. It doesn't mean anything.
 
Because gpu's don't high clock speeds to get their jobs done.

Prolly 5 cypress dies in the world can hit 1.5GHz. It doesn't mean anything.

Its not suppose to mean we are getting there or anything like that, its suppose to mean high clocks helps:
http://www.fudzilla.com/content/view/18462/1/
Look at the GPU score, my old system with 2x gtx 280 SC couldnt get that.

Also, anand did a good review about core clock/shader clock scalling:
http://www.anandtech.com/show/2175/4

Raising the ALU clocks does help to get the job done, u can see that nVidia has shader domain with higher clocks and it helps.

What I was interested to know is what stop a GPU from reaching frequencys more close to CPUs since some CPUscan work above 4.0 Ghz on air now a days.....

High end CPUs are now at 130W with 1.5b transistors while GT200 with 1.4b transistors cant even compete in core/shader clock speed....maybe something about the transistors gate, bulk or SOI wafers, idk.
 
What I was interested to know is what stop a GPU from reaching frequencys more close to CPUs since some CPUscan work above 4.0 Ghz on air now a days.....
Simple answer: (circuit) design and power consumption.
In contrast to a CPU it just makes more sense to pack a lot more more but slower units on a GPU to get the job done it is meant for.
 
Simple answer: (circuit) design and power consumption.
In contrast to a CPU it just makes more sense to pack a lot more more but slower units on a GPU to get the job done it is meant for.

Hum, thanks....so the microarchitecture means a lot more for the power consumption than the number of transistors...
 
Hum, thanks....so the microarchitecture means a lot more for the power consumption than the number of transistors...

You have to consider that in a chip of any kind, some components draw more power than others.

I'm guessing that by "high-end CPUs" you're referring to Nehalem EX, and you have to consider that about half of it is cache, which draws little power. GT200 is mostly made up of power-hungry components such as the stream processors, TMUs, etc.

Then of course, there's the manufacturing process. Intel's 45nm is miles ahead of TSMC's 55nm, let alone their 65nm.
 
Its not suppose to mean we are getting there or anything like that, its suppose to mean high clocks helps:
http://www.fudzilla.com/content/view/18462/1/
Look at the GPU score, my old system with 2x gtx 280 SC couldnt get that.

Also, anand did a good review about core clock/shader clock scalling:
http://www.anandtech.com/show/2175/4

Raising the ALU clocks does help to get the job done, u can see that nVidia has shader domain with higher clocks and it helps.

What I was interested to know is what stop a GPU from reaching frequencys more close to CPUs since some CPUscan work above 4.0 Ghz on air now a days.....

High end CPUs are now at 130W with 1.5b transistors while GT200 with 1.4b transistors cant even compete in core/shader clock speed....maybe something about the transistors gate, bulk or SOI wafers, idk.

Logic blocks take more area even if they are functionally identical. The need for high clocks in cpu's arises from the nature of the code they run.
 
maybe SI is NI@40nm and the "true" NI is 28nm? .... and there is no "hybrid" of NI and cypress called "SI"? mayby SI is just a big "real" NI?

There is no "Southern islands", there's only "Manhattan"... :D
"EG BROADWAY" = ati2mtag_Manhattan_PXAI, PCI\VEN_1002&DEV_68B0&SUBSYS_68B01002
"EG LEXINGTON" = ati2mtag_Manhattan_PXAA, PCI\VEN_1002&DEV_6880&SUBSYS_01341002
"EG LEXINGTON " = ati2mtag_Manhattan_PXAI, PCI\VEN_1002&DEV_6880&SUBSYS_68801002
"EG LEXINGTON " = ati2mtag_Manhattan_PXAA, PCI\VEN_1002&DEV_6880&SUBSYS_01241002
"EG MADISON" = ati2mtag_Manhattan_PXAA, PCI\VEN_1002&DEV_68C0&SUBSYS_01341002
"EG MADISON " = ati2mtag_Manhattan_PXAA, PCI\VEN_1002&DEV_68C1&SUBSYS_01341002
"EG PARK" = ati2mtag_Manhattan_PXAA, PCI\VEN_1002&DEV_68E1&SUBSYS_01341002
"EG PARK " = ati2mtag_Manhattan_PXAA, PCI\VEN_1002&DEV_68E1&SUBSYS_01241002

"ROBSON CE" = ati2mtag_Manhattan, PCI\VEN_1002&DEV_68E4
"ROBSON LE" = ati2mtag_Manhattan, PCI\VEN_1002&DEV_68E5


http://forums.guru3d.com/showthread.php?t=319691
 
Last edited by a moderator:
What I was interested to know is what stop a GPU from reaching frequencys more close to CPUs since some CPUscan work above 4.0 Ghz on air now a days.....

Two words: pipeline depth. CPUs have longer pipelines which allow higher clock speeds. It's the main reason why CPUs speeds are what they are today and the same reason there hasn't been much improvement the last couple years. There's a practical limit to how many stages there can be which the CPU makers have already reached, and adding more doesn't come for free in terms of die area.
 
You have to consider that in a chip of any kind, some components draw more power than others.

I'm guessing that by "high-end CPUs" you're referring to Nehalem EX, and you have to consider that about half of it is cache, which draws little power. GT200 is mostly made up of power-hungry components such as the stream processors, TMUs, etc.

Then of course, there's the manufacturing process. Intel's 45nm is miles ahead of TSMC's 55nm, let alone their 65nm.

CPU's need high clock speed because they run sequential code where there are very often dependencies between successive instructions.

In order to make this kind of code fast, you have to make each operation to run as quickly as possible. This practically means having low latency on cycles(1 cycle on adds and other basic ops), and making those cycles as fast as possible(high clock speed). In order to achieve this more complex fast arithmetic units are used, and the chip is made highly pipelined. Are we have to spent mayby twice the chip area for 20% performance increase when developing a new even more aggressive cpu.

GPU's, on the other hand, handle highly parallel data. Latency of single operations is not important.
So it's much more practical to use huge number of simpler, slower-latency processors, which have slower, smaller arithmetic units and shorter pipelines, leading to lower clock speeds. This kind of simple and slow processor which clocks at 40% of cpu's clock speed may only consume something like 20% of the chip area and 10% of the power the high-speed cpu core would consume.


This was somewhat simplification, the SIMT way of executing threads also complicates the issue.
 
Mass-producing such a large chip on a completely new process would be very risky, even for Intel. Besides, I think that according to Larrabee's original schedule, it was supposed to be released before the 32nm process would be ready.

Well in a GPU related thread I still don't see where Intel's manufacturing process advantage could mean anything for hypothetical GPUs. Whenever Intel makes a comeback with their next iteration of LRB I have severe doubts it'll end up on the same process as their CPUs of the time. In that regard comparisons with TSMC's process roadmap of the future are somewhat off base.

What bears another question is whether future processes at TSMC will bear similar problems as 40G and (since it's an AMD speculation thread) if and where Globalfoundries will show any advantages for bulk processes mostly (probably less foreseeable future).
 
Well in a GPU related thread I still don't see where Intel's manufacturing process advantage could mean anything for hypothetical GPUs. Whenever Intel makes a comeback with their next iteration of LRB I have severe doubts it'll end up on the same process as their CPUs of the time. In that regard comparisons with TSMC's process roadmap of the future are somewhat off base.

What bears another question is whether future processes at TSMC will bear similar problems as 40G and (since it's an AMD speculation thread) if and where Globalfoundries will show any advantages for bulk processes mostly (probably less foreseeable future).

Um it should be obvious. Intel's process technology is *MUCH* faster than TSMC at the same level of power. Their contacted gate pitch is tighter, etc. I wrote an article where I compared some of the numbers:
http://www.realworldtech.com/page.cfm?ArticleID=RWT072109003617&p=11

All things being equal, using Intel's process would result in a faster chip at the same power, or lower power at the same speed.


With respect to a GF bulk process, if you look at the IBM 32nm bulk process, it's in the same league as TSMC's 28nm...and Intel's 45nm (for transistor performance).

DK
 
Two words: pipeline depth. CPUs have longer pipelines which allow higher clock speeds.
Such fact always puzzles me , why do deep pipelines necessitate higher clocks?

For example , P4 had a 30 stage pipeline , they said it facilitated reaching 3.0+ GHz . however Core 2 had fewer stages , and it reached the same clock levels .
 
Um it should be obvious. Intel's process technology is *MUCH* faster than TSMC at the same level of power. Their contacted gate pitch is tighter, etc. I wrote an article where I compared some of the numbers:
http://www.realworldtech.com/page.cfm?ArticleID=RWT072109003617&p=11

All things being equal, using Intel's process would result in a faster chip at the same power, or lower power at the same speed.

I have severe doubts that either AMD or NVIDIA would want to manufacture any GPUs at Intel. As for Intel themselves, I have severe doubts it would have saved LRB's day in the end in terms of perf/W.

With respect to a GF bulk process, if you look at the IBM 32nm bulk process, it's in the same league as TSMC's 28nm...and Intel's 45nm (for transistor performance).

DK

Which GF bulk process would that be? 28nm?
 
Such fact always puzzles me , why do deep pipelines necessitate higher clocks?
They don't necessitate, they *allow*. Big difference.

But yeah, I would like a detailed explanation myself.

For example , P4 had a 30 stage pipeline , they said it facilitated reaching 3.0+ GHz . however Core 2 had fewer stages , and it reached the same clock levels .

Core 2 was made on a smaller process. *Generally*, smaller processes allow same design to reach higher clocks.
 
Such fact always puzzles me , why do deep pipelines necessitate higher clocks?

For example , P4 had a 30 stage pipeline , they said it facilitated reaching 3.0+ GHz . however Core 2 had fewer stages , and it reached the same clock levels .

Well deeper pipelines facilitate higher clocks all other things equal. The P4 and Core 2 are on completely different manufacturing processes so can't be compared on pipeline depth alone.

Clock speed is determined by the slowest stage in the pipeline. You can reduce that bottleneck by further breaking up that stage into more intermediate stages - each of which does less work and is faster as a result. Result is a higher maximum clock but a deeper pipeline.
 
Well deeper pipelines facilitate higher clocks all other things equal. The P4 and Core 2 are on completely different manufacturing processes so can't be compared on pipeline depth alone.

Clock speed is determined by the slowest stage in the pipeline. You can reduce that bottleneck by further breaking up that stage into more intermediate stages - each of which does less work and is faster as a result. Result is a higher maximum clock but a deeper pipeline.

Indeed. And actually, the P4 is still the fastest x86 CPU ever in terms of clockspeed. The fastest commercial version was clocked at 3.8GHz, and if I'm not mistaken it still holds the overclocking record above 8GHz. And that was on 65nm...
 
Such fact always puzzles me , why do deep pipelines necessitate higher clocks?

For example , P4 had a 30 stage pipeline , they said it facilitated reaching 3.0+ GHz . however Core 2 had fewer stages , and it reached the same clock levels .

Core2 was newer, and if it weren't for power limits P4 could have hit speeds Core2 can now only hit with some kind of sub-ambient like phase-change or LN2. It was not timing-limited.

The general reason is that a pipeline can only be clocked as fast as its longest stage allows.

Outsized-numbers for an example follow:
If 9 stages of a 10-stage pipeline take 1ns, but the 10th takes 5ns, the cycle time must be 5ns.
Let's say that slow stage were split into 5 ~1ns stages.
The new cycle time is 1ns.
The overall time it takes to execute an instruction doesn't necessarily change, and can probably worsen.
However, the number of instructions in-flight can increase significantly, as in the same 5ns time period necessitated by the slow pipeline 4 additional instructions could have been started.

Of course, if the clock does not scale for whatever reason, the overhead can make things worse, and unexpected events like a branch mispredict will strip away the throughput advantage with a long stream of useless work.
 
Back
Top