G80 Shader Core @ 1.35GHz, How'd They Do That?

^eMpTy^

Newcomer
Just as the title says, what did nVidia do differently to suddenly hit 1.35GHz on a major portion of G80?

In my limited knowledge of computer architecture, I understand that clock speed increases come from 3 things: die size shrinks, process maturity, or longer pipelines (i.e. smaller pipeline stages). Does G80 have dramatically shorter pipeline stages?

Anyone have any idea what changed with G80 that suddenly these very high clock speeds are possible on a 90nm process that nobody has ever gotten more than 700MHz out of before?
 
To answer this question, you should ask yourself why AMD and Intel reached that clock frequency years ago on their CPUs, while GPU's were still in the couple hundred MHz range on the same process.

Put simply, it's all about design.
 
To answer this question, you should ask yourself why AMD and Intel reached that clock frequency years ago on their CPUs, while GPU's were still in the couple hundred MHz range on the same process.

Put simply, it's all about design.

Well I know the reason that the P4 hit 3.8GHz forever ago and the Core2 and Athlon64 are still nowhere close, is because the P4 had 22 very short pipelines stages. Smaller stages = less time per stage to execute = faster clock speed.

But where is the design change at exactly? And if clock speeds could be doubled with a design change, will ATi do the same thing? Is this design change specific to a unified architecture? Should we expect everyone that makes a DX10 chip to be hitting the same speeds?
 
Just as the title says, what did nVidia do differently to suddenly hit 1.35GHz on a major portion of G80?

"Major" functionally, of course. But what percentage of the silicon area of the chip do you suppose that represents?
 
Well I know the reason that the P4 hit 3.8GHz forever ago and the Core2 and Athlon64 are still nowhere close, is because the P4 had 22 very short pipelines stages. Smaller stages = less time per stage to execute = faster clock speed.

But where is the design change at exactly? And if clock speeds could be doubled with a design change, will ATi do the same thing? Is this design change specific to a unified architecture? Should we expect everyone that makes a DX10 chip to be hitting the same speeds?

It's specific to Scalar-only ALU's, that's for sure. The number of pipelines works well here because GPU's are less sensitive to latency issues than a CPU with large numbers of pipeline stages, like the Pentium 4.
The downside is that they do less work per clock cycle than traditional shader ALU's, hence the need to compensate with high clock's.

The real question is to know how they did it while keeping thermal properties under control, unlike the dreaded Intel "Prescott".
But i guess x86 architectures are always less efficient than dedicated, floating point-oriented designs like GPU's.
 
Just as the title says, what did nVidia do differently to suddenly hit 1.35GHz on a major portion of G80?

In my limited knowledge of computer architecture, I understand that clock speed increases come from 3 things: die size shrinks, process maturity, or longer pipelines (i.e. smaller pipeline stages). Does G80 have dramatically shorter pipeline stages?

Anyone have any idea what changed with G80 that suddenly these very high clock speeds are possible on a 90nm process that nobody has ever gotten more than 700MHz out of before?
10 stages doesn't sound very deep to me. I've been asking this question as well, and asking about the pipeline depth of previous designs (G7x, R5xx) but so far no good answers.

However going from semicustom to custom design should be able get you a 2x performance (clockspeed in this case) boost without having to lengthen the pipeline. And IIRC the G80 shader core is full custom?
 
In the end, clock speed is a design choice. Higher clockspeeds take more area and consume more power. More functional units also take up more area and consume more power. But more functional units are frequently also less efficient (though graphics processing scales exceedingly well with the number of pipelines, GPGPU will benefit hugely from higher clocks but fewer pipes).

So, with each new process, and with each new iteration of GPU technology, exactly what clockspeed is most efficient varies.

Personally, I think that the choice of high-speed ALU's on the G80 is going to be the biggest differentiator between the G80 and R600. More than anything else, I think that that design choice is going to make the biggest difference (though others will clearly have their own benefits and drawbacks as well).
 
Well I know the reason that the P4 hit 3.8GHz forever ago and the Core2 and Athlon64 are still nowhere close, is because the P4 had 22 very short pipelines stages. Smaller stages = less time per stage to execute = faster clock speed.

Don't forget that similarly to the shader cores on the G80, the Netbursts ALU's were 'double pumped', running at twice the core frequency.
This means at 180nm they were running at up to 4GHz, 6.4GHz on 130nm, and finally 7.6GHz in 90 and 65nm chips.
LN2 cooling saw speeds exceeding 7GHz with benchmarking possible, at which point the ALU's would be running at 14GHz :oops:
 
Don't forget that similarly to the shader cores on the G80, the Netbursts ALU's were 'double pumped', running at twice the core frequency.
This means at 180nm they were running at up to 4GHz, 6.4GHz on 130nm, and finally 7.6GHz in 90 and 65nm chips.
LN2 cooling saw speeds exceeding 7GHz with benchmarking possible, at which point the ALU's would be running at 14GHz :oops:

I think that double pumped ALUs werent present od 90 and 65 nm cores as they werent laid out by hand. Not really sure about that, but I think I read it somewhere.

Zvekan
 
Well I know the reason that the P4 hit 3.8GHz forever ago and the Core2 and Athlon64 are still nowhere close, is because the P4 had 22 very short pipelines stages. Smaller stages = less time per stage to execute = faster clock speed.

But where is the design change at exactly? And if clock speeds could be doubled with a design change, will ATi do the same thing? Is this design change specific to a unified architecture? Should we expect everyone that makes a DX10 chip to be hitting the same speeds?

While netburst had very short stages , all of those stages = very long pipeline which hurt it severely in the end .
 
Just as the title says, what did nVidia do differently to suddenly hit 1.35GHz on a major portion of G80?
Are you sure G80 is so special? Maybe it's not..
Look at G70: its pixel shader ALUs approach 600 mhz and they're composed by 2 MADD units connected serially, so that you can do 2 dependant ops per clock cycle.
On G80 each ALUs work in parallel, there's no serial connection AFAIK, so I'm not really that surprised to know that they can sunndely clock a single ALU at a much higher clock.
The rest is probably 'just' the by-product of some clever redesign of their ALUs, imho
 
A primary design goal of the multifunction interpolator (which also provides special functions: SIN, RSQ etc.) was compactness, even at the expense of pipeline stage count.

I think the patent applications relating to the various aspects of multifunction ALUs also specifically note its ability to hit a peak of 1500MHz in 10 pipeline stages (though it's unclear how this result was arrived at, simulation or actual silicon). Again, compactness was a key aim of the design.

Jawed
 
hmm I'm not sure if the g80 pipelines stages are increased. The g70 has well around 300 stages I think, the g80 should have cut that down conciderable. Are you talking about cycles or stages?
 
Last edited by a moderator:
It's specific to Scalar-only ALU's, that's for sure.
I doubt being scalar had anything to do with it.

10 stages doesn't sound very deep to me. I've been asking this question as well, and asking about the pipeline depth of previous designs (G7x, R5xx) but so far no good answers.

However going from semicustom to custom design should be able get you a 2x performance (clockspeed in this case) boost without having to lengthen the pipeline. And IIRC the G80 shader core is full custom?
10 stages might not sound very deep, but it's all math and Pentium 4's ~30 stages included instruction fetch, decode, etc. I suspect full custom wasn't necessary to hit 1.35 GHz. It might be required to double that though. Not that's anyone's known to be trying.

hmm I'm not sure if the g80 pipelines stages are increased. The g70 has well around 300 stages I think, the g80 should have cut that down conciderable. Are you talking about cycles or stages?
You're thinking about the entire pipeline and not just the MADD ALUs.
 
Don't forget that similarly to the shader cores on the G80, the Netbursts ALU's were 'double pumped', running at twice the core frequency.
This means at 180nm they were running at up to 4GHz, 6.4GHz on 130nm, and finally 7.6GHz in 90 and 65nm chips.
LN2 cooling saw speeds exceeding 7GHz with benchmarking possible, at which point the ALU's would be running at 14GHz :oops:
Hey, my old 486 was double pumped too.
It could calculate x = a + b + c in single cycle. You had to kinda cheat, though (use the lea instruction).

Then came Pentium 4, and extended it to x = a OP1 b OP2 c, where OP1 and OP2 are one of: + - & | ^.

Funny how it ended up branded as doubling the clock frequency. You know, as if it wasn't inflated already.
 
Back
Top