G80 Shader Core @ 1.35GHz, How'd They Do That?

Another answer would be that the need of engineering fast running ALUs hasn't really been there until recently. ALUs are fairly cheap in terms of transistors. You're more likely to be bottlenecked trying to keep them fed than having to much to do. Latency is fairly high on GPUs so you need to hide it. If you run up the clockspeeds it's just that much more register information you have to stick somewhere to hide the latency. Only recently have we started moving towards more math than texture operations being performed. Hence why faster or larger ALUs are beneficial. If all you're doing is a texture lookup for a pixel the speed of the shaders isn't going to do you a whole lot of good.
 
There's also the process factor. GPUs thus far have been fabbed on foundry processes that must cover a wide range of customers and customer designs. Timings and drive currents are inferior to the more extreme engineering involved in high-end CPU processes.

Timings could be 2-3 times worse for a foundry process than they are for an Intel or AMD fab at the same geometry.


So as an aside , with AMD hooking up with Chartered Semiconductor will such things factor there as well . Or will Chartered have some identical equipment to AMD ?
 
So as an aside , with AMD hooking up with Chartered Semiconductor will such things factor there as well . Or will Chartered have some identical equipment to AMD ?

AMD has shared process tech with Chartered, so there would be more commonality.
It's also unlikely that the top CPU product lines would come from the outside fab.
 
You may have a point with the hand tuning, but i still dont believe it is the main reason.

The increased transistor count (increased power, more difficult to cool) and the increased complexity(more complex interconnects which leads to signal interference, signal attenuation and signal delay times issues) of the GPUs are the main reasons.

These reasons make even the best "hand tuning" not capable of delivering a substantial increase in clock speed.
Hand tuning does give substantial improvements. You can get something like twice the speed and half the size from a full custom design compared to a semi-custom design. Probably lower power too if you wanted it. But the problem is it just takes a long long time compared to the place and route tool.

And yes, manufacturing process plays a big role as well. I'm almost certain intel and amd are ahead of TSMC in this respect. I've heard 2x faster for intel for a given process node from someone who is very knowledgeable in this stuff, but i can't be sure.
 
I believe nvidia was able to raise the clock so much mainly due to the fact that the ALUs got simpler, 32bit scalar instead of 128bit vec4.

Erm, no. Unless there's some interaction between the individual components of a vector, a vec4 alu is probably designed as 4 vec1 alus... If you design that 1 alu at high speed, all 4 will all run at high speed.

CPUs can achieve higher clocks cause they have less transistors, so for the same frequency less heat dissipation and therefore less power demand.

Nope. The transistors at position A are not aware that there are much more transistors as positions B and C and D and ...
The speed within a clock domain is determined by the critical path between a source and destination flip flop. Which, simplified, is determined by the amount of transitors between them and the length of the wires between them. Once again, those are local parameters that are not correlated with the overall size of the chip.

Frequency depends on the number of transistors, you can have few transistors with high clock speed, or many transistors with low clock speed.
Nope. Not at all. There are some global factors, such as the ability to supply enough power and the ability to remove the heat, but those are not primary factors in the ability to design a high frequency chip.

Depends on what you want to do. If you want more pipelines each more complex (GPU approach) , then you need more transistors hence you cannot raise the frequency so much. If you want few pipelines each not so complex (CPU approach), then transistor count is lower hence you can raise the frequency much more.

You're really simplifying it to a point that it doesn't make sense at all. It just doesn't work that way.

If I'm not mistaken, a significant part of G80 runs at the highest frequencies that have ever been used in GPUs, yet it also has the highest amount of transistors. See?

The fact that CPU's can run even higher is mostly because they spend much more effort at getting it that way. I'm not talking about 3 or 4 or 5 years to get the design ready, but at the amount of manpower that's thrown at the problem.
I have absolutely no numbers to back any of this up, but you can be sure that number of design manhours per transistor for a CPU is still way higher than that of a GPU. (Not including memory transitors of the caches etc.)
 
The increased transistor count (increased power, more difficult to cool) and the increased complexity(more complex interconnects which leads to signal interference, signal attenuation and signal delay times issues) of the GPUs are the main reasons.

I doubt it and I doubt you can back it up with data.

Signal integrity has been a long solved problem that's typically thrown over the wall to the backend guys who deal with the issue with a bunch of really expensive tools.

Signal attenuation doesn't even enter the radar. Well, it's a factor with signal delay times, I guess, but that's nicely modeled by a distributed R/C network for years. And solved by buffering up the wire.

A CPU has to deal with just the same issues (and at higher frequencies). SI is again a localized problem: it doesn't really matter that there are tons of complex busses on the other side of the bus, it's an effect that only works within a range of a um or so.
 
SilentGuy your arguments are all based on some sort of weird argument that "the chip may have a large number of transistors and interconnects globally, but this doesnt affect the operations of transistor and interconnects locally". It just makes nosense to me. For example you say
SI is again a localized problem: it doesn't really matter that there are tons of complex busses on the other side of the bus, it's an effect that only works within a range of a um or so.
Now it doesnt really matter if there are tons of complex busses on the other side of the bus? Signals just have to pass through these complex busses.
And another one on the same
The transistors at position A are not aware that there are much more transistors as positions B and C and D and ...
The speed within a clock domain is determined by the critical path between a source and destination flip flop. Which, simplified, is determined by the amount of transitors between them and the length of the wires between them. Once again, those are local parameters that are not correlated with the overall size of the chip.
Since there are more transistors and more interconnects between them globally we simply expect to find more transistors and interconnects in any given local area.

And handtuning cant do no better cause it just cant overcome the law of physics. For a given integration scale, if you gonna make a transistor work at a higher frequency you going to need higher power. If you make the transistor at 4 times a frequency you are going to need 4 times the power. So for a G80 if it was to double via hand tuning the frequency of shader cores and quadruple the frequency of the rest of the chip (ROPs+TMUs+Memory controller) it was goind to need at least 3 times the power assuming that the shader ALUs and the rest of chip are equal in transistor count, though if i remember correctly shader ALUs arent more than 30% of the total transistors. Now 3 times the power makes for about 300W+ for just the graphic card. What cooling do you need for this?
 
Hand tuning does give substantial improvements. You can get something like twice the speed and half the size from a full custom design compared to a semi-custom design. Probably lower power too if you wanted it. But the problem is it just takes a long long time compared to the place and route tool.

And yes, manufacturing process plays a big role as well. I'm almost certain intel and amd are ahead of TSMC in this respect. I've heard 2x faster for intel for a given process node from someone who is very knowledgeable in this stuff, but i can't be sure.


Well i doubt hand tuning can get you half the size and twice the speed (improvement up to 25% less size and 25% more speed maybe), but even so you implicitly agree that transistor count and interconnect complexity are the major factors limiting the frequency.
If a hand tune doesnt reduce the transistor count and doesnt simplify the interconnects both in number, shape and size, then what on earth will it do in order to increase the speed?
 
Well i doubt hand tuning can get you half the size and twice the speed (improvement up to 25% less size and 25% more speed maybe), but even so you implicitly agree that transistor count and interconnect complexity are the major factors limiting the frequency.
If a hand tune doesnt reduce the transistor count and doesnt simplify the interconnects both in number, shape and size, then what on earth will it do in order to increase the speed?
No, i don't agree. It depends on how those transistors are used, which of course depends on design targets. Generally, to get more speed (frequency), you have to spend more transistors. You do more parallel computation to reduce logic depth per stage. This is especially true for ALUs. Or you pipeline deeper. Or a combination of both. This is assuming that the same design methodology is used (not semicustom vs full custom). I don't think interconnects are the limiting factor.

Full custom design allows you to make custom gates for a specific unit, say the ALU, which may be faster than what a standard cell library gate can do. You can also really mess with the transistor sizing to optimize delays on the critical paths (logical effort optimization really helps, it can give you 2x the speed using the same # of transistors) and really optimize your layouts to be denser than what a place and route tool can do. I don't know exactly what a current state of the art P&R tool can do, but i'm sure it's nowhere near as good as a really good person doing the layouts. The P&R tools i've messed with are not good at all compared to a custom design. But like i said, it's a lot faster than a person.
 
No, i don't agree. It depends on how those transistors are used, which of course depends on design targets. Generally, to get more speed (frequency), you have to spend more transistors. You do more parallel computation to reduce logic depth per stage. This is especially true for ALUs. Or you pipeline deeper. Or a combination of both. This is assuming that the same design methodology is used (not semicustom vs full custom). I don't think interconnects are the limiting factor.

Full custom design allows you to make custom gates for a specific unit, say the ALU, which may be faster than what a standard cell library gate can do. You can also really mess with the transistor sizing to optimize delays on the critical paths (logical effort optimization really helps, it can give you 2x the speed using the same # of transistors) and really optimize your layouts to be denser than what a place and route tool can do. I don't know exactly what a current state of the art P&R tool can do, but i'm sure it's nowhere near as good as a really good person doing the layouts. The P&R tools i've messed with are not good at all compared to a custom design. But like i said, it's a lot faster than a person.

To get more frequency you need to make better transistors(smaller , with less curent draw and less leakage currents) and better interconnects. Interconnects are limiting factor due to signal delay (for their length) and their capacitance(due to their geometry (length, width) and distance from each other) which affects the transistor switching times. Custom gates maybe faster due to but not neccesarilly with fewer transistors and interconnects.
 
To get more frequency you need to make better transistors(smaller , with less curent draw and less leakage currents) and better interconnects. Interconnects are limiting factor due to signal delay (for their length) and their capacitance(due to their geometry (length, width) and distance from each other) which affects the transistor switching times. Custom gates maybe faster due to but not neccesarilly with fewer transistors and interconnects.

Sure, interconnects aren't negligible for cross chip wires, but how many of those are on the critical path modules? For local wiring, it's not a problem. I'm wondering why you think it's a really big issue? It is an issue for long wires, but i don't think it affects clockspeed because it can be dealt with quite well.
 
SilentGuy your arguments are all based on some sort of weird argument that "the chip may have a large number of transistors and interconnects globally, but this doesnt affect the operations of transistor and interconnects locally".
Yes, that's exactly what I'm saying, because it's my practical experience. (It could be, of course, that design and the laws of physics for GPUs are entirely different, in which case I have to pass.)

Now it doesnt really matter if there are tons of complex busses on the other side of the bus?
Yes, that's entirely correct. We don't give a shit about those buses interfering with other signals. It's a back-end problem:
The presence of a lot of buses in a design can give you major headaches wrt routing congestion: your logic density decreases in congested areas, which increases the area of a chip, which increases cost. Tough. But that's about it, and you plan for it anyway during the architecture phase by using lower fill factor estimates.

Signal integrity problems are largely orthogonal to this kind of issues, because they happen on very small scales: an aggressor net of a chip (=a net with a very big driver) can impact victim nets that surround it, but the distance over which it happens is really small: it's typically only the nets that are placed right next to it, and even then the only problem nets are those which are far away from their driver AND that had critical timing at the same time.

In the past, the problem was finding those nets, but that's standard practice now. Fixing them was and is easy: you insert another buffer on the victim net. Or reduce the driver of the aggressor net and add a driver in the middle. Or you do a small routing tweak where you flip 2 wires, so the victim net gets farther away from the aggressor. 99% of those fixes are done automatically by the tools.

Signals just have to pass through these complex busses.
Exactly. Do you have a problem with that?

Edit: I have this little feeling that somehow you're under the naive impression that buses are treated different from other signals during the back-end phase. Typically, they're not. Per macro block, back-end gets a gigantic, often flattened, blob of nets and cells and that's about it. Most of the time, they don't even know what the blob is doing, but if they do they don't care. At All.
For top-level routing purposes, it can happen that certain channels are reserved for buses to cross a macro block. That's fine. It just means that cell placement in those channels will be at lower density, but even then bus signals are not treated differently from other signals.

And handtuning can't do no better cause it just cant overcome the law of physics. For a given integration scale, if you gonna make a transistor work at a higher frequency you going to need higher power. If you make the transistor at 4 times a frequency you are going to need 4 times the power. So for a G80 if it was to double via hand tuning the frequency of shader cores and quadruple the frequency of the rest of the chip (ROPs+TMUs+Memory controller) it was goind to need at least 3 times the power assuming that the shader ALUs and the rest of chip are equal in transistor count, though if i remember correctly shader ALUs arent more than 30% of the total transistors. Now 3 times the power makes for about 300W+ for just the graphic card. What cooling do you need for this?

Duh. So, basically, all you were initially saying is that GPU's have to take power into account? Why didn't you then just write: "GPU's have to take power into account" and leave it at that???

This has nothing to do with the implication that high frequency is not possible because it has a lot of transistors.

High frequencies by themselves are not a problem, even on large area designs, as long as you only use it where it makes sense.
 
Last edited by a moderator:
To get more frequency you need to make better transistors(smaller , with less curent draw and less leakage currents) and better interconnects.

Faster transistors have higher leakage, not lower. Could you please describe how a better transistor has less current draw and how that impact the frequency of a design? And while we're at it, how exactly do you define 'better interconnect'?

Maximum frequency is determined by the largest delay between 2 any FF within the same clock domain. Period. You want higher speed for the same process, you decrease the delay, by pipelining or using certain pre-calc techniques or reducing fanout etc. No need for 'better' transistors, really. And if you really hit the wall, you move to full custom, which is always better (and often with less area.)

Custom gates maybe faster due to but not neccesarilly with fewer transistors and interconnects.

(Unable to decode the meaning of this sentence. Sorry.)
 
Faster transistors have higher leakage, not lower. Could you please describe how a better transistor has less current draw and how that impact the frequency of a design? And while we're at it, how exactly do you define 'better interconnect'?
Simply by going into smaller integration scales. For the same frequency you have smaller currents, or you can pump up the frequency and get the same currents. Thats they way it is going for years now, i wonder why you asked that. Better interconnects are the one with lower resistance and lower capacitance, and their geometry design is such to minimize the interference and their length.

Maximum frequency is determined by the largest delay between 2 any FF within the same clock domain. Period. You want higher speed for the same process, you decrease the delay, by pipelining or using certain pre-calc techniques or reducing fanout etc. No need for 'better' transistors, really. And if you really hit the wall, you move to full custom, which is always better (and often with less area.)
Yes for the same integration process you decrease the delay. One way to decrease the delay is to make shorter interconnects if possible. Reducing fanout is all about reducing interconnects isnt it? At the very start of the thread the guy said that shorter pipelines can work at higher frequency, isnt the length of interconnects critical for a shorter pipeline?



(Unable to decode the meaning of this sentence. Sorry.)
Hehe, i was a bit angry with my sister and mother when i was writing that one. Well, custom gates may be faster not in frequency but in the way they implement a function. They can implement the same function with fewer transistors , and make it work faster by ommiting some intermediate results.

And still about interconnects there are the so called global interconnects which are generally large in lengths cause many transistor gates throughout the chip connect to them. You just cant minimize these interconnects if your chip has large transistor count and large die area.

And again about interconnects, 2-3 years ago we saw AMD who couldnt raise the frequency so much as the Intel P4 Northword &Extreme. This was generally attributed to the more complex design of the AMD chips. When we say complex design what else do we mean if not many and large interconnects that interfere with eatch other.
 
Simply by going into smaller integration scales. For the same frequency you have smaller currents, or you can pump up the frequency and get the same currents. Thats they way it is going for years now, i wonder why you asked that.
A P4 at 180nm runs at 2.8GHz. Maybe the transistsor play a role (of course they do), but not such a big one?

Better interconnects are the one with lower resistance and lower capacitance, and their geometry design is such to minimize the interference and their length.
Well, there's a problem then, right? Metal wire characteristics have been getting worse with every process step.

Yes for the same integration process you decrease the delay. One way to decrease the delay is to make shorter interconnects if possible. Reducing fanout is all about reducing interconnects isnt it?
When you reduce fanout, you try to reduce the amount of gates connected to the driver. That means lower wires too, but that's not what you're design for.

At the very start of the thread the guy said that shorter pipelines can work at higher frequency, isnt the length of interconnects critical for a shorter pipeline?
No, not at all. It means less levels of logic.
Yes, that reduces wire lengths too, which is a nice consequence. But when you need to design a new high speed block, the first question will be this: "how much levels of logic are we allowed to use." Questions about interconnect is never asked.

Well, custom gates may be faster not in frequency but in the way they implement a function. They can implement the same function with fewer transistors , and make it work faster by ommiting some intermediate results.
On the contrary: instead of omitting intermediate results, they hold on to them longer.
One of the common techniques in custom design, is to keep intermediate results in one-hot format, which is a redundant form and less area and less wire efficient. But it also also drastically reduces the decoding complexity at the receiver end, so the amount of logic levels between flops goes down. It also often results in a more regular layout structure, which is easier to manually place and route.
The amount of transistors is, indeed, usually smaller for a given function, because they can use pass gates and use other funky tricks.

And still about interconnects there are the so called global interconnects which are generally large in lengths cause many transistor gates throughout the chip connect to them. You just cant minimize these interconnects if your chip has large transistor count and large die area.
They are not large length because many gates connect to them. They are large because they have to travel large distances. And the obvious thing to do it to just pipeline them and buffer them up. Like Fahran tried to get across: no big deal.

And again about interconnects, 2-3 years ago we saw AMD who couldnt raise the frequency so much as the Intel P4 Northword &Extreme. This was generally attributed to the more complex design of the AMD chips.
'Complex' can mean anything, so it basically means nothing.
You're confusing 'complex' with 'different architecture'. The P4 architecture was entirely designed for crazy speeds and was pipelined to a fault. The AMD architecture had a shorter pipeline. The lower frequency of the AMD was a design decision right from the start. Not something that just happened.

When we say complex design what else do we mean if not many and large interconnects that interfere with eatch other.
Well, that's the first time ever I hear 'complex' defined as a primarily a problem of interfering interconnects. I don't know where you get that idea, but I'm afraid you won't find a lot of supporters.

BTW, when I think of complex, I think of lots of non-trivial logical functions that are hard to get right during design and to verify for correctness. The 'problem' of interfering connects doesn't even cross my mind. Seriously. Maybe it's just me...h
 
Last edited by a moderator:
A P4 at 180nm runs at 2.8GHz. Maybe the transistsor play a role (of course they do), but not such a big one?

Actually, the fastest 0.18u ("P4 Williamette") Netburst CPU released to retail-market was clocked at 2.0GHz. The fastest 0.13u ("P4 Northwood") ... was 3.2GHz. The fastest 90nm ("P4 Prescott") ... was 3.73GHz, and it consumed more leakage-power than the 130nm, at the same freq.

Intel publishes a bunch of papers on their CPU architectures and circuit-level implementation. The P4 utilized a bunch of asynchronous logic (self-resetting domino, or something like that), which hurts my mind even to say it. Definitely not your run of the mill CMOS-digital gate approach. Though ironically, the Pentium4, was also the first Intel CPU in which >50% of the logic-area was implemented through a automated cell-based methodology (i.e. something like 'standard-cell' flow), using the usual (traditional) suite of automated logic-synthesis, place and route, and timing-closure techniques.

Signal integrity has been a long solved problem that's typically thrown over the wall to the backend guys who deal with the issue with a bunch of really expensive tools.

WOW, I'm an idiot working with other idiots then! :) Once in a while, these backend SI/issues trickle upstream to the synthesis-guys, who poke the RTL-designers, who then have to make seemingly contradictory non-sensical mods to the RTL source files. I once tried to convince people to use 'signed' Verilog-2001 arithmetic instead of hand-coding 2's complement math-functions, but that got me branded a heretic. Something about coding in Verilog for so many years, made those designers forget that HDLs (*cough* VHDL) can, in fact, perform unsigned and signed arithmetic with equal naturalness., without awkward manually-written boolean equations for the carry-out bit.

Unreal; said:
Well i doubt hand tuning can get you half the size and twice the speed (improvement up to 25% less size and 25% more speed maybe), but even so you implicitly agree that transistor count and interconnect complexity are the major factors limiting the frequency.

Like Farhan said, the quoted improvements (half area, twice speed) are possible depending on the aggresiveness of boolean-logic and transistor-level optimizations. Certain high-level structures (like a register-file or a barrel-shifter) can be shrunk even more, because standard-cell logic-synthesis/P&R do terrible jobs of mapping them. (A 3rd-year VLSI EE student could do a much better layout than compared to the output of an automated gate-synthesis tool.)

15-20 years ago, doing custom-spins on completed ASICs was a typical "cost-reduction" step for many ASIC design. This was back when product-cycles were longer, and design-rules were less complex/convoluted. That has decreased over the years, partly due to cost, but probably because the front-end digital-design and design-verification/simulation have exploded in scope and complexity...Thanks to Moore's Law, designer productivity can hardly keep up with ever-increasing designs and gate-counts. Even with the latest highly-automated synthesis/P&R production flow, just getting a working chip out to tapeout, on time, is a struggle. Adding budget and manpower for custom-design, on an already hectic schedule, might cause the whole team to implode.

One of the common techniques in custom design, is to keep intermediate results in one-hot format, which is a redundant form and less area and less wire efficient. But it also also drastically reduces the decoding complexity at the receiver end, so the amount of logic levels between flops goes down. It also often results in a more regular layout structure, which is easier to manually place and route.

Most modern front-end RTL synthesis tools can recognize and extract finite state-machines, and transform them to 1-hot representation... Now, I read somewhere that full-custom circuits often violate design-rules, if it can be formally proven that a forbidden (i.e. catastrophic) state can never happen. (You know, like using internal-tristates to shrink a big fan-in mux.) Is that so?
 
Now, I read somewhere that full-custom circuits often violate design-rules, if it can be formally proven that a forbidden (i.e. catastrophic) state can never happen. (You know, like using internal-tristates to shrink a big fan-in mux.) Is that so?
This is true. Especially for dense array type structures like SRAM, design rules are often violated to get a higher density. And yes, the tri state mux is very handy :)
 
Back
Top