R500 will Rock

KimB · Nov 13, 2004

Except that with a 4-component dot product, you only need three adds, and two of them can be in parallel. So, just from comparing the speed of the adders, you would expect a GPU to be half the frequency (assuming the longest basic work unit is the dot product).

Even if you consider that a full dot product includes four multiplies and three adds, all four of those multiplies are in parallel, so you only need three serial calculations. So it just doesn't add up.

MrBored · Nov 13, 2004

Chalnoth said:
So it just doesn't add up.

An outstanding pun btw.

Scott C · Nov 16, 2004

Uh, multiplies are ~ 4x longer than adds in time to execute (logic cascades).

An adder can be really fast because most of its work can be done in parallel.
A multiply is significantly more complicated.

Just look at the latency of instructions for FP code in most CPUs. Adds are typically 1 cycle, multiplies 3 to 5 cycles.

Dot product limiting rate is an add plus a multiply in serial. Interesting how 4 + 1 = 5.

500Mhz x5 = 2.5Ghz.

It adds up just fine.

If the 8 mults and three adds were all in serial in one pipe stage it would run < 100Mhz...

There are other factors than just the operations, like the number of inputs and outputs and how much intermediate data there is and complexity of that intermediate data. For instance a cross product has more inputs, outputs, and complicated data dependancies that would add to its time.

Also, executing all of this in one stage is faster than pipelined like in a CPU, so it is less than a 5x latency of execution ratio. But I think UMC/TSMC isn't quite the perf level of AMD/Intel's fabs.

I already thoght through several of the other "typical units of GPU work" which of course when broken down have the ability to parallelize (like a bilinear filter). These seem to line up so far with about 5 adds of serial execution work.

KimB · Nov 16, 2004

Scott C said:
Uh, multiplies are ~ 4x longer than adds in time to execute (logic cascades).

An adder can be really fast because most of its work can be done in parallel.
A multiply is significantly more complicated.

Just look at the latency of instructions for FP code in most CPUs. Adds are typically 1 cycle, multiplies 3 to 5 cycles.

Dot product limiting rate is an add plus a multiply in serial. Interesting how 4 + 1 = 5.

500Mhz x5 = 2.5Ghz.

It adds up just fine.

No, it doesn't. If you think more about your math, you just said that the dot product in a GPU will only take about two cycles longer than a multiply on a CPU, meaning that with this argument, a GPU would clock at least around 60% that of a CPU. This is clearly not the case.

Anyway, I'm not so sure I believe FP adds are really that much lower in latency than FP multiplies, due to the barrel shifter needed for the exponent. I'm sure integer adds are much lower latency than integer muls, though.

If the 8 mults and three adds were all in serial in one pipe stage it would run < 100Mhz...

That's four muls and three adds.

Edit: Oh, and if you were thinking about cross products, the cross product is only defined in three dimensions, and is six muls (all independent) and three adds (all independent). The cross product has fewer data dependencies than the dot product.

For instance a cross product has more inputs, outputs, and complicated data dependancies that would add to its time.

And I don't believe a cross product is in the instruction set of GPU's.

Edit 2:
Oh, but I would imagine that a cross product in a GPU would typically be implemented via what essentially amounts to a 3x3 matrix multiplied by a 3-vector. This could be implemented with three swizzled dot3's. But that's beside the point.

rwolf · Nov 16, 2004

Scott C said:
Also, executing all of this in one stage is faster than pipelined like in a CPU, so it is less than a 5x latency of execution ratio. But I think UMC/TSMC isn't quite the perf level of AMD/Intel's fabs.

Intrinsity has a math CPU that goes 2.5GHz on a chip from a TSMC fab.

I think you can choose different fabrication methods depending on your design.

arjan de lumens · Nov 16, 2004

Scott C said:
Just look at the latency of instructions for FP code in most CPUs. Adds are typically 1 cycle, multiplies 3 to 5 cycles.

False. The numbers you specify are generally true for INTEGER operations where a multiply is indeed ~3-4x slower than an add. For Floating-Point operations, adds are usually not that much faster than multiplies at all. For example, for Athlon series processors, the latencies of the relevant instructions are as follows:

Integer Add: 1 cycle
Integer Multiply: 4 cycles
Floating-point Add: 4 cycles
Floating-point Multiply: 4 cycles

Bob · Nov 17, 2004

False. The numbers you specify are generally true for INTEGER operations where a multiply is indeed ~3-4x slower than an add. For Floating-Point operations, adds are usually not that much faster than multiplies at all. For example, for Athlon series processors, the latencies of the relevant instructions are as follows:

* Integer Add: 1 cycle
* Integer Multiply: 4 cycles
* Floating-point Add: 4 cycles
* Floating-point Multiply: 4 cycles

There's a difference though: The integer Adder is a 32+32 -> 32 adder. The Floating point adder needs a 64+64->65 bit integer adder to compute the mantissa for an 80-bit float (internal precision of the FPU). So they're not really comparable in the first place.

Edit: I'm ignoring the extra bits in the FP adder used for rounding.

KimB · Nov 17, 2004

Bob said:
There's a difference though: The integer Adder is a 32+32 -> 32 adder. The Floating point adder needs a 64+64->65 bit integer adder to compute the mantissa for an 80-bit float (internal precision of the FPU). So they're not really comparable in the first place.

Edit: I'm ignoring the extra bits in the FP adder used for rounding.

Sure, but I think this was more about comparing a floating point add to a floating point multiply, not comparing an int add to a floating point add.

Bob · Nov 17, 2004

Sure, but I think this was more about comparing a floating point add to a floating point multiply, not comparing an int add to a floating point add.

I did not mean to imply otherwise.

Something else to consider: The FP adder may or may not be fully optimized for latency. It may also be implemented as part of a fused MAC to save hardware. This would make the latencies for FP add and mul be the same simply because the data has to travel the same path.

My point is that you can't really compare GPUs to CPUs.

aranfell · Nov 17, 2004

The reason a floating point add takes so long is the pre-shift and the post-normalize. One of the input terms has to be shifted by up to N-1 bits (for an N-bit mantissa) to align them before adding. Then after the add you have to shift the result by up to N-1 bits to normalize the result, in case the two numbers were of similar magnitude but opposite sign, so that they mostly cancel each other out. That post-shift has to happen after the ripple-carry add completes, so it doesn't pipeline very well.

Floating point multiply doesn't have those problems. No pre-shift is needed and no post-normalize, either (at least if we ignore denorms). The multiply requires a lot of adders, but they can be done with carry-save logic, so that there is only one ripple-carry at the end. That means that the whole operation takes a lot of gates, but it pipelines well. Even with the shifting required by denorms, there is some parallelism that isn't available with the shifts required for floating point adds.

arjan de lumens · Nov 17, 2004

Bob said:
False. The numbers you specify are generally true for INTEGER operations where a multiply is indeed ~3-4x slower than an add. For Floating-Point operations, adds are usually not that much faster than multiplies at all. For example, for Athlon series processors, the latencies of the relevant instructions are as follows:

* Integer Add: 1 cycle
* Integer Multiply: 4 cycles
* Floating-point Add: 4 cycles
* Floating-point Multiply: 4 cycles

Click to expand...

There's a difference though: The integer Adder is a 32+32 -> 32 adder. The Floating point adder needs a 64+64->65 bit integer adder to compute the mantissa for an 80-bit float (internal precision of the FPU). So they're not really comparable in the first place.

Edit: I'm ignoring the extra bits in the FP adder used for rounding.

A 64-bit addition isn't that much slower than a 32-bit addition - with e.g. a Kogge-Stone adder, the difference is like 5-10% even if the adder appears in a critical path. As an example, in the Athlon64, the integer add instruction is perfectly capable of doing a 64+64->65bit add (the 65th bit is the output carry bit) in 1 clock cycle, while the FP add still takes 4 cycles.

Something else to consider: The FP adder may or may not be fully optimized for latency. It may also be implemented as part of a fused MAC to save hardware. This would make the latencies for FP add and mul be the same simply because the data has to travel the same path.

True for some processors; not for the Athlon of my example - it can sustain a throughput of 1 FADD+1FMUL per clock cycle, so there cannot be any resource sharing between the FADD and FMUL pipelines. And given how large the FP adder in the Athlon is, it would surprise me if it isn't heavily latency optimized.

I see Aranfell has given a rather good explanation of what the problem with fast FP adds is.

Of course, both integer and floating-point units are likely to be quite different in a GPU from a CPU. In a CPU, the most critical factor affecting performance is often latency, so you want to spend lots of transistors into a small number of uber-fast execution units. In a GPU, what matters is throughput rather than latency, so you want to build as many execution units as possible from your transistors, even if the resulting units individually aren't nearly as fast as those of CPUs.

Scott C · Nov 18, 2004

No, it doesn't. If you think more about your math, you just said that the dot product in a GPU will only take about two cycles longer than a multiply on a CPU, meaning that with this argument, a GPU would clock at least around 60% that of a CPU. This is clearly not the case.

No, because a CPU is limited by its integer add speed not its multiply speed.

So I was saying a dot product takes 5x longer (4 cycles longer) than an add, or 20% clockability if in one stage.

That's four muls and three adds.

Yeah, my mistake. 8 inputs, 4 mults, 3 adds.

And about cross products, the point was illustrative. I was not saying a GPU had them, just that it would be more complicated than a dot product for several reasons. And these reasons change clockability even if the number of operations were the same (they aren't).

Intrinsity has a math CPU that goes 2.5GHz on a chip from a TSMC fab.

what process? 130nm? 90nm? what other features (SOI, Strained Si, etc)?

Intel and AMD's profit margins depend a good deal on clockability, I'd wager that they typically would clock similar designs faster, though probalby only 10% or 15%.

Either way, its opinion, we don't have much evidence and you can't just fab any old design anywere optimally for so many reasons.

* Integer Add: 1 cycle
* Integer Multiply: 4 cycles
* Floating-point Add: 4 cycles
* Floating-point Multiply: 4 cycles

Absolutely. I screwed up there.

However, this makes my central point even more clear:
A CPU builds its pipelines to operate at 1 cycle latency for an Int add. THis means a full int add in one stage of a pipeline.
If a GPU targetted one stage for just a single FP multiply, its clockability against a CPU would be at least off by a factor of two, and perhaps near a factor of four.

To make the claim that a GPU could clock like a CPU (the start of this thread hijacking) you would have to say that GPU pipe stages are smaller than a FP add. I find that highly unlikely. Especially after Aranfell's explanation of implementing a pipelined FP add -- messy and difficult, best avoid it if you have to have tons of the things on your chip like with a GPU.

Make the limiting pipe stage something like a dot product, (serially equivalent to an add and multiply in fp) and were near the factor of 5 if you assume that there are latency gains for a single stage like this to go from 4+4=8 to around 5. There will be such relative gains since it is a single stage rather than 8 stages in a pipe and the pipelining itself adds to latency (measured in time, not clocks).

There's a difference though: The integer Adder is a 32+32 -> 32 adder. The Floating point adder needs a 64+64->65 bit integer adder to compute the mantissa for an 80-bit float (internal precision of the FPU). So they're not really comparable in the first place.

The exponent takes most of the time in a FP add. Integer adding time is barely dependant on the number of bits with optimized adders.

Sure, but I think this was more about comparing a floating point add to a floating point multiply, not comparing an int add to a floating point add.

For clockability we're comparing integer add limited (CPU) to Floating point operation (my claim was dot product) limited.

I still stand by my central point regardless of some of the errors I made:

Clockability of GPU's is limited by pipe stages, not thermal issues or design effort (resource) considerations.
GPU pipe stage clockability is determined by how much work a typical stage does in serial, and GPU's do more such work in a stage than CPUs.
CPUs have finer grained operations and the speed of the simple ones, like an integer add, is critical.
GPU's have larger granularity of operations, and the speed of more complicated operations (geometry transform, texturing, floating point ops) is the critical determinant of performance.
To clock higher, a GPU would have to break up many of these operations into several pipeline stages, which has little to no gain given the power/die size tradeoff that would have to me made.

KimB · Nov 18, 2004

Scott C said:
However, this makes my central point even more clear:
A CPU builds its pipelines to operate at 1 cycle latency for an Int add. THis means a full int add in one stage of a pipeline.
If a GPU targetted one stage for just a single FP multiply, its clockability against a CPU would be at least off by a factor of two, and perhaps near a factor of four.

"If" is the key here. I see no reason why GPU's should do this, considering they want to hide a whole heck of a lot of latency anyway.

To make the claim that a GPU could clock like a CPU (the start of this thread hijacking) you would have to say that GPU pipe stages are smaller than a FP add. I find that highly unlikely. Especially after Aranfell's explanation of implementing a pipelined FP add -- messy and difficult, best avoid it if you have to have tons of the things on your chip like with a GPU.

I find it highly unlikely that they could save a huge number of transistors this way, enough to offset a potential multiplication of clockrate. What's more, it wasn't that long ago that there were no FP operations at all being done on the GPU. If GPU's were limited by adds, then you'd need a completely different argument to explain why the GeForce 2 GTS clocked similarly to 3dfx's VSA-100 (actually, I think it clocked higher, despite the use of hardware T&L, the only part that should have required floating-point at that time).

Bob · Nov 18, 2004

The exponent takes most of the time in a FP add. Integer adding time is barely dependant on the number of bits with optimized adders.

You can compute the exponent in parallel to the mantissa. The critical path of an FP adder is usually the mantissa rounding. This is doubly true if you need to implement all those wacky IEEE 754 rounding rules.

<edit: I meant - the critical path goes through the mantissa rounding. It is not only the mantissa rounding.>

A 64-bit addition isn't that much slower than a 32-bit addition - with e.g. a Kogge-Stone adder, the difference is like 5-10% even if the adder appears in a critical path. As an example, in the Athlon64, the integer add instruction is perfectly capable of doing a 64+64->65bit add (the 65th bit is the output carry bit) in 1 clock cycle, while the FP add still takes 4 cycles.

Obviously, if you throw more transistors at the problem, you'll get better results. Obviously, you can compute 64+64 bits in twice the time as 32+32 its for (roughly) the same transistor cost (assuming you already have a forwarding path). Or, you can use 3-4 times the transistors and get the same latency and throughput. Or you can (roughly) use twice the transistors for the same throughput but twice the latency.

You can almost always trade area for latency or throughput. Granted, I'm not being very convincing by not pointing out explicit numbers (which I do not know, and if you don't work for both AMD and Nvidia, you likely don't know either).

Another thing to consider is that CPUs have a lot more area to dedicate to each adder (and thus more speed or lower latency, or both). How many adders does an Athlon have? It's likely very much less than a GeForce 6800.

The constraints are different, and thus the design decisions are different.

Scott C · Nov 19, 2004

"If" is the key here. I see no reason why GPU's should do this, considering they want to hide a whole heck of a lot of latency anyway.

Where did I say they _should_ do it that way. And you provide no reason that they should go any smaller than this (latency is not a reason, I'll explain below).

IF is the key point. If they targetted a FP add as the gating factor it would be slower. They absolutely don't make a pipeline's unit of work smaller than that, it would cost a ton of die space.

Lengthening pipelines and splitting work costs die space, especially on small operations like a single FP add. Not so much on large composite operations with clear "half way" points.

You are making the claim that somehow they would need to mask latency WITHIN A SINGLE ADD? Honestly, WTF? what could thay possibly have to wait for between the beginning and end of a single add? You have two numbers, and now you have to add them and get a result. Do you need any extra info in the middle of that calculation NO!
And since you are talking about masking latency, you do realize that adding extra stages and breaking apart a single stage into two (faster clocking ones) is an utter waste since latency is in TIME and not clocks. If they want to mask latency then just add a cycle of waiting in a pipeline.
In simple terms, if at the end of the add you have to wait for something else add a dummy pipe stage that does nothing, viola, masked latency! Don't split the add in two, its way more work!
Splitting a stage that is not clock limiting up is pointless costly work.

I'd like to hear one single good case for them making a pipe stage smaller than an FP add. I still think the stages are larger than that, but even if not, what would it gain them? You have 1/500Mhz in time, and a single add fits in that time. Why bother make it take two cycles? To mask latency! LOL, just wait or do some other work. The "mask latency" argument for splitting work up into multiple stages is bogus.
You split work up into multiple stages if the work is too much to fit in the target clock range and thats it. If you target 2.5Ghz then yes you'll pipeline that FP add or make it complete in more than one clock at a throughput less than 1 a clock.
If you up your target clock, then all stages everywhere must be smaller in work done, but the result is increased overall area and power used.

As I said the cost would be high in area and power use, and that is a fixed resource so a higher clocking chip (say, 2x faster) would probably need 3x to 5x the area, and thus have less than half the GPU pixel and vertex pipes and be even slower in aggregate throughput (though that one pipe would be faster). Three pipes at 1Ghz or 8 at 500Mhz, take your pick.

The relationship between area needed and clock is highly nonlinear. At CPU speeds it is extremely high, slight clockability improvments lead to very large area and power consumption growth. At GPU speeds it is less. The reason is that in CPUs you are breaking apart many small things into multiple parts, such as a FP add or Mult. These are small operations and can be hard to parallelize and break apart and require a lot of area and complexity to work at that speed.
Clearly, breaking apart somthing larger like a matrix transform is easier since there are several obvious boundaries in the process to break it itno stages. Clear and simple boundaries mean hardly any increase in area (complexity).

Clear and simple boundaries break down into more complicated situations when you cross over the fundamental units of work into sub-units.

The constraints are different, and thus the design decisions are different.

Exactly.

GPUs clock lower because they have pipeline stages that do more work.
This is so because the design decisions are different.
Such decisions are the area/power tradeoffs and the funamental units of work.

A circular relationship of course, as the design influences the targetted clock, and the achieved clock is a result of the design.

In a GPU you can add pixel and vertex pipelines, each new one is approximatly linear in area addition, and you get linear improvements in performance. Thus, being able to cut the area used in half is JUST AS VALUABLE as increasing clockability by a factor of two (performance-wise)

In a CPU your limiting factor is a single execution stream and extra pipelines can only parallelize this stream a little. Each extra pipeline (superscalar pipe) requires nonlinear extra area and power because the pipelines must coordinate on a common set of registers and a single instruction stream, and the logic to organize and coordinate this activity is expensive (most of the logic are in a CPU, actually).
Cutting area of the execution resources (which are now < 50% of the chip) is hardly valuable. Increasing clock is extremely valuable, so long as the instruction execution efficiency doesn't get too bad (IPC).
Thus, the optimal designs limit the number of pipes and go for high clock speeds at good IPC with the smallest common instructions (int add, boolean, logical, and other simple ops) as the clock limiting factors, this has been the primary CPU design philosophy since circa 1990 give or take a few years depending on the ISA, when the transistor count budgets allowed for pipelining at all (more proof that pipelining an action and splitting it up takes more transistors/area/power).

For a CPU that expects to have many simultaneously parallelizeable instruction streams, the design constraints are different, and lo and behold, Sun's Niagra is designed to have many simple parallel independant pipelines that are only moderately clock tuned. WIth throughput the goal, it would make little sense to quadruple each mini-cpu pipeline's size for more clock, since they're cramming lots of them into one core.

Bob · Nov 19, 2004

I'd like to hear one single good case for them making a pipe stage smaller than an FP add.

I'll bite. Let's take your example: a 500 MHz FP adder in one stage. Let's also assume that forwarding logic and MUXes are free, and that wire delay is 0.

If I can make an adder that's half the size, but only runs at 250 MHz, then it's worth splitting up this pipeline stage into two 500 MHz stages. Then, I have an adder that has 2 clocks of latency (at 500 MHz) instead of 1 clock at 500 MHz. But this adder now uses half the area.

You could then put two of such adders in parallel, to double your add rate, yet use no additional area!

Brimstone · Nov 23, 2004

Here is an ALU patent from Micron.

Hybrid ALU

Methods and apparatus for improving the efficiency of an arithmetic logic unit (ALU) are provided. The ALU of the invention combines the operation of a single-cycle ALU with the processing speed of a pipelined ALU. Arithmetic operations are performed in two stages: a first stage that produces separate sum and carry results in a first cycle, and a second stage that produces a final result in one or more immediately subsequent cycles. While this produces final results in two or more clock cycles, useable partial results are produced each cycle, thus maintaining a one operation per clock cycle throughput.

Also it seems Micron has a patent along the lines of Fast14. I'd assume a lot of semiconductor companies have patents on Dynamic Logic.

Pseudo CMOS dynamic logic with delayed clocks

[0052] Pseudo-CMOS dynamic logic gates with delayed clocks is a new CMOS logic family with potential for extremely fast switching speeds. Unlike static CMOS it has no series connections of logic devices, it requires fewer transistors than either static CMOS or OPL. And, like OPL, only about one half the outputs of the logic gates are required to make a transition of the full power supply voltage during the evaluation of any input to the chain. Like all dynamic circuit families the present invention has the potential for high switching speed and low power consumption.

R500 will Rock

KimB

MrBored

Scott C

KimB

rwolf

Rock Star

arjan de lumens

Bob

KimB

Bob

aranfell

arjan de lumens

Scott C

KimB

Bob

Scott C

Bob

Brimstone

B3D Shockwave Rider

Similar threads