R500 will Rock

Are you saying some parts of the GPU are running in the gigahertz but because of how the deep the pipelines are constructed the chip outside the pipline is running in 100s of megahertz to counter say the propogation time for data to get all the way thru a 10 stage 1,000 megahertz?

I'm saying a single transistor can switch in a length of time equivalent to hundreds of Ghz. That is, 500Ghz = 1 / 2ps.

A transistor can take 2 to 3 ps to switch (333 to 500Ghz rate). However, a single pipeline stage takes a lot longer than the switching time of a single transistor, it is made up of a series of transistors and wires. The signal delay in wires between the transistors is important, and it is the total sum of switching and wire delays throughout the single pipeline stage that must sum ot a number LESS than the processor frequency or else it won't work.


I'm not sure that's true, as it determines how quickly a signal can propagate through the chip. Given infinite cooling, the time it takes for a signal to propagate through the longest pathway in the chip determines the maximum clockspeed of that chip. I claim that this is one reason why, even with extreme cooling, more modern processors clock higher than older ones.

If by "pathway" you mean a series of transistors with wire delays that make up a single pipeline stage, then yes. That determines the fastest frequency.
And the wire delays and transistor switching speeds get faster with each generation of manufacturing process. However, transistor speed gets faster at a greater rate than wire delays, and so wire delay is a larger component than in the past..


I really doubt this is true, for the simple reason that GPUs need to hide many cycles of latency to keep texture reads moving at a good clip. Also consider that the performance hit for state changes (which is, essentially, flushing the pipelines) on a GPU is huge (on the order of hundreds of cycles).

You doubt that a single pipeline in a GPU can do a bilinear filter per clock? I never said it did the WHOLE thing -- fetching the data and writing it out. I was talknig about one of the pipeline stages in that deep pipeline you talk about. One of the stages in a GPU does work on the order of a bilinear filter per clock. This represents the granularity of pipeline stage operation in a GPU roughly. In a CPU the pipeline stage granularity is about 1 64 bit add per clock (though there are many pipeline stages and the latency is much more than one clock, the throughput of a pipeline is 1 add per clock). On a GPU the throughput is much more than that, pipeline stages are larger.

Thus, with longer pipeline stages (and many many total stages in a pipeline) GPUs are lower frequency.

If power density were the only reason, you could slap on liquid nitrogen cooling on a GPU and run it at 2Ghz.
But there are limits to clock speed that are not heat related and GPUs run into those much earlier because they are doing more per clock.

It is as simple as that.


Note that the clock in a chip is a signal that kicks off the start of processing on a pipeline stage. It does not specify how fast each transistor works in a stage... each stage is essentially an asynchronous domain that is triggered by the clock and that must finish its work and put the resulting data onto the inputs of the next stage before the next clock signal comes around.
The length of time between these signals is longer on a GPU because more work is expected to be done (more transistors and wires between pipeline start and finish) per clock than on a CPU.
 
Thus, with longer pipeline stages (and many many total stages in a pipeline) GPUs are lower frequency.

If power density were the only reason, you could slap on liquid nitrogen cooling on a GPU and run it at 2Ghz.
But there are limits to clock speed that are not heat related and GPUs run into those much earlier because they are doing more per clock.

It is as simple as that.

The fact that GPUs currently do more work per clock than CPUs does limit them to relatively low frequencies. However, what exactly precludes a next-gen GPU designer from splitting these "monster" stages into many smaller stages?

IMO the fact that frequencies of GPUs are currently low has a lot more to do with memory bandwidth/latency than any of the reasons you mention.

Increasing the number of and/or frequency of vertex/pixel pipes makes sense if you want to do more work per external memory word, which currently seems to be the case.
 
psurge said:
The fact that GPUs currently do more work per clock than CPUs does limit them to relatively low frequencies. However, what exactly precludes a next-gen GPU designer from splitting these "monster" stages into many smaller stages?

Basically you get the GPU equivalent of a Pentium 4. A chip that on each cycle doesn't do much work but you can crank up to high speeds. Apparently there is some optimal amount of work per cycle and having long pipelines with stages that do very little work.
 
Scott C said:
You doubt that a single pipeline in a GPU can do a bilinear filter per clock? I never said it did the WHOLE thing -- fetching the data and writing it out.
I'm not saying it's not possible. I'm saying I don't think it is done. These operations are usually pretty low in precision, so I don't think there'd be a big problem doing them rather quickly. But these GPU's need to have lots and lots of pipeline stages anyway in order to hide texture
fetch latency (particularly for dependent reads).

So, what I'm saying is that even with the large number of operations that must be done each pixel, and can be done each clock, you still have a whole lot of latency that needs hiding. I have a hard time believing that any GPU manufacturer would sacrifice the potential benefits of many pipeline stages, as otherwise they'd just be using FIFO buffers to hide latency, and those don't add anything to performance (granted, since there are many different portions of the chip besides pixel processing, one of the other parts may actually be the primary limitation on clock rate, so FIFO buffers may make sense if they need more latency hiding in the pixel shader than you'd realistically want in the other pipelines...but even then, I'm still willing to bet that GPU's have more pipeline stages than CPU's).

In a CPU the pipeline stage granularity is about 1 64 bit add per clock (though there are many pipeline stages and the latency is much more than one clock, the throughput of a pipeline is 1 add per clock). On a GPU the throughput is much more than that, pipeline stages are larger.

Thus, with longer pipeline stages (and many many total stages in a pipeline) GPUs are lower frequency.
If I remember correctly, for current CPU's the latency is usually about 10-20 clocks. I'm saying that, from what I've read in connection with this message boards, I believe the total number of pipeline stages in a GPU is closer to a few hundred. Even if you split this up relatively evenly between the vertex shader, the pixel shader, and the vertex->pixel processing, and then between the different functional units of the various parts, I still say that they probably have more pipeline stages. And since GPU's deal with lower-precision data, the latency required for the calculations is typically going to be less, so I really don't think that long pipeline stages is a problem for GPU's.

That is to say, I don't doubt that GPU's do more work total per clock, but I claim that they also have more pipeline stages, so I don't believe they do more work per pipeline stage.

If power density were the only reason, you could slap on liquid nitrogen cooling on a GPU and run it at 2Ghz.
But there are limits to clock speed that are not heat related and GPUs run into those much earlier because they are doing more per clock.

It is as simple as that.
Right, and I'm suggesting that the reason they don't clock so high is more related to issues of not being able to spend enough engineering time on the designs.
 
Keep in mind that you can have a 500Mhz chip with 1GHz dynamic logic math units. The dynamic logic design can be mixed with a cmos circuit layout.
 
g__day said:
I think that is closer to the mark, but the only way I could see that you could run faster and produce less heat is to have either:

1) less current leakage == better materials and fab process

2) less current == a profoundly simpler design that achieves your end consuming fewer transistors to do exactly the same work load and therefore current to switch them.

Point 2 is a holy grail of circuit design. It means your layout needs a revelation in designing silicon to do more with less. Whilst possible it requires a eureka moment or two. I have seen that happen once or twice with silicon design along time ago, but today we are processing with a multitude of complex parts that I presume alot of really smart people are trying to optimise. I expect only if someone has a breakaway insight will a step change occur.

The fast-14 technology is a new implementation of an old technology. Dynamic Logic was replaced years ago by CMOS because it was simpler to design. At the time it was noticed that while the chips were easier to design they were much slower. Fast-14 is a technique for using dynamic logic using the same processes, but with key innovations that solve some of the technical challeges and complexity of design. A side benefit is that the implementation doesn't require as many transistors and the very nature of dynamic logic reduces power consumption.
 
The Geforce3 has app 600/800 hundred stages if i remember a lecture at chalmers, i don´t have any clue how it have changed but i think that is a good indicator of pipelinestages present.
 
Quick question - how much does static logic limit you in terms of work you can do per clock? That is - does a move to dynamic logic allow you to significantly increase clock rate without having to break up existing pipe stages?

(Aaron :D)?
 
The initial post had the link, the technology has a few tenants, the primary one seems to be a switch of base or gain technology (materials) to speed charge and discharge times greatly, but this can produce complex signal timings and you need to be much more careful in designing your chips to ensure everything sync's from a timing signal perspective - and I guess they say complex CADCAM tools can now allow for this.

* * *

What key factors most significantly limit GPU core speed is an interesting question that an authorative source should be able to easily answer, it can only be one of a few factors

1) current draw
2) capiticance / signal timings give routing lengths between chip components vs buffer placement

These are simply a function of chip size, peak active transistor utilisation, circuit and routing complexity.

Anyone have an engineering contact at TSMC, IBM, AMD, INTEL, NVidia or ATi can could respond - I doubt this high level info is sensitive, they could nail it for us in 2 minutes!
 
In my opinion the biggest factor in clock speeds these days is routing/trace length. Although there are many other factors. I believe current draw is more of a factor in limiting the number of transistors rather than how fast the transistor can switch.
 
Chalnoth said:
I'm not saying it's not possible. I'm saying I don't think it is done. These operations are usually pretty low in precision, so I don't think there'd be a big problem doing them rather quickly. But these GPU's need to have lots and lots of pipeline stages anyway in order to hide texture
fetch latency (particularly for dependent reads).

So, what I'm saying is that even with the large number of operations that must be done each pixel, and can be done each clock, you still have a whole lot of latency that needs hiding.

You are clearly either not understanding, or not paying attention to the difference between a
Pipeline Stage
and a
Pipeline

and thus entirely misinterpreting my post.

Note the very critical distinction:
A Pipeline STAGE does a bilinear filter per clock (input is four texels, output one color).
Another pipeline STAGE does perspective correct texture interpolation (before the filter, obviously).

versus:
A Pipeline does a bilinear filter per clock.

The above are all true statements, but the last one has nothing to do with max frequency.

I am NOT saying a single texture stage does the whole bilinear filtering process, which includes the memory access, the perspective correction, the z-check, the actual filtering, the writing/blending of the filter result, and a ton of other stuff that includes a lot of pipeline STAGES to mask latencies, etc.

The key for frequency scaling is the amount of work in an individual pipeline STAGE.



And since GPU's deal with lower-precision data, the latency required for the calculations is typically going to be less, so I really don't think that long pipeline stages is a problem for GPU's.

That is to say, I don't doubt that GPU's do more work total per clock, but I claim that they also have more pipeline stages, so I don't believe they do more work per pipeline stage.

Ugh, completely misinterpreting.

It isn't (to be clear:) longer pipeline stage count in a pipeline
that I was talking about. I was talking about a longer single pipeline stage

GPU's pipeline STAGES are longer than cpu pipeline STAGES. A single stage on a CPU has to complete the work of a slinge add (adds of any bit lengh are similar in time to compute because it is always the same number of steps, unlike a multiply or divide).
A single add is NOT the whole operation of executing the instruction, and retiring it and writing the results to registers in this case, it is just the task of the single pipeline STAGE to take two numbers as input and expose the result on an output.
On a GPU a single stage might do a bilinear filter, taking four colors and two weights as an input and blending them to a single color output.

That is more work per stage. I am not talking about more work total per clock (of the whole chip). I am not even talking about total work per pipeline. Only per pipeline STAGE.

Right, and I'm suggesting that the reason they don't clock so high is more related to issues of not being able to spend enough engineering time on the designs.


No chance. If they could merely spend more engineering effort and clock faster they could get equal performance from fewer pipelines, and have lower manufacturing costs. If this was the case, the economic reasons for spending more engineering effort to clock higher would be very large --- millions of $ could be saved in manufacturing so millions could be put towards faster clocks.


GPU's very clearly have larger pipeline stages (pipeline granularity, if you will). Manufacturing and design issues for clockability might make up for half of the discrepancies to CPUs, but no more.



psurge said:
The fact that GPUs currently do more work per clock than CPUs does limit them to relatively low frequencies. However, what exactly precludes a next-gen GPU designer from splitting these "monster" stages into many smaller stages?

IMO the fact that frequencies of GPUs are currently low has a lot more to do with memory bandwidth/latency than any of the reasons you mention.

Yes, the memory situation is related. But not really for latency (GPUs easily hide that). And for bandwidth? Well why not make a quad pipe 1Ghz instead of 8 pipe 500Mhz and use the same bandwidth?

Its not either of those but related.

The pipeline stage length (not number of pipes, but length of individual stages) is best optimized with respect to your inputs and outputs.

A CPU typically reads two values and does something to this value, then writes it back.

A GPU's most simple operation is far more complex than that. Breaking up a bilinear filter into four pipeline stages has many costs (larger die, more fine-tuning of individual stages, extra logic for more precice clocking between stages) and the clock speed won't increase by a factor of four, more like a factor of 2.5 at best.
In the end, lengthening a pipeline increases your clock, and thus your throughput (performance) but costs in power and transistors and complexity. These costs mean you have to have fewer total pipelines to fit in the same die size and power budget. Look at the Pentium 4 ... increasing the pipeline length by a bit more than 50% between Northwood and Prescott caused a rough DOUBLING of the number of non-cache transistors in the core. That is a lot of logic to coordinate the clocking and data transfer between pipeline stages. Clockability is sublinear with respect to pipeline stages and power consumption is nearly quadratic. Another reason not to do it.

But a root cause is the definition of "tasks" and the granularity with which you want to break them down is very different in a GPU. There is very little reason to want to break a bilinear filter up into its components, or break a triangle setup or transorm into to many little parts. Breaking up operations into parts that are smaller than the smallest single task that is required to produce an output and feed into another task does little or no good.

Besides, I'm sure it is tough enough as it is to have the current GPU pipeline as long as it is. Longer pipelines require more state information because to keep them filled you have to have more concurrent operations in flight. More state information and more logic to track and respond to it means more transistors per pipeline.

There is a relationship between pipeline stage size (number of logic cascades per stage) and the maximum frequency. There is another relationship for power consumption. And another for number of stages and stage size with respect to supporting circuitry.

The fundamental unit of work for a CPU is register to register adds, shifts, boolean operations, and such small operations.

The fundamental unit of work for a GPU is things like a vertex transform, a texture blend, a fog operation, a perspective correction, a z write, or a pixel write.

This leads to different optimal places on the frequency curve as dictated by optimal pipeline stage length.
 
Scott C said:
GPU's pipeline STAGES are longer than cpu pipeline STAGES.
And I'm saying they are not. Try re-reading my post and see if you can see what I was trying to say, as I certainly did not misunderstand in the way you felt I did.

A single add is NOT the whole operation of executing the instruction, and retiring it and writing the results to registers in this case, it is just the task of the single pipeline STAGE to take two numbers as input and expose the result on an output.
On a GPU a single stage might do a bilinear filter, taking four colors and two weights as an input and blending them to a single color output.
And I'm saying, how do you know these things are done in a single stage? I'm saying that they're almost certainly not. I'm saying that the high latency requirements of memory access have made it very beneficial to have very long pipelines with many stages.

No chance. If they could merely spend more engineering effort and clock faster they could get equal performance from fewer pipelines, and have lower manufacturing costs.
No. Most of the design is copied anyway. Pipeline X in the chip is going to be exactly the same as pipeline Y, and so on. Cutting out a number of pipelines does not suddenly give the engineering teams more time to optimize those pipelines for higher clock speeds/lower power requirements.

If this was the case, the economic reasons for spending more engineering effort to clock higher would be very large --- millions of $ could be saved in manufacturing so millions could be put towards faster clocks.
The problem is time. It doesn't just take more engineering effort, it takes more time. Product cycles for these chips are so short that doing this is just unfeasible for GPU's. It may happen, of course, in a few years when the design of GPU's stabilizes, but not now.

A GPU's most simple operation is far more complex than that. Breaking up a bilinear filter into four pipeline stages has many costs (larger die, more fine-tuning of individual stages, extra logic for more precice clocking between stages) and the clock speed won't increase by a factor of four, more like a factor of 2.5 at best.
Ah, you finally got to the heart of the issue. What logic is added in a most simple implementation of splitting a pipeline stage into multiple parts? Well, let's say that, to first order, you don't even worry about how long each stage is. You just look at the algorithm and see where it can be split apart. Then you split the pipe and add a small buffer in between.

Now, what is the alternative? The latency has to be hidden anyway, so if this were not done, what you'd need is a FIFO buffer somewhere. As far as I can see, this is little more than taking the FIFO buffer and interlacing it with the pipelines (I'm sure it takes a few more transistors, but it can't be that many).

Do this, with a first approximation to equal path length and you'll probably get a 1.5x performance boost while quadrupling the number of stages. Refine your architecture and you'll do better (which is probably one of a number of reasons why current architectures clock much higher than older ones).

I claim that the flaw in your logic is that in CPU's, there is no reason to have long pipelines other than the fact that you want to get high clock speeds. In GPU's, you're already spending transistors on FIFO buffers. So it should just make good sense to interlace those FIFO buffers with your logic.

And, finally:
The fundamental unit of work for a CPU is register to register adds, shifts, boolean operations, and such small operations.

The fundamental unit of work for a GPU is things like a vertex transform, a texture blend, a fog operation, a perspective correction, a z write, or a pixel write.
...which you can all put in the form of register to register adds, shifts, etc., and thus I see no fundamental difference (most of the GPU operations are, fundamentally, just CPU operations done on many different values at once, and thus are very similar).
 
ScottC - if GPU stages are quite long as you say, then the transistor cost of pipelining them is going to be a much smaller percentage increase over increasing pipeline depth in an already multi-GHz CPU. Also, Fast14 claims to automate dynamic logic design. It appears to me that this type of tool makes higher clocks a viable target for GPU designers... and as for power efficiency, they claim their dynamic logic style is more power efficient than static logic at clock speeds above 500Mhz.

Chalnoth - i'm not an expert but... increased throughput comes at the expense of increased latency in cycles. I'm assuming the average FIFO runs at core clock, so this means to hide a roughly identical latency (in seconds, e.g. a dependent texture access hitting external memory) requires a corresponding increase in the amount of pixels that must be in flight to hide it.
 
Chalnoth said:
Scott C said:
GPU's pipeline STAGES are longer than cpu pipeline STAGES.
No chance. If they could merely spend more engineering effort and clock faster they could get equal performance from fewer pipelines, and have lower manufacturing costs.
No. Most of the design is copied anyway. Pipeline X in the chip is going to be exactly the same as pipeline Y, and so on. Cutting out a number of pipelines does not suddenly give the engineering teams more time to optimize those pipelines for higher clock speeds/lower power requirements.

Chip design is unfortunately more complicated than that. If given a group of identical instances (pipeline#0 .. 15), the *preferred* design-flow is to treat every instance like a unique object, then run the synthesis/placement/routing algorithms as if you were dealing with a large "blob of gates." This seems counterintuitive, but you must consider, the ideal placement of a gate is a function of all its neighbors. Therefore, in the *ideal* flow, an auto-layout tool will generate a layout based on the proces's physical cost metrics -- the fact that 16x gates are actually identical doesn't factor into the tool's decision.

But today's (multi-million gate) chips are too large for this traditional flow. So the layout-tools now support a 'hierarchical' approach, which is pretty much a "divide and conquer" strategy. The design is broken into several big chunks, and each chunk is processed one after the other. This reduces the runtime of the layout-process because the tool is crunching smaller data-sets instead of one ridicululously large database (thiink of the classic sorting problem: would you rather sort 1,000,000 objects once, or 1000 sets of 1000 objects?) The downside, obviously, is the tool's view/focus is limited to 1 partition per iteration, so overall efficiency (versus a "flat" all-at-once flow) decreases.

As Chalnoth has pointed out, none of us work at NVidia/ATI, so we don't really know how they tackle these problems. My intuition tells me both companies will mix/match . For example, perhaps the pipelines near the center of the die are all identical (meaning that the first instance was run through the place&route tool, then the remaining pipelines were just cut/pasted from the first instance.) But the pipelines near the die edge aren't, because of their proximity to the MIU/BIU (memory/bus interfaces) -- the MIU/BIU burn more power, so the die-edge pipelines are layed-out to reduce power-density (I'm just making this up as hypothethical...)

If ATI/NVidia treated ALL pipelines like "hard-macros" (i.e. cut/paste objects from a single master-template), then you would be correct -- the major design-effort is budgeted over the optimum layout of the 'master template', with no inconsequential work expended on the cut/paste replication-process. I'm trying to suggest the equation isn't so easy. Some pipelines are likely clones, and others are unique instances (different from a layout perspective.)

The problem is time. It doesn't just take more engineering effort, it takes more time. Product cycles for these chips are so short that doing this is just unfeasible for GPU's. It may happen, of course, in a few years when the design of GPU's stabilizes, but not now.

Agreed. Unlike CPUs, GPU architecture changes radically from generation to generation -- there's little time in the development cycle for micro-layout optimizations (as with CPUs.)

The fundamental unit of work for a CPU is register to register adds, shifts, boolean operations, and such small operations.

The fundamental unit of work for a GPU is things like a vertex transform, a texture blend, a fog operation, a perspective correction, a z write, or a pixel write.
...which you can all put in the form of register to register adds, shifts, etc., and thus I see no fundamental difference (most of the GPU operations are, fundamentally, just CPU operations done on many different values at once, and thus are very similar).[/quote]

That's true, but you're ignoring *quantity* of hardware-units.

Engineering minutae aside, thelayman's rule of thumb is "for a given process-technology and design-style, area/speed/power are related to a constant." The exact relationship is quite complex (non-linear), but suffice it to say, I can make something run faster if I'm willing to burn more power (or increase area.) Conversely, I can reduce gate-area if I'm willing to trade power-consumption and/or speed.

At one extreme, I could recall my friend's senior lab-project: design a serial (1-bit calc per clock-cycle) 64-bit multiplier. His implementation required < 5000 gates (written in VHDL), had a latency and throughput of ~ 64x64x64 cycles (260k clocks), and took him about 1 academic quarter (10 weeks.) Much of that time was learning the basics of VHDL, and finding "the magic textbook", or so he told me.

At the one *extreme*, are the ALU/FPU cores in the Athlon64 and Pentium/4. In terms of FPU-multiiplier power, they have like what? a *single* 64-bit "wallace-tree" (micro-architecture) multiplier. They burn lots of power, but run at phenomenal sustained throughput (>2gig 64-bit fp multiplies/second.) That *single* structure occupies the majority of the FPU's die-area (and dynamic power-consumption.) And the layout probably took a dozen PhD gurus the equivalent of several *man-years* to get right.

Even ignoring design-time, and allowing for an 'inifinite development cycle", there is a balance point between a math-unit's area and max clock-speed. Ignoring storage (sequential) elements, a bit serial-multiplier is easily <500 gates, as any FPGA-designer can attest, and calculation-precision scales linearly size (entirely for the storage shift-registers.) (*Bit-serial means 1-bit throughput per cycle, not 1-bit precision, hehe!) That's obviously unsuitable for modern real-time 3D-graphics. I don't know what a 32-bit multiplier @ 2GHz would look like, but my guess several times bigger than a 500MHz unit.

Given infinite cooling, the time it takes for a signal to propagate through the longest pathway in the chip determines the maximum clockspeed of that chip.

In the early days of digital-logic design, that was an accurate generalization. In modern multi-million gate ASICs, the clock-ceiling is typically limited by the clock "skew" in the distribution tree. That's really more off-topic then I want to get into, and it's not really a device-physics limitation -- rather a issue between the layout, clock-tree synthesis, and validation tools.
 
asicnewbie said:
Chip design is unfortunately more complicated than that. If given a group of identical instances (pipeline#0 .. 15), the *preferred* design-flow is to treat every instance like a unique object, then run the synthesis/placement/routing algorithms as if you were dealing with a large "blob of gates." This seems counterintuitive, but you must consider, the ideal placement of a gate is a function of all its neighbors. Therefore, in the *ideal* flow, an auto-layout tool will generate a layout based on the proces's physical cost metrics -- the fact that 16x gates are actually identical doesn't factor into the tool's decision.
This is true, but given that these companies are building most of their architectures to be scalable, it makes much more sense to instead have the large, complex parts of the chip remain essentially the same, varying only very slightly with the number of units implemented.

As Chalnoth has pointed out, none of us work at NVidia/ATI, so we don't really know how they tackle these problems. My intuition tells me both companies will mix/match . For example, perhaps the pipelines near the center of the die are all identical (meaning that the first instance was run through the place&route tool, then the remaining pipelines were just cut/pasted from the first instance.) But the pipelines near the die edge aren't, because of their proximity to the MIU/BIU (memory/bus interfaces) -- the MIU/BIU burn more power, so the die-edge pipelines are layed-out to reduce power-density (I'm just making this up as hypothethical...)
I would be willing to bet this is the case, due to the existence of many implementations of the same architecture at different pipeline counts. I hope I didn't make anybody think I felt the entire design of these different versions of the same architecture were just "cut and paste" products. If they were, it'd take no time to release secondary products. nVidia and ATI probably also strategecally plant the input/output interfaces in such a way that they can be made malleable, to be re-computed for every iteration of the architecture for minimal cost (in other words, these "pipelines near the die edge" would mostly consist of the memory controller and AGP/PCIe interfaces, as well as, if they don't take up enough space, other portions of the processing that don't necessarily need to be linked to the various pipelines, such as texture compression/decompression, framebuffer/z-buffer compression, the video processor, etc.).

Edit: I don't think I actually pointed that out....
 
Chalnoth said:
Scott C said:
GPU's pipeline STAGES are longer than cpu pipeline STAGES.
And I'm saying they are not. Try re-reading my post and see if you can see what I was trying to say, as I certainly did not misunderstand in the way you felt I did.

Perhaps, but you are still arguing in what I see as completely orthogonal to my main points. I'll try to be clear and less wordy this time.

Chalnoth said:
A single add is NOT the whole operation of executing the instruction, and retiring it and writing the results to registers in this case, it is just the task of the single pipeline STAGE to take two numbers as input and expose the result on an output.
On a GPU a single stage might do a bilinear filter, taking four colors and two weights as an input and blending them to a single color output.
And I'm saying, how do you know these things are done in a single stage? I'm saying that they're almost certainly not. I'm saying that the high latency requirements of memory access have made it very beneficial to have very long pipelines with many stages.
Most certainly I agree that the latency has driven pipeline length. The root question is clockability however, which has not driven pipeline length.
Because there is so much work to do in a GPU in a single pipeline, it is fairly easy to start the read for a texture lookup many many pipeline stages before the hardware that blends the results needs the data. In many cases, FIFOs aren't even required to avoid a pipeline stall.

You want to know why I think the pipeline stages do more work? I'll get to that but the basic reason is that the chips don't clock higher, even with super cooling. If the stages were more granular they would. And in chip design you try to target all the stages to have the same approximate number of logic cascades so that you don't end up with 200 fast stages waiting on 4 slow ones all the time.

The empirical evidence is that the stages do more work in a GPU than a CPU.

Chalnoth said:
No chance. If they could merely spend more engineering effort and clock faster they could get equal performance from fewer pipelines, and have lower manufacturing costs.
No. Most of the design is copied anyway. Pipeline X in the chip is going to be exactly the same as pipeline Y, and so on. Cutting out a number of pipelines does not suddenly give the engineering teams more time to optimize those pipelines for higher clock speeds/lower power requirements.

That wasn't my point. Let me clarify. There is no chance that extra engineering time by the GPU makers is the prime reason that they don't clock at CPU speeds.
Could they clock faster? Yes. But it isn't just a matter of willpower and time, it is also a tradeoff in complexity, die size, and power. In the end it might be twice the Mhz, half the pipelines, but use more power and be even larger die size! Otherwise if clocking faster didn't have these costs it would make sense to spend extra effort and time and save money on the manufacturing side.

The problem is time. It doesn't just take more engineering effort, it takes more time. Product cycles for these chips are so short that doing this is just unfeasible for GPU's. It may happen, of course, in a few years when the design of GPU's stabilizes, but not now.

Its not just time. It won't happen in a few years, because clocking higher means breaking larger pipeline stages into smaller ones, using more power (nonlinearly more) and more die size. Makes sense for a CPU where you can't just add pipelines and where Caches take most of the die area anyhow. Makes no sense for a GPU.

Chalnoth said:
A GPU's most simple operation is far more complex than that. Breaking up a bilinear filter into four pipeline stages has many costs (larger die, more fine-tuning of individual stages, extra logic for more precice clocking between stages) and the clock speed won't increase by a factor of four, more like a factor of 2.5 at best.
Ah, you finally got to the heart of the issue. What logic is added in a most simple implementation of splitting a pipeline stage into multiple parts? Well, let's say that, to first order, you don't even worry about how long each stage is. You just look at the algorithm and see where it can be split apart. Then you split the pipe and add a small buffer in between.

Now, what is the alternative? The latency has to be hidden anyway, so if this were not done, what you'd need is a FIFO buffer somewhere. As far as I can see, this is little more than taking the FIFO buffer and interlacing it with the pipelines (I'm sure it takes a few more transistors, but it can't be that many).

It isn't that simple. If your one stage thingy that requires latency hiding can just as easily wait a few cycles rather than becoming more complicated you might choose to do that.
Also, --- what latency ---? Most things are designed to avoid latency by kicking off the reads several clocks before being required. In most of a GPU there isn't a whole lot of latency to worry about, especially on the geometry side.
Latency is generally easy to hide as long as there is no branching, so the harder things are z-checks and dependant reads.



Chalnoth said:
I claim that the flaw in your logic is that in CPU's, there is no reason to have long pipelines other than the fact that you want to get high clock speeds. In GPU's, you're already spending transistors on FIFO buffers. So it should just make good sense to interlace those FIFO buffers with your logic.

Flaw? I have no flaws ;~D
I agree on the CPU point. I also agree that GPUs have other reasons to add pipeline stages. However I don't see how these latency hiding stages have a single thing to do with clockability. They don't have to break down a task into smaller parts, they can instead just use buffers. Breaking these things into smaller parts is a lot more work and increases the complexity, power, and die size more than just using buffers and allowing for multiple in-flight ops within a pipe.

Chalnoth said:
And, finally:
The fundamental unit of work for a CPU is register to register adds, shifts, boolean operations, and such small operations.

The fundamental unit of work for a GPU is things like a vertex transform, a texture blend, a fog operation, a perspective correction, a z write, or a pixel write.
...which you can all put in the form of register to register adds, shifts, etc., and thus I see no fundamental difference (most of the GPU operations are, fundamentally, just CPU operations done on many different values at once, and thus are very similar).

Yet, there is absolutely no reason you would want to go through the hassle of doing that in a GPU. The reason those ops are fundamental in a CPU is that you actually might want to have at the intermediate results, thus they need to be in registers, etc.
On a GPU you don't want to get at the intermediate value in a triangle setup or perspective correction calculation, and putting such results into a buffer or register is a waste of transistor and power budget. The only exception is in shaders.
 
I wouldn't call "vertex transform" a fundamental unit of work. It's clearly implemented on most GPUs in terms of more primitive ops (dot product) which themselves are implemented via more primitive ops.
 
Yes, DemoCoder understands.

In one of ATI's video presentations it was stated that the basic block layout is somthing like this :

Vertex Fetch --> Vertex Shader --> Triangle Setup --> Pixel Shader --> FB fog+Blend

Each of these blocks is highly complex (several hundred stages). It was stated that - it literly takes thousands of clock cycles between the front-end of the graphics chip to the back-end of the graphics chip.

VPUs are 'an order of magnitude' longer (than CPUs) in numbers of pipeline stages.
 
DemoCoder said:
I wouldn't call "vertex transform" a fundamental unit of work. It's clearly implemented on most GPUs in terms of more primitive ops (dot product) which themselves are implemented via more primitive ops.

Exactly. Dot products are a component of a vertex transform. And in addition they are useful on their own (lighting).

Interesting how a Dot Product and a bilinear filter blend are roughly the same ammount of work for two different pipeline stages. Could it be that on today's processes a dot product stage or blend stage can complete work at a rate of ~ 500Mhz? And that a single add goes at 2.5Ghz?

Funny how those speeds are roughly GPU and CPU speeds. hmmm.
 
Back
Top