General Purpose (Graphics) Processor

Generally speaking modern GPU's are only exposing programmability at a couple of very specific points in the pipeline i.e. VS and PS. The number of programmable "spots" will increase in the future but it is likely that large part of the pipe will always remain fixed function e.g. depth, stencil testing, texture addressing and filtering etc. This is unlikely to change (any time soon) as programmable units that are capable of coming even close to the performance levels seen today would be massive. You may of course get the option to do these functions programmatically (e.g. ps2.0 up can do a large part of address gen and filter functions, albeit slowly).

Basically the internals of a modern GPU still look very different to a CPU, you might see some blocks bearing similar functionality but how it all fits together is very different. This means that data flow tends to be constrained in ways that a CPU could never consider, e.g. as described in ET's post.

So although they are growing closer, they are still worlds apart, and will remain so for sometime to come.

John.
 
arjan de lumens said:
Hmm - when I saw pictures of the AXE reference card design, I wondered why the chip needed such a large package, especially given that it didn't have a 256-bit external memory bus or any other fancy feature that normally leads to such large package sizes.

And package size doesn't necssarily say very much about the chip size - K6-2 is not a terribly large chip at 81 mm2, although it comes in a rather large package to fit into socket7 mobos.

this is going a bit off topic but anyways...
core size can be seen from backside of the package, so I measured it. core is roughly 20x20 mm. so here is your 400mm^2 chip. :) package contains around 620 pins/balls.

(as off the record I have been told that core logic implements monstrous 11 channel memory controller. (logically 8 channels for edram and rest for external memory and for path to another chip on dual chip mode.)
 
On macro-scale, i.e. a complete computer network for a random corporation, you break things down to individual units, like servers, switches, routers, workstations, etc. But to make it preform specified functions, you have to break down the individual units into functional units. 'Where is the DNS? On the server or on the router?'

Essentially, you need a set of distributed functional units to preform a specific function. And on that macro scale, from a logical point of view, separate units that do only one thing are much easier and better to manage than units that do a lot of things. 'Why is DNS broken? Does one of the units conflict with one of the others?'

But to sell those individual 'boxes', you want the smallest amount of units that do the largest amount of functions possible. One box with one chip that can do it all would be best. You get bang for your buck.

That does not mean, that more is better. Sure, it is very nice that you have a large amount of units that can do it. But you only want one to do it good and fast for each specified task!

On that count, a 'General Purpose Unit' is really great, as long as you can specify which function it will preform in a certain task.

Being 'General Purpose' is bad when you have lots of redundant units that coud do it as well, or when they conflict with one another. We want all the parts that make up our network to do something (otherwise it is wasted) and complement the whole.

For blindingly fast and superbly looking graphics, It would be best (I think), that all parts of the hardware involved would contribute to the work to be done as best they could. A unit that is idle, because there is no task it can do, is just wasted. Let it contribute, but make sure there are no conflicts.

Does the above makes sense?
 
Yep, the basic idea does make sense. Why have dedicated fixed FN HW that migth sit idle half the time when you could add another generic programmable unit that can do that function or something else when that functions isn't required.

The reason why this falls down is simply the number, and hence and area, of required generic units would be very large in comparison to a fixed FN unit.

E.g. consider Trilinear filtering. This requires, what, 7 lrp's or 7 muls and 7 mad's. When hard coded thats exactly what it needs maybe plus some storage for pipelining. If you wanted the same performance in a GP unit you would need 14 pipelines (ok, you might make the pipes do lrp, but that makes them bigger as well, but ok lets say 7 pipes anyway). Each of these pipelines needs a bunch of storage and other stuff to make them useful for other things, so each one would be at _least_ as big as a fixed FN filter unit. So just to get the same filter perf you need 7x as much area just to do the filtering.
The other aspect of this is that you've made you're pipeline much wider for much more of the chip e.g. the fixed fn filter might take 8 32 bit buses and output a single one, the GP approach requires all 8 buses to be carried all the way down the pipeline.

Basically this will probably happen at some point, but only when we have large amounts of area going free...

John.
 
JohnH said:
Yep, the basic idea does make sense. Why have dedicated fixed FN HW that migth sit idle half the time when you could add another generic programmable unit that can do that function or something else when that functions isn't required.

The reason why this falls down is simply the number, and hence and area, of required generic units would be very large in comparison to a fixed FN unit.

E.g. consider Trilinear filtering. This requires, what, 7 lrp's or 7 muls and 7 mad's. When hard coded thats exactly what it needs maybe plus some storage for pipelining. If you wanted the same performance in a GP unit you would need 14 pipelines (ok, you might make the pipes do lrp, but that makes them bigger as well, but ok lets say 7 pipes anyway). Each of these pipelines needs a bunch of storage and other stuff to make them useful for other things, so each one would be at _least_ as big as a fixed FN filter unit. So just to get the same filter perf you need 7x as much area just to do the filtering.
The other aspect of this is that you've made you're pipeline much wider for much more of the chip e.g. the fixed fn filter might take 8 32 bit buses and output a single one, the GP approach requires all 8 buses to be carried all the way down the pipeline.

Basically this will probably happen at some point, but only when we have large amounts of area going free...

John.

Yes, true. But those same 7 or 14 units could do a lot of other tasks as well. So, instead of splitting up all those fixed function units, you would only need the same amount of general function units to run the same amount of tasks.

When you want another function executed, they could do that as well, you don't need another set to replace the other fixed function unit. So, the total amount of units needed could be much smaller.
 
Di Guru,
In a dedicated piece of hardware, you can usually have your N units working with, say, > 90% efficiency. On a general purpose processor you'd wouldn't stand a chance of coming close to achieving that.
 
Trilinear interpolation takes a lot more than ~7 multiplies and adds. The basic operation of trilinear is as follows:
  • First determine appropriate mipmap level. This requires taking the logarithm of the derivatives of the texture coordinates wrt screen ccordinates (although in practice, you can get away with doing this at the polygon vertices only and interpolate the results over the polygon: 2 multiplies+2 adds)
  • Then scale the texture coordinates according to mipmap level, and from those results compute the memory address of each of the 8 needed texels (2 shifts + lots of bit-fiddling)
  • Then compute the relative weight of each texel. (2 multiplies for each of 8 texels => 16 multiplies)
  • Then fetch the 8 texels from memory or texture cache and unpack them to an internal generic RGBA format (8 loads + a lot of bit-fiddling again)
  • Then multiply each texel by its weight - that's one multiply for each of the 4 color components of each texel (32 multiplies)
  • Finally add together the weighted texel results (28 adds)
In current pixel shaders, this is all wrapped neatly up into 1 instruction, which you can't just split up without running into a ~50-instruction avalanche requiring either absurdly long execution times or huge circuits. So dedicated circuits/instructions for texturing will be with us for a very long time.
 
I did say _filter_ !

The number of math ops used in address gen is as you point out even greater, I couldn't be bothered to work out the exact numbers!

John.
 
DiGuru said:
JohnH said:
Yep, the basic idea does make sense. Why have dedicated fixed FN HW that migth sit idle half the time when you could add another generic programmable unit that can do that function or something else when that functions isn't required.

The reason why this falls down is simply the number, and hence and area, of required generic units would be very large in comparison to a fixed FN unit.

E.g. consider Trilinear filtering. This requires, what, 7 lrp's or 7 muls and 7 mad's. When hard coded thats exactly what it needs maybe plus some storage for pipelining. If you wanted the same performance in a GP unit you would need 14 pipelines (ok, you might make the pipes do lrp, but that makes them bigger as well, but ok lets say 7 pipes anyway). Each of these pipelines needs a bunch of storage and other stuff to make them useful for other things, so each one would be at _least_ as big as a fixed FN filter unit. So just to get the same filter perf you need 7x as much area just to do the filtering.
The other aspect of this is that you've made you're pipeline much wider for much more of the chip e.g. the fixed fn filter might take 8 32 bit buses and output a single one, the GP approach requires all 8 buses to be carried all the way down the pipeline.

Basically this will probably happen at some point, but only when we have large amounts of area going free...

John.

Yes, true. But those same 7 or 14 units could do a lot of other tasks as well. So, instead of splitting up all those fixed function units, you would only need the same amount of general function units to run the same amount of tasks.

When you want another function executed, they could do that as well, you don't need another set to replace the other fixed function unit. So, the total amount of units needed could be much smaller.

You're missing the point, if you replace the various fixed FN units with GP ones, you will need to massively increase the area to get the same performance in most applications. (filtering, addressing logic tends not to spend too much time idle these days). If you look at the size of a typical high end chip you'll see that they are already pushing the limits both of tool chains, cost, yeild etc. Basically there is no area to spend on this at the moment. It doesn't matter what the benifits would be, its just not particularily pheasable at the moment, unless of course you're suggesting that we should make most things slower for some reason ?

John.
 
A GP unit is not limited to the things a 'generic' CPU does.

A long time ago, (~1985), I got some requests from people who wanted their HP-Basic programs that ran on their 'old' HP 'laptop', to be converted to a program that could run on a PC. I could not do it. Impossible.

Well, I could do it, but those programs mostly consisted of very complex matrix calculations (which HP Basic did *really* well), and there were at that time no accesible libraries that could preform the same things on a PC. I would have to write most of it myself, and the first simple test showed us, that the HP laptop could do it much better and faster. In 64k, on a slow processor...

Nowadays, that would be peanuts. Curent CPU's have no problem with that. And it would be the perfect thing to run on a GPU.

And what about multi-media extensions? Think about this: the hardware in your optical mouse (a dedicated and quite fast DSP), would have set you back about 50,000 bucks about 7 years ago.

A circuit integrated into a CPU that could do the matrix calculation needed for a trilineair lookup (the end result) could probably be used for other things as well. And a shader that can do a LOOP, IF, WHILE or CASE statement could be used as a general FP matrix processor.
 
DiGuru said:
A GP unit is not limited to the things a 'generic' CPU does.
Oh yes it is, unless you have somehow managed to build a computer that exceeds the limitations of turing machines (very unlikely). Efficiency or performance may be very different, though.
And what about multi-media extensions? Think about this: the hardware in your optical mouse (a dedicated and quite fast DSP), would have set you back about 50,000 bucks about 7 years ago.
7 years? Make that 20. The DSP in a current Microsoft IntelliMouse runs at ~18 MIPS, roughly similar to 486SX-33, which was getting phased out 7 years ago - I'd say it's closer to $50 than $50000 anno 1996 (around that time, PentiumPro was being introduced at 200 MHz). Also, I distinctly remember using an optical mouse at least 10 years ago (required a special mouse mat back then, but worked just fine).
A circuit integrated into a CPU that could do the matrix calculation needed for a trilineair lookup (the end result) could probably be used for other things as well.
Not necessarily - what it does is a filtered table lookup; shoehorning it into other operations is possible but tends to be either inefficient or requiring substantial added circuit space overhead.
And a shader that can do a LOOP, IF, WHILE or CASE statement could be used as a general FP matrix processor.
Yup, although you have to take some care and work with what you've got. For example, for a large-matrix multiply, you will want to do pixel shading, mapping each 'pixel' to 1 element (or better: a group of 4 elements) in the result matrix, if you want good performance. This will exploit the available parallelism in the GPU. If you try calculations that don't naturally fit a parallel execution model, a programmable GPU may be able to do the job, but the performance will suck.
 
arjan de lumens said:
DiGuru said:
A GP unit is not limited to the things a 'generic' CPU does.
Oh yes it is, unless you have somehow managed to build a computer that exceeds the limitations of turing machines (very unlikely). Efficiency or performance may be very different, though.

A Turing machine only has one execution path and a very strict set of functions. That describes only one functional unit, or more probably a single function of that unit.

It is not a good analogy for a 'random' set of units that should preform a random function. Distributed computing, a computer network or a cluster would be a better representation for a whole system.

And what about multi-media extensions? Think about this: the hardware in your optical mouse (a dedicated and quite fast DSP), would have set you back about 50,000 bucks about 7 years ago.
7 years? Make that 20. The DSP in a current Microsoft IntelliMouse runs at ~18 MIPS, roughly similar to 486SX-33, which was getting phased out 7 years ago - I'd say it's closer to $50 than $50000 anno 1996 (around that time, PentiumPro was being introduced at 200 MHz). Also, I distinctly remember using an optical mouse at least 10 years ago (required a special mouse mat back then, but worked just fine).

You are comparing a CPU with a DSP ;-)

That's what this discussion is all about. See if you can get that 486SX-33 inside an optical mouse and make it calculate movement from the camera. The mouse you talk about is quite different: it used light-intensity variable resistors that counted lines on the surface to drive counters, it did not use a DSP to analyze the pictures captured by a camera.

Counting pulses versus doing pattern recognition on the pictures captured by a camera in realtime.

A circuit integrated into a CPU that could do the matrix calculation needed for a trilineair lookup (the end result) could probably be used for other things as well.
Not necessarily - what it does is a filtered table lookup; shoehorning it into other operations is possible but tends to be either inefficient or requiring substantial added circuit space overhead.

That's why a CPU uses general functions. Split it into useful operations and feed it to the circuit as a specific combination of operations preformed in sequence, but executed in a single clock. VLIW components.

And a shader that can do a LOOP, IF, WHILE or CASE statement could be used as a general FP matrix processor.
Yup, although you have to take some care and work with what you've got. For example, for a large-matrix multiply, you will want to do pixel shading, mapping each 'pixel' to 1 element (or better: a group of 4 elements) in the result matrix, if you want good performance. This will exploit the available parallelism in the GPU. If you try calculations that don't naturally fit a parallel execution model, a programmable GPU may be able to do the job, but the performance will suck.

That's why you have to generalize it.
 
DiGuru said:
A Turing machine only has one execution path and a very strict set of functions. That describes only one functional unit, or more probably a single function of that unit.

It is not a good analogy for a 'random' set of units that should preform a random function. Distributed computing, a computer network or a cluster would be a better representation for a whole system.
A sysem with multiple functional units corresponds roughly to a multi-tape turing machine; for every multi-tape turing machine, you can always build a single-tape one capable of doing the exact same operations. Although, as a general rule, turing machines tend to map to processors only in a very theoretical sense (a processor with 10000 bits of intenal state translates to a turing machine with 2^10000 states, for example) and may not be terribly useful to gauge what is possible from a practical point of view.
And what about multi-media extensions? Think about this: the hardware in your optical mouse (a dedicated and quite fast DSP), would have set you back about 50,000 bucks about 7 years ago.
7 years? Make that 20. The DSP in a current Microsoft IntelliMouse runs at ~18 MIPS, roughly similar to 486SX-33, which was getting phased out 7 years ago - I'd say it's closer to $50 than $50000 anno 1996 (around that time, PentiumPro was being introduced at 200 MHz). Also, I distinctly remember using an optical mouse at least 10 years ago (required a special mouse mat back then, but worked just fine).

You are comparing a CPU with a DSP ;-)
And how big is the difference? Admittedly DSPs usually have fast mulitply instructions and the 486 had a very slow multiply instruction, but tthe differences aren't that large otherwise. Even SIMD instructions don't usually add more than ~2-4x performance increase, which brings us up to about what? Pentium-90?
That's what this discussion is all about. See if you can get that 486SX-33 inside an optical mouse and make it calculate movement from the camera. The mouse you talk about is quite different: it used light-intensity variable resistors that counted lines on the surface to drive counters, it did not use a DSP to analyze the pictures captured by a camera.

Counting pulses versus doing pattern recognition on the pictures captured by a camera in realtime.
OK ...
That's why a CPU uses general functions. Split it into useful operations and feed it to the circuit as a specific combination of operations preformed in sequence, but executed in a single clock. VLIW components.
Which adds overhead; for trilinear, there isn't that much useful ground between the 1-instruction wrapped approach of pixel shaders and the ~50 instructions you will need in a CPU, and the decoding, scheduling and register file logic needed for 50 instructions in 1 clock is HUGE. A common problem with large-scale VLIW is that for N instructions, you need a register file with ~3N ports, and such register files takes O(N^2) area with large hidden constants. Also, the delay of accessing such a register file grows with O(N) as well, so your clock speed goes down almost inversely proportional to the number of processing units you add.
And a shader that can do a LOOP, IF, WHILE or CASE statement could be used as a general FP matrix processor.
Yup, although you have to take some care and work with what you've got. For example, for a large-matrix multiply, you will want to do pixel shading, mapping each 'pixel' to 1 element (or better: a group of 4 elements) in the result matrix, if you want good performance. This will exploit the available parallelism in the GPU. If you try calculations that don't naturally fit a parallel execution model, a programmable GPU may be able to do the job, but the performance will suck.

That's why you have to generalize it.
To what? Having a sea of generic processing units a la PixelFuzion gets inefficient/expensive for 3d graphics rather quickly. And vertex/pixel shader-type architectures work most efficiently if you have an array where you can process each element independently of every other - once you break that independence, the parallellism that makes the GPU so desirable in the first place falls apart like a house of cards.
 
The biggest difference between a DSP and a CPU, is that the CPU must be exact, while the DSP has to deliver the best approximation within a give timeframe.

It would bother you greatly when your CPU would deliver a rough estimate instead of an exact value, because it ran out of time. And it would bother you likewise if your optical mouse would only deliver a new coordinate when it was quite sure it was correct.

Computers as a whole would not function at all without very well defined results to each action. But your optical mouse has to move that cursor as good as it can.

The same goes for graphical hardware and is seen in the current debacle for speed versus image quality. Older hardware (fixed-function) was allowed to produce a fair approximation. Newer hardware is programmed, which requires the exact same output every time.

That is quite a difference!
 
DiGuru said:
The biggest difference between a DSP and a CPU, is that the CPU must be exact, while the DSP has to deliver the best approximation within a give timeframe.

That's not really correct, a DSP itself is just as capable of delivering an answer that is as exact as any CPU. However, when running a DSP algorithm you may choose to compromise precison in some way in order for it to run within a required time budget, although many "standards" driven algorithms require a minimum precision, if you can't meet this precision in the require time then you need a more powerful processor. This is all true irrespective of running the algorithm on a CPU or DSP.

John
 
Back
Top