General Purpose (Graphics) Processor

Frank

Certified not a majority
Veteran
From the point of view of a programmer, a CPU is very 'flat'. It offers a set of lineair functions, performed in sequence. While a GPU is highly pipelined by design. It is more like a few higly specialized DSP's, embedded in between some generic fixed-function electronics.

But that is only at the outside! Inside, from the POV of the chip-designer, a CPU and a GPU have more in common than they differ from one another. And they keep growing closer.

While a lot of functions like a FPU, a MMU, DMA and others have been incorporated into the CPU, the CPU itself has been broken down into a pipeline composed of highly specialized fixed-function units. And they all use multiple pipelines nowadays.

So, what are the big differences?

Well, for starters, the datatypes used are quite different, although (generalized) a GPU uses the same ones like some of the sub-units of a CPU (FPU, multi-media extensions). And while a CPU gets more specialized units, the GPU evolves towards a 'flatter', more general purpose CPU.

We all know a CPU can be used as a GPU. But is the opposite possible as well? And would the introduction of a general vector unit, preferably as a macro to incorporate into a processor design create a hybrid? Or do we need more for that?

They use large (and cheap) PS2 clusters as 'poor-man' vector processing supercomputers nowadays...

:)

EDIT: It would (I think) at the same time allow GPU's to cut down significantly on the amount of transistors used, as they could benefit from the building blocks used to design CPU's, thereby abandoning the ASIC approach. (Edit2: which would *GREATLY* increase the clockspeed of the GPU!)
 
About chip macros: some CPU's (like an Alpha-ARM7 processor) and just about all other components are sold also (or even exclusively) as macros. RAMDAC's, pixel-processors (Bit-boys, for example) and just about anything else is no problem. The only thing I'm not sure of is a decent vector processor.

If I were into the graphic chip market and had huge amounts of money ( ;-) ), I would want a project team to check that out, pronto!

It would also be really bleeding fast! That is, together with real nice drivers/compilers, of course.

I am just awaiting you guys to tell me what I'm missing, really.
 
DiGuru said:
So, what are the big differences?
A big difference is the lack of setup and rasterization hardware on a CPU. A GPU can do this quickly while a CPU takes many cycles.

DiGuru said:
We all know a CPU can be used as a GPU. But is the opposite possible as well?
I don't believe it's possible for a GPU to emulate a CPU, but it will probably be possible someday.

DiGuru said:
And would the introduction of a general vector unit, preferably as a macro to incorporate into a processor design create a hybrid? Or do we need more for that?
SSE added vector units to the CPU.
 
In the shader competition, the "Retro" entry uses pixel shaders to do most of the program.

"All the game logic is in pixel shaders; the C++ code just performs initialisation and stores the time and key press in constants for the shaders to read."
 
I Googled a bit, and there seem to be plenty vector- and pixel-processors you can get as a chip macro. In how far they are designed for ASIC or can be implemented directly on a 'plain' die, I cannot make up out of most documents.

But it would surprise me, if there are no general programmable ones between them. There are so many, it would make a large project just to sort them.
 
3dcgi said:
DiGuru said:
So, what are the big differences?
A big difference is the lack of setup and rasterization hardware on a CPU. A GPU can do this quickly while a CPU takes many cycles.

What is needed to do that? Is there a general purpose part that could do that as well?
 
DiGuru said:
What is needed to do that? Is there a general purpose part that could do that as well?
Sure, CPUs are general purpose and they can do setup and rasterization. They just can't do it as fast as dedicated hardware. It requires a good number of math units among other things. You'll need to look at a book if you're interested in the exact calculations. Initially, this was the main purpose of graphics chips. Shaders get all the attention these days, but without good setup, etc. it doesn't matter how good your shaders are.
 
Memory access with a GPU is more limited. You can't read and write to the same memory. In a pixel shader you can read from a texture and write to the specific pixel in the render target that you're currently on. This makes managing data with the GPU very difficult (and impossible without the CPU switching between the render target and texture, which I do in Frogger).

CPUs can interface with the outside world. They have access to I/O, which the GPU doesn't, except visually. That's why Frogger must get key presses and can't trigger sounds.

I don't see these limitations changing soon, since GPUs are geared towards a specific use.

In the other direction, for the CPU to work more quickly as a GPU, it'd have to have a way to access memory that performs texture filtering quickly, and similarly do fast Z processing, etc. GPUs have special logic and special caches for this. They can also change the memory layout of textures and other surfaces to be the most convenient for the job, since they are guaranteed no direct access to this memory.

Adding this to a general purpose CPU would mean a lot of excess logic and cache area that is meant for only one task. Now, it's possible, and has even been done, as the integration of CPU and graphics on the same chip, but these are still distinct parts even if on the same chip, since the aforementioned functionality used for graphics has no use for general processing.

It may be possible to extend this integration such that the floating point processing logic will be used by both the CPU and the vertex and pixel shaders, but it'd likely not have the performance of a pure 3D chip, so couldn't be used as a general solution. We might have to wait for a shift in functionality -- such as real time ray tracing -- for this to become a practical proposition.
 
Very nice post, ET! Clearly described, very easy to read/follow... Thanks!

ET said:
Adding this to a general purpose CPU would mean a lot of excess logic and cache area that is meant for only one task.

Sounds like a job some FPGA circuitry could do! :) Part, or indeed ALL of the CPU could be constructed on a FPGA and reconfigured as needed... :)

Assuming FPGAs can ever be brought into the GHz range, of course.
 
Another issue separating GPUs from CPUs is the degree of parallellism available - in a GPU, you usually have that every vertex in a vertex array and every pixel in a framebuffer can be processed completely independently of every other, so you get practically linear performance scaling from adding functional units until you run into die size or memory bandwidth limits. In a a standard (single-threaded, single-core) CPU, you however get the problem that every instruction is supposed to execute as if every previous instruction has completed and no later instruction has started, limiting the throughput to about 2-4 instructions per clock no matter how many transistors you throw at it.

If you want to look at a GPU as a CPU replacement, it looks like a massively multi-core (1 pipeline = 1 core), multi-threaded processor (1 vertex or pixel in flight = 1 thread) with a highly vector-oriented instruction set - although the instruction set is not yet nearly as general-purpose as that of a general CPU (VS/PS 3.0, for example, still lack while-loops, a stack, writable memory, recursive functions and pointers, and still hasn't got rid of instruction count limits).

As for FPGAs, you do have today beasts like the Xilinx Virtex-II Pro, which contain multiple PowerPC processing cores and are large/fast enough to hold Voodoo1-class 3d cores - at a cost of several thousand dollars. I don't see them replacing Radeon9800Pro anytime soon.
 
arjan de lumens said:
In a a standard (single-threaded, single-core) CPU, you however get the problem that every instruction is supposed to execute as if every previous instruction has completed and no later instruction has started, limiting the throughput to about 2-4 instructions per clock no matter how many transistors you throw at it.
Hypothetically, if you have enough CPU registers (eg RISC-like) and code that has lots of independent calculations (and no branches :)) you could run lots of instructions at the same time. I believe that even the early PowerPCs were managing several instructions at the same time.

In reality, of course, the scheduling to test for the independence of large numbers of instructions would be a combinatorial nightmare.

If you want to look at a GPU as a CPU replacement, ....(VS/PS 3.0, for example, still lack while-loops,

You "sort of" can in VS3.0, because it has data-dependent tests. For example, you could use a LOOP-ENDLOOP with a BreakC to exit.
 
Simon F said:
arjan de lumens said:
In a a standard (single-threaded, single-core) CPU, you however get the problem that every instruction is supposed to execute as if every previous instruction has completed and no later instruction has started, limiting the throughput to about 2-4 instructions per clock no matter how many transistors you throw at it.
Hypothetically, if you have enough CPU registers (eg RISC-like) and code that has lots of independent calculations (and no branches :)) you could run lots of instructions at the same time. I believe that even the early PowerPCs were managing several instructions at the same time.

Even before that. Pipelining is a another (cheaper) form of extracting instruction level parallism. So it goes all the way back to the 60s.

Modern general purpose CPUs spend enormous amounts of effort on speeding up sequential execution. Effort that would be more or less wasted in a GPU where workload parallism is abundant.

Cheers
Gubbi
 
Simon F said:
Hypothetically, if you have enough CPU registers (eg RISC-like) and code that has lots of independent calculations (and no branches :)) you could run lots of instructions at the same time. I believe that even the early PowerPCs were managing several instructions at the same time.
The highest numbers I have heard for sustained IPC in a general-purpose processor (except for hand-optimized code with lots of manual prefetching and cache blocking) is about 1.4 (for a PowerPC G4, IIRC), even though there are several processors out there with a theroretical maximum IPC of as much as 6 (Alpha 21264, Itanium2)
In reality, of course, the scheduling to test for the independence of large numbers of instructions would be a combinatorial nightmare.
I have heard about instruction schedulers for N-way superscalar processors that take only O(N) logic, although I am not sure if they are fast/dense enough to be useful in practice (if it has O(N) delay or takes O(N^2) routing area, it's not really very much of an improvement)

Otherwise, you can go the VLIW way and shift the scheduling burden from hardware to the compiler (although e.g. the Itanium architecture is a disaster in that respect).

You "sort of" can in VS3.0, because it has data-dependent tests. For example, you could use a LOOP-ENDLOOP with a BreakC to exit.
Such code will behave as some sort of while-loop with an upper ceiling on the number of iterations, almost but not quite like the while() in C/C++.
 
Gubbi said:
Even before that. Pipelining is a another (cheaper) form of extracting instruction level parallism. So it goes all the way back to the 60s.
But that, typically, was still only issuing, at most, one instruction per clock. It just made it look as though the instructions took just one cycle to execute.

arjan de lumens said:
You "sort of" can in VS3.0, because it has data-dependent tests. For example, you could use a LOOP-ENDLOOP with a BreakC to exit.
Such code will behave as some sort of while-loop with an upper ceiling on the number of iterations, almost but not quite like the while() in C/C++.
That's why I said "sort of". IIRC, in the VS you can have 4 nested loops each with a max of 256 iterations. You could lock up the graphics chip for quite a long time with 4x10^9 loops.
 
good old PixelFuzion 150 was something DiGuru is looking for, if I am correct... too bad that Number Nine bankcrupt was last nail to it's coffin.
 
Wasn't the PixelFuzion 150 ridiculously large, like 400 mm2 (don't remember process)? Also, IIRC, it performed like a voodoo2 - I also seem to remember that the architecture was later turned into a network processor and enjoyed some level of success there?
 
arjan de lumens said:
Wasn't the PixelFuzion 150 ridiculously large, like 400 mm2 (don't remember process)? Also, IIRC, it performed like a voodoo2 - I also seem to remember that the architecture was later turned into a network processor and enjoyed some level of success there?

process was 0.23µm UMC eDRAM. it had around 3MB eDRAM. Can't remember the exact size though... But it was smaller than Verite V4400 that was 0.20µm Micron eDRAM with 128 Milj. Transistors and around that 400 mm^2.

well, it turned to network processor and there reallly wasn't need any very big changes on architechture because of it's flexible programmability. and afaik, ClearSpeeds new GigaFLOPS class math unit released lately still continues using same basic principles that they used for PF 150 already.

EDIT:
and as someone already mentioned Bitboys here, I can reveal that AXE graphics processor has almost identical packaged size as AMD K6-2. (I have both sitting on my desk right now.) So, neither AXE is small chip. ;) I can post size comparison picrtures if someone wants it... :)
 
Nappe1 said:
and as someone already mentioned Bitboys here, I can reveal that AXE graphics processor has almost identical packaged size as AMD K6-2. (I have both sitting on my desk right now.) So, neither AXE is small chip. ;) I can post size comparison picrtures if someone wants it... :)
Hmm - when I saw pictures of the AXE reference card design, I wondered why the chip needed such a large package, especially given that it didn't have a 256-bit external memory bus or any other fancy feature that normally leads to such large package sizes.

And package size doesn't necssarily say very much about the chip size - K6-2 is not a terribly large chip at 81 mm2, although it comes in a rather large package to fit into socket7 mobos.
 
Back
Top