Ok it seems like there are lots of things about GPU architecture I don't understand:
andypski said:
It certainly accesses all the memory that is attached to it, yes. From an individual program that you would write, typically no.
As far as I know the CPU's memory access method is a lot more complicated than a GPU's. It has to calculate addresses, look it up in the L1 cache, translate to physical address, lookup in L2 cache, generate a page interrupt when data is not in RAM, etc. As far as I know a GPU works with very limited caches, addresses are hardwired, unpaged and always physical. Overall less complicated and dedicated.
Certainly has associative caches. Whether they're 8-way or not would be private implementation details.
That makes little sense to me. For example a texture cache doesn't need associativity because you only need texels close together. Or is there a unified cache for all texture units? In that case it would be logical but I expect every texture unit can have its own tiny cache. Don't know, just sounds most probable to me...
Might well be running an awful lot of threads.
Really? I thought a GPU just had hardwired control? Has this always been the case or only recently? I'm just wondering because else I would expect GPUs are usable for much more than rendering. If I understand correctly the GPU is just a bunch of SIMD units?
Every time it processes a new pixel?
Euh, you got me confused again. Can't that be hardwired? I mean, many GPUs have 4 pixel pipelines, not 4 pixel threads does it? I'm particularly interested in this because my software rendering project (see sig.) can't process pixels in complete parallel, and I though a GPU could.
Does a processor have a built in memory controller? (Well, I guess Opteron does now).
You answered your own question
I thought a memory controller was not very complicated in design? All it has to do is generate row and colums signals from a linear address?
Dual memory interface (AGP and local)?
Hmm, I'm getting intrigued. How is AGP handled? Does it contain control about where the data has to be stored on the card's memory or is that also directed by the GPU? Or am I completely wrong?
You mean the RAMDAC? That's a separate chip which doesn't have to be very complicated as far as I know.
Ok I don't really count that to the GPU, but you're right, nowadays it's also integrated on the chip. Isn't it implemented with a programmable DSP?
Same as above? Doesn't really have anything to do with 3D rendering?
Full featured BitBLT with ROPs?
A blit doesn't seem a complicated operation to me, and ROPs only need a basic ALU. Or I'm probably terribly wrong again...
If I recall correctly, very little hardware has line drawing capabilities and most just draw a thin rectangle?
Well I'm not up to date with anti-aliasing techniques, but super-sampling just requires a bigger frame buffer and color averaging (with gamma correction). The BitBLT unit could maybe help here?
Hierarchical depth-culling?
Let's see. All you need is a few comparators and calculating the addresses of the hierarchical depth buffers. Of course a lot more complicated than straightforward depth buffers but it doesn't seem like it needs a radical design change.
Colour compression? Depth compression? Texture decompression? Colourspace conversion? Input and output gamma correction?
Also seems like a 'plugins' that don't influence the rest of the chip much. So not too complicated.
Primitive assembly? High order surface tessellation?
Really wouldn't know. Seems complicated
Hardwired Bresenham algorithm and interpolators?
Texturing? Trilinear filtering? Anisotropic filtering?
Can also be done with dedicated units? Again this seems like a 'component' to me that hasn't changed much in 'functionality' over the years.
I'm sure clipping also can be hardwired. Don't know anything about scissoring.
Overlay? Alpha blending? Fogging? Stencil buffering?
Also seems like basic extensions of the pipelines...
How about doing all of these simultaneously? How about doing all of these simultaneously while also running vertex and pixel shading programs on multiple vectors?
Doesn't every unit just work independently? It seems nowhere near as complicated as out-of-order execution where everything is shared and there are hundreds of exceptions. Just to mention a few: jump misprediction, register renaming, resource dependency, interrupts, monitoring, address generation interlock, locked instruction execution, blocking instructions, etc. There are no independent components that handle this.
There's plenty of complexity in a VPU. A CPU's complexity is in the program. A VPU's is in all the many things that it does simultaneously to give you a high-speed graphics display.
There surely is a lot of complexity in the microcode, but it wouldn't be complex if the hardware wasn't complex. In a GPU every unit can just work nearly independently. Control isn't influenced much by the states of other units. It just processes what comes in and passes it to the next unit. Again, I could be very wrong about this because I never learned GPU architecture at university, but that's how I see things. If I'm terribly wrong please correct me.
A CPU can do many of the above things, but it doesn't have dedicated hardware for it, and will do it extremely slowly. It's hard work making a VPU.
Sure I never implied a GPU is 'easy', but I find it a bit unlogical to think it's more 'complicated' than a CPU. The way I see it, if you change one thing in a CPU, the whole design has to change. For a GPU it seems that certain things are completely reusable for every design and can relatively simply be extended in functionality without depending on the rest of the chip's implementation.
I'm not sure of this, but I think a modern CPU, compared to a GPU with only one pipeline, has more 'functional' transistors. I mean leaving the caches and such aside. I'm sure a GPU can do thousands of operations per clock, but a CPU has hundreds of micro-instructions 'in flight'. But a main difference is that each of these micro-instructions can change the control of the whole pipeline. A GPU works much more 'linear' and computed results don't influence the execution of other parts of the chip directly.
Thanks