nobond said:
In the gpu filed, it looks like there is not a typical model as the five stage for the risc
processor. The conception of the gemo processor and rasterizaion processor looks
too rough to provide any idea, i think.
The first observation that motivates having a GPU in the first place is parallellism. Each pixel is essentially independent of every other pixel, which means that for a 1280x1024 framebuffer, you have a bit over 1 million sequences of calculations that are pretty much independent of each other. This has some rather immediate implications:
- You will want to keep track of execution state for more than 1 pixel at a time, or else you are f***ed from the get-go; think multithreading and multi-core - a modern GPU keeps track of several hundred to a few thousand such states and can have dozens of execution units.
- If you think that you might be unable to keep an execution unit busy because of a data dependency, a cache miss, a branch mispredict or whatnot, don't stall the processor. Instead, make sure you have instructions ready for other pixels, so that you can continue feeding the execution unit for 100% utilization.
- Instruction latencies should not be affecting performance: if your execution unit has 100 cycles of latency but can accept 1 instruction per clock, you collect 100 pixels and interleave execution between them, so that your execution unit stays 100% busy all the time.
That should give you some hints to the overall architecture of a pixel shader processing unit - vertex shader units are similar, but tend to have less high-latency operations (like in particular texturing). There are lots of stuff to learn about the internal workings of the various execution units (iterators, texture mappers, arithmetic circuits) which will give you something to chew on for a very long time. You can generally make the assumption that every execution unit will be fully pipelined.
In addition, there are a substantial number of subunits that do not look like processors, but serve more specialized purposes in the general 3d graphics data flow:
- Vertex shader pre-transform and post-transform caches
- Triangle Setup Unit
- Scan-converter
- Z and Stencil test units
- Framebuffer blend units
These units are conceptually not very complicated, but they have room for large amounts of optimization at all levels from the gate level to the algorithm level, which often makes them extremely complex in practice. One example of such an optimization would be Hierarchical Z; there are many, many others.
Finally, there is the memory subsystem, that will supply all the various units with the data they need, and return to memory the data that the units produce (the data produced should be limited to framebuffer data - that is, color, Z, and stencil). This subsystem needs to be rather deeply pipelined and highly parallel as well, to allow it to serve/prioritize a large number of execution units efficiently; a GDDR3 memory chip can easily transfer data twice per clock cycle and perform random memory accesses at a throughput of once every 2 cycles, but you can safely expect its latency of the chip to be on the order of 30 cycles (for the chip itself; this comes in addition to any latency imposed by the memory controller).
Of course, as Captain Chickenpants notes, what probably will happen in practice is that you will be assigned to work on just a small, well-defined part of this whole. If you want more than that, you will likely need a substantial aptitude for algorithm and GPU architecture development.