Pipeline stages at a glance
A look at the pipeline as far as synchronization is concerned:
At the very top of the pipeline is the pre-fetch processor (PFP), which is the part of the command processor that reads memory for the micro-engine (ME). The PFP also kicks off the direct memory access (DMA) for the vertex geometry and tessellation (VGT) block, specifically for index buffers and indirect draw buffers.
The PFP is responsible for:
- Reading command-buffer data.
- Triggering index-buffer DMA transfers for the VGT.
- Reading indirect buffer parameters.
- Reading predication information.
- Communicating and synchronizing with the ME.
The next stage in the pipeline is the ME. The PFP and the ME, both of which are parts of the command processor, are connected by two first-in, first-out queues. Before the ME can start executing commands, all of the data must have been read from memory by the PFP.
The command processor always tries to run as far ahead of the rest of the GPU as possible and always attempts to execute command packets if it is not stalled. Synchronization at the PFP level is more expensive than synchronization at the ME level because it’s farther away from the GPU and because the mechanism for PFP-to-ME synchronization is expensive.
The command processor also contains a unit called the CP DMA, which can be used to perform generic copies of memory and global data store (GDS) through direct memory access and through the GPU L2 cache. The CP DMA can run asynchronously or synchronously with regard to the rest of the command processor packets and to itself. It can be kicked off by the PFP or the ME, but it’s used mostly from the ME.
For the purposes of this discussion, it is useful to consider the shader-execution blocks of the GPU as a monolithic block. Unless manual synchronization using shader atomics is involved, that part of the GPU pipeline is relatively straightforward when it comes to synchronization.
The fixed-function part of the GPU that executes color-buffer writes, blending, depth tests, and so forth can also be seen as a monolithic block.
Conceptually, the synchronization process through the pipeline is as follows:
- A draw packet or a dispatch command packet enters the command processor through the pre-fetch processor.
- The packet is executed by the micro-engine.
- The packet’s shader stages are executed by the shader processor input (SPI) and the sequencer (SQ).
- The fixed-function render back end (comprising the shader export block [SX], depth block [DB], and color block [CB]) finishes the non–shader related work.