Fermi's subdivision of the geometry pipeline may also be different from GCN.
Fermi has the polymorph engine and raster engine blocks, and devotes a fabric to keeping the polymorph engines in each SM in sync with one another. Outside of cases where there is an ordering constraint, it allows for more parallel setup work.
AMD has kept the geometry engine confined outside of the CU block, which may mean that it is more conservative about how it sets up primitives and geometry.
The division is also different because the pixel pipe contains both the scan conversion and render backend, while the primitive pipe contains the tessellation and geometry.
Nvidia pairs edge setup, rasterization, and culling in one block, with the other functions placed in the polymorph block.
I'm curious now as to the specialized bus in GCN for the ROPs and GDS.
Is it to save bandwidth? Is it also because the GDS and ROPs are part of a pipeline with rather strict ordering, and the arrays of CUs and their R/W subsystem is not consistent enough to maintain it?