Hint: with fully decoupled shader and texture pipelines, a Xenos-like scheduler becomes extremely useful.
Even if you only have 4 shader pipes and 4 texture pipes.
The key to ATI's new architectures is, I believe, out of order batch scheduling - which essentially means that a batch that needs texturing can be texturing, while a batch that doesn't need texturing can be calculating.
Batches do a little dance through the GPU, each taking their turn on the dancefloors, according to whether they want to waltz (texture) or breakdance (calculate), and whether there's space on those dancefloors for them.