A rather more elegant solution IMO...
Would be one controller/dispatcher chip that contains all features that don't require massive parrallelism. For example - UVD, RAMDAC, thread dispatcer, ec...
Then you have multiple cores speciallizing in all those parrallel tasks that can benefit from more units.
Each part would then be smaller allowing for more chips or more redundancy to improve yields and greater granularity.
This would also allow for the whole to act as one big monolithic GPU, thus doing away with the need for AFR or other multi-GPU algorythms. However, how much latency would there be? And would it be possible to effectivel hide i?
Then again I'm not a chip designer so all this might be virtually impossible or impractical for one reason or another.
Regards,
SB