I don't see DS as relevant here. It's merely a vertex shader that consumes naked vertices and paints them with attributes (position, normal, colour...). That's how DS was implemented all these years in ATI's pre-D3D11 tessellation pipeline.
My interpretation of there being 4 PolyMorph engines, rather than 1, per GPC is that this enables each SM to take vertices from VS all the way through TS, HS, DS, GS, SO and pre-setup triangle operations to produce a completed triangle. By localising the processing of primitives like this, keeping them private to a SM, NVidia has minimised the amount of communication outside of each SM. This is important because communication is expensive and slow.
Jawed
I think making tesselation 4 times faster in a given time is much bigger gain. I mean why would you keep the whole thing near a single SM and than use only 1/4 of the SM-s Would it be slower to use one PolyMorph engine for 4 SM-s than a single one for 1 SM. Of course not
The 4 PolyMorph engines will surely work parallel or else it doesnt have much sense.