Well, because if each clock, exactly only (u,v) coordinate is sent for domain shading, then the amplification amounts to bandwidth saving only, and tessellation into many small triangles won't be able to keep 1600 ALUs busy, since they'll all be waiting for rasterization of the next small triangle, which is going to take a few clocks to pop out. What you'd want is for the (u,v) to be sent to 64 different groups of domain shading ALUs, so that you can parallelize the tessellation as much as possible and not have those ALUs sitting idle.
On Fermi, you can be working on 16 different (u,v) values, separately domain shading, setting up the triangles, and rasterizing. So even if the polymorph engine can only tessellate one set of coordinates per clock, it can do 16 of them, as well as setup 4 outputs from domain shaders each clock. It's less bottlenecked.