For one, if the processing is done in serial in each unit, it should be fairly easy to avoid any load balancing issues.
That is, if unit A does all of the processing in triangle X from the vertex shading to the shading of each pixel within that triangle, efficiency of processing could be maximized. There may be additional problems with memory bandwidth efficiency, but I don't see how it would be all that much different from current architectures.
If the NV40 does indeed sport unified shading, then it seems logical that the shaders will be optimized for FP32 performance, as FP32 will be need to be used almost exclusively when executing a vertex program.