I think you are getting a bit confused here by Huang's "It cost us 60 million transistors" statement. I think what he basically meant was that they could do a PS 2.0 part with 16 pipes and all of the current tech for about 160 million transistors. Yes, adding SM 3.0 to the mix did increase the amount of transistors per pipeline. However, if you look at the NV3x pipeline, they are also fairly complex, plus they have the extra texturing unit, as well as the other garbage that NV threw in there that didn't need to be. NVIDIA totaly revised their pipeline scheme to be more efficient in math operations (and not as robust in texturing operations). So, mixing and matching those things up led to a decrease in overall transistors per pipeline.
Basically, if NV had just made a PS 2.0 compatible unit that stuck to those specifications, the NV40 would have been a much smaller chip. However, even with SM 3.0 capabilities, the NV4x pipeline is still smaller than the NV3x pipeline (you can do the simple math there cause the NV35 was 125 million transistors and it had 4 pixel units and 3 VS units, and if you brought that architecture to the 16 pipeline design and 6 VS units, that chip would hit 330 million transistors easily).
So, chopping the NV4x down to a 2x1 product for integrated graphics would probably net you around 35 million transistors (lean) up to 45 million transistors (with some bells and whistles). On the 110 nm process, this would be a very affordable chip for the chipset market.
Basically, if NV had just made a PS 2.0 compatible unit that stuck to those specifications, the NV40 would have been a much smaller chip. However, even with SM 3.0 capabilities, the NV4x pipeline is still smaller than the NV3x pipeline (you can do the simple math there cause the NV35 was 125 million transistors and it had 4 pixel units and 3 VS units, and if you brought that architecture to the 16 pipeline design and 6 VS units, that chip would hit 330 million transistors easily).
So, chopping the NV4x down to a 2x1 product for integrated graphics would probably net you around 35 million transistors (lean) up to 45 million transistors (with some bells and whistles). On the 110 nm process, this would be a very affordable chip for the chipset market.