While on the topic of blending, does anyone know if NV40 can blend more pixels per clock for <4 channel surfaces? DemoCoder said there are 8 blending units for either fixed or floating point, and now Dave is suggesting maybe only 4 are FP capable (which makes perfect sense from a bandwidth point of view). Either way, does anyone think throughput can be increased for single or double channel rendering?
Anyway, back to MSAA. I can see why it isn't on the R3xx generation (and hence R420) because I don't think StretchRect was available or even in the works at the time, so offscreen MSAA buffers were rather pointless (FP buffers have to be offscreen).
For NV40, I think they could have implemented it, but it might have meant disabling FB compression, not a big deal IMO for longer shaders. The required per sample blending slows things down quite a bit if you were to only use the available blenders instead of adding new ones, but again, not a big deal IMO. The final downsampling could be done by the FP16 filtering units during a StretchRect call.
My guess is that NVidia thought it wasn't worth the headache to tie up all the loose ends involved. They'd need a slightly different rendering path down the hardware for the reduced performance FP MSAA I mentioned above. Given that FP rendering already has a big performance hit, gamers probably won't want to enable FSAA as well.
I think NVidia made the right decision. The FP blending allows developers to experiment now, and the next gen will give them, as Dave mentioned, a fully orthogonal solution.