I don't have any specific numbers to hand, but a year or so back when I had to implement filtering in a shader for the ATI path (using FP16 textures on a ATI9800 vs NV6600's HW impl) using techniques discussed in this thread I didn't see such bad performance. ISTR the ATI path was within 2-3hz of the NV path - admittedly its hardly a scientific comparison! Low end hardware might well have masked it (I actually killed the 9800 with this project in the end
)
A lot of 9800Pros died because the fan failed... I had one that did that. It caused very very strange behaviour in Excel when it was on its last legs...
fwiw, a few people I've spoken to that would have some influence on pushing programmable blending said its really not a priority. I'm not sure I agree (but haven't thought about it), but they argue that its such a big change for a relatively small class of algorithms - most uses could be shoe-horned onto existing blend-ops (e.g. use simple additive blending but abstract it out to f(x) = g(x) + h(x) - compute the complex functions g and h in shaders and then use the simple FF addition to composite...)
My main motivation in saying that was thinking of all that computing power that's sat idle when there's no texture filtering/blending being done.
But there's the precision problem as there's hardly any fp32 texture filtering capability in current GPUs, it's mainly int8, so maybe generalising it like this would cost too much in routing/scheduling etc.
Then again, with int16 and int32 as part of the ALU pipeline in D3D10, the common ground between the ALU pipeline and the TMU pipeline (which needs to filter int and fp 32 bit formats) seems to have increased.
There's also density, though - I assume for a given level of performance a fixed function filtering/blending pipeline will just be smaller.
Still, I can't help thinking there must come a cross-over point, based perhaps on available bandwidth versus bilinear filtering capability. Beyond that point the fixed-function bilinear pipeline's utility will simply tail off.
I admit, texture addressing, LODding, biasing, filtering/blending math is stuff I always trip up on
Maybe the argument is more pertinent to ROPs? I guess ROPs have a more uneven workload, specifically blending and Z testing. Isn't Z testing within a shader program the holy grail? (Admittedly, with dire parallelism consequences, i.e. read after write conflicts.) Wasn't this the brunt of David Kirk's argument against ROPs that support AA + FP filtering, their being too expensive for their utility - and that programmable output merge would come?...
Jawed