DC, you're starting to see my points now, though you may not know it.
Any vector processor worth a damn does indeed have at least MAD, and it has been around for ages now with blending stages, at least in the form of MUL and ADD. A pair of adds will make a MAD into a DP3. I know that having a native dot product instruction is very important for speed during complex mathematical routines. But for real-time effects back in the day, EMBM was vastly more important, or at least would have been if NVidia included it, and if developers paid more attention to it. I don't know a single "wow" effect in games from 1999 up to the DX9 era that couldn't be well approximated with EMBM.
All the examples you are giving me - PRT, PTM, BRDF lighting, Polybump - can be done without DP3 either exactly (using MAD/MUL with ADD) or approximately (using EMBM), often with little performance hit. I know they don't need dependent texture lookups, and acknowledged that in my last post. I haven't seen any NSR demos that go beyond what I've been talking about. You can have DOOM3 quality using EMBM, too. Look
here, for example.
DP3 is a simple math op achieved by summing components of a MUL. All those operations you mentioned have adds and multiplies as their basis. DP3 is an optimization. A matrix (translation, scaling, linear transformation, etc) multiplied by a vector can be done just as fast with MAD as with DP3. Don't see how DP3 helps cross product (today it's implemented as MUL followed by MAD using swizzling). Why in god's name would you say you can't do vector calculus without dot product when you know multiplication and add have been here for ages?
You've now qualified your assertion with "mathematically speaking", though, and I agree with you in that context. For graphical effects in games, however, especially creative ones, EMBM is far more important. These effects were just never exploited until pixel shaders came (even then very slowly), at which point fixed function EMBM wasn't used anymore.
As I've said before, EMBM was plenty fast, at least on the Radeon. 75% of the speed of GF3, which had about three times the fillrate and twice the transistors. The GF3 was no dependent texturing slouch, either. I haven't seen any good data for the Kyro or G400, though. The cache "havoc" is not nearly as bad as you think, especially for a small lighting texture.
I don't know why you so badly want to underemphasize the importance of EMBM. It was the first dependent texture ability on consumer hardware. That is a milestone far greater than some simple wiring converting a MAD to a DP3. So what if EMBM is tied to a 2x2 matrix multiplication. That makes it a mega-fixed function with little relevence? Going from EMBM to general purpose dependent texturing is not a very big step at all. Just details, really. Be more specific about why EMBM wasn't as useful as general dependent texturing. Are you talking about it not working with cube-maps?
Anyway, I think we should stop this discussion now, as we're not really bringing up any new points. In summary, this is my stance: DP3 is a natural math optimization, nothing more; in contrast, EMBM gives you 2D dependent texturing, opening a world of possibilities.