The next big leap was Rage 128 PRO, which supports even some HDTV resolutions (720x480 is still officially the highest input resolution, but the chip actually manages much higher res. - 14xx*1xxx works, i haven't tested any higher, but if you give me a link to some HD-MPEG2 videos, I could test it for you.) The scaler uses 4-tap/4-tap filter and I think this was the first consumer graphics chip supporting 4-tap/4-tap filtering. Rage 128 PRO consists of 8 mil. trans. (don't forget it's dual-pipeline 3D core, AGP4x interface, 128bit mem. controller, integrated TMDS with ratiometric expander, RAMDAC etc.)
If you're talking rage128...
I don't know too much about that one, but it is supposedly pretty similar to the radeon ones. The back-end scaler of the radeons (pre-avivo, I don't think avivo-based cards have a BES) can handle source video resolutions up to 1536 or 1920 horizontal resolution (depends on the exact model, and noone seems to know... - for instance r200 is 1920 rv250 is 1536...), and it will only be able to do 2-tap filtering with that resolution (the overlay line buffer is too small). For 4-tap the video must not be larger than half that. Vertically it's virtually unlimited. If you want to play a 1920x1080 video with a rv250, it wouldn't work. What the chip can do however is predownscale, so you essentially get 960x1080 video (still with 2-tap filtering, it's actually hard to notice quality is degraded...).
Of course, nowadays everyone uses front-end scalers - just use textured video, the 3d engine is a pretty powerful scaler these days... The BES of the radeon was also a bit nasty to program, as it not only depends on source resolution but of course the output resolution (timing) as well.
And of course back-end scalers have all sort of limitations: there is typically only one per card, so not only can you output only one video that way, but it's limited to one output too (unless you'd have a true clone mode where you'd drive two displays with the same display controller, i.e. same timings). You can't use it if you have rotated display (unless the BES would support rotation directly, which it probably doesn't on any graphic card), you can't use it really well in composited environments etc.
I think that a scaler for HDTV resolutions can consist of less than 1 million of transistors if you don't have any special requirements.
A scaler alone should be quite a bit simpler I think, even if you'd include things like color-space conversion. Of course for deinterlacing the complexity required can vary from basically 0 (bob or weave) to something very complex... (even first radeons had some sort of adaptive algorithm, though I don't think you'd call it high-quality these days)