Having not worked in this field or anything like, you're obviously way more clued up than me and i can only go by what I hear from other sources. In this case, why is "In The Movies" so poor and why do the EyeToy people asy it can't be done? Are they missing something?
This was a long time ago, and I didn't work on the motion detection stuff directly (I was doing CV on static images), so I would take this with a pinch of the hazy memory salt :smile:
I probably didn't clarify that I was talking about two very different techniques - background removal (In The Movies, etc), and the motion calculation technique with the name I forget (gait recognition).
As I understand it, background subtraction is difficult because even a supposedly stationary camera can change focus or exposure (especially as the foreground object you are trying to capture moves around). Also the object moving will affect the light falling on various parts of the background - the most obvious example of this is a shadow or a reflection but there are indirect effects too.
So when comparing the pixel on frame n to the reference image, you need a threshold of values to cover these variations. Too low and you incorrectly pick up background as foreground, too high and you incorrectly label foreground as background. There is a sweetspot where these errors are minimal (but usually it's not possible to get this to zero). The real problem is that the sweetspot on frame n may not be the same as the sweetspot on frame n+1, because of the changes in camera and scene characteristics I mentioned above. So using the same threshold across a video means almost every frame is non-optimal. You really need to do it manually to get decent results.
Compare this to greenscreen, where you can remove the background with a great degree of accuracy because you determine from the outset that the background is a single colour which is significantly different to any of the foreground colours, so you can use a wide range in the removal and still not lose any foreground, and no manual intervention is required.
The motion detection (gait recognition) system doesn't use a reference image, it just compares frame n with frame n-1, and where there are differences it computes the likeliest direction and speed of motion of that pixel between frames. So you get a kind of "velocity map" if you like, of pixel movements in image space. Of course there are erroneous results, but it is detecting objects covering hundreds of pixels, so averaging techniques can work against this. It also uses edge detection and suchlike IIRC to help clean up the results.
To give an idea of the precision here, it had a 3D awareness (so it could pick up people walking towards/away from the camera etc), and could calculate stride characteristics, joint angles and rotations etc with enough precision to identify individuals in the sample sets by the way they walk.
As I said, I didn't work on it directly so this is all info I acquired from colleagues and saw in demos/videos, and has subsequently been rattling around in the back of my brain for years. I had a quick look for a video of the intermediary stage but I couldn't find one, which is a shame because it's illuminating and very interesting.