Good in theory but in practice it'll be easier to just use a high-resolution camera (if those 8-12MP things can be called that) that is static and doesn't have extra optics or even worse, move itself around to find it's target. Also, zoom would mean just a single person could be tracked at a time and it would probably require a second camera for the depth buffer generation. Way too complex.
Agreed, too complex/costly/limited.
But the second camera part is already there.
For HD camera:
Microsoft LifeCam Studio 1080p HD Webcam roughly $60 at retail (2MP)
IPhone 4gs has a 8MP camera on the back, but I have no idea what the video capabilities are. I'd assume 1080p at 30fps. I doubt the cost of the camera in this phone would be more costly than the webcam above, but I could be wrong as it is in a $600 device.
Either way, we are talking about roughly 7 times more bandwidth, data, and detail in going with a 1080p video feed.
Bandwidth savings could be had by smartly culling data in the video stream which is irrelevant. Doing so would enable Kinect2 (or Kinect HD) to be compatible with existing xb360 hardware.
In fact, perhaps with the success of Kinect, MS will recognize the device is worthy of real investment and put guts on board which can generate accurate 3d skeletal information and just send that info down the usb.
The room itself is static, and if the camera is high res, it could just get texture and model info of the player by doing a full Hi-Res scan and mapping the info to a full 3d model in the 360/720.
Then all they need to do is map the skeletal info to the 3d model of the person playing.
For uses other than full body motion gaming, they will be sampling a much smaller area and so they can cull useless data outside of that which is being tracked (hand/face etc).