I'm also interested to see how well Kinect and Voice-recognition will work, once you are not sitting in a quiet small livingroom, but a big, perhaps noisy one. Or even, if you are listening to a sports-channel at a reasonable volume, how well the Box will respond to your commands (unless of course you have a microphone ready to pick up - but I'm guessing the mics are where you place the camera, which will be at a distance for maximum effect). And at then, there's still a distinction between what is a simple command like "Xbox One, channel 7" or one "Xbox One, browser, search x y and z" (where x y and z might be named/strings and not recognized commands).
These are important points, considering this is Microsofts business approach - their vision of the future livingroom with Xbox. If this doesn't work well (as well as portrayed, but in daily experiences), what is left beyond the features that people are likely going to buy this for; games?
You brought up good, valid points in the rest of your post, but in the true tradition of forums, I'm only going to answer one small part of it
.
MS has two technologies running in the Kinect to deal with the scenarios you posit: echo cancellation and beamforming.
Echo cancellation takes the outgoing stream, and uses it, and some serious computer science DSP work to remove the result of that signal from the incoming mics. If your sports game is being watched using the HDMI in port on the One, it will have access to that outgoing audio stream. Results vary, but in general we could get around 40DB reduction in the echo. This is the equivalent of taking a food blender 3 feet from you, and making it sound like a dishwasher in the next room. That's with Kinect v1. v2 will have a lot more CPU dedicated to echo reduction, and will be able to achieve even better results. (This right here is why you have to calibrate your Kinect, and if you find voice control degrading, your room impulse has probably changed, and you need to calibrate again)
For ambient noises in the room not coming from the TV, like a vacuum cleaner, loud conversation, etc. they can use beamforming. This is a technology where you use the physical characteristics of a multi-mic system to generate a directional filter. In effect, you only hear things coming from a single specific direction. Couple this with Kinect head tracking, and you can clearly hear a person speaking in a loud room. We did this with vacuum cleaners and other very loud environments, and the effect is quite impressive.
A third technology, which will probably no longer be used, since they moved the mics out of the Kinect enclosure, was what we jokingly called the "Phase Compensator", or the PMs called the "Static Tone Remover". This was a filter designed to detect and notch a single frequency, and was originally used to compensate for the stupid fan in the Kinect housing (Who the hell puts a fan next to a microphone???). It also worked well for other sounds that usually stick close to a single frequency, like vacuum cleaners, TV test tones, and sine wave generators. It has the disadvantage of reducing voice quality slightly if the tone is in the vocal range, which is why I think they won't use it anymore.