Higher resolution video processing requires MUCH faster GPUs because of the size of the video.
4x the resolution requires 4x the processing power, unless you're using something like an O(n^2) algorithm.
Silicon is now fast enough but the same issues that require separate memory for video buffers in GPUs (back buffer, front buffer) apply to inputing high resolution video and running comparison templates to determine movement (old video frame compared to new frame) and what was moved (frame compared to template of human body, face or voice).
You're kinda talking twaddle now. A 1080p frame is ~2 million pixels. At 24 bits RGS, that's 6 megabytes a frame. 60 FPS would be 360 MBps bandwidth consumed for shifting through every pixel. A comparison between two 1080p60 framebuffers would be 720 MBps. PS3 has >45 GBs. Next gen will have more. A couple of GB/s from the total RAM pool is no great loss, and certainly nothing needing a whole extra memory system as you are suggesting.
Multitasking OSes for decades have been able to run concurrent, independent task on the same RAM and processors by shifting tasks on the fly. Video processing is no different. The current Kinect PC demos are doing exactly that, using the same processor to evaluate the Kinect data, and then run whatever tasks are happening concurrently.
The cheat I mentioned before would have the color depth changed from 24 bits to 2 or 3 like a scanner converts from Jpeg (picture greyscale) to text mode which is either black or white with no graduations.
Again, this isn't at all accurate. You'd either have a posterized image losing all the information that denotes objects, or you'd have to dither it, making it nigh impossible to do optical processing on. And it'd still look like crap. If you're using the video feed in game, you'll need the full colour image. Good optical recognition wants as little noise and as much information as possible. JPEG compressing a video stream is bad enough, let alone throwing away most of the image information! And it's uneccessary. Future ports will be able to cope with higher camera resolutions. Maybe they'll be limited to 720p. Regardless, that's all covered by the general IO choices of the console and don't need any special attention, unless you feel a brand new port needs to be designed specifically for high-speed cameras because the likes of USB3 aren't up to it.
It's not that a main CPU can't do this it's that memory time is being hogged by these processes and conflicts with memory scheduling can result.
No different to every other system out there. We don't break PCs up into a processing component with CPU and RAM for audio, another for video, another for physics, another for browser, etc. We take one pool of resources and use it dynamically.
The choice is to increase memory speed even more (which would be expensive)...
Again, you're looking at a few GB/s maximum. In systems with likely well in excess of 50GB/s, that's not a problem that needs special attention.
I would expect that we will have 3-D 1080P or possibly 720P video cameras during the life of the PS4 and provisions to process two streams of video which when combined with going from 480P to 1080P will massively increase input video processing.
An 8x increase in needs will match a natural 8x increase in performance that comes with the next generation of console hardware. The impact will be no more than the current requirements are on this gen. XB360 wasn't designed with a memory and processing subsystem for a future 3D camera. Instead the Kinect works by using a fraction of the system's available resource pool, with no song-or-dance complications about it crippling the running of other applications because it's getting in the way of their memory accesses.
So even if you don't use a separate CPU and memory pool for video preprocessing you would have to make provisions for 1080P 3-D and gesture, facial and voice recognition video input and that would affect CPU and memory choice. You must determine the features you will support before deciding what hardware and CPU you will use.
Only if the features are specialst. Everything you say can fit into the possible processing choices we've outlined before in this discussion. Only if you are doing something extraordinary that conventional processors can't cope with (in the same way a 2005 tri-core PPC and GPU can't cope with 2010 cutting-edge 3D vision tracking) would you need to consider extraordinary CPU soltuions, but that would be cost prohibitive meaning you'd drop that feature and go for a lesser one that works within budget-constrained hardware choices. Wii didn't include a gyro when it launched, even though that'd have provided the full features Nintendo wanted, because the cost was too high. They reduced features to match a price target. Next-gen consoles will have a CPU and GPU made to a price, and the interface options will be built around those knowing that they'll consume a small fraction of resources.