Project Natal: MS Full Body 3D Motion Detection

Check it out ... http://www.engadget.com/2009/06/03/project-natal-video-hands-on-impressions-and-further-details/

"The body tracking is truly impressive -- according to Kudo, it's picking up 48 joint points on the human body. As soon as we stepped into line in front of the box, the avatar immediately took on our stance and movements. And we mean really took them on -- little gestures with our arms, the posture we had, front and back movements -- it completely tracked them with accuracy. We did notice a bit of stutter during some finer movements, but overall the effect was impressive (and more than a little eerie)."

"The accuracy is far better than you would imagine it could be; it's very impressive stuff."


I think we should take the wait and see approach before making any assumptions about the accuracy of the skeletal tracking. I have no doubts that finger tracking is out of the question though. I'm also interested in knowing if it could pick up hand rotations like Shifty suggested.

The dodgeball demo thing did seem to have those. Maybe it was just using IK to guess the position of the hand based on the position of the arm? Or, again, we may not have seen what we thought we saw, it wouldn't be the first time at E3.
 
Development cost depends on many factors. You can probably do a cheap or expensive project using the same technology.

I don't see Project Natal as a technology to reduce game dev cost in general. When you do mo-cap, you get tons of 3D geometry data. When you do 3D camera, you really only get slices of 2D images. That's why the recognition can be fast.

Natal doesn't work by getting slices of 2D images, at least according to common consensus. It builds a 3D model of the room and updates it according to motion.

This is why it's a very different beast from 2D screen-diff motion capturing tech like EyeToy.
 
How can we say how limited it is? Speech-to-text is fairly robust, especially for game use. MS Research has an entire group dedicated to this and they've been working on it for many years: http://research.microsoft.com/en-us/groups/srg/

While I'm sure it won't enable fluid conversations with AI entities, it'll definitely be strong enough to pick up key words and general gists of the speech to work into conversation-tree systems like in Oblivion or Mass Effect.

Of particular interest to the Milo demo are the following articles:
Understanding user's intent from speech: http://research.microsoft.com/en-us/projects/intentunderstanding/default.aspx
Microphone array processing and spatial sound: http://research.microsoft.com/en-us/projects/audioprocessing/default.aspx
Language modeling: http://research.microsoft.com/en-us/projects/language-modeling/default.aspx
Multimodal conversational user interface: http://research.microsoft.com/en-us/projects/slu/default.aspx
Speaker identification (who is speaking?): http://research.microsoft.com/en-us/projects/whisperid/default.aspx
SAPI (Speech API): http://research.microsoft.com/en-us/projects/sapi/default.aspx

MS' main advantage here is the enormous MS Research talent and workpool they have to draw from. This kind of stuff has been worked on in many different silos for years, and Natal seems to just be putting them all together for a practical application.

I think his point is that, even with all that effort applied, it's still very very limited compared to 'understanding natural human speech'. Even with pure text and tons and tons of material to 'learn' from the success rate isn't great.
 
I think his point is that, even with all that effort applied, it's still very very limited compared to 'understanding natural human speech'. Even with pure text and tons and tons of material to 'learn' from the success rate isn't great.
It depends what we're talking about. Are we talking about developing a sentient AI, or gathering the information and intonation present in human speech? We can do the latter, not the former.

All we expect and need for gaming currently is the latter. Games will be on rails in some form for years to come, the computational complexity and algorithms simply aren't there for anything more. But the Oblivion example, that is definitely possible.

The tech currently exists to infer emotion and intonation (are they shouting, whispering, talking, asking a question, etc) as well as getting a textual representation of what they are saying, which can feed into algorithms to trigger actions based on key words or answers to asked questions (with synonyms, etc). It won't be 100% perfect, but it need not be for things like conversations with NPCs in a single player game...
 
The dodgeball demo thing did seem to have those. Maybe it was just using IK to guess the position of the hand based on the position of the arm? Or, again, we may not have seen what we thought we saw, it wouldn't be the first time at E3.

What is IK? Do you mean position of the hand like if your wrist is bent, or the position as in you have it rotated a certain way? Otherwise the hand is commonly known to be at the end of your arm, which it should be able to track very accurately. They should also be able to do a bent wrist depending on where those 46 skeletal points are.
 
What is IK? Do you mean position of the hand like if your wrist is bent, or the position as in you have it rotated a certain way? Otherwise the hand is commonly known to be at the end of your arm, which it should be able to track very accurately. They should also be able to do a bent wrist depending on where those 46 skeletal points are.

That's more or less what I meant. It might not see or even be able to track a bent hand, but it might be able to 'guess' it and fake it, like I suspect the Motion+ does in a few demos I've seen.
 
It depends what we're talking about. Are we talking about developing a sentient AI, or gathering the information and intonation present in human speech? We can do the latter, not the former.

We can't really do either. Hence the very limited. If you mean that say, filtering out everything but the few keywords you say, then fine, we can do that. But that's sort of equivalent to old sierra games, except using voice.

The tech currently exists to infer emotion and intonation (are they shouting, whispering, talking, asking a question, etc) as well as getting a textual representation of what they are saying, which can feed into algorithms to trigger actions based on key words or answers to asked questions (with synonyms, etc). It won't be 100% perfect, but it need not be for things like conversations with NPCs in a single player game...

My point is that it's not even close to 100% perfect for text. If we set the bar low enough, sure, it works. But for heaven sakes, we're actually trying to set real goals based on something Molyneux tells us. Of course we'll fall short of the target.
 
They have great impressions for the 3D motion tracking. Why don't we wait for hands-on writeup for the speech recognition too ? What we have read so far are just tone and keyword recognition. I'll give the Milo concept video a rest.

Natal doesn't work by getting slices of 2D images, at least according to common consensus. It builds a 3D model of the room and updates it according to motion.

This is why it's a very different beast from 2D screen-diff motion capturing tech like EyeToy.

Yes. That has been clarified. My point about 3D tracking wasn't that it's impossible, but it can be approximated usually. Things like color tracking may not even need 3D data.
 
They have great impressions for the 3D motion tracking. Why don't we wait for hands-on writeup for the speech recognition too ? What we have read so far are just tone and keyword recognition. I'll give the Milo concept video a rest.



Yes. That has been clarified. My point about 3D tracking wasn't that it's impossible, but it can be approximated usually. Things like color tracking may not even need 3D data.

Fair enough, I'm just going by what I knew was state of the art in natural language processing in 2006-2007. Maybe there's been a gigantic breakthrough, but the problems were pretty fundamental. I really don't think that consumer electronics will be on the forefront of this research (meaning, I'm sure MS research's scientists are doing state-of-the-art work, but what they're doing we're not seeing in what's hitting the market a year from now).
 
Check it out ... http://www.engadget.com/2009/06/03/project-natal-video-hands-on-impressions-and-further-details/

"The body tracking is truly impressive -- according to Kudo, it's picking up 48 joint points on the human body. As soon as we stepped into line in front of the box, the avatar immediately took on our stance and movements.
48? Sounding better all the time!

Natal doesn't work by getting slices of 2D images, at least according to common consensus. It builds a 3D model of the room and updates it according to motion.
Neither is an accurate description. What you get is exactly a z-depth value along with the image. From the differences in distance from camera, you can calculate object boundaries. From a knowledge of human anatomy, you can map areas of related depth to a human being and associate a physics skeleton to that. The understanding of the room is solely in the plane of the view.

...The tech currently exists to infer emotion and intonation (are they shouting, whispering, talking, asking a question, etc) as well as getting a textual representation of what they are saying, which can feed into algorithms to trigger actions based on key words or answers to asked questions (with synonyms, etc). It won't be 100% perfect, but it need not be for things like conversations with NPCs in a single player game...
It doesn't need to be 100%, but it needs to be accurate enough that it doesn't get it wrong, which has long been the bane of voice control. Even a 1% failure means lots of faults. How'd you like to play a game when 1 time in a hundred pressing right causes your character to run left?! It needs to handle muddled speech without putting too much burden on the player to control their speech unnaturally. The voice filtering of an open mic has to be perfect, so a friend chatting or a dog barking in the background doesn't throw out the intonation measurements and cause a false positive. It'll also need some acting powers on the part of the user unless all the applications are current natural existence, and then the system will need to understand hammed performances of the stoic warrior giving gruff commands to his comrades. Lots of pitfalls to worry about.
 
It doesn't need to be 100%, but it needs to be accurate enough that it doesn't get it wrong, which has long been the bane of voice control. Even a 1% failure means lots of faults. How'd you like to play a game when 1 time in a hundred pressing right causes your character to run left?! It needs to handle muddled speech without putting too much burden on the player to control their speech unnaturally. The voice filtering of an open mic has to be perfect, so a friend chatting or a dog barking in the background doesn't throw out the intonation measurements and cause a false positive. It'll also need some acting powers on the part of the user unless all the applications are current natural existence, and then the system will need to understand hammed performances of the stoic warrior giving gruff commands to his comrades. Lots of pitfalls to worry about.

And that's before applying scottish/french/chinese/indian accent or language other than english.

Edit. To me this sounds very similar to ps2 emotion engine marketing and that's about it :) good stuff but the amount of hype and level of expectations is on ludicrous level.( but I would love to be wrong on this one)
 
Last edited by a moderator:
We can't really do either. Hence the very limited. If you mean that say, filtering out everything but the few keywords you say, then fine, we can do that. But that's sort of equivalent to old sierra games, except using voice.


My point is that it's not even close to 100% perfect for text. If we set the bar low enough, sure, it works. But for heaven sakes, we're actually trying to set real goals based on something Molyneux tells us. Of course we'll fall short of the target.
This is not at all my experience. Most commercial solutions today hit between 98 and 99% accuracy for speech to text. It's viable enough for use in fighter aircraft like the F-22 and F-35, it's good enough for use in a single-player RPG...
 
It's easier if there is a specialized vocab and location. Doctor, pilot, mechanics, helpdesk, ... applications are easier. Free-form speech with casuals worldwide in uncontrolled locations are harder.

In a game, the developers can cheat using keyword recognition anyway. So we may never get to see the natural language recognition.
 
This is not at all my experience. Most commercial solutions today hit between 98 and 99% accuracy for speech to text. It's viable enough for use in fighter aircraft like the F-22 and F-35, it's good enough for use in a single-player RPG...

Are we talking about speech to text or understanding natural language? Very different problems. Voice -> internal representation is the easy part compared to trying to parse natural language.
 
This is not at all my experience. Most commercial solutions today hit between 98 and 99% accuracy for speech to text. It's viable enough for use in fighter aircraft like the F-22 and F-35, it's good enough for use in a single-player RPG...
Is the range of vocabulary comparable between everything you could say in an RPG and every command you'd use in an jet-fighter? Does intonation matter to a fighter-pilot? The chance for errors is exponentially greater as you increase vocabulary and similar sounds. As for speech-to-text being 99% accurate, that's great when you can edit the mistakes afterwards. If in a game one in a hundred times of talking to an NPC you accidentally insult them, that won't go down at all well! It needs to be dependable to be more than a novelty.
 
This is not at all my experience. Most commercial solutions today hit between 98 and 99% accuracy for speech to text. It's viable enough for use in fighter aircraft like the F-22 and F-35, it's good enough for use in a single-player RPG...

Those are marketing numbers. You need hours of training the software, low noise, etc. I don't know why anyone is talking about speech recognition, you don't need a new device for this, everyone has a mic now. If it worked so well we'd be using it. The only example I know of is the Tom Clancy RTS and Sing Star.
 
Actually I you are all going too far in what will enable voice recognition, I guess it 's the Molnyeux effect... setting unrealistic expectations and making people forget about what could bring the tech in next future.
As far as I'm concerned I would be happy if the system can understand properly many simple orders with minimal failure rate, that would be a good begin. In a controler be able to safely call out various menu, send spells, give order to units, etc and it would be even better if it works as well while playing local multi player.
 
Yes, then it's already shipping now. See DrJay24's SingStar example above. You don't need to wait until fall 2010.
 
Yeah, Singstar already has to pick up 1000+ songtitles and a few hundred artists, and it manages to do so without any training, so not bad. It understands me mostly perfectly, though my English is pretty good for a foreigner. ;)
 
Back
Top