When will we have good speech synthesis for games?

nenarek

Newcomer
I was thinking that one of the features that could really up the bar for interactive worlds is good speech synthesis. When are we likely to see this on consoles?

Voice acting is nice, but in a way I see it as limiting to game creators as recorded video was to the early full motion video games. There is no way to make thousands of characters interact vocally with the player through recorded speech. This is a storage space limitation as well as a voice actor time (and money) limitation.

AT&T has some reasonably good speech synthesis technology. Here is an example:

http://www.naturalvoices.att.com/demos/Recorded.html

If a game like GTA or Elder Scrolls could have cities full of people with synthesized speech instead of a few canned phrases or popping up a box full of text to read it would really raise the bar on how immersive the worlds were.

The processor requirements for the AT&T technology don't seem too high (300Mhz CPU, 128MB ram) but I wonder if this would be practical in game on the XBOX 360 or PS3. Of course for this to be totally believable they would have to refine the technology and find some way of carrying emotion through synthesized speech.

If it isn't possible with this generation of consoles I hope XBOX 3 and PS4 will be able to do this.
 
Unfortunately, I think in a lot of cases this idea has been abandoned. It would be nice if it eventually get picked back up though. :cry:
 
Hopefully stuff like this will become more possible with all these cpu's in the next-gen consoles. Put those suckers to work I say!
 
Even those examples sound blobby. Getting speech synthesis to replace recorded voices is a LONG way off, especially considering how far we HAVEN'T come since the 80's!
 
I recently looked looked at a whole bunch of solutions.

Basically the dictionay of sounds they require for high quality synthesis are huge, you can make it sound good IF you handcraft every sentence with all the hints, but you can't easilly do this for dynamically created text. In the end if your going to the trouble of handcrafting every sentence in the game you might aswell just record the voice and stream it off disc.
 
The other solution would be wave synthesis, modelling the human vocal pathway through a physics model as air passes through the vocal apparatus and pressure changes are created. I imagine this will be the future. There's good work on this principal in musical instrument modelling, but of course a flute is a damned sight easier to model than lungs+larynx+tongue+mouth+lips!

So 10-20 years maybe we'll start to see results?
 
I wonder when it will be more immersive for an RPG to use synthesized speech instead of a box full of text?

A robot voice in a medieval setting would be jarring, but you are also taken out of the game world when you have to read a box full of text when you are "talking" to a game character. It also ruins the immersion when game characters with recorded speech start repeating the same phrases after the first hour of gameplay.

Will games acheive photo realism before they can do realistic speech?

How realistic does speech have to be before it is good enough?

I guess it comes back to the industry focus on eye candy instead of other elements that could enhance the experience like solid AI or new gameplay mechanics. Is this because the methods for producing more realistic visuals are well known? When processing power increases developers seem to know how to make use of it for visual realism. What makes natural speech a harder problem to solve?

After the graphics are good enough maybe then we will see more work on speech and realistic AI.
 
nenarek said:
Will games acheive photo realism before they can do realistic speech?
Yes I think so.

How realistic does speech have to be before it is good enough?
Very!

I guess it comes back to the industry focus on eye candy instead of other elements that could enhance the experience like solid AI or new gameplay mechanics.
Producing good visuals is a doddle. It's easy to recreate a virtual world as objects, light paths, and simulating a primitive rendition of optics. Human artists can create the models - we don't need to mathematically model them. So data can hold some car models, and the visual element need only simulate how they would look when lit up. That said, it still takes alot of processing power (by our standards) to get realistic visuals.

Audio creation on the other hand can't just apply a simple process to premade building blocks. Or at least, as has been pointed out by ERP, you can't do it in a way worth bothering with. So to create realistic speech you need super advanced mathematical modelling of the speech process, something that's not known yet. It's not because eye-candy is more immersive (though it is I'm sure most'll agree), but eye-candy is much, much easier!

As for how good does simulated speech have to be? Extremely good if it's to contribute anything to the game experience and to rival actors. And that's a huge crux. It's not enough just to have a realistic voice that doesn't sound like a robot. That simulated voice needs to be able to 'act' as well. It needs to use intonation to convey emotion, and at a very subtle level as the human mind is very sensitive to this aspect of communication. Even if they could simulate the human voice perfectly in the 8th iteration of Elder Scrolls, if they can't get the intonation right it'll just sound like a terribly bad actor and still destroy the illusion.

After the graphics are good enough maybe then we will see more work on speech and realistic AI.
This next gen should see advances in AI and physics, and interaction (HDIP EyeToy for example). From there on...not sure. Maybe more abstract physics like fluid dynamics and worlds with actual volumetric atmosphere so as you drive past a tree the dead leaves around it lift up and swirl around after the car. But sound synthesis is still a long way off, unless there's a breakthrough discovery.

I'll be interested to see in Linux on Cell amounts to anything, what happens with instrument synthesis, which is severely processor bound at the moment, but I think it's well suited to the SPEs.
 
if you consider the massive capacity of blu-ray drives, you may decide it is just a easy to read several hours of textual matter and save actual voices of actors instead of trying to use AI to translate the printed conversations. The result is more authentic; save for half-baked voice acting, you have to bonus of authentic vocal intonations.
 
I doubt that drive capacity will be the limiting factor in recorded voice for games. I think that actor time and the money developers have to pay to actors will limit voice. I don't think it will be financially possible to record enough voice for rich interaction with more than just a few characters.

Developers won't spend money on voice acting for complex interaction that most gamers will never explore, but the only way for recorded voice to be convincing is if there is enough of it that you won't explore the full interaction tree.

As long as we have to have recorded voice I think it will be hard to break away from games that have simple branching conversations and story lines. KOTOR had some nice voice conversations with characters in the game but it still reminded me of the old choose your own adventure books. If you would climb the wall, turn to page 47. If you would like to break down the door, turn to page 83.

Of course the new consoles and new games will break down other barriers to immersion. I just hope that sometime in the next few generations of hardware we can cross the speech barrier. (Of course it is more a software issue than a hardware issue but I am using hardware generations to approximate generations of gaming.)
 
There are a few companies out there that make some really believable voice sets for various things. A friend of mine was working on a project at his job that involved one (I can't for the life of me remember the company that made it though -- I think it was AT&T and Loquendo that I heard). They still aren't 100% life-like, but they are pretty close -- I think, for now, we're better off with scripts.

I think it's just a matter of whether companies think its worth licensing these voice sets (because creating them is far more work than it's worth for going into a game). After we see graphics hit the imaginary bar we'll probably see stuff like this come into play more often.
 
oh please, voice acting is so cheap, even respectable voice actors do not make enough to be their primary source of income.
 
Back
Top