Accurate human rendering in game [2014-2016]

Status
Not open for further replies.
That UFC looks pretty realistic to me. The characters are a little robotic, especially in reaction to punches, kicks, but overall it looks very good.
 
That UFC looks pretty realistic to me. The characters are a little robotic, especially in reaction to punches, kicks, but overall it looks very good.
It was all good until they started fighting. And also, 2 minutes 40 seconds before the first punch is thrown? That had better be skippable.
 
It was all good until they started fighting. And also, 2 minutes 40 seconds before the first punch is thrown? That had better be skippable.

I'm sure all the entrances and presentation are skippable. They have been in EAs boxing titles, and they are in their other sports titles like NHL, FIFA, Madden.
 
While I'm a big fan of hand crafted characters, there's also a trend to go for completely realistic styles and that absolutely requires some sophisticated data acquisition. Several studios have moved on to this approach, the games include Crysis 3 and Ryse, Infamous Second Son, The Order, Quantum Break, Halo 4 and probably 5, Call of Duty Advanced Warfare and so on. Beyond also did a subset of the stuff below, but they were using a different method of performance capture that's - IMHO - an inferior approach. Overall, I fully expect this trend to continue with a lot of game developer R&D invested in it in the next years.

Now, on our current project we've been working with scanned talent for a lot of characters. I thought it might be interesting for you guys to share some of our experiences from the past few months; despite being quite familiar with the general process, I've still learned a lot about how it works in practice.

The most important thing to realize is that evolution has trained us to be incredibly good at recognizing even the smallest details of a human face. The ability to determine the intents of someone, deducing the condition (stuff like restlessness, sickness, death etc), following the thought process and so on are huge advantages to survival. Funnily enough, most of this happens unconsciously, and it takes many years of studies and research to gain an actual understanding of these factors. The current state of the art in CG is a result of decades of university research in a vast amount of different fields, and while we're getting closer and closer, there's still a lot to learn.

The actors do the performance capture (body movement, facial expressions and voice), they also lend their likeness through face scans and facial expression scans, and we also get a LOT of reference photographs with both polarized and unpolarized images (basically using polarizer filters on the lights and the camera lens will remove all reflections / speculars - if you have Polaroid sunglasses, you can take a look at your own arms and hands to see the effect). To make it clear, we don't do neither the photo shoots nor the performance capture here, they're all acquired by the client - but we do have our own scanning rig as well and we use it for various other aspects of the characters and environment elements, so I also have some first hand experience with the approach.

Scanning is usually implemented using stereo photogrammetry - basically surround the talent with a lot of cameras, fire them all at once. It helps to get the lighting as even and diffused as possible, to avoid reflections and shadows too.
Then the software (Agisoft Photoscan most of the time) analyzes the images and match points based on the surface texture variation (thus photogrammetry won't work on single-colored surfaces), building colored point cloud data. It can then build a mesh on top of it at several million triangles of detail. This is how the results can look like:

headscan_01.jpg


The accuracy of the result depends on several of factors - generally more cameras allow better coverage of the subject; higher quality cameras provide the software with better tracking points (skin pores, moles, wrinkles etc); more even lighting also helps. Any kind of hair (including eyelashes and eyebrows, or the tiny fibers on cloth) will be near impossible to track and also muddles up the scan.
Incidentally, I had to shave my entire upper body this week... :p

There's also software that can use polarized and unpolarized image pairs to capture very small level of detail based on the speculars, but it isn't really available on the market yet. Interestingly enough a pretty advanced one has actually been developed at Disney(!).
Polarized images may also allow to capture the proper reflectance properties of the face, as the amount of oil on top of the skin (which determines the size and strength of the highlights) changes across the various areas, and is another very important aspect of getting a good likeness. However, this is something that's based on anatomy and thus it's usually very similar for most humans, so it's not necessary to acquire data for. And the reference photo shoot can also provide the necessary information, especially if it's possible to match the lighting used in the studio and the renderer uses physics based shading and lighting.

This is how a setup looks like, courtesy of Ten24 in the UK, the studio that did the scanning on the Halo 4 Spartan Ops CG movies:
Full-Body-Scanner_001.jpg



The baseline set for a good head scan is about 20 cameras, but results keep getting better up to about 40-50 units. Full body scans work better with landscape orientation and splitting up the height of the talent into 3-4 rows, so there you can go up to a hundred cameras. This is because Agisoft works better if most of the source image is the subject and not the background.
These all need to be the same DSLR model, and you also need a lot of extra equipment and software to manage stuff like synchronization, image downloads and deletion, you'll need professional flashes, a completely white studio environment to reflect diffused light, and you also don't want any batteries to go out. So it's not a cheap thing, but more importantly the results are usually not immediately usable, not at all.

The only straight solution would be to use 4D scanning, which means that you capture the entire facial performance at 60fps and automatically generate a dynamic mesh and textures; with properly sophisticated equipment, this can create an animated 3D model that can be re-lit realistically. But it doesn't allow for any custom animation at all, and the datasets are also pretty big so it's not really viable for realtime applications. LA Noire did use this method, but the results were still creepy IMHO; and it's also impossible to match the facial performance with body mocap, so there's always going to be some lack of coherence in the character. It also means that either you 4D scan EVERY character, or you need to develop two parallel asset production pipelines and then you need to match the results...
As far as I know no other developers have decided to use it (although I do know that some have indeed experimented with it).

Other methods of scanning are a Light Stage which is a geodesic sphere arrangement of computer controlled LED lights that allow much greater control but the gear is FAR more expensive; and there's also laser scanning which unfortunately takes some time for the light beam to track the entire surface, so even the tiniest movements of the subject can create distortions. You also need to put the subject on a turntable as the laser is usually to big to move.I believe laser scanning was the first to be used, and photogrammetry has only taken on in the past few years. It's however more affordable as photographic equipment is a commodity product, available at relatively affordable prices around the world.

So, once you have your 3D scan of the face and perhaps the expressions, you can start to build the assets. This is fairly straightforward, but you'll definitely have to clean up the scans manually in zbrush or mudbox as a first step, removing messy parts from the hair or lack of coverage (teeth and eyes are also notoriously hard to scan). Skin detail usually has to be augmented as well. Then you can use the polarized photos to create the color texture, but this is where you run into the first problem. Accurate representations of real life people can look pretty unattractive.
Movies and HD TV shows use a lot of pretty sophisticated make-up to hide the various skin blemishes and even compensate for weak facial features to some extent. It's also necessary to scale down the reflections, because cameras require very strong lighting and that'd bring out the natural oiliness - just look at photos taken with a simple flash to see what I mean. Finding the right balance of what to fix and smooth out and what to leave from the original images can be quite complicated, because using too much "digital make-up" will start to make the face look more and more artificial.

Human skin also requires some sort of subsurface scattering to render realistically - the skin itself is highly transparent, and most of the underlying tissue is basically tiny cells with transparent membranes and lots of water in them. This is also true for the veins which have red blood.
Our shader requires several textures for the various layers; the skin itself is surprisingly scary as it needs a greenish hue, then the deeper ones get warmer tones. The thickness of the tissue also changes a lot across the head, there are bony areas on the forehead and the bridge of the nose, but for example the jaw bone is actually pretty small and even on slim people there's at least a centimeter thick layer of muscle and skin that cover it. Make-up worn by women is another interesting feature.
The actual amount of translucency of the skin is usually determined only by race and fairly consistent between people, although age can also be an important factor.

Eyes are also incredibly complex on their own - the tissue is highly translucent as well, requiring subsurface scattering; and it can also receive light through the eyelids as they're thin enough (although much thicker than you'd think, 2-4 millimeters). The eyes are also covered by thick fluid with a pretty uneven surface that scatters reflections and doesn't necessarily follow eyeball rotations 1:1, and it also tends to accumulate near the eyelids, creating a thin line of highlight at the bottom and the top. Then there's the refractivity of the cornea, which distorts the position of the pupils from side views and also creates caustics. The eyeballs also reflect the eyelashes pretty strongly (or rather, the eyelashes block the reflections of the environment) and receive subtle shadows from the eyelids too.
And the iris is flexible tissue as well, so it actually jiggles a lot as the eye moves, something that can only be seen in slow motion video but may effect the subconscious responses as well (it's also pretty damn freaky - https://www.youtube.com/watch?v=Fmg9ZOHESgQ). Oh and eyeballs don't move in at a constant speed, they tend to have this start-and-stop-and-start movement, and they never really rest still either; animators call this "eye darts".

Once you have a textured and shaded head, you'll also need to create the hair. We've found that it's also very very important to match properly - checking the likeness to the talent is near impossible with a bald head. It's also important to get the eyebrows and eyelashes right (also some pretty complex stuff), as well as the size of the iris and the eyeball in general. These are all aspects where the scan data will be very little help and you'll need some talented artists with a good eye to get the right results.
Here's a nice illustration for this stuff, from The Order:
image_the_order_1886-23533-2752_0004.jpg


Something not implemented in games most of the time is that faces (and the entire body as well) have an additional layer of thin and short white hair over them, usually 2-4mm long and highly transparent - we tend to call this peach fuzz as it's similar to what you can see on that fruit. But due to the unique reflectance and transparency of hair, this actually changes the look of the face considerably, as hairs on the edge can get a lot of backlight that shines through. I'm not sure but some games may try to recreate this in the skin shader.

Now that the head is complete, and you can check the likeness, we've found that two pretty different things can happen. On some of the characters the likeness was immediately spot on and almost no extra work was required. On others though, everything felt completely off and the CG character looked way too artificial, despite matching the model to the scans as close as we could.
I'm not yet sure about the cause of this problem, so I can only offer a few ideas here.
For example the jaw can be held in many different positions depending on state of mind and such - the actor might try to present the best possible face for the photo shoot but may get tired of the scanning process and let it sag down a little by the time the cameras and lighting and focus are all aligned. The process can take as much as 15-30 minutes and it's very exhaustive to hold the same pose (especially for a full body scan) for that long.
There may also be a case where we sort of know the actor from various roles or publicity photos, where the director of photography and the lighters worked hard to present the best possible looks by carefully tuning the shadows and the camera angles and lenses, not to mention the make-up artists. But the scans expose everything as they are, and thus you need to work on your CG lighting and cameras and such to match their efforts in order to create a recognizable likeness.

This is also the area where the 80-20 rule applies the most, or in other words the effort invested vs. the results gained can be plotted as an exponential curve. Each extra % of quality will require a lot more effort and work then the previous one and you can easily end up fine tuning the CG head for as long as it took to create the first version - maybe even more. This is probably an exceptionally important issue for games with celebrity talent - well-known voice or TV actors and such - where the investment in hiring the talent will only pay off if the likeness of the 3D character is close enough, as otherwise the audience will reject the results, no matter how good the performance may be.

The other aspect of the scan based workflow is of course the facial animation. Here, the standard method is to reference the FACS system, developed in the '60s by American psychiatrists as a tool to compare the facial expressions of their patients in order to gain a better insight and an ability to compare their underlying thought processes. A pretty good example of this is to detect when someone is lying, I think there's actually been a TV show based on this concept with Tim Roth as the lead?
I think I've mentioned this here a few times; the first mainstream use in CG was on Gollum in LOTR 2, and it's become the base industry standard in the past 12 years. Most studios usually expand the range of individual facial Action Units though, in order to give more control to the animators. While FACS is only about 40-50 AUs, it's not uncommon to have more than a hundred different facial animation controls. There are also some "procedural" elements to facial movement that have to be accounted for, the best example is how the lips tend to "stick" together before they're opened, especially with a dry mouth.

A full FACS scan session will create about 40-60 elemental expression scans, and maybe some extra for more complex emotions like anger, happiness, or fear. Scans are a great help here because the actual looks depend on many factors and lots of them are based on the bone structure, the amount of fat and so on, making it near impossible to create the right behaviour for a face without exact references.
You need to match individual animation controls to these AUs on your face rig, and dozens of these have to be dialed in to create the above mentioned complex expressions. The most complex area is of course the mouth, but the eyelids also have a lot of important motion that's usually not as evident for most people.
Complex expression scans are also important because the way the computers can combine the various AUs is based on simple mathematics, but in real life the skin tissue has a much more complex behavior. The same facial muscles may move the face very differently based on the context, a good example is raising the upper lip. It produces very different motion if the muscles fire from a neutral state, or while smiling, or while the jaw is opened as much as it's possible. The facial rig needs to have automatic compensations to produce the right results for the animator's intents.

Most implementations of performance capture prefer to use a method where the solver software tries to interpret the facial expression and breaks it down into basic components (the AUs). Previous methods tried to track the surface itself and then use a direct connection to drive the vertices - but the most common optical motion capture systems can't track more than a few dozen markers on the face and so the complexity of skin deformations is completely lost. This is very obvious in Beyond and the Sorcerer demo, where the facial folds and wrinkles aren't behaving properly (or not at all) and the lips don't deform well enough either, however the method can still work reasonably well on a smooth face (hence the young girl protagonist in both Beyond and the android tech demo).
Unfortunately there's still no proper facial solver software available on the market, most studios tend to develop their own; and no matter how good that solver is, a lot of manual animation work is still required.
There is another approach used by Mova, where the face is covered in rough UV make-up that allows the optical system to track tens of thousands of points instead of a few dozen optical markers. The software can generate a mesh with a high polygon density, but it also requires a special lighting setup (UV lights) so the subject has to stand still. This was the method used on Benjamin Button, where it took Brad Pitt's acting talents to match the timing of his various stunt doubles in his facial performance - and of course a lot of animation work to tweak the results. The Mova facial expression scans were also the basis for the facial expressions of the three differently aged CG versions, which were then "aged" by the artists.

Facial performance capture is also very similar to the surface scans in that it's more of a guideline than an actual solution. Digital Domain guys talked about how stuff like moving the eyelids by half a millimeter were sometimes the key to selling a performance - we are as adept at unconsciously interpreting facial movement as we are at understanding the static face. Capture methods are still not accurate enough to give a 100% solution here either.

Deformations are a pretty complex aspect as well. Human skin tissue is interesting in that it's highly flexible, but it actually doesn't like to stretch - it prefers to slide over the harder bone structure, and when it should compress it tends to form wrinkles and folds instead. Also, most facial movements usually effect much larger areas of the face then what you'd expect - for example squinting your eyes moves the skin as far as the jawline. Or when you stretch your mouth as wide as you can, the thin layer of muscles and tendons down your neck will be flexed and this usually moves even your nipples(!).
So the real life surface "geometry" is very complex, this is something most realtime assets prefer to depict through blending between normal maps of a neutral state and one or more different expression states. Skin color can also change if the expression changes the blood flow, certain areas may become more or less reddish because of this.

On top of this all, the skin isn't only driven by the facial muscles, other things like fat deposits, the rest of your body, general movement, gravity and such are also important and this is something near impossible to control without fully simulating bones, muscles and skin altogether. The neck is a particularly complex area.

We're only starting with the facial animation at this time, but initial tests using scanned expressions are quite promising. However I still expect that this will be the other area where the 80-20 rule will apply, so - being the facial rigging guy here - I look forward to very little sleep in the coming months ;) Once we're done, I'll try to get back to this topic to talk a bit about the experiences some more. I'll probably also be able to share what we're working on, so that you all can criticize the results as much as you like :D

But I imagine this lengthy post has already given you a sense of just how incredibly complex and hard it is to get this "accurate human rendering" right - and games have a lot of extra limitations because of the lack of rendering time and thus they need to use many approximations and trade-offs instead of even trying to do things "right". We'll definitely see some truly remarkable human characters in this generation though, but I believe that completely convincing results will take a huge effort in both the hardware capabilities and the rendering and asset creation aspects as well.

Edit: some extra information and explanations based on re-reading
 
Nice post. You must be procrastinating from doing something else to have that much time to write. :smile:
 
Great post. I would only add (from a laymen's perspective) that everyone seems to keep underestimating how important the eyes are. Some cultures almost look at those exclusively for determining emotion. I think it should be a requirement to get the animation of that, including dilation of pupils, rotation, transparency, blinking and skin movement for wide open amazement, laugh wrinkels etc as good as you can bet them. After that, the mouth, and essential there is to get the lips and the inside of the mouth right, especially the teeth and the lighting, but if you can make the tongue move realistically that's also a win. It is not easy but certaintly not impossible to do - I've studied phonetics and the position of the tongue is known for almost any vowel.

If you zoom out further, the animation of the body becomes far more important - different poses (animation of the shoulders and neck, relaxed stance vs tensed stance), walks and dimensions is how people can identify people they know even if they don't see their faces.

My wife always spots the eyes first in games (she calls it dead-eyes syndrome).
 
Nice post. You must be procrastinating from doing something else to have that much time to write. :smile:

Actually, it just takes me a few hours to calm down enough for sleep, no matter how late I lave ;)
 
everyone seems to keep underestimating how important the eyes are

I don't think that anybody is underestimating their importance, they're just very difficult. Even just the shading of the eyes needs to approximate complex interactions through multiple subsurface features in order for it to look accurate, and that's not even considering the animation. Getting good visibility (shadowing) is also tricky for games when it comes to small-scale features like eyes, especially when the surrounding geometry can change due to animation.
 
The video you posted of the SloMo Iris was preatty surprizing. I also realized how much the eylid's skin folds inwardly when it opens (lots of skin texture basically dissapear into its wrinkle)
Do you guys or anybody simulate that effectively?
Also, you mentioned that to get fully correct secondary motion of skin from all factors such ans tension, inertia, gravity, tendons and such procedurally studious would have to eventually start doing a full physics sim of flesh, bone and other anatomical features, which seems pretty reasonable. Do you know of any research actually going that route with facial animation and getting good results?
 
Well, the full sim stuff is not really used by any VFX studios on faces, as far as I know.
Simulations are slow, hard to art direct, and the director approved facial performance may get altered too much by them so the original intents may get lost. It also adds a certain overhead that requires shots to be completed earlier, in order to leave the creature teams enough time to generate and polish the results.

Sims are however the preferred starting point for body deformations, as there are a lot more joints and muscles, and a higher need for dynamic movement. But the results are usually still manually tweaked (sculpted) in each shot to get the shapes and silhouettes look right. Weta has the body sim stuff at a point where they use it on every character in every shot; ILM also has some different approaches that were used on the Hulk or the kaiju in Pacific Rim.

On the faces the blendshape based rigs which are today's standard are already offering this level of control, as the facial expressions are already sculpted - so it's kinda like skipping the time consuming middle step. But as I've mentioned the neck is a very complex area, so it's usually starting with a sim.

Weta however did a surface-level skin sim on the apes in Rise, and most of the studios can still use various fake approaches to add some secondary dynamics to facial movements. On our previous project we used it on a jowl in some key shots of the character walking down some stairs and such.

But I fully expect Jim Cameron to strongly focus on this more physical skin movement with the Avatar sequels. He's very technical as we all know and may push the VFX crew to develop some practical solutions for the effects of gravity and such; especially if there really are going to be scenes taking place underwater. Maybe he's crazy enough to also push for underwater performance capture, after all he did go very far with such realism on The Abyss...

As for the eye, we don't do the iris dynamics stuff, not sure if it'd be visible and worth the extra effort, and haven't really heard about anyone trying it either.
We do model and texture the hidden surface of the eyelids though and I think this is also why many studios prefer to create their characters with eyes closed at least half way. If you don't account for it then you'll get some terrible texture stretching whenever you need to show a character with eyes closed or even just looking down.
 
This has been making the rounds in the past two days, thought it's relevant:

https://www.youtube.com/watch?v=8qeOFibRmoo

Basically a demonstration of a high-end game face rig.
It's most likely built around the techniques used in Ryse or Last of Us; a mostly bones based rig enhanced by corrective blendshapes, based on FACS, using blended normal maps for the wrinkles.
It could also be using facial scans for the model and reference expressions, but the wrinkle patterns are a bit generic so it could also be completely hand crafted.
Not sure if the guys are going to sell it as a product or a service, I think they're also working on a face camera based capture system as well.
 
Wow Yosh, that's some great info. I deeply apreciate the time you put into informing us. :) thaks.
 
This has been making the rounds in the past two days, thought it's relevant:

https://www.youtube.com/watch?v=8qeOFibRmoo

Basically a demonstration of a high-end game face rig.
It's most likely built around the techniques used in Ryse or Last of Us; a mostly bones based rig enhanced by corrective blendshapes, based on FACS, using blended normal maps for the wrinkles.
It could also be using facial scans for the model and reference expressions, but the wrinkle patterns are a bit generic so it could also be completely hand crafted.
Not sure if the guys are going to sell it as a product or a service, I think they're also working on a face camera based capture system as well.

Very cool. The expressions are very lifelike.
 
Status
Not open for further replies.
Back
Top