Facial capture is pretty much standard - use an HD facecam and when necessary, add some markers too. The fun part is the solver that converts your captured data into animation and there are many possible ways to do that.
What KZ seems to be doing and what Quantic Dreams is doing is to use the same talent for the performance and for the likeness. This way they have to 100% replicate the performance up to the skin deformation.
This approach is a relatively good fit for face rigs that use only bones - basically, replicate the markers used on the actor in the face rig as bones and drive them using the 3D translation data. I'd consider this a "dumb" rig, in that the system has no idea about what the actor does - like, is it a smile, a blink? - and thus it does not require a solver.
The problem is that realistic deformations would require practically unlimited numbers of bones/markers to reproduce things like skin (up to 0,7-1cm of soft tissue) sliding over the facial bones, thin skin wrinkling and folding, and also volume preservation (all human tissues are basically water, thus not compressible). Also, you cannot capture the inside of the mouth and the tongue, both very visible and necessary for a convincing result.
It is possible to add a secondary set of bones to the rig that are not directly driven by the capture but can be either "programmed" or manually animated to compensate for the lack of subtlety on a bones based rig. It is however a limited solution.
If you don't use the same talent for the capture and the likeness, then things get even worse, as you have to figure out proper offsets that you need to apply to the marker data. This will create more and more freaky looking results, which is why almost noone is using the "dumb" approach for such cases.
More complex facerigs require a solver, a math based software tool that analyzes the captured data, either the video footage or the extracted marker movements, and attempt to figure out what the performance means and break it down into elemental facial movements. Almost every solver is based on psychiatric research from the '60s called the Facial Action Coding System. This is basically a set of about 40 elemental expressions called Action Units, that can be combined with various intensities to create practically every possible facial expression. So the solver's task is to break down the 2D or 3D data into values for these AUs.
Then the face rig replicates these expressions using either just bones, blendshapes, or a combination of both, and the data is used to drive these expressions. So instead of a direct transformation, it's based on metadata, and thus the facial deformations can be independent of the performance.
This also means that no match is required between the actor and the likeness (but it certainly helps), that it's easy to manually tweak the animation, also easy to create procedural animation (but the results can still look silly...).
The Order is using face capture by the way, explicitly mentioned on the image linked in this thread. They also seem to be using a mixed bones and blendshapes based rig (and blended normal maps to create facial wrinkles), which is very similar to Naughty Dog's and Crytek's tech.
I'm not sure what GTA V or Infamous does, but I do know that Halo4 is almost exclusively blendshapes (they have a single bone for the jaw) so it's different (and they're using a more different set of expressions compared to FACS).
I can also go into a little more depth on the pro/con of bones and blendshapes if there's interest, but I think the above is enough for one post