Without seeing the results you're talking about, I can only provide some general insight. The 3D depth image resolution isn't a problem because the face is vacuum-formed over the point-cloud. Depth resolution is too small, but accumulating over multiple samples can be very accurate. We need only look at the incredible results achieved with the crusty methods of Kinect 1 regards realtime scanning. Disparity between depth and video images is irrelevant. You'll crop the images and scale, mapping based on face recognition tech.
Creating a 3D depth map from stereo is a lot harder and prone to errors. It'll no doubt work in the same way, creating a volume and shrink-wrapping the head onto it. So I'd be inclined to believe that it's the libraries giving poor results on XB1, if they are worse, or possibly not a best-case use of the tech. Does the user have to move forwards and backwards from the camera, or move it around their head?