Q: What is the difference between the 64 bit and 128 bit G-buffer layouts?
64 bit layout is the minimal layout. It provides the bare minimum for most real time applications (UV and tangent frame). UV allows you to fetch all the material textures (and calculate screen space gradients for virtual texturing page ID pass and anisotropic filtering). Tangent allows you to perform (anisotropic) lighting and parallax mapping. UV also implicitly gives you virtual texture page ID (UV divided by page size = VT page id). This gives you the mip level (integer part) for free and allows you to fetch per material properties from the virtual texture, such as texture renormalization mul+add, colorize color, lighting model id, etc.
128 bit layout gives you some extra space to store data that can be linearly interpolated (assuming MSAA trick is used). The most common thing to include would be per pixel motion vectors (for motion blur and temporal reprojection). Per triangle constant data (such as instance id) can also be stored (as the MSAA trick reconstructs all per triangle constant data perfectly).
MSAA-trick reduces the color bandwidth on average by 2.7x. 128 bit layout roughly matches traditional G-buffer of 48 bits/pixel (+depth) in bandwidth usage. As everybody seems to be using temporal reprojection, having a cheap way to store per pixel motion vectors is an additional advantage of the MSAA-trick technique.
Q: How can everything be rendered with a single pixel shader?
The pixel shader receives the interpolated UV coordinate and the interpolated tangent frame (quaternion) from the vertex shader (*). Tangent is encoded (10-10-10-2) and virtual texture indirection is performed for the UV. There are not that many different ways to perform these two operations. A single pixel shader is enough for this task.
Material/lighting permutations can be handled with the exactly the same methods used by other deferred rendering pipelines. With tiled deferred rendering, this either means tile classification or uber-shader. Clustered deferred shading (pixel binning) can be also used if high amount of material/lighting permutations are needed.
(*) To ensure that the VS->PS quaternion interpolation works properly, use the mesh preprocessing trick described by Peter Sikachev in GPU-Pro 5:
http://gpupro.blogspot.fi/2014/01/our-gpu-pro-5-chapter-describes.html
Q: Can deferred texturing support complex materials (mask textures to blend multiple material textures together)?
Virtual texture page cache provides storage for texture space transformations. Operations such as the one described above can be executed in a separate (artist authored/configured) shader that writes to the virtual texture page cache. The advantage of this method is that the complex blending operation only needs to be executed when a page comes visible. On average pages are visible for more than 2 seconds. At 60 fps, this means that the cost of a material blend operation is amortized on average over 100 frames. Decal rendering works similarly and has similar gains.
The main limitation of this technique is that it requires unique UV mapping for the object. We have unique UV mapping in most (90%+) objects and for terrain, but having unique UV can be problematic for large structures with lots of repeated (tiled) texture surfaces.
Q: Does deferred texturing support light maps?
Light mapping needs two sets of UVs. Deferred texturing could be extended to support multiple UVs, but that would make it less efficient. I would not recommend using deferred texturing with light maps.
Modern light mapping techniques such as SG (spherical gaussian) look awesome. This technique gave Order 1886 a very nice look. Ready at Dawn guys gave a wonderful presentation about it at SIGGRAPH:
https://readyatdawn.sharefile.com/share#/download/s9979ff4b57c4543b
However, light mapping is a static technique. Geometry and lights (*) cannot move. This is a perfect trade off for many games, since most environments are inherently static. However static lighting (and big offline generated data set per level) is not suitable for games that are based around dynamic content (lots of physics and/or structural destruction) and/or user generated content (levels loaded from a server).
Many next-gen AAA games have achieved convincing results without using light maps. Dynamic direct lights (with high quality soft shadow maps) + light probes + high quality screen space techniques (to mask out probe reflections -> much better object grounding) have proven to be good enough.
Tomasz Stachowiak's presentation in SIGGRAPH 2015 was a perfect example about the graphics fidelity achievable by combining high quality screen space techniques with light probes:
Presentation:
http://advances.realtimerendering.com/s2015/Stochastic Screen-Space Reflections.pptx
Video:
http://advances.realtimerendering.com/s2015/Stochastic Screen-Space Reflections.mp4
GPU-driven rendering can be used to speed up light probe rendering. You can render thousands of probes during the level load time and do partial refreshes if the level geometry changes during the game play. This makes it possible to combine good quality lighting with fully dynamic (& destructible) environments.
(*) Lights that are not baked to light maps can of course move. You don't get any GI from these lights (unless you combine light maps with probes). You could also bake the light transport, but that takes lots of extra space (meaning that lower light map resolution is needed). Dynamic geometry is not supported by light transport methods.
Q: Is per pixel discard (clip) possible with the MSAA trick?
Use ddx/ddy to calculate the four "virtual pixel center" UVs. Take four samples of the BC4 alpha mask texture (stored in virtual texture). Output SV_Coverage mask based on the results. This sounds expensive, but is actually faster than traditional discard at full resolution, as you have much smaller amount of pixel shader invocations.