It'd also be nice to have full six axis motion. Adding another channel for depth motion is a bit costly but straightforward. As for rotational motion, you could probably get away with 10:10:10
What for? All you do need is translation and a significance value, to deal with areas which are likely *not* reconstruct-able by patching. Coincidentally, that's also a somewhat good heuristic where you should probably ramp up the render resolution (not limited to shading rate, but also full geometry detail). Depth-variance multiplied by norm of the motion vector multiplied with age of the data should be a good estimate for where the data is gonna be garbage. Take special care if you know you have areas with distinct patterns in texture or micro geometry, or if you know normals are changing rapidly. You should evaluate the likeliness of successful reconstruction on both the source as well as the target buffer.
You might even get away without emitting refined motion vectors explicitly entirely (even though it helps!) by just using dedicated hardware for a straight-forward optical flow analysis.
Optical flow analysis is part of the fully decoupled video encoder engines, so as long as you get access to the output of that (
https://developer.nvidia.com/blog/an-introduction-to-the-nvidia-optical-flow-sdk/), and manage to feed it specifically non-ambigous buffers (you don't want to get caught up on repeated patterns in textured content, and absolutely no illumination!), you should get an accurate forward motion vector field down to 4 pixels granularity. And you got enough throughput to do this 4k@120fps forward-only, with significance values for the motion-estimate itself free on the house.
You then only need to choose a threshold where the (adjusted) significance is too low for your quality target, and render these parts in the highest resolution you can afford. For the stuff you can patch, remember to degrade the significance of data you haven't redrawn for a couple of frames (where the motion vector was also non-zero) so you know you need to refresh eventually.
You don't even *want* TSAA side effects like motion blur if you can just get a stable, high frame rate instead, even if you lagging behind by 2-3 frames (as you can't extrapolate as easily as interpolate). And you don't want multi-frame accumulation of details either if you can just opt to "render the stuff that matters at full detail NOW", and get away with rendering nothing but highly simplified proxies into minimal buffers only for the rest.
Think of the whole up-scaling stuff less like a denoiser, but rather take inspiration from video compression. That's all about encoding - or in this case re-creating - only the minimal amount of information necessary to introduce details which are new in a frame, while straight-out copying the rest, as long as you can get away with it.