The point of AF is approximating a supersampled texture result over the entirety of a pixel's coverage, as you suggest.
So I'm gonna stop you right there.
You think, that with
supersampled texture filtering, you will be able get as much detail as the native by merging 2 offset samples?
Let's revisit math:
0204060802040608020 <- assuming highly detailed scene in math
If your sample, with texture filtering, to form your first "native" shot
You get:
12341234, do you dispute on this?
so, let's try sample the offset 2 frames, at half the resolution from the scene:
2424, frame 1, do you dispute on this?
2424, frame 2, judder by 1 original, or preferably,
3333, a better frame 2, when judder by 2, do you dispute?
combined, naive:
23432343, when your original is
12341234
The resulting image would look very good in terms of reproducing the native, but that's not to say it's a perfect reconstruct, and the number does show that it's blurred.
You can try point sampling on the texutre, but I think it'll just give worse result.