does use mip-maps but it just point samples them instead of linear/tri-linear
It's sampling the latent feature mip-maps at two different levels, 4 points each.
That does permit the network to effectively do tri-linear / bicubic filtering filtering, but it's not using the mip-maps in the way they're intended, which is to provide a cutoff in frequency space as close to required frequency as possibly. Instead it's only using it as a reference point to ensure that features 2-3 octaves lower don't get completely missed. It's also only using that as network inputs, so the chance that the trained network happens to actually doing proper filtering is low.
Then the answer is yes but you have to decode multiple texel values before you interpolate because you can't interpolate the parameters before they go into the neural network. So that will increase the cost significantly and probably making it unusable.
All good if you can properly batch that. The cost is not in the arithmetic (neither tensor evaluation, let alone the activation function), nor the register pressure from the hidden layers. It's primarily in streaming the tensor in between layers.
So if you can reuse the tensor before it's gone from the cache, the overhead is acceptable.
It would not help though. If you look closely, the noise dominates so badly in the upper 1-2 octaves, and only at the point at which it's close to being supported by the "latent feature" mip-map it turns stable enough that a reasonably low sample count could produce an improvement.
The high frequency features appear to be incorrect in both frequency and phase, only the amplitude is somewhat correct. Only the features up to one octave above the support points from the mip-map appear to have been reconstructed reasonably well.