32 bit is high, two independent 64 bit fetches? Probably not so much.Traditional Gather4 style "clumped" jittered sampling isn't as nice as a truly sparse sampling. They know that. The architectural cost to optimise for multiple, distinctly sampled, 32-bit fetches per clock is high.
You say "Gather4 is just ATI's optimisation for when the data aligns within 128-bit buckets.". That is not how I would implement a texture cache ... if you store texels quad ordered in cache then you are going to be able to get 4 bilinear samples in a single go 20% of the time, the same amount of time your needed texels will be completely non contiguous. I don't like them odds.