I expect that the memory requirements of a multiple-small-batch scheduling architecture are somewhat higher than a conventional one - apart from anything else, instead of having only a few batches in flight (hence program counters, constants and texels in cache) you've got dozens if not hundreds of batches in flight.
But the batches are issued in order and they will tend to execute in order, round-robin. The batch order will get broken up by dynamic flow control. Then it's a matter of if there are any texture fetches in the "else" clause, or whatever it is that is rarely executed as a result of the flow control. If the else clause only executes a few times across hundreds of batches, then each successive texture fetch will prolly find that the texture data pre-fetched by the previous instance has been flushed.
So that kind of texture fetch will definitely consume a disproportionately large amount of texture bandwidth.
X800XT has prolly got either 16KB or 32KB of texture cache in it, total - 4KB or 8KB per quad TMU. It's a very low base from which to start adding cache, if Xenos (for example) needs extra cache to support its, effectively, randomised texture fetching. It's worth remembering that the four quads of X800XT are 4-way MIMD (as a group), so taken as a group they issue randomised texture fetches, anyway. Not as random as the batches of Xenos, maybe, but still more random than in, say, NV40.
NV40 has a two-level cache architecture because all quads are sharing a texture, but the quads don't perform coherent fragment shading as they all share the workload for a triangle at a time (or multiple triangles in a batch), with pixel-quads allocated round-robin.
Obviously, being in the dark about batch sizes in these architectures really doesn't help.
In the end, pre-fetching is the main solution to the coherency problem. It's then a matter of sizing the texture cache to support the more randomised access patterns. In the end, the vast majority of texture fetches are going to be fairly coherent, and pre-fetched texels will stay in cache long enough that they won't need to be fetched multiple times.
It's only going to be in the exceptional cases (texture-fetch in rarely executed clauses) that cause multiple main memory reads.
(I should point out that my earlier post talking about batch switching upon a texture-fetch instruction isn't correct - realised this in bed just before I got up, sigh. Not a major thing, but I'll leave it is because we're past that point now.)
Jawed