The latency hiding has to be hidden by scheduling in either case. What you need to suspend is the actual instruction stream that depends on the texture request (i.e. the shader) by switching to doing something else, so that has to be handled in the cores.
It's done that way by the software rasterizer, it's just that doing so with the separate unit makes it easier to do so.
The texture ops become fire-and-forget with a better static target latency, uniform handling of alignment issues, and a texture unit that hopefully would do a better job of not discarding its prefetches before performing the filter or thrashing the limited L1 or valuable L2 tile space, which a long stream of prefetches would do.
In fact it's less of an issue if filtering was being done in the cores since then you're only hiding the latency of the tap fetches rather than the entire latency of the filtering operation.
I was under the impression that the filtering portion of the operation was significantly less than the worst-case fetch latency.
As with all long latency memory accesses, you should be prefetching and switching to other work before issuing the load/store to avoid stalling the thread.
Prefetch with the scalar or vector prefetches?
The vector ones have the downside of hitting the L1 and they do leave the VPU's FP resources unused for one or more cycles.
The scalar ones can fetch to the larger L2, though the 1:16 disadvantage they'd face with the organization of the software rasterizer's strand-based organization and the back and forth between the vector and scalar sides may make this less than universally useful.
Either way, this is a dual-issue core. There are only so many slots to burn.
Texture sampling is just another long latency event that needs hiding... it's irrelevant whether or not it is being done in a separate unit except that if it is it's actually harder to keep the main cores busy, not easier.
The math the cores would be doing would be shovel work, and the cache would be tracking dozens to hundreds of outstanding fetches. It's not a workload the P54 was meant to handle.
A custom texture unit could do all those things in its own way, and merely have to produce the results within a given time window.
The main cores could spend their time doing something useful.