I can't imagine that you wouldn't have a texture header cache/buffer for each TMU. In fact, you would have had this ever since the beginning of multitexturing, since it needs to know information about whatever texture slot is being sampled.
The difference with bindless is that it's a cache instead of a buffer. For a hit, the latency would probably be unaffected, though at worst, it might add a single pipeline stage for the address comparison. For a miss, it would have to read the texture header from memory, which of course would be an indirection. When conventional texture slots are used, it will conveniently map directly into the cache, meaning no misses.
The more interesting question would be about this cache's associativity.