Depending on how much you're willing to stretch the definition, it could be said that it's already done in GF3/4, and possibly in a lot of other T/L hardware too.
Hyperthreading is about jumping between tasks for each instruction. This reduces the hit for flushing the pipeline after a branch misprediction. It also reduces latencies for complex instructions. If the instruction has high throughput but high latency, the apparent latency for each thread will always be reduced with hyperthreading. If the instruction has low throughput and high latency, the apparent latency will be reduced as long as the other thread(s) aren't doing that instruction simultaneously.
The VS hardware in GF3/4 has ALUs that has high latency but high throughput, but the latency is hidden by operating on several vertices at once. If each run of the VS program on one vertex is concidered one thread, then it could be called hyperthreading, at least it solves a prolem in a similar way. It's of course a simplified version since all threads are running the same program, and maybe even with the program counter in sync.
I think I've read somewhere that even traditionally expensive operations like reciprocal is done in one cycle. To avoid needing a reciprocal unit with throughput 1 op/cycle, they could do a static calculation to get a time shift between the threads so that only one thread uses that unit at one time. As long as reciprocals are sparse, it shold be possible, and otherwise the driver could add "nop" instructions to make it possible. If this is done, and the threads are running "shifted in time" to each other, then it would be a rather close match for multithreading.