Hm... If they can execute instructions from two threads but not issue, won't that cause instruction bubbles in the pipeline?
No, not really. In the frontend, there is just a single 4-wide pipeline that is strictly in-order and which reads instructions from alternating thread each clock, unless it has no valid data for one thread. (Such as it doesn't yet know a good address after missing a branch, or when the thread is idle).
Then this feeds into a pool, which
entirely decouples the execution of the uops from the frontend. And so long as there are enough independent instructions in the pool, the processor is able to go full tilt.
So there are no bubbles in the frontend (it's just a single pipe and almost always full), nor the back-end, which only cares about having independent instructions, and instructions from the separate threads are independent by definition.
Do you know why this limitation exists? Maybe there just aren't enough cases where both threads have data available in-cache AND there's unused execution units available for such a feature to be implemented?
I'm not even sure I'd call it a limitation. The key to understand it is that the 4 decoders are not independent, they are more like a single machine that chomps an aligned chunk of 16 bytes and outputs 1-4 uops. (If there are more than 4 instructions, they will have to operate on the same chunk again). All the rest of the in-order pipe just flows from them, to most efficiently work on the results of the decoder.
And so long as there are enough instructions in the pool, it
makes no difference whether the frontend spits out 4 uops per cycle from alternating threads, or 2 uops per cycle from both threads, and they want to be able to read 4 uops from a thread when running just one (or the other one is stalled), so they picked the "alternating threads"-option.
Basically, stop thinking about the cpu as a pipeline, and start thinking of it as two separate ones, where the first one fills the pool, and the second one empties it. On good optimized code, the frontend should always "run away" from the back-end, and the pool the back-end gets to operate on is always full. Until you miss a branch, and you have to toss half the contents of the pool, which is really the point where the CPU gains most from HT -- without HT, the pool would run dry, and the execution units would have nothing to do.
Sorry for making a post consisting of nothing but questions!
It's really no problem. If you want to read more about the operation of modern CPU's, I strongly suggest reading at least chapter 2 from the great
microarchitecture guide(pdf) by Agner Fog, and then the in-depth
RWT article about SNB.