Why go multithreading/multiple cores?

Hm... If they can execute instructions from two threads but not issue, won't that cause instruction bubbles in the pipeline?

Do you know why this limitation exists? Maybe there just aren't enough cases where both threads have data available in-cache AND there's unused execution units available for such a feature to be implemented?

Sorry for making a post consisting of nothing but questions! :D
 
Hm... If they can execute instructions from two threads but not issue, won't that cause instruction bubbles in the pipeline?

Well, no exactly. Generally, pipeline bubbles are mostly caused by dependent instructions, where one can't start until another one finished.

In order to reduce pipeline bubbles, what you want is to introduce more independent uOps decoded from x86 instructions. You can increase the size of OOE window (to try collecting more instructions so in theory you can find more independent instructions), but that's very expensive and subject to the law of diminishing returns. SMT, on the other hand, is a great candidate if you have enough number of threads, as instructions from a different thread are generally independent from instructions of this thread. So by decoding from alternative threads, you may be able to produce twice amount of independent uOps for the execution units to run, and that increases efficiency.

Since the whole point of SMT is to reduce the amount of pipeline bubbles, there is no point in increasing the decoding power, as it's rarely decoding limited.
 
Hm... If they can execute instructions from two threads but not issue, won't that cause instruction bubbles in the pipeline?

No, not really. In the frontend, there is just a single 4-wide pipeline that is strictly in-order and which reads instructions from alternating thread each clock, unless it has no valid data for one thread. (Such as it doesn't yet know a good address after missing a branch, or when the thread is idle).

Then this feeds into a pool, which entirely decouples the execution of the uops from the frontend. And so long as there are enough independent instructions in the pool, the processor is able to go full tilt.

So there are no bubbles in the frontend (it's just a single pipe and almost always full), nor the back-end, which only cares about having independent instructions, and instructions from the separate threads are independent by definition.

Do you know why this limitation exists? Maybe there just aren't enough cases where both threads have data available in-cache AND there's unused execution units available for such a feature to be implemented?
I'm not even sure I'd call it a limitation. The key to understand it is that the 4 decoders are not independent, they are more like a single machine that chomps an aligned chunk of 16 bytes and outputs 1-4 uops. (If there are more than 4 instructions, they will have to operate on the same chunk again). All the rest of the in-order pipe just flows from them, to most efficiently work on the results of the decoder.

And so long as there are enough instructions in the pool, it makes no difference whether the frontend spits out 4 uops per cycle from alternating threads, or 2 uops per cycle from both threads, and they want to be able to read 4 uops from a thread when running just one (or the other one is stalled), so they picked the "alternating threads"-option.

Basically, stop thinking about the cpu as a pipeline, and start thinking of it as two separate ones, where the first one fills the pool, and the second one empties it. On good optimized code, the frontend should always "run away" from the back-end, and the pool the back-end gets to operate on is always full. Until you miss a branch, and you have to toss half the contents of the pool, which is really the point where the CPU gains most from HT -- without HT, the pool would run dry, and the execution units would have nothing to do.
Sorry for making a post consisting of nothing but questions!
It's really no problem. If you want to read more about the operation of modern CPU's, I strongly suggest reading at least chapter 2 from the great microarchitecture guide(pdf) by Agner Fog, and then the in-depth RWT article about SNB.
 
Thanks guys, imba posts there, quite enlightening stuff. A bit too high level for me I suppose to delve into seriously, as I'm not into microprocessor design, professionally or otherwise. :) I'm just a curious dabbler on this here forum.

But thanks again. I appreciate it!

Hyperthreading (hype-filled label aside ;)) has always seemed like a very good concept to me, despite that its original implementation (in the northwood rev. of the P4 wasn't it?) was rather shoot and miss if it actually benefited the user. Tech sometimes needs time to mature, Intel's fumbling with a host of different integer and float SIMD implementations shows that if nothing else...
 
Back
Top