nutball said:
Maybe not in verification, but IBM regularly say (to me at least!) that adding SMT to POWER5 increased the transistor count by ~10%, so it's a *lot* cheaper than dual-core in transistor terms. In some codes I run it boosts performance by 30%, so it's potentially a big win.
The number is 24% for Power5 to make SMT useful. This is still lower than duplicating an entire core and IBM claims that the increase in throughput is larger than the increase in die area, which is good.
But it isn't stellar.
SMT is implemented to take better use of the execution units, but a lot of other resources has to be beefed up, things like rename registers, buffers, D$ and I$ all experiences increased pressures unless they are reworked to sustain multiple contexts.
Take a look at the P4 Northwood->Prescott transformation. The die grew much more than the 32bit -> 64bit execution units and data lanes could account for. The thing is that Intel made a real effort to make SMT worth while. This meant increasing the amount of rename registers from 128 to 256, doubling (or more, I forget) the amount of write combine buffers, store buffers, etc. Most important they increased the D$ from 8KB to 32KB to reduce thrashing when two contexts are active.
Increasing these structures did not come without a cost: The D$ load-to-use latency increased from 2 to 4 cycles having a very serious impact on single thread performance, the more than doubling of the D$ was in part to alleviate this increase in latency.
Also, to maintain clock frequency with these larger structures, the new core had to be more deeply pipelined. This had the negative impact of a bigger branch mispredict penalty, so the branch predictor had to beefed up to ensure mispredictions were fewer.
All in all this added up to a
big increase in die area.
If you look at
these slides from an AMD presentation you get a good feel about the breakdown of a modern CPU, an Opteron consists of:
1. FP exec units, FP registers+scheduler: 5%
2. LS (load/store): 4%
3. Integer exec units/register+scheduler: 4%
4. x86 decode: 2%
5. Branch prediction: 2%
6. Branch unit: 1%
7. D$: 6%
8. I$: 4%
9. Northbridge: 5%
10. L2: 42%
11. I/O: 25%
The core (with L1 caches) only takes up 30% of the total die area. To implement SMT you'd need to beef the size up on the following:
1. FP registers and preferably the size of the scheduler as well
2. Write combine buffers in the LS
3. INT registers and scheduler
7. D$
8. I$
Let's say 1.) grows by 50%, 2.) doesn't really grow, 3.) grows by 50%, 7.) by 100% (actually both size and associativity should be doubled to avoid a negative impact, so could be even worse) and 8.) by 100%, that would make the core 42% (42.5/30) larger than today. And what you get in return is potential doubling of throughput and a guaranteed negative impact on single thread performance.
I'm fairly certain that AMD has just chosen to
not implement SMT for these reasons, and instead put multiple cores on a die instead. Because of the smaller core and the reduced complexity they get a time-to-market advantage.
Multiple threads/cores makes a lot of sense as long as there is parallellism to exploit, but once you reach the point of diminishing returns Amdahl's law will make single thread performance matter again.
So AMD's choice makes perfect sense, but then again so does Intel's and in particular IBM's
Cheers