That is very wrong!
For one the working set of 2 threads is larger as for 1 thread.
True, which is one of the trade-offs designs have to make when it comes to threading and the memory hierarchy.
Working sets have scaled much more gradually over time, and cache capacity is one of the less intensive knobs that can be tuned for design, particularly if they are further out from the core.
So you need bigger L3 caches to accommodate 4 cores with 2xSMT versus 4 cores with no SMT.
Why would this be significant, particularly with designs like Intel's where the L3 is frequently highly inclusive of the L2 and L1? The L3's pressure is actually worse the more cores you have in that scenario. SMT doesn't materialize additional physical lines the L3 has to track.
The lower levels of cache are a fraction of the L3's size, so the impact has been measured to be modest. The L3 is also very adjustable in terms of capacity, and this is a very low-power adjustement versus doubling the active circuitry of the core complex.
Similar your L3 bandwidth requirements for a core with SMT are higher compared to no SMT.
The L3 bandwidth requirements are exactly as the same core without SMT: the physical number of transactions that need to be serviced by the next level of the hierarchy, assuming that is local to the core. The L3 and the uncore see the cores in terms of their physical interface points and porting, which is why doubling the number of cores means more than using the same hardware that is already in place. If the L3's bandwidth is not sufficient to supply 4 SMT cores's worth of bandwidth, it is at best half what is needed for 8 cores. Due to coherence traffic, which would worsen with the number of active caches, it would be worse. That's why Intel starts upping the complexity of its internal interconnect at the higher counts, and as bandwidth needs start to up the number of memory controllers.
SMT does not materialize additional physical ports or additional caches that need to maintain coherence.
Or to make it even more clear. Say you have very efficient SMT that allows your single core with 2xSMT to perform as good as 2 cores, obviously your single core needs the same shared infrastructure as the 2 weaker cores.
It's not necessary that SMT match two separate cores. It just needs to yield better performance to justify its power and area cost on the workloads the chip will face.
SMT helps justify a beefier core than can be justified without it, which means a secondary benefit to single-threaded performance.
The overhead is effectively a small amount of additional context tracking and selection logic for these cores, since the OoO engine's rename and result tracking can with very little alteration keep two threads in flight.
This puts die size penalty at 5% for SMT, and this is for now ancient architectures. Some of the split resources are now pooled and the overall size of the cores and the complexity of the rest of the chip have expanded far more since then.
http://www.cs.cmu.edu/afs/cs/academic/class/15740-f02/www/lectures/hv.pdf
Power penalty was put up as high as 16% in other places with sufficient utilization by multiple threads.
Having two whole cores running the same two threads as one SMT is not a lower power penalty.
Putting an 8-core solution against a 4-core means it will be compared in terms of the workloads the 4-core does best in, which even now is heavily weighted towards benchmarks and applications that usually do not scale to 8 and appreciate the higher clocks the 4-core can turbo to.