The quote I was addressing was the claim that having in-order allowed depedent instructions to execute in the next cycle, as if OoO prevented it. Both in-order and out-of-order cores do this.
Sorry for the confusion but that was not a claim. In fact I mentioned forwarding on the first page already. What I did say is that with in-order the complexity is lower, reducing execution latencies. So some instructions that previously took two clock cycles to execute might now require only one.
That has nothing to do with register pressure. Since each individual thread still only sees the architectural registers in the ISA, you could have five billion hyper-threads, and they'd all be constrained by the same small software-visible register pool.
Look, 16 registers is really ok as long as you're not doing heavy software pipelining. It's not ideal, there will still be some spills, but it's close enough to ideal. Now, Hyper-Threading makes software pipelining unnecessary, because instead of working on multiple tasks in the same thread simultaneously, which would require extra logical registers, you work on them simultaneously in separate threads, which uses extra physical registers. So it does help reduce register pressure.
Just compare it with a GPU. It runs many threads per shader unit, but each of them individually only has a small number of physical registers. Correct me if I'm wrong, but don't they typically have far less than 16 registers?
So that's the other end of the spectrum. To create x86 mini-cores, the road in the middle would be to have say four thread per mini-core, and 64 physical registers. So both the number of logical and physical registers is manageable and the end result is a good throughput per core.
There are limited gains beyond a certain point of SMT. Intel's magic limit was 2 simultaneous threads. IBM had up to 4, I believe. After that, the penalties from contention of buffers, register ports, and other critical resources make it better to just go with separate cores.
That's with out-of-order execution. Beyond two threads the complexity quickly becomes unmanageable. You have to widen every component required to support out-of-order execution, wich is a large part of the core area, to the point where having separate cores is indeed far more interesting. But with in-order execution the complexity is fairly low and you can keep adding threads up to the point where there's a good balance between core utilization and core area.
Hyperthreading also assumes a significant number of free parallel resources, which for an SPE-type core wouldn't exist. There wouldn't be 5 spare execution units waiting for another thread to take over, and there wouldn't be 5 times as many register ports to handle the traffic.
The Pentium 4 with Hyper-Threading didn't have extra execution units, yet the combined throughput of both threads was higher than one thread with Hyper-Threading disabled. That's because each thread only uses a limited number of execution units at any time. The major reason it didn't work so well in practice was because speculative execution of each thread took away precious resources from the other thread. This doesn't happen with just one thread (even if 1 out of 10 speculatively executed instructions is correct that's still a win). So essentially out-of-order execution was the problem.
With in-order execution, Hyper-Threading would be used to hide latencies, and only has a positive effect on the combined throughput. Every executed instruction is as good as one from another thread. And even with four threads I don't see a big need to add extra execution units (although that's definitely an option). We just need a high utilization. If adding extra execution units lowers the total utilization, even if it makes an individual thread run faster, don't do it. Inreacing single-thread performance at all cost used to be the motto for Intel for over a decade. Now they really have to start looking at throughput per area, just like a GPU.
SMT is useful when a wide processor is having trouble utilizing its resources. If the core is narrow, it is not that useful.
Some switch on event or coarser methods would be preferable.
That's true when out-of-order execution is used to keep utilization high. With smaller in-order cores, wide or narrow, it has to hide latency to increase utilization.
Out-of-order execution is in terms of hardware just a very expensive way to have a high utilization. It was devised specifically to
keep CPUs single-threaded. Now that the barriers have been broken, there's no strict need to hang on to it. With in-order execution and Hyper-Threading ideally we'd have the same utilization per core, just far smaller cores.
Sometimes it is better to write code for something that is new but performs well as opposed to using old code on something that performs like a joke. Software programs using a system wouldn't like being ambushed by suddenly horrible performance.
If that was true for CPUs, x86 would be something only my dad remembers, and we'd be coding for 16-core already. And every couple years a new ISA.
Don't underestimate the importance of software. If you create a new CPU and none of the legacy software runs any faster on it, it won't sell. Just like you can't sell a GPU that isn't DirectX compatible. And it's the server market that pays for much of the research cost, before the architecture goes mainstream.
That's why I believe x86 mini-cores would be very interesting. You do get a speedup for legacy software, and especially server applications would get almost the full potential. I also don't believe it would "perform like a joke". There's no reason why an x86 CPU with mini-cores couldn't perform equivalently to a Cell processor. It also has the benefit of already having excellent development tools.
That's more amenable to a Niagra approach, with non-threaded simple cores. Hyperthreading would do little without bloating a core of this type, and even Niagra's cores have more registers.
Niagara cores are threaded. Strictly speaking it's CMT. In fact that's what I pretty much have in mind for x86 mini-cores. Heck, Intel could easily get a bite out of Sun's server market with an x86 processor with mini-cores. And with 32 registers Niagara's Sparc cores don't really have an abundance like Cell's SPEs either. So I'm sure 16 is adequate when you have a fast L1 cache.