For an architecture as narrow as the SPE, the cost would be relatively minor. Forwarding networks have delays that scale quadratically with issue width, but being mostly scalar pretty much makes it worthwhile.Thanks for the information. I've had a closer look at the Cell SPE architecture and it has a forwarding network as well. So it must be a good trade-off between transistor count and performance.
Considering how sensitive in-orders are to latency, it was probably a very important feature to include.
An Out-of-Order processor tries to execute an instruction that is ready to execute and is waiting in some reservation station at the front of an execution unit.Indeed, it's very close to a barrel processor. But I wouldn't restrict it to round-robin execution. That would require as many threads as the longest possible latency. I'd only use as many as the average latency, executing an instruction from any thread that is ready. So it's still important to have low minimal latency (which leaves a forwarding network an interesting option). The goal would be high throughput without an insane number of threads.
A superscalar Out-of-Order processor attempts to do the same, it just has to scan for multiple instructions from a wider buffer.
A barrel processor just iterates through the thread list.
Your non-barrel FMT processor attempts to execute a ready instruction by simultaneously scanning multiple entries from some buffer of instructions in front of the execution unit...
It's not exactly the same, since superscalar OoO has to wrestle with quadratic problems with multiple issue and forwarding.
The multithreaded example has some drawbacks compared to a barrel model, since determining readiness involves cross-chip communication, and there is a slightly more timing-constrained multiple issue check (on a scalar processor, no less), and the lack of round-robin brings in questions of fairness. Without round-robin, one well-utilized thread will block 7 other threads.
Your solution would probably be a minor variation of Niagara's barrel/switch on event hybrid scheme.
Your willingness to discount the importance of Niagara's larger register file seems to assume thread context takes no room in a cramped L1. Since no L1 I'm aware of is 32-way associative or 32 times the size of the smallest working cache on a single-threaded processor, the mythical 32-threaded minicore will thrash as a rule, unless you want a huge L1 with latencies that would require >128 threads to hide.
If you want monster throughput, why not just toss out the cache entirely and add another 30 minicores, since the cache is unlikely to work anyway?
The programming model used by GPUs allows them to minimize the working set and context of each thread they run, and they go to great lengths to keep it that way. Independent CPUs cannot assume this.
A scalar and limited OoO core would probably be in the same ballpark of complexity and the L1 would still be useful for a much wider niche than a poor-man's Niagara. It could even do some limited multithreading to 2 threads, since OoO can be repurposed to do most of the lifting.
We'll have to see what level of utilization Sun's upcoming Rock cores have, which still multicores with OoOE.
The future I envision is that we'll have 4-core on 45 nm, 8-core on 32 nm, 16-core on 22 nm, but somewhere around that point it actually becomes cheaper to have mini-cores running 32 threads. Software has to adapt to Cores-a-plentyâ„¢ anyway...
Do you mean 32 mini-cores running 32 threads each?
That would be multithreaded to an absurd degree, and from a wiring perspective, it would be pointless. Signal propagation through cache will not be 8 times as fast at 22nm as it is now, and the cache will be bigger to handle that much data. You'd probably need to have increase the number of threads again in order to hide the latency of the cache needed to support 32 threads, and then increase the number of threads again to hide the latency of a cache that can support the increased number of threads, and so on.
I'd rather have 16 effective threads with mostly utilized hardware than have 100% utilization on weak hardware that spends most of its time dealing with the busywork of 32x32 ineffective threads.
In known cases, x86-64 code expands code enough that it reduces the effectiveness of the L1 cache.I believe that's incorrect. A REX prefix is only needed when using 64-bit registers or when more than 8 registers are used. But most code still uses 32-bit values ('int' is still 32-bit), and if you're using the upper half of the register set it means you've avoided spill instructions. So in practice the x86-64 code is more compact.
This leads to the overall improvement being a wash in many cases on aggressively OoO cores.
A high-clocked minicore as heavily dependent on the cache as what you want would be would suffer more.
In a million years, any ISA will be a drag on performance. x86 is a horrible drag now, unless you use hardware to compensate. It will be even worse in the future, as silicon performance improvements falter, and the process lead the x86 manufacturers rely on to maintain competitive performance becomes harder to maintain.I'm really interested to hear about other designs, as long they are x86 and they maximize throughput. Theoretically I fully agree that x86 should be ditched. But in practice it's not that simple, and any new ISA would also become a limitation in the long-term future, making x86 almost just as good.
It's not too difficult to maintain incremental gains in x86 performance, and slowly ease in a few non-compatible cores.