Super-Scalar, But Not as You Know It
Super-scalar processor design is an old idea. The basic idea is simple - a single CPU has multiple function units, so why not use them in parallel? E.g., do an "add" in the same cycle as a "load" or some other instruction. In its most basic form, 2-way in-order super-scalar, this requires four main changes to work. Firstly, the whole pipeline must be able to process 2 instructions per cycle. Secondly, the CPU registers must be able to handle reads and writes from 2 instructions per cycle. Thirdly, some logic is needed to determine what functional units are available for parallel processing. Finally, the instruction issue part of the pipeline must be able to extract ILP from the instruction stream so that issuing two instructions in parallel does not cause processing to change.
However, extracting much more ILP than this from the instruction stream is very inefficient, which is why most high-performance CPUs are 3-way or 4-way. But Niagara doesn't have to follow the same old pattern as Niagara is explicitly designed to process multiple threads. Instructions from different threads can always be issued in parallel, so long as they use different functional units. So a future Niagara based design could issue 2 instructions per cycle, from 2 different threads. In other words, the maximum IPC of a single thread will not exceed 1, but the IPC per CPU core will now be a maximum of 2.
This will require a double-width pipeline to sustain 2 instructions per cycle and some logic to determine what functional units are available for parallel processing. This might require adding a stage or two to the existing pipeline. However, it will not require logic to find available ILP or to increase the number of register ports. So not only is it easier to issue multiple instructions per cycle this way compared to a traditional super-scalar design, but less logic is required.
I wanted to point something out about the "leaked" Xbox 2 specs and this guess work on future designs from Sun by Ace's Hardware. The leak states
"The Xenon CPU is a custom processor based on PowerPC technology. The CPU includes three independent processors (cores) on a single die. Each core runs at 3.5+ GHz. The Xenon CPU can issue two instructions per clock cycle per core. At peak performance, Xenon can issue 21 billion instructions per second."
http://news.gamewinners.com/index.php/news/1225/
I'm curious if IBM is going to implement simultaneous multithreading or like Sun does, coarse-grained multithreading which Sun also calls "Vertical Multithreading".
The leak also states
"Each core has two symmetric hardware threads (SMT), for a total of six hardware threads available to games. Not only does the Xenon CPU include the standard set of PowerPC integer and floating-point registers (one set per hardware thread), the Xenon CPU also includes 128 vector (VMX) registers per hardware thread. This astounding number of registers can drastically improve the speed of common mathematical operations."
With a large amount of registers, it would seem to make sense to take the Sun Vertical Multithreading approach instead of the traditional Power 5 simultaneous multithreading method.
A future Niagara based design?
There is however, one new aspect, which requires a design trade-off. With 4 threads and up to 2 instructions per cycle, the instruction issue logic could either look for 2 instructions to issue from any 2 of the 8 threads, or 2 specific threads. The former will help maximise IPC, while the later is simpler. At a guess, I think there would not be much performance difference between the two, so the simpler solution would be best.
The easiest way to implement this would be to simply duplicate the front-end of the pipeline, with some minor changes at instruction fetch and the logic actually issuing the instructions to the execution units. The current Niagara design has one active thread at a time, and switches between them on stalls. This possible dual-pipeline Niagara design would still have 1 active thread per pipeline, and switch in the same way. So the instruction issue logic would look at the next instruction from the active thread from each of the two pipelines and issue both if there are no resource conflicts. If each pipeline has 4 threads, that gives 8 threads per core, which would also help the average IPC, though also puts more pressure on the cache system.