DeadmeatGA said:
1. Blue Gene Cyclopse
This one uses a radical SMT processor design to scale. It keeps 32 active threads on core but runs only 1 thead at a time. The first contender of Blue Gene design competition.
Up to 32 threads run at once, one executing thread per thread group,
the current spec can run 32 threads simulataneously. The number
of threads being processed is much larger, 256 in the reference
design.
Each thread group shares a 64 entry register file, a program counter,
ALU, instruction sequencer, an FPU and data cache (16 kB).
Instruction caches (32 kB) are shared by two thread groups.
There are 16 banks of 512 kB DRAM, with a bandwidth of 40 GB/s.
Alternate DRAM designs allow for up to 160 GB/s.
Any thread can issue on any cycle if execution resources allow,
if more than 1 thread tries then execution is scheduled in a
round-robin fashion.
The ISA is a 3-operand load-store architecture using 60 of the
most common PowerPC instructions with multi-threading extensions.
At 500 MHz a Cyclops chip peaks at 32 GFlop/s.
Cyclops is a precursor to the final architecture for the BlueGene/P
machine which Cell may be related too. It's not a big jump to
assume Cell will be well north of 32 GFlop/s.