Here's some food for thought...
Merrimac: Supercomputing with Streams
http://www.sc-conference.org/sc2003/paperpdfs/pap246.pdf
Single chip 90nm CMOS stream processor - 128 GFlop/s peak.
16 DRAM chips - 2 GBs total.
Sustains up to half of peak in simulations on scientific
workloads.
Picking at the paper:
64bit FPU in 130nm CMOS, < 1mm^2
200 such FPUs on a 14mm x 14mm chip
Total production cost of such a chip < $100
Even at a conservative 500 MHz, > 1 GFlop/s per dollar
Moving data is more costly than computation, so
do less of the former and more of the latter.
"To exploit the capabilities of today’s VLSI technology
requires an architecture that can exploit parallelism — to
keep large numbers or arithmetic units busy while hiding
the ever increasing latency to memory, and locality
— to increase the ratio of arithmetic, which is inexpensive,
to global bandwidth, which is the limiting factor."
The Merrimac stream processor chip:
~ 100mm^2
1 - MIPS64 processor - fetch instructions, dispatch stream instructions
16 - clusters - execute stream instructions
512KB cache
interface to 16 DRAM chips - 20 GB/s
network interface - to connect to routers in multi-processor systems
1ns cycle - 128 GFlop/s
31W - $200 - 90nm
1000 pin BGA
Merrimac cluster (stream processor):
768 - 64bit registers
8K - 64bit words stream register file
4 - FPU MADD units
8 GFlop/s @ 1 GHz
In simulation, for the benchmarks quoted, they achieved between
18% and 52% of peak.
Just pulling some figures out of my hat:
Broadband Engine:
65nm with bells and whistles.
32 bit arithmetic, so double the Flop/s per cycle.
Twice as large though, 200mm^2. The shrink also
reduces the area of our FPU ~ 50%. So up-to 4 times
as many 'cluster/APUs' on 'our' chip.
Still at 1 GHz.
64 'clusters/APUs'.
60W
16 GFlop/s per 'cluster/APU' (single precision).
1024 GFlop/s per chip. (Ta da!)
$200 per chip.
~200-500 GFlop/s in applications (games).
Merrimac: Supercomputing with Streams
http://www.sc-conference.org/sc2003/paperpdfs/pap246.pdf
Single chip 90nm CMOS stream processor - 128 GFlop/s peak.
16 DRAM chips - 2 GBs total.
Sustains up to half of peak in simulations on scientific
workloads.
Picking at the paper:
64bit FPU in 130nm CMOS, < 1mm^2
200 such FPUs on a 14mm x 14mm chip
Total production cost of such a chip < $100
Even at a conservative 500 MHz, > 1 GFlop/s per dollar
Moving data is more costly than computation, so
do less of the former and more of the latter.
"To exploit the capabilities of today’s VLSI technology
requires an architecture that can exploit parallelism — to
keep large numbers or arithmetic units busy while hiding
the ever increasing latency to memory, and locality
— to increase the ratio of arithmetic, which is inexpensive,
to global bandwidth, which is the limiting factor."
The Merrimac stream processor chip:
~ 100mm^2
1 - MIPS64 processor - fetch instructions, dispatch stream instructions
16 - clusters - execute stream instructions
512KB cache
interface to 16 DRAM chips - 20 GB/s
network interface - to connect to routers in multi-processor systems
1ns cycle - 128 GFlop/s
31W - $200 - 90nm
1000 pin BGA
Merrimac cluster (stream processor):
768 - 64bit registers
8K - 64bit words stream register file
4 - FPU MADD units
8 GFlop/s @ 1 GHz
In simulation, for the benchmarks quoted, they achieved between
18% and 52% of peak.
Just pulling some figures out of my hat:
Broadband Engine:
65nm with bells and whistles.
32 bit arithmetic, so double the Flop/s per cycle.
Twice as large though, 200mm^2. The shrink also
reduces the area of our FPU ~ 50%. So up-to 4 times
as many 'cluster/APUs' on 'our' chip.
Still at 1 GHz.
64 'clusters/APUs'.
60W
16 GFlop/s per 'cluster/APU' (single precision).
1024 GFlop/s per chip. (Ta da!)
$200 per chip.
~200-500 GFlop/s in applications (games).