Check it out: Merrimac a 128 GFlop/s stream processor

glw · Nov 18, 2003

Here's some food for thought...

Merrimac: Supercomputing with Streams
http://www.sc-conference.org/sc2003/paperpdfs/pap246.pdf

Single chip 90nm CMOS stream processor - 128 GFlop/s peak.

16 DRAM chips - 2 GBs total.

Sustains up to half of peak in simulations on scientific
workloads.

Picking at the paper:

64bit FPU in 130nm CMOS, < 1mm^2

200 such FPUs on a 14mm x 14mm chip

Total production cost of such a chip < $100

Even at a conservative 500 MHz, > 1 GFlop/s per dollar

Moving data is more costly than computation, so
do less of the former and more of the latter.

"To exploit the capabilities of todayâ€™s VLSI technology
requires an architecture that can exploit parallelism â€” to
keep large numbers or arithmetic units busy while hiding
the ever increasing latency to memory, and locality
â€” to increase the ratio of arithmetic, which is inexpensive,
to global bandwidth, which is the limiting factor."

The Merrimac stream processor chip:
~ 100mm^2
1 - MIPS64 processor - fetch instructions, dispatch stream instructions
16 - clusters - execute stream instructions
512KB cache
interface to 16 DRAM chips - 20 GB/s
network interface - to connect to routers in multi-processor systems
1ns cycle - 128 GFlop/s
31W - $200 - 90nm
1000 pin BGA

Merrimac cluster (stream processor):
768 - 64bit registers
8K - 64bit words stream register file
4 - FPU MADD units
8 GFlop/s @ 1 GHz

In simulation, for the benchmarks quoted, they achieved between
18% and 52% of peak.

Just pulling some figures out of my hat:

Broadband Engine:

65nm with bells and whistles.
32 bit arithmetic, so double the Flop/s per cycle.
Twice as large though, 200mm^2. The shrink also
reduces the area of our FPU ~ 50%. So up-to 4 times
as many 'cluster/APUs' on 'our' chip.
Still at 1 GHz.

64 'clusters/APUs'.
60W
16 GFlop/s per 'cluster/APU' (single precision).
1024 GFlop/s per chip. (Ta da!)
$200 per chip.
~200-500 GFlop/s in applications (games).

Paul · Nov 18, 2003

200mm^2.

SCE and Toshiba will push for a bigger chip than 200mm2. Remember, they will be using 300mm wafers.

I estimate 250-270mm2.

64 'clusters/APUs'.

I don't see BE with 64APU's, this is 8 PE's if you go by what the Patent says.

Vince · Nov 18, 2003

Paul said:
I don't see BE with 64APU's, this is 8 PE's if you go by what the Patent says.

Paul, it's irrelevent at this point. They can up the concurrency* or the clock - it doesn't much matter at this point. What this shows is the feasibility of the Broadband Engine - something which many have claimed is physically impossible.

* There has actually been articles talking of 8 Cores for a Gaming Console I believe. So, whatever...

Paul · Nov 18, 2003

What this shows is the feasibility of the Broadband Engine

Yes but did you ever really doubt it Vince?

Tahir2 · Nov 18, 2003

768 Registers 64bit Registers...

What is that divided by 32?

Panajev2001a · Nov 18, 2003

glw:
Here's some food for thought...

Merrimac: Supercomputing with Streams
http://www.sc-conference.org/sc2003/paperpdfs/pap246.pdf

Single chip 90nm CMOS stream processor - 128 GFlop/s peak.

16 DRAM chips - 2 GBs total.

Sustains up to half of peak in simulations on scientific
workloads.

Picking at the paper:

64bit FPU in 130nm CMOS, < 1mm^2

200 such FPUs on a 14mm x 14mm chip

Total production cost of such a chip < $100

Even at a conservative 500 MHz, > 1 GFlop/s per dollar

Moving data is more costly than computation, so
do less of the former and more of the latter.

"To exploit the capabilities of todayâ€™s VLSI technology
requires an architecture that can exploit parallelism â€” to
keep large numbers or arithmetic units busy while hiding
the ever increasing latency to memory, and locality
â€” to increase the ratio of arithmetic, which is inexpensive,
to global bandwidth, which is the limiting factor."

The Merrimac stream processor chip:
~ 100mm^2
1 - MIPS64 processor - fetch instructions, dispatch stream instructions
16 - clusters - execute stream instructions
512KB cache
interface to 16 DRAM chips - 20 GB/s
network interface - to connect to routers in multi-processor systems
1ns cycle - 128 GFlop/s
31W - $200 - 90nm
1000 pin BGA

Merrimac cluster (stream processor):
768 - 64bit registers
8K - 64bit words stream register file
4 - FPU MADD units
8 GFlop/s @ 1 GHz

In simulation, for the benchmarks quoted, they achieved between
18% and 52% of peak.

Just pulling some figures out of my hat:

Broadband Engine:

65nm with bells and whistles.
32 bit arithmetic, so double the Flop/s per cycle.
Twice as large though, 200mm^2. The shrink also
reduces the area of our FPU ~ 50%. So up-to 4 times
as many 'cluster/APUs' on 'our' chip.
Still at 1 GHz.

64 'clusters/APUs'.
60W
16 GFlop/s per 'cluster/APU' (single precision).
1024 GFlop/s per chip. (Ta da!)
$200 per chip.
~200-500 GFlop/s in applications (games).

Somebody's face must be as red as a tomato and getting his obvious "but the Broadband Engine will suck anyways" rebuttal.

Those $52 were the best I have spent, btw.

Vince... Paul... we have to meet on AIM soon

Megadrive1988 · Nov 18, 2003

Paul, it's irrelevent at this point. They can up the concurrency* or the clock - it doesn't much matter at this point. What this shows is the feasibility of the Broadband Engine - something which many have claimed is physically impossible.

* There has actually been articles talking of 8 Cores for a Gaming Console I believe. So, whatever...

Yep, like the Mercury News report mentioned a PS3 Cell CPU with 72 processors, from 8 CPU Cores and 64 APUs.

nondescript · Nov 18, 2003

Sounds good. Great find! I'll read the pdf after I polish off some homework.

Guys, let's try to refrain from the small torments that some-other-people tend to enjoy. Retailation is fine, but try not to start anything. Keep it clean, if you can.

DeadmeatGA · Nov 18, 2003

...

I don't understand why you people are so excited with this thing, this processor actually spells doom for teraflop CELL.

1. The whole thing is one processor. "Clusters" are not processors, they are mere execution units of a processor. Consider this a superlong Altivec.

2. The processor has a FLOP/memory instruction ratio of 100:1. The MIPS processor acts as the instruction decoder, fetcher, and L/S unit for all clusters. Very few algorithms can be tuned for an architecture like this, and certainly not physics and 3D rendering.

3. Each APU of CELL is an independent processor on its own right. In other word, the control logic, load/store, program counter, synchronization table are repeated for every APU. Expect tripling of die size if all clusters were made independent processors.

4. This thing costs $200 to build, offers 128 GFLOPS theoritical peak @ 90 nm, and burns 37 Watts.

This thing actually proves that it is physically impossible to pack in 64 APUs in a die and have it clock at 4 Ghz as some Sony fans are dreaming up.

Wake up from your dream and smell the coffee.

Panajev2001a · Nov 18, 2003

Re: ...

DeadmeatGA said:
I don't understand why you people are so excited with this thing, this processor actually spells doom for teraflop CELL.

1. The whole thing is one processor. "Clusters" are not processors, they are mere execution units of a processor. Consider this a superlong Altivec.

2. The processor has a FLOP/memory instruction ratio of 100:1. The MIPS processor acts as the instruction decoder, fetcher, and L/S unit for all clusters. Very few algorithms can be tuned for an architecture like this, and certainly not physics and 3D rendering.

3. Each APU of CELL is an independent processor on its own right. In other word, the control logic, load/store, program counter, synchronization table are repeated for every APU. Expect tripling of die size if all clusters were made independent processors.

4. This thing costs $200 to build, offers 128 GFLOPS theoritical peak @ 90 nm, and burns 37 Watts.

This thing actually proves that it is physically impossible to pack in 64 APUs in a die and have it clock at 4 Ghz as some Sony fans are dreaming up.

Wake up from your dream and smell the coffee.

Need to have more PROFESSIONAL developers ( in addition to tons of other people ), that get PAID good money for their hard work, need to explain to you that you cannot even wipe your butt with your "we want 0.5-1 TFLOPS we need to have 0.5 * 10^9 * 4 bytes of external memory bandwidth" ( given we have tons of regusters, each APU has 128 KB of LS and we have e-DRAM ) ?

We can stop and that thing you said and do not go further.

DeadmeatGA · Nov 18, 2003

...

Need to have more PROFESSIONAL developers ( in addition to tons of other people)

And you believe those PROFESSIONAL developers will massively shrink the die somehow???

We can stop and that thing you said and do not go further.

Exactly. This is turning into a BS. This is nothing more than a Altivec16.

Panajev2001a · Nov 18, 2003

Re: ...

DeadmeatGA said:
Need to have more PROFESSIONAL developers ( in addition to tons of other people)

Click to expand...

And you believe those PROFESSIONAL developers will massively shrink the die somehow???

Coders do not work on the physical die, but you know that already.

Coders know about the software side of things, they experience first hand why your argument leads to an incorrect conclusion.

Your weak attempt at humor is not amazing anyone, it is having the opposite effect further weakening your argument.

DeadmeatGA · Nov 18, 2003

...

Coders do not work on the physical die, but you know that already.

But I thought you meant the developer of this thing. After all, all hardware designs are technically VHDL codes and hardware designers are VHDL coders.

Your weak attempt at humor is not amazing anyone, it is having the opposite effect further weakening your argument.

My arguement is strengthening, not weakening. It is you who need to take a cold shower and get a sense of reality. Or are you not able to understand exactly how this Merrimac thing works??? One integer unit, one control unit. One load/store unit, one program counter shared among 64 FPUs. Sure, this thing is going to perform just fine in real world code...

Panajev2001a · Nov 18, 2003

Blah.... Blah.... Blah...

The "FLOPS rate * 4 bytes = bandwidth needed by the external memory" is something you are convinced and a lot of people have tried to help you see the problems with such a statement... of course if you are not able to see things like that, if you cannot accept your being wrong in anything, how can you see your credibility ( since you are the holy knight in the holy crusade of lecture fanboys) weakening ?

At 2 GHz we would have 500 GFLOPS or 16 GFLOPS per APU.

Each APU has 128 KB of SRAM with a 256 bits bus to the Register File.

The LS should be clocked at the same speed as the APU logic ( it is SRAM ).

( 256 bits / (8 bits/byte) ) * 2 GHz = 64 GB/s

16 GFLOPS * 4 bytes/FLOP = 64 GB/s

Whoa...

The Register file has four busses to the Execution Units: 3x256 bits to the Execution Units and 1x128 bits from the Execution Units.

DeadmeatGA · Nov 18, 2003

...

The "FLOPS rate * 4 bytes = bandwidth needed by the external memory" is something you are convinced and a lot of people have tried to help you see the problems with such a statement...

Why don't you and your friends try to help the real supercomputer designers see the problems with "byte per flop" law??? Why do "real" supercomputer designs(Earth Simulator & Blue Gen /L) go for the lowest "byte per flop" ratio possible???

At 2 GHz we would have 500 GFLOPS or 16 GFLOPS per APU.

This merimac thing clocks 1 Ghz max and burns 31 watts for 128 theoretical GFLOPS. Have 8 of those and it will burn 250 watts.

Each APU has 128 KB of SRAM with a 256 bits bus to the Register File.

And each APU has its own control unit, its own L/S unit, its own address calculator, its own program counter, its own bus interface.

The most optimistic estimation I have come across for APU is 15 million transistors(5 million logic + 10 million SRAM) and 10 mm2 @ 65 nm. Have 32 of those and run them at full clock to see it set the new world record in microprocessor power consumption....

Each APU has 128 KB of SRAM with a 256 bits bus to the Register File.

The LS should be clocked at the same speed as the APU logic ( it is SRAM ).

( 256 bits / (8 bits/byte) ) * 2 GHz = 64 GB/s

16 GFLOPS * 4 bytes/FLOP = 64 GB/s

Whoa...

The problem with your calculation is that the local storage cannot hold all the data/code and must be continuously swapped from Yellowstone DRAM at 25.6 GB/s. The situation is similar to texture RAM of PSX2 GS; the GS VRAM is not large enough to hold all the textures of a scene, so the entire texture map of a scene must be reloaded from system RAM 60 times a second, hence the poor texture quality due to limited texture bandwidth availability.

Panajev2001a · Nov 18, 2003

Re: ...

DeadmeatGA said:
The "FLOPS rate * 4 bytes = bandwidth needed by the external memory" is something you are convinced and a lot of people have tried to help you see the problems with such a statement...

Click to expand...

Why don't you and your friends try to help the real supercomputer designers see the problems with "byte per flop" law??? Why do "real" supercomputer designs(Earth Simulator & Blue Gen /L) go for the lowest "byte per flop" ratio possible???

At 2 GHz we would have 500 GFLOPS or 16 GFLOPS per APU.

Click to expand...

This merimac thing clocks 1 Ghz max and burns 31 watts for 128 theoretical GFLOPS.

Each APU has 128 KB of SRAM with a 256 bits bus to the Register File.

Click to expand...

And each APU has its own control unit, its own L/S unit, its own address calculator, its own program counter, its own bus interface.

The most optimistic estimation I have come across for APU is 15 million transistors(5 million logic + 10 million SRAM) and 10 mm2 @ 65 nm. Have 32 of those and run them at full clock to see it set the new world record in microprocessor power consumption....

Well... your estimate was 10-15 Watts for 32 of them at 1 GHz...

Are you saying they will break the 60 Watts barrier to go at 2 GHz ?

You can keep listing features for the APU to make its size bloat in your mind, but it is not a problem... we already went over the size of the whole thing and either you count the SRAM as a whole and it comes out at around 20-30 mm^2 for 4 MB IIRC or you count the 128 KB of SRAM on each APU and make the APU bigger.

I can start now and count you features of MIPS and ARM cores and end up with 1 mm^2 of space, maybe less even with caches and stuff if we talk about 90 nm and beyond.

10 MTransistors for 128 KB of SRAM and 2 KB worth of Registers ? Are you for real ?

I cannot count more than 780-1 MTransistors for the 128 KB of LS ( no needs of cache tags as it is not a cache ).

Also ever thought that BlueGene/L and supercomputers in general work at different problems than a processor designed for multi-media and 3D graphics processing ?

I showed you that the LS has enough bandwidth/cycle to sustain the Execution Units and the Register File does too ( 384 bits from the Register file is worth 3x128 bits operands and 128 bits for the bus to the Register file is good for the 128 bits result ), but you are free to go on and on... you are not looking that much smarter by not admitting to be ever wrong... smart people learn from their often mistakes.

Panajev2001a · Nov 18, 2003

The problem with your calculation is that the local storage cannot hold all the data/code and must be continuously swapped from Yellowstone DRAM at 25.6 GB/s. The situation is similar to texture RAM of PSX2 GS; the GS VRAM is not large enough to hold all the textures of a scene, so the entire texture map of a scene must be reloaded from system RAM 60 times a second, hence the poor texture quality due to limited texture bandwidth availability.

Oh... I will have to tell that to the devs who made SH3 and the ones who pull around 2+ MB/frame of compressed textures on PS2...

Or Naughty Dog and their Jak II engine...

Or the obviously crappy textures in Primal...

They will not have 25.6 GB/s of Yellowstone DRAM and no e-DRAM.

That can be obtained with a 64 bits memory controller ( 128 data pins ) and 400 MHz of external clock.

I think that if they cannot put e-DRAM they would push the memory controller to 128 bits ( 256 data pins ) and/or the base clock to 800 MHz: they are getting close to it with current Direct RDRAM.

Without e-DRAM ( unlikely they will not use any ) they would probably push for a minimum of 51.2 GB/s of XDR bandwidth which is in the realms of possibility.

Their 65 nm e-DRAM cells are like of 0.11um^2 which means 1.1 * 10^-7 mm^2 per e-DRAM cell.

SRAM cells are 6 * 10^-7 mm^2 so they are not that large.

nondescript · Nov 18, 2003

DeadmeatGA said:
FLOPS/memory bandwidth ratio(Lower the better)

SH-4 : 1.6 GFLOPS/0.8 GBs = 2 FLOPS/byte
EE : 4.8 GFLOPS/2.4 GBs = 2 FLOPS/byte
PSP : 2.6 GFLOPS/2.6 GBs = 1 FLOPS/byte
Earth Simulator : 8 GFLOPS/32 GBs = 0.25 FLOPS/byte
BGL : 5.6 GFLOPS/22 GBs = 0.25 FLOPS/byte

CELL(DM version) : 32 GFLOPS/25 GBs = 1.3 FLOPS/byte
CELL(Sony fan version) : 1000 GFLOPS /25 GBs = 40 FLOPS/byte

DeadmeatGA said:
Why don't you and your friends try to help the real supercomputer designers see the problems with "byte per flop" law??? Why do "real" supercomputer designs(Earth Simulator & Blue Gen /L) go for the lowest "byte per flop" ratio possible???

Try reading before posting. I find it helps.

The Actual Paper said:
(From the intro)
Modern semiconductor technology makes arithmetic inexpensive and bandwidth expensive. To exploit this shift in cost, a high-performance computer system must exploit locality, to raise the arithmetic density (the ration of arithmetic to bandwidth) of the application as well as parallelism to keep a large number of arithmetic units busy. Expressing an application as a stream program fulfills both of these requirements. It exposes large amounts of parallelism across stream elements and reduces globa bandwidth by expressing locality within and between kernels.

(From section 5)
The Sustained GFLOPS and FP Ops / Mem Ref columns illustrate the arithmetic density of the applications; they are able to sustain from 18% to 52% of the node's peak arithmetic performance, by performing from 7 to 50 floating point operations for each global memory access.

Obviously, several computer systems researchers including a eminent Stanford professor don't know your "bytes per flop" law. Maybe you should e-mail them and tell them they're wrong.

Panajev2001a · Nov 18, 2003

Nondescript, of course that professor is only a crazy EE guy with no formal training in CS and that went to university in the 1910s...

nondescript · Nov 18, 2003

Panajev2001a said:
Nondescript, of course that professor is only a crazy EE guy with no formal training in CS and that went to university in the 1910s...

Not only that, he was Ken Kutaragi's roommate.

Check it out: Merrimac a 128 GFlop/s stream processor

Similar threads