On the Feasibility of the Broadband Engine

Vince

Veteran
Hey, So, Yeah. With all the discussion on the feasibility of the Broadband Engine as seen in the well known Suzuoki/SCE and IBM/STI patents and in a feew threads in which some (you can guess who) used BlueGene L as a base, I had a feeling something didn’t add up. So, I did some looking around and have an alternative concept of the situation which I’m not saying is totally accurate, but I propose that we have a central, civil, discussion on this.

I’m going to present some numbers, show where they’re from, and then open it up for people to draw their own conclusions and comments. We realize that these numbers are ballpark, but instead of shunning the variance, lets use it as a pseudo-constant, a fudge factor if you will, that will allow us to have a rough estimate of what can be done, what can’t and where it all stands. So, without further ado, here’s the basic premise.

I propose a strict interpretation of the above patents, use of recent comments which are certain, and past precedence as a guide when in doubt. Let’s try to keep this as plausible as possible. We’ll add up the knowns and leave the overhead and fudgefactors lumped together until the end. Then, depending on the magnitude, we’ll see what can be done.

James Kahle said:
Kahle also had to remain aware of the eventual manufacturability of the chip at this point, but elected to put the main burden of this part of the effort on the implementation phase that would follow.

So, we’ll start by assuming the BE needs to be a commercially replicable part and as such will have definitive upper bounds on it’s area. I figure that area is a much better metric than, say, density, due to it being invariant – I feel we’ll all agree. I then propose that the 250nm Graphic Synthesizer and it’s 279mm2 physical size will provide a good reference point. We will also state that it’ll be fabricated on the 65nm process, which is supportable by facts like this, or that I know Sony has bought rights to all Cell based 65nm chips produced for an undisclosed period, blah… we all can agree on this.

We’ll start with the PE core, which it shall be assumed to be the PowerPC 440 that Sony just licensed from IBM as seen here. Some background information on this core can be seen above or for more information there will be some links below.


PowerPC 440 Core Features said:
  • 0-667MHz performance
  • 9.8 mm2 hard core size
  • 7-stage pipeline, out-of-order
    issue, out-of-order execution and
    completion
  • Dynamic branch prediction
  • Parity error detection and recovery
  • Static design with extensive clock
    and power management support
  • 32x32 multiply, with single-cycle
    throughput

Memory Management Unit
  • 64-entry, fully associative unified
    TLB
  • Separate I- and D-side micro-TLBs
  • Flexible TLB management
  • Variable page sizes (1KB-256MB)
  • 32K/32K instruction cache/data
    cache controllers with parity
  • Write-back, write-through,
    non-blocking operation
  • Cache line locking (I and D)

The PowerPC 440 hardcore is quite compact at 9.8mm2 on the 130nm process. Given our previous assumption that the IC will be produced on the 65nm node and encounter linear scaling, we’re left with a PowerPC 440 core that’s 4.9mm2 in size. As per the other assumptions above, we assume there will be four of them on a single die, which brings us to 19.6mm2 in utilized area (or down to 259mm2 in free area as per our assumption of precedence).

* In the original 9.8mm2 figure, there is a good chance that L1 Caches were not included. But, we are only talking about 64 KB per PowerPC 440 core though, so this might reflect in increase in the 19.6mm2 number to something like 20.8583mm2 (64 KB * 4 cores and a cell size for SRAM of 0.6um2 at 65nm would mean 1.2583mm2 ) –- which is statistically irrelevant with the variance and accumulated error which is in this measurement.

The PowerPC 440 core needs an FPU, so we’ll utilize this one. Here’s the background information:


PowerPC 440 FPU Core Specifications said:
Performance:
  • 1050 megaflops @ 525 MHz (Nominal)
  • 734 megaflops @ 367 MHz (Worst case)
Frequency:
  • 0-525 MHz nominal silicon, 1.8V, 55°C
  • 0-367 MHz slow silicon, 1.65 V, 85°C

  • Architecture - 32-bit PowerPC Book E compliant, supports IEEE-754 floating-point Precision IEEE Single-precision and Double-precision Superscalar 2 way Issue Out of order
    Core size: 3.7 mm2
    Technology: 0.18 µm (drawn) CMOS copper technology (CMOS 7SF)
    Power Supply: 1.8 ± 0.15 V
    Transistors: < 800K
    Temperature range: -40 to 125 °C

This will work well for our FPUs. At 180nm, it’s 3.7mm2 – which would yield roughly 1.33mm2 at 65nm with the same assumptions. This brings us to 24.9mm2 in utilized area thus far and 254mm2 in possible area left.



So, next we need to factor in the APUs, of which there’s a plurality (32, 8 per Power440) and each which contain 4 FMACs, 4 FXUs, Registers, and all these assorted things we’ll get to in a bit. First let’s start with the FPUs.

There are 32 FPUs per PE, yeiliding an aggregate of 128 FMACs. We'll base the FMAC off that utilized in the Emotion Engine (thanks Quaz51) which was 1.7mm2 on the 250nm node. So using the linear scaling we're left with each FMAC needing 0.442mm2. Thus, 128 FMACs will necessitate 56.57mm2 in necessary area – bringing the grand total upto 67.65mm2 utilized, with 197.43mm2 of the area left.

There are also 128 FXUs, of which I couldn’t find a definitive area requirement. So, if someone knows and can help out, that would be awesome. I assumed just for this that each FXU is around 150% the size of a FMAC and rounded to 0.663mm2. So, that’ll yield 84.864mm2 total, bringing the area count used up to 166.34mm2, and the amount of potential area down to 112.566mm2

Within each APU (which we assume is reminiscent of the following Patent on Unified Scalar and SIMD datapaths, we can assume that each APU will require control logic as well as Instruction Fetch, Decode, an Issue/Branch unit where It’s queried and then overhead for this. WePropose a conservative fudgefactor of a full 2mm2 per APU for this which comes to 64mm2 and brings us to a total used area of 195.65mm2 and roughly 48.566mm2 worth of potential area remaining.

Just something to keep in mind when looking at the remaining area of 48mm2, the entire area of the AMD Thoroughbred’s, A & B, are 80mm2 and 84mm2 respectively on a 130nm process. You can fit an one, perhaps two of them normalized to 65nm in the area remaining according to this.


We then have the SRAM based storage to account for: these being the Registers and LS. The SRAM will likely be derived from this. A brief summation:

Embedded SRAM cell said:
SRAM is sometimes used as cache memory in SoC systems. The Hi-NA193-nm lithography with alternating phase shift mask and the slimming process combined with the non-slimming trim mask process will achieve the world's smallest embedded SRAM cell in the 65nm generation an areas of only 0.6um2

Knowing this, Pana first found the area allocated to the APU registers. Each Register per APU is 128*128 bits in size, giving us an aggregate size of 512KB of SRAM.

We then used the patent as a guide and found the area taken up by the aggregate 32Mbits of SRAM. Combined, we have 32MBits (LS) and 512KBits (Registers) of SRAM which comes to 20.448 mm2. This brings our total area used up to around 250.782mm2 and the potential area remaining to be tapped at 28.118mm2.

Abstracted above the APUs and SRAM in the memory hierarchy is a layer of eDRAM. The patent states 64MB, several press releases state 32MB (256MBit) or more. I suggest that we start at 32MB and move up from there depending on the size of the fudgefactor at the end. So, again, Pana used the released specs of Sony and Toshiba’s 65nm process for embedded RAM as found above. Here’s the relevant part:

Embedded DRAM cell said:
High-speed data processing requires a single-chip solution integrating a microprocessor and embedded large volume memory. Toshiba is the only semiconductor vendor able to offer commercial trench-capacitor DRAM technology for 90nm-generation DRAM-embedded System LSI. Toshiba and Sony have utilized 65nm process to technology to fabricate an embedded DRAM with a cell size of 0.11um2, the world's smallest, which will allow DRAM with a capacity of more than 256Mbit to be integrated on a single chip.

So, using this as a guide, you’ll take up at least 29.528 mm2 of die area (again, don’t freak out with these numbers – there will be a fudgefactor to eat up inefficiencies) which brings us to 280.31mm2 and leaves us with -1.41mm2 of potential area to utilize.

Now is the time to note that during this exploration of the feasibility of the Broadband Engine, we've been very conservative in our use of logic scaling. As Entropy has stated, logic scales down geometrically bounded by [linear scale]^2. We haven't factored in the fact that we're dealing with a two dimensional area -- which depending on the desing of your microprocessor may or may not make a large difference in the potential area.

Unlike many processors such as the PentiumM or Itanium line which use a plurality of their die area to fit in a large cache hierarchy, the Broadband Engine is more of a balance between the conventional CPU design and a modern stream processor or GPU, whose use of potential die area is highly biased towards logic/cache anemic. From our calculations, the Broadband Engine is composed of 82% logic (230mm2) with the remaining 18% (50mm2) composed of SRAM and eDRAM.

So, keep in mine that 82% of the total area we've postulated so far has been linearly scaled without the exponential factor inherient in multidimensional scaling being factored in.

So, within that area that could net anywhere from a highly [outrageous] conservative 0mm2 to a conservative 30mm2, upto a theoretical high of around 100mm2. And within that range, we need to fit 4 DMACs, TLBs, and then go from there to work around the remaining control logic and wiring, inefficiencies which weren’t eaten up in the conservative 2mm2/APU factored in earlier as well as places in which highest density hasn’t been used (which we’d hope [assume?!]) on a modular processor like this is to a minimum. Not sure of the sizes of a DMAC, don’t know where to start really, but there would appear to be ample room, at least theoretically as ~30mm2 on 65nm is a lot of room indeed. For a general idea, that’s roughly, the same amount of area as an AMD Thoroughbred if it was built on the 65 nm process.



Before we get into inefficiencies, I do believe there is a strong case for this to be a highly area efficient design. Between it being inherently customized for a specific application, unlike, say the Power5, which includes a high degree of modularity, it should be possible to make each corpuscle highly optimized with a very attractive equilibrium between area, power, and computational ability.

James Kahle said:
He directed some of his engineers to work from the bottom up, drafting circuit designs for specific functional elements of the chip, while other engineers were working from the top down, modeling the high-level architecture and providing the definition of the chip instruction set…. As with the high-level design, the implementation process is highly iterative, with the engineers building simulations of how each section of the chip will perform under real-life circumstances. "The trick is to achieve the right stability of the chip's smaller pieces," explains Kahle, "and then to build up bigger and bigger chunks, thereby improving the overall quality and stability of the design.

We’ve heard that tools such as IBM’s EinsTuner has been used since to the beginning with theoretical efficiencies of 1/3rd in area savings and 15% in circuit performance. But one would still expect inefficiencies. Personally, I see variance in these numbers of ~10-25mm2 easily in each direction. But, again, the purpose of this isn’t to show exactly what it will or may or might be, but could be. Just to add up a bunch of relatively known numbers and see what we have remaining – which a bit indeed.

If anything, this shows that with the commercial production of ICs approaching the order of 300mm2, be it the Graphic Synthesizer or the recently announced Nv40, that the Broadband Engine is highly feasible from a strictly area PoV. The numbers used here are all stand-alone and don't reflect the effeciencies gained by designing a modular and set-piece architecture, nor do they reflect the ineffeciencies or other area drains - although, holistically, they should cancel eachother out for the most part, perhaps leaving a slight bias towards the ineffeciency.

The bounds regarding this design would appear to rest with the power attributes, be them the direct problem of wiring and propogation or the thermal and power intake issues. Levied against this are the process technologies like the Low-K dielectrics and PD-SOI's, which have worked amazingly well with AMD's recent use of it with Opteron. Appearently, word on the street, is that the BE implimentation is limited by the thermal aspects, not size - which may be indicative of the speeds in which they seek to attain. But, this is another thread, at another time.

PowerPC 440 Links:
PowerPC 440 Core: Whitepaper.
PowerPC 440 Core: Product Brief
PowerPC 440 Core: Product Brief 150nm
PowerPC 440 Core: Main Page

IBM Patent Links:
Processor with Redundant Logic
Reduction of interrupts in remote procedure calls
Symmetric Multi-Processing
Processor implementation having unified scalar and SIMD datapath
Token-Based DMA
 
...

What you missed out.

1. Assumption of 278 mm2 starting die size is no longer valid since SCEI is not in a position to lose $150 on first batch of units shipped this time around.

2. Notice that nVIDIA was forced to "DOWNCLOCK" nV40 to keep the heat generation and power consumption under control. SCEI is not free from the law of physics either and will be forced downclock under massive heat generation if it managed to pack in as many transistors as possible. So you need to draw a balance between transistor count and clockspeed.

3. PPC440 series is not a high-clocker. CELL is believed to built around Power4 core.

4. In James Kahle's "Redundancy" patent application, a description of CELL's physical layout is given. And from this reading it is quite obvious that CELL is a single PE device.

5. IBM explicitly stated they were looking to reach 1 TFLOPS in 2010.
 
I think you might be off base with your area calculations, resulting in an exaggeration of the estimated areas. When lithographic technique changes, the required area goes down as the (linear scale)^2. In your case, going from 0.13 to 0.065 um should lead to areas being reduced to one fourth, rather than halved.

Now, for various reasons the theoretical area reduction may not be fully realized, so it may be that using a halved area makes sense anyway as an extremely conservative estimate, but from what I've seen of other lithographic changes IBM has made, (most recently in shifting the 970 from .13 to .09 um, but also with the Power4 et cetera,) the area reduction should be better than that.
 
The PowerPC 440 hardcore is quite compact at 9.8mm2 on the 130nm process. Given our previous assumption that the IC will be produced on the 65nm node and encounter linear scaling, we’re left with a PowerPC 440 core that’s 4.9mm2 in size. As per the other assumptions above, we assume there will be four of them on a single die, which brings us to 19.6mm2 in utilized area (or down to 259mm2 in free area as per our assumption of precedence).

From 130nm to 0.65nm you would expect one fourth of the original size, not halved.
 
Entropy said:
I think you might be off base with your area calculations, resulting in an exaggeration of the estimated areas. When lithographic technique changes, the required area goes down as the (linear scale)^2. In your case, going from 0.13 to 0.065 um should lead to areas being reduced to one fourth, rather than halved.

Now, for various reasons the theoretical area reduction may not be fully realized, so it may be that using a halved area makes sense anyway as an extremely conservative estimate, but from what I've seen of other lithographic changes IBM has made, (most recently in shifting the 970 from .13 to .09 um, but also with the Power4 et cetera,) the area reduction should be better than that.

I know, but Vince and I thought that it was better this way as it left more room for imprecisions, redundancy ( for fault tolerance ), inefficiencies, miscalculations of the impact of wide data busses, control logic, etc...

The thought was in a way simple yet straight to the point: if the numbers work with a quite conservative approximation of their technology scaling, then there is a chance that the chip might be realized in the real world.


Posted: Fri Apr 16, 2004 6:36 am Post subject: ...
What you missed out.

1. Assumption of 278 mm2 starting die size is no longer valid since SCEI is not in a position to lose $150 on first batch of units shipped this time around.

3. PPC440 series is not a high-clocker. CELL is believed to built around Power4 core.

[...]

4. In James Kahle's "Redundancy" patent application, a description of CELL's physical layout is given. And from this reading it is quite obvious that CELL is a single PE device.


1.) First, IMHO the 45 nm CMOS6 process wil be available to Sony and Toshiba not too much after the launch of the console ( 2006 launch IMHO ). Also, as Entropy noted, we were trying to be conservative with the technology scaling so if you want to shave off some mm^2 from the total area feel free to do so, we left some headroom for things like that too.

2.) The PU does not need to run at 4 GHz and using a POWER4 chip for the PU is IMHO wasteful.

4.) CELL, IA-64, IA-32, x86-64 = architectures.
 
V3 said:
The PowerPC 440 hardcore is quite compact at 9.8mm2 on the 130nm process. Given our previous assumption that the IC will be produced on the 65nm node and encounter linear scaling, we’re left with a PowerPC 440 core that’s 4.9mm2 in size. As per the other assumptions above, we assume there will be four of them on a single die, which brings us to 19.6mm2 in utilized area (or down to 259mm2 in free area as per our assumption of precedence).

From 130nm to 0.65nm you would expect one fourth of the original size, not halved.

Yes, but you cannot rely on that as if it were gospel: it varies from chip to chip.
 
Entropy said:
I think you might be off base with your area calculations, resulting in an exaggeration of the estimated areas. When lithographic technique changes, the required area goes down as the (linear scale)^2. In your case, going from 0.13 to 0.065 um should lead to areas being reduced to one fourth, rather than halved.

Now, for various reasons the theoretical area reduction may not be fully realized, so it may be that using a halved area makes sense anyway as an extremely conservative estimate, but from what I've seen of other lithographic changes IBM has made, (most recently in shifting the 970 from .13 to .09 um, but also with the Power4 et cetera,) the area reduction should be better than that.

It was a conscious decision for the reasons you stated. As I tried to make abundantly clear, the purpose of this was to demonstrate, utilizing the most conservative metrics avalailable, that within a manufacturable bound you can produce the BE.

It wasn't that I don't comprehend this, I started to veer into the more realistic shrinkage projetions when I walk talking about fitting 3 Thoroughbred's when going to 65nm, but my belief is that if I can show - utilizing the methodology I did in which we take all the "known" parts, apply known sizes and scalings and can accomplish the task being as conservative as possible - it's the best bet to demonstrate feasibility.

The whole concept was to throw everything in, add in a whole shit load of redundancy and then see how much is left over (+/-) in the 'fudge-factor' and go from there.

Personally, I wanted to factor in 512Mbit of eDRAM and it's extremely realistic, but Panajev was insistent on being conservative and his thinking in valid.

EDIT: Woops, he beat me to it :)
 
V3 said:
Its what you would expect, given the ideal.

Yes and No. Yes, it's what I'd expect and like to see - but there is [some] talk about the thermal tolerances and I question how it'll be delt with physically/architecturally (in addition to the process technology). If we are correct, the Broadband Engine will be the preeminent IC of it's time, perhaps the first commercial IC with 100M gates, and active gates at that; with it comes quite a few potential localized hotspots on the chip - I stand by our decision to be conservative - we all know what's *possible* and can extrapolate from there ourselves.
 
SO how will leakage , heat , power consumption , yields and basicly getting a new chip up on a new micron process effect the chip ?

Esp since it seems with every new drop in micron we are having more and more problems getting it working .
 
Yes, it's what I'd expect and like to see - but there is [some] talk about the thermal tolerances and I question how it'll be delt with physically/architecturally (in addition to the process technology).

So you are you taking into account most of these processors within are going to be clocked at 4 GHz, when doing this ?
 
...

SO how will leakage , heat , power consumption , yields and basicly getting a new chip up on a new micron process effect the chip ?
From now on, it will be power consumption level that will determine the actual logic gate count(SRAM gates don't burn up as much power since they don't change their state often) and clock frequency, not the lithography.

What good is having chips fabricated on 45 nm when it will leak so much current it takes more power just to maintain equivalent clockspeed????

That 4 PE BroadEngine thing was imagined before the power leakage problem at sub 100 nm processes became apparant. No one owns the technology needed to fabricate something like BE before 2010.

PS. The reason Dothan's successor Joshua has twin core is that they are optimized for different purposes; one is tuned for high clockspeed on AC power, the other for long battery life. That's how much the power consumption problem has gotten worse....
 
1. Assumption of 278 mm2 starting die size is no longer valid since SCEI is not in a position to lose $150 on first batch of units shipped this time around.
You are talking about early PS2 production here, right? I'm not sure they ever lost $150 per unit to begin with. Their financial reports back then, were indicating that they are actually making money on sold PS2 units, ever since they started selling them.

It's their massive initial investments in chips used in PS2 that made people think money have been lost per unit sold as far as I can remember.
 
V3 said:
Yes, but you cannot rely on that as if it were gospel: it varies from chip to chip.

Its what you would expect, given the ideal.

Yes, but at times you might wish to leave the Transistor's width a little bit bigger and sometimes you might not to be able to scale wires as well as you would want. It depends on the design IMHO.
 
Re: ...

Deadmeat said:
SO how will leakage , heat , power consumption , yields and basicly getting a new chip up on a new micron process effect the chip ?
From now on, it will be power consumption level that will determine the actual logic gate count(SRAM gates don't burn up as much power since they don't change their state often) and clock frequency, not the lithography.

The problem with leakage is that we are wasting more and more power when the transistor is "not" switching which worried some about the inherent advantage that everyone assumes CMOS to have by default.

Depending on your design, Power consumption always had a great importance.
 
Re: ...

IBM has a number of interesting and relevant whitepapers that refer to CMOS leakage, power consumption management schemes, and experimental chips where they have tried these concepts.

I've read a few. This one, titled "power constrained CMOS scaling limits" is of general interest, I believe, although fairly heavy reading:
http://www.research.ibm.com/journal/rd/462/frank.html

Generally speaking, IBM publishes a lot of interesting articles and white-papers pertaining to various research subjects. These are often adressed at a level where an interested engineer/scientist can extract quite a bit of insight, as if they were targeting other departments within IBM.
http://www.research.ibm.com/ truly is a goldmine.
 
...

Given the present state of fab technology and 25.6 GB/s memory bus, the most SCEI can manage to do with 2 PEs running at 2 Ghz, single PE is much more probable and realistic.

Granted 128 GFLOPS peak is not what some SCEI fans are expecting, yet this figure is 50x the FLOPS power of PSP launching a year earlier and around 5X the FLOPS rating of XCPU2(but the actual gap will be close since CELL is inherently very inefficient while quad-thread XCPU2 can be kept to run its near peak rate).
 
Back
Top