Hey, So, Yeah. With all the discussion on the feasibility of the Broadband Engine as seen in the well known Suzuoki/SCE and IBM/STI patents and in a feew threads in which some (you can guess who) used BlueGene L as a base, I had a feeling something didn’t add up. So, I did some looking around and have an alternative concept of the situation which I’m not saying is totally accurate, but I propose that we have a central, civil, discussion on this.
I’m going to present some numbers, show where they’re from, and then open it up for people to draw their own conclusions and comments. We realize that these numbers are ballpark, but instead of shunning the variance, lets use it as a pseudo-constant, a fudge factor if you will, that will allow us to have a rough estimate of what can be done, what can’t and where it all stands. So, without further ado, here’s the basic premise.
I propose a strict interpretation of the above patents, use of recent comments which are certain, and past precedence as a guide when in doubt. Let’s try to keep this as plausible as possible. We’ll add up the knowns and leave the overhead and fudgefactors lumped together until the end. Then, depending on the magnitude, we’ll see what can be done.
So, we’ll start by assuming the BE needs to be a commercially replicable part and as such will have definitive upper bounds on it’s area. I figure that area is a much better metric than, say, density, due to it being invariant – I feel we’ll all agree. I then propose that the 250nm Graphic Synthesizer and it’s 279mm2 physical size will provide a good reference point. We will also state that it’ll be fabricated on the 65nm process, which is supportable by facts like this, or that I know Sony has bought rights to all Cell based 65nm chips produced for an undisclosed period, blah… we all can agree on this.
We’ll start with the PE core, which it shall be assumed to be the PowerPC 440 that Sony just licensed from IBM as seen here. Some background information on this core can be seen above or for more information there will be some links below.
The PowerPC 440 hardcore is quite compact at 9.8mm2 on the 130nm process. Given our previous assumption that the IC will be produced on the 65nm node and encounter linear scaling, we’re left with a PowerPC 440 core that’s 4.9mm2 in size. As per the other assumptions above, we assume there will be four of them on a single die, which brings us to 19.6mm2 in utilized area (or down to 259mm2 in free area as per our assumption of precedence).
* In the original 9.8mm2 figure, there is a good chance that L1 Caches were not included. But, we are only talking about 64 KB per PowerPC 440 core though, so this might reflect in increase in the 19.6mm2 number to something like 20.8583mm2 (64 KB * 4 cores and a cell size for SRAM of 0.6um2 at 65nm would mean 1.2583mm2 ) –- which is statistically irrelevant with the variance and accumulated error which is in this measurement.
The PowerPC 440 core needs an FPU, so we’ll utilize this one. Here’s the background information:
This will work well for our FPUs. At 180nm, it’s 3.7mm2 – which would yield roughly 1.33mm2 at 65nm with the same assumptions. This brings us to 24.9mm2 in utilized area thus far and 254mm2 in possible area left.
So, next we need to factor in the APUs, of which there’s a plurality (32, 8 per Power440) and each which contain 4 FMACs, 4 FXUs, Registers, and all these assorted things we’ll get to in a bit. First let’s start with the FPUs.
There are 32 FPUs per PE, yeiliding an aggregate of 128 FMACs. We'll base the FMAC off that utilized in the Emotion Engine (thanks Quaz51) which was 1.7mm2 on the 250nm node. So using the linear scaling we're left with each FMAC needing 0.442mm2. Thus, 128 FMACs will necessitate 56.57mm2 in necessary area – bringing the grand total upto 67.65mm2 utilized, with 197.43mm2 of the area left.
There are also 128 FXUs, of which I couldn’t find a definitive area requirement. So, if someone knows and can help out, that would be awesome. I assumed just for this that each FXU is around 150% the size of a FMAC and rounded to 0.663mm2. So, that’ll yield 84.864mm2 total, bringing the area count used up to 166.34mm2, and the amount of potential area down to 112.566mm2
Within each APU (which we assume is reminiscent of the following Patent on Unified Scalar and SIMD datapaths, we can assume that each APU will require control logic as well as Instruction Fetch, Decode, an Issue/Branch unit where It’s queried and then overhead for this. WePropose a conservative fudgefactor of a full 2mm2 per APU for this which comes to 64mm2 and brings us to a total used area of 195.65mm2 and roughly 48.566mm2 worth of potential area remaining.
Just something to keep in mind when looking at the remaining area of 48mm2, the entire area of the AMD Thoroughbred’s, A & B, are 80mm2 and 84mm2 respectively on a 130nm process. You can fit an one, perhaps two of them normalized to 65nm in the area remaining according to this.
We then have the SRAM based storage to account for: these being the Registers and LS. The SRAM will likely be derived from this. A brief summation:
Knowing this, Pana first found the area allocated to the APU registers. Each Register per APU is 128*128 bits in size, giving us an aggregate size of 512KB of SRAM.
We then used the patent as a guide and found the area taken up by the aggregate 32Mbits of SRAM. Combined, we have 32MBits (LS) and 512KBits (Registers) of SRAM which comes to 20.448 mm2. This brings our total area used up to around 250.782mm2 and the potential area remaining to be tapped at 28.118mm2.
Abstracted above the APUs and SRAM in the memory hierarchy is a layer of eDRAM. The patent states 64MB, several press releases state 32MB (256MBit) or more. I suggest that we start at 32MB and move up from there depending on the size of the fudgefactor at the end. So, again, Pana used the released specs of Sony and Toshiba’s 65nm process for embedded RAM as found above. Here’s the relevant part:
So, using this as a guide, you’ll take up at least 29.528 mm2 of die area (again, don’t freak out with these numbers – there will be a fudgefactor to eat up inefficiencies) which brings us to 280.31mm2 and leaves us with -1.41mm2 of potential area to utilize.
Now is the time to note that during this exploration of the feasibility of the Broadband Engine, we've been very conservative in our use of logic scaling. As Entropy has stated, logic scales down geometrically bounded by [linear scale]^2. We haven't factored in the fact that we're dealing with a two dimensional area -- which depending on the desing of your microprocessor may or may not make a large difference in the potential area.
Unlike many processors such as the PentiumM or Itanium line which use a plurality of their die area to fit in a large cache hierarchy, the Broadband Engine is more of a balance between the conventional CPU design and a modern stream processor or GPU, whose use of potential die area is highly biased towards logic/cache anemic. From our calculations, the Broadband Engine is composed of 82% logic (230mm2) with the remaining 18% (50mm2) composed of SRAM and eDRAM.
So, keep in mine that 82% of the total area we've postulated so far has been linearly scaled without the exponential factor inherient in multidimensional scaling being factored in.
So, within that area that could net anywhere from a highly [outrageous] conservative 0mm2 to a conservative 30mm2, upto a theoretical high of around 100mm2. And within that range, we need to fit 4 DMACs, TLBs, and then go from there to work around the remaining control logic and wiring, inefficiencies which weren’t eaten up in the conservative 2mm2/APU factored in earlier as well as places in which highest density hasn’t been used (which we’d hope [assume?!]) on a modular processor like this is to a minimum. Not sure of the sizes of a DMAC, don’t know where to start really, but there would appear to be ample room, at least theoretically as ~30mm2 on 65nm is a lot of room indeed. For a general idea, that’s roughly, the same amount of area as an AMD Thoroughbred if it was built on the 65 nm process.
Before we get into inefficiencies, I do believe there is a strong case for this to be a highly area efficient design. Between it being inherently customized for a specific application, unlike, say the Power5, which includes a high degree of modularity, it should be possible to make each corpuscle highly optimized with a very attractive equilibrium between area, power, and computational ability.
We’ve heard that tools such as IBM’s EinsTuner has been used since to the beginning with theoretical efficiencies of 1/3rd in area savings and 15% in circuit performance. But one would still expect inefficiencies. Personally, I see variance in these numbers of ~10-25mm2 easily in each direction. But, again, the purpose of this isn’t to show exactly what it will or may or might be, but could be. Just to add up a bunch of relatively known numbers and see what we have remaining – which a bit indeed.
If anything, this shows that with the commercial production of ICs approaching the order of 300mm2, be it the Graphic Synthesizer or the recently announced Nv40, that the Broadband Engine is highly feasible from a strictly area PoV. The numbers used here are all stand-alone and don't reflect the effeciencies gained by designing a modular and set-piece architecture, nor do they reflect the ineffeciencies or other area drains - although, holistically, they should cancel eachother out for the most part, perhaps leaving a slight bias towards the ineffeciency.
The bounds regarding this design would appear to rest with the power attributes, be them the direct problem of wiring and propogation or the thermal and power intake issues. Levied against this are the process technologies like the Low-K dielectrics and PD-SOI's, which have worked amazingly well with AMD's recent use of it with Opteron. Appearently, word on the street, is that the BE implimentation is limited by the thermal aspects, not size - which may be indicative of the speeds in which they seek to attain. But, this is another thread, at another time.
PowerPC 440 Links:
PowerPC 440 Core: Whitepaper.
PowerPC 440 Core: Product Brief
PowerPC 440 Core: Product Brief 150nm
PowerPC 440 Core: Main Page
IBM Patent Links:
Processor with Redundant Logic
Reduction of interrupts in remote procedure calls
Symmetric Multi-Processing
Processor implementation having unified scalar and SIMD datapath
Token-Based DMA
I’m going to present some numbers, show where they’re from, and then open it up for people to draw their own conclusions and comments. We realize that these numbers are ballpark, but instead of shunning the variance, lets use it as a pseudo-constant, a fudge factor if you will, that will allow us to have a rough estimate of what can be done, what can’t and where it all stands. So, without further ado, here’s the basic premise.
I propose a strict interpretation of the above patents, use of recent comments which are certain, and past precedence as a guide when in doubt. Let’s try to keep this as plausible as possible. We’ll add up the knowns and leave the overhead and fudgefactors lumped together until the end. Then, depending on the magnitude, we’ll see what can be done.
James Kahle said:Kahle also had to remain aware of the eventual manufacturability of the chip at this point, but elected to put the main burden of this part of the effort on the implementation phase that would follow.
So, we’ll start by assuming the BE needs to be a commercially replicable part and as such will have definitive upper bounds on it’s area. I figure that area is a much better metric than, say, density, due to it being invariant – I feel we’ll all agree. I then propose that the 250nm Graphic Synthesizer and it’s 279mm2 physical size will provide a good reference point. We will also state that it’ll be fabricated on the 65nm process, which is supportable by facts like this, or that I know Sony has bought rights to all Cell based 65nm chips produced for an undisclosed period, blah… we all can agree on this.
We’ll start with the PE core, which it shall be assumed to be the PowerPC 440 that Sony just licensed from IBM as seen here. Some background information on this core can be seen above or for more information there will be some links below.
PowerPC 440 Core Features said:
- 0-667MHz performance
- 9.8 mm2 hard core size
- 7-stage pipeline, out-of-order
issue, out-of-order execution and
completion- Dynamic branch prediction
- Parity error detection and recovery
- Static design with extensive clock
and power management support- 32x32 multiply, with single-cycle
throughput
Memory Management Unit
- 64-entry, fully associative unified
TLB- Separate I- and D-side micro-TLBs
- Flexible TLB management
- Variable page sizes (1KB-256MB)
- 32K/32K instruction cache/data
cache controllers with parity- Write-back, write-through,
non-blocking operation- Cache line locking (I and D)
The PowerPC 440 hardcore is quite compact at 9.8mm2 on the 130nm process. Given our previous assumption that the IC will be produced on the 65nm node and encounter linear scaling, we’re left with a PowerPC 440 core that’s 4.9mm2 in size. As per the other assumptions above, we assume there will be four of them on a single die, which brings us to 19.6mm2 in utilized area (or down to 259mm2 in free area as per our assumption of precedence).
* In the original 9.8mm2 figure, there is a good chance that L1 Caches were not included. But, we are only talking about 64 KB per PowerPC 440 core though, so this might reflect in increase in the 19.6mm2 number to something like 20.8583mm2 (64 KB * 4 cores and a cell size for SRAM of 0.6um2 at 65nm would mean 1.2583mm2 ) –- which is statistically irrelevant with the variance and accumulated error which is in this measurement.
The PowerPC 440 core needs an FPU, so we’ll utilize this one. Here’s the background information:
PowerPC 440 FPU Core Specifications said:Performance:
Frequency:
- 1050 megaflops @ 525 MHz (Nominal)
- 734 megaflops @ 367 MHz (Worst case)
- 0-525 MHz nominal silicon, 1.8V, 55°C
- 0-367 MHz slow silicon, 1.65 V, 85°C
- Architecture - 32-bit PowerPC Book E compliant, supports IEEE-754 floating-point Precision IEEE Single-precision and Double-precision Superscalar 2 way Issue Out of order
Core size: 3.7 mm2
Technology: 0.18 µm (drawn) CMOS copper technology (CMOS 7SF)
Power Supply: 1.8 ± 0.15 V
Transistors: < 800K
Temperature range: -40 to 125 °C
This will work well for our FPUs. At 180nm, it’s 3.7mm2 – which would yield roughly 1.33mm2 at 65nm with the same assumptions. This brings us to 24.9mm2 in utilized area thus far and 254mm2 in possible area left.
So, next we need to factor in the APUs, of which there’s a plurality (32, 8 per Power440) and each which contain 4 FMACs, 4 FXUs, Registers, and all these assorted things we’ll get to in a bit. First let’s start with the FPUs.
There are 32 FPUs per PE, yeiliding an aggregate of 128 FMACs. We'll base the FMAC off that utilized in the Emotion Engine (thanks Quaz51) which was 1.7mm2 on the 250nm node. So using the linear scaling we're left with each FMAC needing 0.442mm2. Thus, 128 FMACs will necessitate 56.57mm2 in necessary area – bringing the grand total upto 67.65mm2 utilized, with 197.43mm2 of the area left.
There are also 128 FXUs, of which I couldn’t find a definitive area requirement. So, if someone knows and can help out, that would be awesome. I assumed just for this that each FXU is around 150% the size of a FMAC and rounded to 0.663mm2. So, that’ll yield 84.864mm2 total, bringing the area count used up to 166.34mm2, and the amount of potential area down to 112.566mm2
Within each APU (which we assume is reminiscent of the following Patent on Unified Scalar and SIMD datapaths, we can assume that each APU will require control logic as well as Instruction Fetch, Decode, an Issue/Branch unit where It’s queried and then overhead for this. WePropose a conservative fudgefactor of a full 2mm2 per APU for this which comes to 64mm2 and brings us to a total used area of 195.65mm2 and roughly 48.566mm2 worth of potential area remaining.
Just something to keep in mind when looking at the remaining area of 48mm2, the entire area of the AMD Thoroughbred’s, A & B, are 80mm2 and 84mm2 respectively on a 130nm process. You can fit an one, perhaps two of them normalized to 65nm in the area remaining according to this.
We then have the SRAM based storage to account for: these being the Registers and LS. The SRAM will likely be derived from this. A brief summation:
Embedded SRAM cell said:SRAM is sometimes used as cache memory in SoC systems. The Hi-NA193-nm lithography with alternating phase shift mask and the slimming process combined with the non-slimming trim mask process will achieve the world's smallest embedded SRAM cell in the 65nm generation an areas of only 0.6um2
Knowing this, Pana first found the area allocated to the APU registers. Each Register per APU is 128*128 bits in size, giving us an aggregate size of 512KB of SRAM.
We then used the patent as a guide and found the area taken up by the aggregate 32Mbits of SRAM. Combined, we have 32MBits (LS) and 512KBits (Registers) of SRAM which comes to 20.448 mm2. This brings our total area used up to around 250.782mm2 and the potential area remaining to be tapped at 28.118mm2.
Abstracted above the APUs and SRAM in the memory hierarchy is a layer of eDRAM. The patent states 64MB, several press releases state 32MB (256MBit) or more. I suggest that we start at 32MB and move up from there depending on the size of the fudgefactor at the end. So, again, Pana used the released specs of Sony and Toshiba’s 65nm process for embedded RAM as found above. Here’s the relevant part:
Embedded DRAM cell said:High-speed data processing requires a single-chip solution integrating a microprocessor and embedded large volume memory. Toshiba is the only semiconductor vendor able to offer commercial trench-capacitor DRAM technology for 90nm-generation DRAM-embedded System LSI. Toshiba and Sony have utilized 65nm process to technology to fabricate an embedded DRAM with a cell size of 0.11um2, the world's smallest, which will allow DRAM with a capacity of more than 256Mbit to be integrated on a single chip.
So, using this as a guide, you’ll take up at least 29.528 mm2 of die area (again, don’t freak out with these numbers – there will be a fudgefactor to eat up inefficiencies) which brings us to 280.31mm2 and leaves us with -1.41mm2 of potential area to utilize.
Now is the time to note that during this exploration of the feasibility of the Broadband Engine, we've been very conservative in our use of logic scaling. As Entropy has stated, logic scales down geometrically bounded by [linear scale]^2. We haven't factored in the fact that we're dealing with a two dimensional area -- which depending on the desing of your microprocessor may or may not make a large difference in the potential area.
Unlike many processors such as the PentiumM or Itanium line which use a plurality of their die area to fit in a large cache hierarchy, the Broadband Engine is more of a balance between the conventional CPU design and a modern stream processor or GPU, whose use of potential die area is highly biased towards logic/cache anemic. From our calculations, the Broadband Engine is composed of 82% logic (230mm2) with the remaining 18% (50mm2) composed of SRAM and eDRAM.
So, keep in mine that 82% of the total area we've postulated so far has been linearly scaled without the exponential factor inherient in multidimensional scaling being factored in.
So, within that area that could net anywhere from a highly [outrageous] conservative 0mm2 to a conservative 30mm2, upto a theoretical high of around 100mm2. And within that range, we need to fit 4 DMACs, TLBs, and then go from there to work around the remaining control logic and wiring, inefficiencies which weren’t eaten up in the conservative 2mm2/APU factored in earlier as well as places in which highest density hasn’t been used (which we’d hope [assume?!]) on a modular processor like this is to a minimum. Not sure of the sizes of a DMAC, don’t know where to start really, but there would appear to be ample room, at least theoretically as ~30mm2 on 65nm is a lot of room indeed. For a general idea, that’s roughly, the same amount of area as an AMD Thoroughbred if it was built on the 65 nm process.
Before we get into inefficiencies, I do believe there is a strong case for this to be a highly area efficient design. Between it being inherently customized for a specific application, unlike, say the Power5, which includes a high degree of modularity, it should be possible to make each corpuscle highly optimized with a very attractive equilibrium between area, power, and computational ability.
James Kahle said:He directed some of his engineers to work from the bottom up, drafting circuit designs for specific functional elements of the chip, while other engineers were working from the top down, modeling the high-level architecture and providing the definition of the chip instruction set…. As with the high-level design, the implementation process is highly iterative, with the engineers building simulations of how each section of the chip will perform under real-life circumstances. "The trick is to achieve the right stability of the chip's smaller pieces," explains Kahle, "and then to build up bigger and bigger chunks, thereby improving the overall quality and stability of the design.
We’ve heard that tools such as IBM’s EinsTuner has been used since to the beginning with theoretical efficiencies of 1/3rd in area savings and 15% in circuit performance. But one would still expect inefficiencies. Personally, I see variance in these numbers of ~10-25mm2 easily in each direction. But, again, the purpose of this isn’t to show exactly what it will or may or might be, but could be. Just to add up a bunch of relatively known numbers and see what we have remaining – which a bit indeed.
If anything, this shows that with the commercial production of ICs approaching the order of 300mm2, be it the Graphic Synthesizer or the recently announced Nv40, that the Broadband Engine is highly feasible from a strictly area PoV. The numbers used here are all stand-alone and don't reflect the effeciencies gained by designing a modular and set-piece architecture, nor do they reflect the ineffeciencies or other area drains - although, holistically, they should cancel eachother out for the most part, perhaps leaving a slight bias towards the ineffeciency.
The bounds regarding this design would appear to rest with the power attributes, be them the direct problem of wiring and propogation or the thermal and power intake issues. Levied against this are the process technologies like the Low-K dielectrics and PD-SOI's, which have worked amazingly well with AMD's recent use of it with Opteron. Appearently, word on the street, is that the BE implimentation is limited by the thermal aspects, not size - which may be indicative of the speeds in which they seek to attain. But, this is another thread, at another time.
PowerPC 440 Links:
PowerPC 440 Core: Whitepaper.
PowerPC 440 Core: Product Brief
PowerPC 440 Core: Product Brief 150nm
PowerPC 440 Core: Main Page
IBM Patent Links:
Processor with Redundant Logic
Reduction of interrupts in remote procedure calls
Symmetric Multi-Processing
Processor implementation having unified scalar and SIMD datapath
Token-Based DMA