On the Feasibility of the Broadband Engine

Saem said:
A note about feature size scaling. It takes place in three dimensions and they do not all scale at the same rate -- I believe Russ can confirm this.

You are correct, but again it is design dependent.
 
3. PPC440 series is not a high-clocker. CELL is believed to built around Power4 core.

The Giga core would be WAAAYYYY overkill for a PE (although nice). Also the 440 core is capable of much higher clocks than IBM ships it with. Moto has a quad-issue (well techincally 3) core with the same logical pipline running at 1.5GHz on 130nm...

The patent said 4 Floating point units and 4 Integer units. It can turn out to be FMACs I suppose, but I'll treat it as FPUs for the time being.

An FMAC *is* an implementation type of an FPU ALU (vs. say having separate multipliers, accumulators (and shifters) like you do in most x86 designs).

I say it because it because an with 3-StagePipeline doing FMAC operations could be a good idea, and VMX is only 4 milion transistors with 4 Vint/op and 4 Vfp/op

As somebody who loves, programs, and uses AltiVec, I'd say no. Note: only on the 7400/7410 was the VCIU 3-stages. The VPU was 1, VSIU was 1, and VFPU was 4. These also got longer on the 745x cores, and *MUCH* deeper on the 970.

Also the APU design is more like the VUs which employed elements of VLIW. Plus there's a lot of junk in AltiVec you wouldn't really need. With each element having it's own FMAC you don't need stuff like AltiVec's permute function, when you can just specify a swizzle or arbitrary element to element operations in instruction masks.

If APUs are on the same route of PS2 VUs a single APU's FPU would be smaller (no superscalar execution, no full IEEE compliancy, no fancy other stuff.. ) than the FPU selected by Vince.

Hehe, you could same the same about the PE FPU, if you want to draw the analogy to the EE Core FPU. :p

The PowerPC 440 FPU is an out-of-order design and the FXUs are handled by out-of-order logic with register renaming and all other neat thingies.

Barely... :p

I'd think that they'll do completely without FDIV in the APUs. Instead they will have a reciprocal estimate instruction (doing 4 estimates in parallel) which you then can refine to the desired precision with Newton-Raphson. This has the added bonus that it can be pipelined.

Depends. The VU FDIVs are actually pretty fast (although still not without stall penalties). OTOH yeah you could drop the FDIV and save on the logic space and (and software pipeline your recip. est./refinements)... After all thats pretty much what you have to do in AltiVec since there is no div (Hell divs are optional on the Alpha ISA)...

Also I think they'll build the Floating point hardware so that they can just push integers through as denormalized floating point values - saves transistors on execution units.

While I actually like this idea, it does have it's drawbacks (depending on how much functionality resides in the APU. Any bit-mangling instructions would be a real pain to implement, and you can get pretty wasteful of space if you're aligning off of 16-byte or 32-byte boundaries (depending on what data types you're going to support).
 
Also I think they'll build the Floating point hardware so that they can just push integers through as denormalized floating point values - saves transistors on execution units.

Intel likes to do this with it's muls.

I wonder how long until we just have an enormous amount of fast integer units that rather than bothering with fp hardware, you'll just emulate it.
 
archie4oz said:
With each element having it's own FMAC you don't need stuff like AltiVec's permute function, when you can just specify a swizzle or arbitrary element to element operations in instruction masks.
It's tossing out sentences like this that make you realize you're an uncurable tech-head. ;)
 
...

sun05.jpg

The first die picture of Sun's Niagara processor : 340 mm2.

CELL will look something like this...
 
Back
Top