Shifty Geezer said:
Well Im going to push on with DP talk regardless, 'coz I'm reckless like that!
(I'd like to know why PS3's Cell isn't managing 1:2 performance SP vs DP floats).
The DP hardware in the SPEs isn't pipelined. For SP operations, there is a result latency of 7 cycles (IIRC) and an issue rate of 1 every cycle. SP operations are 4 wide SIMD.
For DP operations there is a result latency of 7 cycles and an issue rate of 1 every 7 cycles. DP operations are 2 wide SIMD. I don't know the exact hardware config of the DP MACs to know if they are sharing any logic with the SP hardware.
So Aaronspink, you're saying a DP enhanced SPE occupies something like 3x the space of a conventional SPE? Or to go to DP enhanced would lose 2 SPE's from the 1:8 config? Assuming the same footprint for this DP+ Cell, would it be 1:6 or nearer 1:4?
No, a DP MAC will occupy on the order of 2x the space of 2 SP MACs. This is do to the much greater size of the multiplier needed for a 64x64 + offsets multiplication.
http://www.iccd-conference.org/proceedings/2001/12000497.pdf provides some area estimates for SP and DP floating point multipliers.
If you look at the SPU on the die photo, the non-pipelined DP MAC occupies roughly 1/2 the area for the 4 SP MAC units. IBM likely saved a significant amount of space by not making the DP MACs pipelined. IBM has two options, they can either put in 2 pipelined, DP MACs with the nessesary logic to also do 2 SP MACs or just increase the functionality of the current DP MAC block to support pipelined operation.
It would be nice if I could find die photos of two processors of the same micro-architecture, one supporting SSE and one supporting SSE2, because that would give a good idea of the area difference we are talking about.
Realistically, IBM will probably just increase the die size. Looking at the layout, to fit in the fully pipelined DP MACs they will likely require more horizontal real estate.
Aaron Spink
speaking for myself inc.