It has the possibility of being faster, but it's likely instruction latency is longer for a MAC instruction since it is doing more work. This could lead to bubbles in the instruction stream that slows overall performance.
Nah... Pretty much every RISC design out there can do single-cycle FMAC operations... At least single precision, some incur more latency for double precision, while others do not...
Dual FPUs performing FMACs could require 2x(2 operands + 1 result) or 32 bytes read and 16 bytes written per clock cycle. At 2.5 GHz.... Which is why the advantages of the 970 FPUs aren't obvious on all codes, but only those where the large general register set or cache residency can be brought to bear, reducing traffic to main memory.
Actually with PowerPC you need to load 3 operands to do a full FMAC (which all totalled will eat 4 registers, but you preserve the data in all of them). And that's amount of loading isn't really a problem for the 970 (it has 2 load/store units), the real trick is trying to saturate the FPUs within the limitation of the dispatch groups.
That isn't quite true though (referring to PPC family in general). I don't know about G3 derivates, but the actual PPC750 family has single cycle 64bit FP.
While yes this is true from some instructions it's not true for all... The 750 series has single cycle double precision instructions except for fmadd, fmsub, and fmul (of which all take 2-cycles, IIRC there's also a stall incurred after something like 4-5 sequential fmadd are executed IIRC), of course that doesn't include the typically slow fdiv, or other instructions like fsqrt/fsqrts/fsqrte. Motorola eliminated those double precision latencies in the G4s (although IIRC the sequential fmadd stall still exists) and I believe IBM has in the G5's (not sure about the actual execution times since the 970 is much deeper, but FMAC instructions while complex are not cracked).
That's not much to do with processing power - properly leveraging a large register set can have huge impact on execution speed. Problem is most compilers aren't really up to the task at all...
Yes this is especially true of compilers born from register starved accumlator ISAs [cough]x86/gcc[/cough]... Never mind even leveraging the large amount of rename registers on some of todays deep OOE designs (although vender provided compilers tend to do a better job, e.g. icc, xlc)...
Okay smarty pants, decipher this for us! Where is the gridlock?
Well aside from optimizing for dispatch groups, the biggest surprises are the changes to AltiVec. Leveraging the DST instructions (software directed hardware prefetch) and some of the DCB instructions (data cache blocking (namely DCBZ, DCBI and DCBA)) that are so crucial for performance optimization on the G4s cause lots of performance problems on the G5s...
I didn't quite trust my memory so I checked it out.
It appears that the FP-unit may have been changed during the life-time of the "G3". For the last iteration from Motorola, the MPC755, these are the timings:
Yes the 755 was sort've like Moto's testbed for the improvements in the 74xx FPUs...
Quite remarkable to see them spend that kind of effort on non-proprietary software, even though the strategic value is undeniable. Bodes well for simple porting of FP heavy scientific/technical codes.
Why is so remarkable? Have you seen the effort IBM has put behind Java and Linux other opensource projects? And have you taken a look at OS X? The underlying OS, core foundation, IOkit, etc are all opensource (not to mention the bundling of a lot of opensource tools and apps with it). Then theres other portions like the Darwin Streaming server, and webkit (the HTML/Javascript engine derived from KDE's KHTML)...
Gekko ( a PowerPC 75x CPU ) can do two parallel 32 bits FP MADD operations at a single time with its 64 bits FPU: I would think it can do a 64 bits ADD in a single cycle as well as a 64 bits MADD in a single cycle.
Gekko suffers all the same limitations of the 750. However it's packed instructions are all still single cycle...