Xenon and Revolution CPUs: PowerPC 970 vs Power5 ?

ERP said:
Unless they're talking about a 64bit altivec, I'm missing the point.

I'm not sure I see the point of the quote myself, but the functionality of the IBM 970 processor is crystal clear. The 970 has two independent FPUs, each capable of single cycle 64-bit FMAC. It can perform these operations independent of (and in addition to) its Altivec SIMD unit. Supports trinary operations as well by the way, which SSE(1,2,3) doesn't.

Compared to x86 processors, the 970 is an FPU monster. For double precision (64-bit) scientific codes, the difference in performance can be quite substantial. It's one of the nice aspects of its Power4 heritage.

To what extent this is an advantage for games is another question. The game coders will have to answer that one. There are other interesting aspects of the processor, such as the neat FSB, its instruction grouping et cetera, but digging in to these minutiae may not be pertinent anyway as the processor(s) going into the Xbox2 may differ.

What I would like to have explained to me is how having three dual core CPU chips in the XBox2 can be possible given the likely cost in dollars and Watts.
 
DeanoC said:
Or something completely different :LOL:

Honestly, I can't see how a triple dual-core system is supposed to be implemented in practise. It's interesting to know what is going into the XBox2 since IBM is typically big on IP reuse. If they develop something new, it is likely to end up in other machines, if not, the XBox2 solution is likely to be a pretty straightforward remapping of existing design(s). The first case is the most interesting, but it's difficult to envision a parallell but completely different effort of Cell-like magnitude. So if not Cell, I'd predict something derived from Power5.
 
Entropy:

what other people like Mfa have been saying is this: three CPU cores ( each, say PowerPC 976 ) with each core modified for SMT support ( up to 2 threads ).

SMT supposely only adds like 20% to the core's area.
 
Panajev2001a said:
Entropy:

what other people like Mfa have been saying is this: three CPU cores ( each, say PowerPC 976 ) with each core modified for SMT support ( up to 2 threads ).

SMT supposely only adds like 20% to the core's area.
That's what IBM says about their Power5.
Three such cores should be doable on a single chip, lending itself to further process shrinks with time for improved economics.
Sounds almost believable. :) Power5 is pretty well documented.
 
Edit: sorry, I misread your answer :LOL:.



Entropy said:
Panajev2001a said:
Entropy:

what other people like Mfa have been saying is this: three CPU cores ( each, say PowerPC 976 ) with each core modified for SMT support ( up to 2 threads ).

SMT supposely only adds like 20% to the core's area.
That's what IBM says about their Power5.
Three such cores should be doable on a single chip, lending itself to further process shrinks with time for improved economics.
Sounds almost believable. :) Power5 is pretty well documented.

POWER5 has, in a single chip, two full cores.

Each core, in turn, was SMT enhanced to support up to 2 threads at a time.

POWER4 chips appeared to the OS as two CPUs while POWER5 chips should appear as up to 4 logical CPUs to the OS.

PowerPC 970 takes only one of the two POWER4 cores, modifies its architecture a bit and adds an Altivec/VMX unit.

Add SMT to the PowerPC 970 and take three of such "enahanced" processors. The L2 cache in the Microsoft document was shared by the three cores and there was no on-die L3 cache support.
 
a688 said:
Apple never said that 32-bit chips COULDN'T do double percision, they said (according to that quote in the thread earlier) that it takes them MULTIPLE clock cycles to do the work.

Exactly!!

I was trying to point out G5's math abilities. That it got all mixed up in double-precision floating point mumbo jumbo is Apple's fault! :p

Apparently, precision is what differentiates a multiply-accumulate from a fused multiply-accumulate operation. The latter is only rounded once.

The Free Dictionary said:
A Fused Multiply-Add computes a multiply-accumulate

FMA (A,B,C) = A * B + C

with a single rounding of floating point numbers.

Source: The Free Dictionary

Other than that, they are functionally equivalent, with multiply-accumulate being accommodated by two transactions: a multiplication followed by an addition.

But G5 is doing all of this in a single operation! :oops:

So it would seem that the more arduous its mathematical workload, the more timesaving it should become, comparatively speaking of course. ;)
 
If indeed Xbox2 CPU is 3 PowerPC 970 cores on one die, that looks like 6 logical CPUs to the OS (6 threads) I'll be very interested in finding out what Microsoft's reasoning for this particular configuration is.
 
Megadrive1988 said:
If indeed Xbox2 CPU is 3 PowerPC 970 cores on one die, that looks like 6 logical CPUs to the OS (6 threads) I'll be very interested in finding out what Microsoft's reasoning for this particular configuration is.

Why not?

If it's true it probably has more to do with cost/perfromane tradeoffs than anything else.
 
Pepto-Bismol said:
a688 said:
Apple never said that 32-bit chips COULDN'T do double percision, they said (according to that quote in the thread earlier) that it takes them MULTIPLE clock cycles to do the work.

Exactly!!

...Which is what is BULLSHIT, because 32-bit processors has had double-precision math units for friggin AGES, INCLUDING APPLE'S OWN SYSTEMS. I'm pretty sure even the old-old 68020/30-based Macintoshes with 68881/2 FPUs had double-precision math capabilities. In any case the 68040 Quadra line had it, and PowerPC line has had it forever. These chips do NOT need to cycle anything multiple times through the FPU to do double precision, they'll issue one instruction that does all the work because register size in the FPU is 64 bits.

So it would seem that the more arduous its mathematical workload, the more timesaving it should become, comparatively speaking of course. ;)

It has the possibility of being faster, but it's likely instruction latency is longer for a MAC instruction since it is doing more work. This could lead to bubbles in the instruction stream that slows overall performance. Also, if you're processing more data per time unit you need to have more bandwidth to deliver that data and remove the results... Lacking that, the FPU will simply stall. :p
 
Brimstone said:
The VPU should end up with Vertex and Pixel shaders unified, with the CPU having Vertex shading also if a developer wants to use the CPU for that.

So if a developer wants to use all of the VPU's power for pixel shading they can do so and then have the CPU work on Vertex operations. But say the developer wants to have really complicated physics they can have the CPU's work on that and have the VPU work on Pixel Shading and Vertex Shading.

Megadrive1988 said:
that's nice flexibility. kinda like having the best of both the Emotion Engine from PS2 and the Vertex Shaders from Xbox.


ooooooooh........didn't think about it that way. Thanks for the explanations. :)
 
Guden Oden said:
Pepto-Bismol said:
a688 said:
Apple never said that 32-bit chips COULDN'T do double percision, they said (according to that quote in the thread earlier) that it takes them MULTIPLE clock cycles to do the work.

Exactly!!

...Which is what is BULLSHIT, because 32-bit processors has had double-precision math units for friggin AGES, INCLUDING APPLE'S OWN SYSTEMS.

I think you misunderstand the context of the quote.
The PPC7xx (G3) processors were capable of single cycle 32-bit FP, but 64-bit precision FP was performed as described in the Apple quote.

This is not competitive PR vs x86 PCs, it is directed towards their current customers who are familiar with how Apples previous processors have worked.

Your comments re:the memory subsystem limiting the realizeable performance is obviously correct. Dual FPUs performing FMACs could require 2x(2 operands + 1 result) or 32 bytes read and 16 bytes written per clock cycle. At 2.5 GHz.... Which is why the advantages of the 970 FPUs aren't obvious on all codes, but only those where the large general register set or cache residency can be brought to bear, reducing traffic to main memory.

As an aside, if the XBox2 processor cores are multithreaded (x2 in the Power5) that's a sure sign that it isn't a 970 derivative (which in turn is derived from the Power4) but is derived from the Power5.
 
Entropy said:
I think you misunderstand the context of the quote.

I don't think so. The wording of the quote is ambiguous enough to imply 32-bit CPUs either can't do double precision without faking it (using multiple instructions), or at least not single-cycle double-precision - both being untrue. It's not some extra-special 64-bit-integer-only feature... Even the age-old Pentium had single-cycle double-precision FPU I believe.

This is not competitive PR vs x86 PCs, it is directed towards their current customers who are familiar with how Apples previous processors have worked.

To me it looks like FUD directed at 32-bit MPUs in general, ie typical behavior of Apple these days. Which is sad, really, because their 64-bit boxes are damn hot (no pun intended) and Apple should trust them to stand on their own merits without inventing non-existing advantages.

At 2.5 GHz.... Which is why the advantages of the 970 FPUs aren't obvious on all codes, but only those where the large general register set or cache residency can be brought to bear, reducing traffic to main memory.

Hm, registers are burnt through in no time and cache almost as fast with that kind of processing power, so unless the code is written to do a vast amount of calculations on a relatively small amount of data it won't come anywhere near peak efficiency. :) I would think a G5 would run Seti@Home much quicker (relatively speaking) than it would MP3 encoding for example. :) Heck, the entire work unit could easily fit inside L2 cache... :) Of course, the client uses a fair amount of buffer memory too so it couldn't entirely run out of cache, but it would be really nice to see a comparison between modern systems on Seti. :)

As an aside, if the XBox2 processor cores are multithreaded (x2 in the Power5) that's a sure sign that it isn't a 970 derivative (which in turn is derived from the Power4) but is derived from the Power5.

No it isn't! What makes you think that in the first place anyway? IBM alluded quite some time ago work was going on to add both multi-core and multi-threading to the PPC9xx line; see article at Ars Technica, I forget which one (the most recent blackpaper I believe).
 
passerby said:
Multiply/add has been around for a long, long, long time. I wouldn't be surprised if last generation's consoles have it(corrections welcome).
Last generation as in N64/PS1? As far as general cpus go, I don't think so (afaik MIPS ISA didn't have multiply-adds until introduction of MIPS32/MIPS64, although some chips did have them prior to that through custom extensions (R5900 in EE for instance)).
I'm not familiar with ISA for N64s vector units though, I'm sure ERP could help there.
Current gen have it across the board though (aside for Intel based stuff of course).

Entropy said:
The PPC7xx (G3) processors were capable of single cycle 32-bit FP, but 64-bit precision FP was performed as described in the Apple quote.
That isn't quite true though (referring to PPC family in general). I don't know about G3 derivates, but the actual PPC750 family has single cycle 64bit FP.

Guden said:
Hm, registers are burnt through in no time and cache almost as fast with that kind of processing power,
That's not much to do with processing power - properly leveraging a large register set can have huge impact on execution speed. Problem is most compilers aren't really up to the task at all...
 
Guden Oden said:
It has the possibility of being faster, but it's likely instruction latency is longer for a MAC instruction since it is doing more work. This could lead to bubbles in the instruction stream that slows overall performance.

Okay smarty pants, decipher this for us! Where is the gridlock? :p

Apple said:
Within the brains of the PowerPC G5 is more processing power than you’ve ever experienced from a desktop chip. Its massively parallel circuits are capable of handling multiple assorted tasks at the same time. Called an execution core, it’s where your Mac does all its thinking.

Derived from IBM’s 64-bit POWER series processors, the G5 offers two double-precision floating-point units, advanced branch prediction logic and a high-bandwidth frontside bus. To that superscalar, superpipelined execution core, Apple and IBM added the Velocity Engine to the design, so that every Mac OS X application could take advantage of vector processing. Additionally, the PowerPC G5 features processing innovations that optimize the flow of data and instructions.


L2 Cache
512K of L2 cache provides the core with ultrafast 64GBps access to data and instructions.


L1 Cache
Instructions are prefetched from the L2 cache into a large, direct-mapped 64K L1 cache at 64GBps. In addition, 32K of L1 data cache can prefetch up to eight active data streams simultaneously.


Fetch and Decode
As they are accessed from the L1 cache, up to eight instructions per clock cycle are fetched, decoded and divided into smaller, easier-to-schedule operations. This efficient preparation maximizes processing speed as instructions are dispatched into the execution core and data is loaded into the large number of registers behind the functional units.


Dispatch
Before instructions are dispatched into the functional units, they are arranged into groups of up to five. Within the core alone, the PowerPC G5 can track up to 20 groups at a time, or 100 individual instructions. This efficient group-tracking scheme enables the PowerPC G5 to manage an unusually large number of instructions “in flightâ€: 20 instructions in each of the five dispatch groups and an additional 100-plus instructions in the various fetch, decode and queue stages.


Queue
Once an instruction group is dispatched into the execution core, it is broken into individual instructions, which proceed to the appropriate functional unit. Each unit has its own dedicated queue, where multiple instructions are arranged for processing in whatever order is required.


Optimized Velocity Engine
The PowerPC G5 uses an optimized dual-pipelined Velocity Engine with two independent queues and dedicated 128-bit registers and data paths for efficient instruction and data flow. This vector processing unit accelerates data manipulation by applying a single instruction to multiple data at the same time, known as SIMD processing. The Velocity Engine in the PowerPC G5 uses the same set of 162 instructions as in the PowerPC G4, so it runs — and accelerates — existing Mac OS X applications already optimized for the Velocity Engine.


Load/Store
At the same time as instructions are queued, the load/store units load the associated data from L1 cache into the data registers behind the units that will be processing the data. After the instructions manipulate the data, these units store it back to L1 cache, L2 cache or main memory. Each functional unit is generously equipped with 32 registers that are 128-bit wide on the Velocity Engine and 64-bit wide on the floating-point units and the integer units. With two load/store units, the PowerPC G5 is able to keep these registers filled with data for maximum processing efficiency.


Condition Register
This special 32-bit register summarizes the states of the floating-point and integer units. The condition register also indicates the results of comparison operations and provides a means for testing them as branch conditions. By bridging information between the branch unit and other functional units, the condition register improves the flow of data throughout the execution core.


Three Component Branch Prediction Logic
The PowerPC G5 usually knows the answer before it asks the question, using branch prediction and speculative operation to increase efficiency. Like finishing someone else’s sentences, branch prediction anticipates which instruction should go next, and speculative operation causes that instruction to be executed. If the prediction is correct, the processor works more efficiently — since the speculative operation has executed an instruction before it’s required, as with a conversation that seems to be a mind meld. If the prediction is incorrect, the processor must clear the unneeded instruction and associated data, resulting in an empty space called a pipeline bubble. Pipeline bubbles reduce performance as the processor marks time waiting for the next instruction, not unlike wasting time hearing how very wrong your assumptions were. The G5 can predict branch processes with an accuracy of up to 95%, allowing the chip to efficiently use every processing cycle.


Complete
When operations on the data are complete, the PowerPC G5 recombines the instructions into the original groups of five and the load/store units store the data in cache or main memory for further processing.


Source: Apple
 
Fafalada said:
Entropy said:
The PPC7xx (G3) processors were capable of single cycle 32-bit FP, but 64-bit precision FP was performed as described in the Apple quote.
That isn't quite true though (referring to PPC family in general). I don't know about G3 derivates, but the actual PPC750 family has single cycle 64bit FP.

I didn't quite trust my memory so I checked it out.
It appears that the FP-unit may have been changed during the life-time of the "G3". For the last iteration from Motorola, the MPC755, these are the timings:
— Three-cycle latency, one-cycle throughput, single-precision multiply-add
— Three-cycle latency, one-cycle throughput, double-precision add
— Four-cycle latency, two-cycle throughput, double-precision multiply-add
Checking through IBMs tech docs didn't reveal any changes to the FPU, although changes to the cache subsystem is bound to have improved real life performance significantly.

Re:compilers, Apple and IBM are pushing an effort which should provide reasonably decent auto-vectorization for gcc. Quite remarkable to see them spend that kind of effort on non-proprietary software, even though the strategic value is undeniable. Bodes well for simple porting of FP heavy scientific/technical codes.
 
Entropy said:
Fafalada said:
Entropy said:
The PPC7xx (G3) processors were capable of single cycle 32-bit FP, but 64-bit precision FP was performed as described in the Apple quote.
That isn't quite true though (referring to PPC family in general). I don't know about G3 derivates, but the actual PPC750 family has single cycle 64bit FP.

I didn't quite trust my memory so I checked it out.
It appears that the FP-unit may have been changed during the life-time of the "G3". For the last iteration from Motorola, the MPC755, these are the timings:
— Three-cycle latency, one-cycle throughput, single-precision multiply-add
— Three-cycle latency, one-cycle throughput, double-precision add
— Four-cycle latency, two-cycle throughput, double-precision multiply-add
Checking through IBMs tech docs didn't reveal any changes to the FPU, although changes to the cache subsystem is bound to have improved real life performance significantly.

Re:compilers, Apple and IBM are pushing an effort which should provide reasonably decent auto-vectorization for gcc. Quite remarkable to see them spend that kind of effort on non-proprietary software, even though the strategic value is undeniable. Bodes well for simple porting of FP heavy scientific/technical codes.

Gekko ( a PowerPC 75x CPU ) can do two parallel 32 bits FP MADD operations at a single time with its 64 bits FPU: I would think it can do a 64 bits ADD in a single cycle as well as a 64 bits MADD in a single cycle.

Edit: in a single cycle... yeah... that is "pipelined" ( 1 cycle throughput ).
 
It has the possibility of being faster, but it's likely instruction latency is longer for a MAC instruction since it is doing more work. This could lead to bubbles in the instruction stream that slows overall performance.

Nah... Pretty much every RISC design out there can do single-cycle FMAC operations... At least single precision, some incur more latency for double precision, while others do not...

Dual FPUs performing FMACs could require 2x(2 operands + 1 result) or 32 bytes read and 16 bytes written per clock cycle. At 2.5 GHz.... Which is why the advantages of the 970 FPUs aren't obvious on all codes, but only those where the large general register set or cache residency can be brought to bear, reducing traffic to main memory.

Actually with PowerPC you need to load 3 operands to do a full FMAC (which all totalled will eat 4 registers, but you preserve the data in all of them). And that's amount of loading isn't really a problem for the 970 (it has 2 load/store units), the real trick is trying to saturate the FPUs within the limitation of the dispatch groups.

That isn't quite true though (referring to PPC family in general). I don't know about G3 derivates, but the actual PPC750 family has single cycle 64bit FP.

While yes this is true from some instructions it's not true for all... The 750 series has single cycle double precision instructions except for fmadd, fmsub, and fmul (of which all take 2-cycles, IIRC there's also a stall incurred after something like 4-5 sequential fmadd are executed IIRC), of course that doesn't include the typically slow fdiv, or other instructions like fsqrt/fsqrts/fsqrte. Motorola eliminated those double precision latencies in the G4s (although IIRC the sequential fmadd stall still exists) and I believe IBM has in the G5's (not sure about the actual execution times since the 970 is much deeper, but FMAC instructions while complex are not cracked).

That's not much to do with processing power - properly leveraging a large register set can have huge impact on execution speed. Problem is most compilers aren't really up to the task at all...

Yes this is especially true of compilers born from register starved accumlator ISAs [cough]x86/gcc[/cough]... Never mind even leveraging the large amount of rename registers on some of todays deep OOE designs (although vender provided compilers tend to do a better job, e.g. icc, xlc)...

Okay smarty pants, decipher this for us! Where is the gridlock?

Well aside from optimizing for dispatch groups, the biggest surprises are the changes to AltiVec. Leveraging the DST instructions (software directed hardware prefetch) and some of the DCB instructions (data cache blocking (namely DCBZ, DCBI and DCBA)) that are so crucial for performance optimization on the G4s cause lots of performance problems on the G5s...

I didn't quite trust my memory so I checked it out.
It appears that the FP-unit may have been changed during the life-time of the "G3". For the last iteration from Motorola, the MPC755, these are the timings:

Yes the 755 was sort've like Moto's testbed for the improvements in the 74xx FPUs...

Quite remarkable to see them spend that kind of effort on non-proprietary software, even though the strategic value is undeniable. Bodes well for simple porting of FP heavy scientific/technical codes.

Why is so remarkable? Have you seen the effort IBM has put behind Java and Linux other opensource projects? And have you taken a look at OS X? The underlying OS, core foundation, IOkit, etc are all opensource (not to mention the bundling of a lot of opensource tools and apps with it). Then theres other portions like the Darwin Streaming server, and webkit (the HTML/Javascript engine derived from KDE's KHTML)...

Gekko ( a PowerPC 75x CPU ) can do two parallel 32 bits FP MADD operations at a single time with its 64 bits FPU: I would think it can do a 64 bits ADD in a single cycle as well as a 64 bits MADD in a single cycle.

Gekko suffers all the same limitations of the 750. However it's packed instructions are all still single cycle...
 
archie4oz said:
Nah... Pretty much every RISC design out there can do single-cycle FMAC operations...

But it doesn't take a single cycle for the FPU to complete the operation from start to finish. Float execution unit is typically even deeper pipelined than the integer execution unit...
 
Back
Top