CELL Patents (J Kahle): APU, PU, DMAC, Cache interactions?

Jaws said:
  • "Btw, the PPC 400 series IP has been sold by IBM to AMCC : Source... "
  • "So in conclusion...
    ...it's not the PPC 400 series, IBM has sold the IPs as mentioned above..."

Please read your own link:

  • [url=http://www-306.ibm.com/chips/news/2004/0413_power.html said:
    Link to SMCC deal[/url]]IBM also will continue to develop and use PowerPC 400 series embedded processor cores as building blocks for application specific integrated circuits (ASICs), SoC and other highly customized logic chips.
 
Vince said:
Jaws said:
  • "Btw, the PPC 400 series IP has been sold by IBM to AMCC : Source... "
  • "So in conclusion...
    ...it's not the PPC 400 series, IBM has sold the IPs as mentioned above..."

Please read your own link:

  • [url=http://www-306.ibm.com/chips/news/2004/0413_power.html said:
    Link to SMCC deal[/url]]IBM also will continue to develop and use PowerPC 400 series embedded processor cores as building blocks for application specific integrated circuits (ASICs), SoC and other highly customized logic chips.

Bugger....I should stop skim reading every other paragraph and just stick to every other line! ;) Thanks for pointing that out :p

So...in hindsight...etc etc...*cough, cough*

...the PPC 400 series and 300 series are still candidates for Cell PUs and Xenon cores...
 
Okay, looking at the PPC 400 and 300 series...

If we assume the Xe CPU at ~ 200 mm2 to be a decent order of magnitude for the die size, and if Xe is indeed to have 3 cores at 90 nm process and if a PPC 440 core ~ 10 mm2 at 130 nm, then,

3 PPC 440 cores at 90 nm ~ 15 mm2 <<< 200 mm2

So we could safely discard the PPC 440 for Xe cores.

As for the Cell PUs, the PPC 440 cores are ageing and were not designed to clock > 1GHz and if the PPC 300 series is it's replacement that would debut at 65 nm as suggested by the rumours, then we could also safely discard the 440 as a likely candidate. The 300 series would seem more plausible, especially as it's roadmap suggests 45nm process aswell.

So in conclusion etc etc, again...etc etc..

The PPC 300/350 series look like very good candidates for the Cell PUs and Xenon cores, but this would also suggest Xe CPU at a 65 nm process :?:
 
...it's not the PPC 400 series, IBM has sold the IPs as mentioned above...

Keep in mind the 401-405 are different beasts from the 440. The former are actually 603 based cores, and the latter a different, Book E compliant core.

If we assume the Xe CPU at ~ 200 mm2 to be a decent order of magnitude for the die size, and if Xe is indeed to have 3 cores at 90 nm process and if a PPC 440 core ~ 10 mm2 at 130 nm, then,

3 PPC 440 cores at 90 nm ~ 15 mm2 <<< 200 mm2

you're making assumptions of an entire CPU from pretty bare cores... You're not adding the additional realestate for the FPUs, Vector extensions, SMT logic, or the chip/core integration logic...

As for the Cell PUs, the PPC 440 cores are ageing and were not designed to clock > 1GHz

Another false assumption... The only thing preventing 440 derivatives from achieving higher clock speeds has been lack of demand... The 750 is a meager 4 stage design and it achieves 1.1 GHz...

Of course you seem to be looking at this strictly as something form off-the-shelf components when you could end up with totally customized one off design (not uncommon with IBM), or the derivative been quite radical... (sort've like how the Pentium M is a radical derivative of the P6 core).
 
Vince said:
Please read your own link:

  • [url=http://www-306.ibm.com/chips/news/2004/0413_power.html said:
    Link to SMCC deal[/url]]IBM also will continue to develop and use PowerPC 400 series embedded processor cores as building blocks for application specific integrated circuits (ASICs), SoC and other highly customized logic chips.

PPC4XX IP is not shared between those two. It only means IBM continues to work as the foundry to make PPC4XX processors by the demand of AMCC and others as before to assure current customers, not more than that. In other words PPC4XX are no more in the strategic IP portfolio of IBM. At least Sony won't like PPC4XX in the Cell SoC without cross-license with AMCC which will adds unwanted risk. Then again as I wrote in another thread IBM couldn't license Microsoft for PPC4XX. In the press release of MS-IBM agreement the licensor is not AMCC or Motorola but IBM.
 
archie4oz said:
If we assume the Xe CPU at ~ 200 mm2 to be a decent order of magnitude for the die size, and if Xe is indeed to have 3 cores at 90 nm process and if a PPC 440 core ~ 10 mm2 at 130 nm, then,

3 PPC 440 cores at 90 nm ~ 15 mm2 <<< 200 mm2

you're making assumptions of an entire CPU from pretty bare cores... You're not adding the additional realestate for the FPUs, Vector extensions, SMT logic, or the chip/core integration logic...

Yeah...I was just showing a quick point on the order of magnitude. If we did add the above, then we'd probably look at an order of maginitude around ~ 50 mm2 <<< 200 mm2. They'd still be a significant difference.

archie4oz said:
As for the Cell PUs, the PPC 440 cores are ageing and were not designed to clock > 1GHz

Another false assumption... The only thing preventing 440 derivatives from achieving higher clock speeds has been lack of demand... The 750 is a meager 4 stage design and it achieves 1.1 GHz...

Okay...but the 750 != 440. Also according to this 2004 IBM Power PC roadmap , the fastest PPC 440 is ~ 667 MHz. Still substantianlly below the fastest clockers ~ 2 GHz. This is what I was basing my assumtion on. ;)

archie4oz said:
Of course you seem to be looking at this strictly as something form off-the-shelf components when you could end up with totally customized one off design (not uncommon with IBM), or the derivative been quite radical... (sort've like how the Pentium M is a radical derivative of the P6 core).

Yeah...just to guage the possibilities. The IBM off-the-shelf processor line is so 'incestuos' that you can trace one core basically to any other core! :) We'd prolly get similar debates like the Xbox CPU, is it a celeron or is it a PIII argument etc, etc! :) . Relative likelyhood, IMO, if the PPC300 is real,

Xe core: PPC3xx>PPC9xx>Power4>PPC7xx>PPC4xx>Power5.

Cell PU core: custom core> PPC3xx>PPC4xx.
 
Yeah...I was just showing a quick point on the order of magnitude. If we did add the above, then we'd probably look at an order of maginitude around ~ 50 mm2 <<< 200 mm2. They'd still be a significant difference.

Are you sure that's what the additional logic would add up to? You're still forgetting the L1's (32KB/32KB, which aren't included in 440 die size calculations because it's variable), the 1MB L2 (which I don't even think the 440 supports) , security components, memory controller, DMA controllers...

Okay...but the 750 != 440. Also according to this 2004 IBM Power PC roadmap , the fastest PPC 440 is ~ 667 MHz. Still substantianlly below the fastest clockers ~ 2 GHz. This is what I was basing my assumtion on.

You totally missed the point... The 440 has more headroom than the 750 (while the 750 is truckin' along @1.1GHz). The main reason the 750's are available at higher clocks has been because IBM had customer demand for them...

Speaking of which, I think I need to reiterate someone else's post about reading your own links (the roadmap shows 440s available up to 800MHz), and it's not very accurate anyways because there are faster 750FX and GXs than that roadmap lists...

The IBM off-the-shelf processor line is so 'incestuos' that you can trace one core basically to any other core!

You wanna try tracing a 970 to a 750? Good luck...

We'd prolly get similar debates like the Xbox CPU, is it a celeron or is it a PIII argument etc, etc!

There's nothing to debate, it's a mobile celeron...
 
archie4oz said:
Yeah...I was just showing a quick point on the order of magnitude. If we did add the above, then we'd probably look at an order of maginitude around ~ 50 mm2 <<< 200 mm2. They'd still be a significant difference.

Are you sure that's what the additional logic would add up to? You're still forgetting the L1's (32KB/32KB, which aren't included in 440 die size calculations because it's variable), the 1MB L2 (which I don't even think the 440 supports) , security components, memory controller, DMA controllers...

I did some rough calculatons earlier in this thread to get a rough estimate of the BE die area using this PSX core below,

EE-GS.jpg


Didn't use anything fancy, just looked at the EE with 3 major cores, VU0, VU1, Mips core, DMA, registers, cache etc...also worked out earlier in this thread that 4MB L3 cache at 65 nm ~ 14mm2, so 1MB at 90 nm ~ 7mm2.

So the approx. ~ 50 mm2 for 3 PPC 440 cores is just an order of magnitude estimation looking at that diagram to compare against 200mm2. Nothing tremendously accurate or anything...if I get any time I may attempt an accurate go! But it's just to illustrate, that IMO, 3 PPC 440 cores at 90nm <<<200 mm2. So either MS is looking for a relative cheap to manufacture CPU with a compact die or they are using different cores, i.e. the rumoured PPC300 series for more 'ooomph' with a larger die intended for a lower process and lower heat generation...

archie4oz said:
Okay...but the 750 != 440. Also according to this 2004 IBM Power PC roadmap , the fastest PPC 440 is ~ 667 MHz. Still substantianlly below the fastest clockers ~ 2 GHz. This is what I was basing my assumtion on.

You totally missed the point... The 440 has more headroom than the 750 (while the 750 is truckin' along @1.1GHz). The main reason the 750's are available at higher clocks has been because IBM had customer demand for them...

I think were trying to make two different points here. I'm trying to eliminate the PPC 440 as the cores for the Xe CPU by implying that they've never been clocked anywhere near 3GHz. And your saying that they have the capacity to do this? Maybe they can but it seems to me that they were not designed to approach those speeds (or never been tested anywhere near those speeds in the realworld) and hence my point that they'll use a different core i.e. the rumoured replacement 300 series or something else that are designed for those clock ranges.

archie4oz said:
Speaking of which, I think I need to reiterate someone else's post about reading your own links (the roadmap shows 440s available up to 800MHz), and it's not very accurate anyways because there are faster 750FX and GXs than that roadmap lists...

Bugger...again! Bah...what's a few GHZs here and there! ;) Thanks for pointing that out! But my above point still stands...

*Mental note: don't wear shades indoors!* *Mental note 2: never assume last figure is highest! * *Mental note 3: never run with scissors!*

archie4oz said:
The IBM off-the-shelf processor line is so 'incestuos' that you can trace one core basically to any other core!

You wanna try tracing a 970 to a 750? Good luck...

How about both their forefathers are the Power1 ;)
 
Didn't use anything fancy, just looked at the EE with 3 major cores, VU0, VU1, Mips core, DMA, registers, cache etc...also worked out earlier in this thread that 4MB L3 cache at 65 nm ~ 14mm2, so 1MB at 90 nm ~ 7mm2.

If you're going to use Sony parts then a more comparable estimate would be 3 EEcores (with VU0)...

So the approx. ~ 50 mm2 for 3 PPC 440 cores is just an order of magnitude estimation looking at that diagram to compare against 200mm2.

I'd guess more around 90-110mm² myself... But why the infatuation with 200mm²?

I'm trying to eliminate the PPC 440 as the cores for the Xe CPU by implying that they've never been clocked anywhere near 3GHz

Why are you trying to eliminate the 440? You don't like it or something?

And your saying that they have the capacity to do this? Maybe they can but it seems to me that they were not designed to approach those speeds (or never been tested anywhere near those speeds in the realworld) and hence my point that they'll use a different core i.e. the rumoured replacement 300 series or something else that are designed for those clock ranges.

I'm just saying the 440 has more architectural headroom than the 750, and that the 750... And the 750 isn't anymore engineered for >1GHz speeds than the 440 is... All I'm saying is look more towards IBM's embedded dual-issue designs as a starting platform rather than their massive, wide OOE server/workstation designs...

How about both their forefathers are the Power1

No... They're not even binary compatible...
 
archie4oz said:
I'm just saying the 440 has more architectural headroom than the 750, and that the 750... And the 750 isn't anymore engineered for >1GHz speeds than the 440 is... All I'm saying is look more towards IBM's embedded dual-issue designs as a starting platform rather than their massive, wide OOE server/workstation designs...

Strictly speaking these CPUs also do OoO execution, albeit to a very limited degree.

I think you're right when you say a narrow superscalar core is the more likely. But I don't think it'll be based off either PPC 440 or 750.

Throughput matters in a console, so I think we'll see a new design with much longer pipelines (like the 970), but 2-(or 3) way superscalar. Also You'd want your OOO capabilities to hide the latency from your caches (at least to level 2). So if we assume level 2 cache latency will be 20 cycles, you'd want to at least have 40 instructions in your OOO scheduling window in a 2-way superscalar, which is beyond what the 750 ot 440 can do.

Cheers
Gubbi
 
Gubbi said:
archie4oz said:
I'm just saying the 440 has more architectural headroom than the 750, and that the 750... And the 750 isn't anymore engineered for >1GHz speeds than the 440 is... All I'm saying is look more towards IBM's embedded dual-issue designs as a starting platform rather than their massive, wide OOE server/workstation designs...

Strictly speaking these CPUs also do OoO execution, albeit to a very limited degree.

I think you're right when you say a narrow superscalar core is the more likely. But I don't think it'll be based off either PPC 440 or 750.

Throughput matters in a console, so I think we'll see a new design with much longer pipelines (like the 970), but 2-(or 3) way superscalar. Also You'd want your OOO capabilities to hide the latency from your caches (at least to level 2). So if we assume level 2 cache latency will be 20 cycles, you'd want to at least have 40 instructions in your OOO scheduling window in a 2-way superscalar, which is beyond what the 750 ot 440 can do.

Cheers
Gubbi

From Aces Hardware

Brian Neal [Ace's Hardware]: Do you see out-of-order execution (OOOE) to be complementary to on-chip multithreading or is that something that's kind of obsolete or not worth the logic?

Dr. Marc Tremblay: I think there are various forms of out-of-order execution and the stuff that tries to speculate in parallel is not really well suited because of the power requirements. It's really burning power and speculating to try to optimize single-thread performance at the expense of running other stuff that's non-speculative. So, we'll have to be very careful about applying that old style architecture to the CMTs.

Now notice that for x86, you don't have much of a choice. [People] keep saying RISC technique won the war against CISC. Obviously CISC chips are doing just fine considering how well Intel is doing, but Intel had to go to hardware translation to go from CISC to RISC and basically they have a RISC pipeline using micro-ops. The enabler for that was advanced branch prediction techniques that allowed them to stretch the front-end of the pipeline to be able to do translation over multiple cycles or to introduce a trace cache to cache the translation and so on, and then they run the RISC pipeline in the back-end. The ordering of instructions is unrelated almost to how software scheduled them originally, at least the micro-ops are, so therefore an out-of-order engine in that space makes a lot of sense, although it does cost Intel a lot of power.

Chris Rijk [Ace's Hardware]: To clarify slightly, with regards to out-of-order, it's not so much you're necessarily against it inherently, it's just to some extent more specific implementations, particularly reducing speculation overhead?

Dr. Marc Tremblay: Yes.

http://www.aceshardware.com/read.jsp?id=55000248
 
Marc Tremblay is a microprocessor architect that works for Sun. Sun has had inferior (compared to PaRISC, Alpha, x86, heck anybody) single thread performance for well over a decade now.

A large part of that has to do with Sun's insistence in making static scheduling (in-order execution) super scalar CPUs.

As for the
Dr. Marc Tremblay: I think there are various forms of out-of-order execution and the stuff that tries to speculate in parallel is not really well suited because of the power requirements. It's really burning power and speculating to try to optimize single-thread performance at the expense of running other stuff that's non-speculative. So, we'll have to be very careful about applying that old style architecture to the CMTs.

Here he's just confused. Mixing up OOO execution with conditional execution (facilitating eager execution). The commercial processor family that facilitates conditional execution the most is IA64 and all implementations of that (both of them) are in-order executing superscalars.

For OOO CPUs, the capability to schedule and execute instructions past instructions that would stall on data dependencies in an in-order CPU of course allows an OOO cpu to speculate further than an in-order one, wasting more work (throwing more away) when eg. a branch is mispredicted. However in-order CPUs waste power when they hang on data dependancies, so I think that is a toss up. The real boon of OOO execution is a much higher tolerance to memory latency.

Cheers
Gubbi
 
Gubbi said:
Marc Tremblay is a microprocessor architect that works for Sun. Sun has had inferior (compared to PaRISC, Alpha, x86, heck anybody) single thread performance for well over a decade now.

A large part of that has to do with Sun's insistence in making static scheduling (in-order execution) super scalar CPUs.

As for the
Dr. Marc Tremblay: I think there are various forms of out-of-order execution and the stuff that tries to speculate in parallel is not really well suited because of the power requirements. It's really burning power and speculating to try to optimize single-thread performance at the expense of running other stuff that's non-speculative. So, we'll have to be very careful about applying that old style architecture to the CMTs.

Here he's just confused. Mixing up OOO execution with conditional execution (facilitating eager execution). The commercial processor family that facilitates conditional execution the most is IA64 and all implementations of that (both of them) are in-order executing superscalars.

For OOO CPUs, the capability to schedule and execute instructions past instructions that would stall on data dependencies in an in-order CPU of course allows an OOO cpu to speculate further than an in-order one, wasting more work (throwing more away) when eg. a branch is mispredicted. However in-order CPUs waste power when they hang on data dependancies, so I think that is a toss up. The real boon of OOO execution is a much higher tolerance to memory latency.

Cheers
Gubbi

Things will probably be different on the XboX NeXt CPU since it looks to be designed with thread level paralism in mind. Sun's Niagara is targeted at the server market, so obviously it won't reflect the true nature of the Microsoft CPU. Although it will probably take advantage of many of Niagara's tricks to a lesser degree.

Once again I'll quote some information for ACE's Hardware.

Niagara's branch predictor probably only has simple static branch prediction logic. If there are 4 threads on a core and all are ready to execute then when a branch is encountered, it doesn't hurt to use simple logic to guess which way the branch will go. If the guess is right, execution continues normally, but if the guess is wrong, then execution switches to another thread.

...

By making use of other threads, Niagara needs only very simple logic to help handle branches well. This has several benefits, including reduced design and test time, less logic, smaller chip size, lower power and a shorter pipeline. Not only that, but the cost of branches becomes very small, improving performance. This is an example of how TLP optimised designs can be naturally efficient.

...



What About Branches on Super-scalar SMT Designs?

Though Niagara is the focus of this article, let's take a brief pause to look at how super-scalar multi-threaded CPU cores could compare.

Niagara probably has quite a short pipeline, and a branch miss may only cost 5 or 6 cycles. In addition, since Niagara executes only 1 instruction per cycle, that means only 5 or 6 instructions are lost in pre-fetching, decode and so on. On Intel's "Prescott" Pentium 4 E, a branch miss-prediction costs 30 cycles, and with 3 instructions per pipeline stage per cycle, that's an awful lot of instructions that could depend on one branch prediction. Clearly things are rather different here.

...
Speculation Mode


Super-Scalar, But Not as You Know It

Super-scalar processor design is an old idea. The basic idea is simple - a single CPU has multiple function units, so why not use them in parallel? E.g., do an "add" in the same cycle as a "load" or some other instruction. In its most basic form, 2-way in-order super-scalar, this requires four main changes to work. Firstly, the whole pipeline must be able to process 2 instructions per cycle. Secondly, the CPU registers must be able to handle reads and writes from 2 instructions per cycle. Thirdly, some logic is needed to determine what functional units are available for parallel processing. Finally, the instruction issue part of the pipeline must be able to extract ILP from the instruction stream so that issuing two instructions in parallel does not cause processing to change.

However, extracting much more ILP than this from the instruction stream is very inefficient, which is why most high-performance CPUs are 3-way or 4-way. But Niagara doesn't have to follow the same old pattern as Niagara is explicitly designed to process multiple threads. Instructions from different threads can always be issued in parallel, so long as they use different functional units. So a future Niagara based design could issue 2 instructions per cycle, from 2 different threads. In other words, the maximum IPC of a single thread will not exceed 1, but the IPC per CPU core will now be a maximum of 2.

This will require a double-width pipeline to sustain 2 instructions per cycle and some logic to determine what functional units are available for parallel processing. This might require adding a stage or two to the existing pipeline. However, it will not require logic to find available ILP or to increase the number of register ports. So not only is it easier to issue multiple instructions per cycle this way compared to a traditional super-scalar design, but less logic is required.

I wanted to point something out about the "leaked" Xbox 2 specs and this guess work on future designs from Sun by Ace's Hardware. The leak states

"The Xenon CPU is a custom processor based on PowerPC technology. The CPU includes three independent processors (cores) on a single die. Each core runs at 3.5+ GHz. The Xenon CPU can issue two instructions per clock cycle per core. At peak performance, Xenon can issue 21 billion instructions per second."

http://news.gamewinners.com/index.php/news/1225/

I'm curious if IBM is going to implement simultaneous multithreading or like Sun does, coarse-grained multithreading which Sun also calls "Vertical Multithreading".


The leak also states

"Each core has two symmetric hardware threads (SMT), for a total of six hardware threads available to games. Not only does the Xenon CPU include the standard set of PowerPC integer and floating-point registers (one set per hardware thread), the Xenon CPU also includes 128 vector (VMX) registers per hardware thread. This astounding number of registers can drastically improve the speed of common mathematical operations."

With a large amount of registers, it would seem to make sense to take the Sun Vertical Multithreading approach instead of the traditional Power 5 simultaneous multithreading method.






A future Niagara based design?
There is however, one new aspect, which requires a design trade-off. With 4 threads and up to 2 instructions per cycle, the instruction issue logic could either look for 2 instructions to issue from any 2 of the 8 threads, or 2 specific threads. The former will help maximise IPC, while the later is simpler. At a guess, I think there would not be much performance difference between the two, so the simpler solution would be best.

The easiest way to implement this would be to simply duplicate the front-end of the pipeline, with some minor changes at instruction fetch and the logic actually issuing the instructions to the execution units. The current Niagara design has one active thread at a time, and switches between them on stalls. This possible dual-pipeline Niagara design would still have 1 active thread per pipeline, and switch in the same way. So the instruction issue logic would look at the next instruction from the active thread from each of the two pipelines and issue both if there are no resource conflicts. If each pipeline has 4 threads, that gives 8 threads per core, which would also help the average IPC, though also puts more pressure on the cache system.

http://www.aceshardware.com/read.jsp?id=65000297
 
Strictly speaking these CPUs also do OoO execution, albeit to a very limited degree.

Yes I know they have some limited OOE capablities (I'm pretty intimate with 750s and 74xxs)...

I think you're right when you say a narrow superscalar core is the more likely. But I don't think it'll be based off either PPC 440 or 750.

Well it depends on how much you mean by based off of... I mean I can see them taking a 750 and stretching the snot out of it's execution pipeline, increasing the rename resources, instruction window, enlarging the cache buffers, etc... Anyways I sorta envision something more along the lines of Intrisity's FastMIPS cores...
 
archie4oz said:
Didn't use anything fancy, just looked at the EE with 3 major cores, VU0, VU1, Mips core, DMA, registers, cache etc...also worked out earlier in this thread that 4MB L3 cache at 65 nm ~ 14mm2, so 1MB at 90 nm ~ 7mm2.

If you're going to use Sony parts then a more comparable estimate would be 3 EEcores (with VU0)...

Okay...that would be more accurate...

archie4oz said:
So the approx. ~ 50 mm2 for 3 PPC 440 cores is just an order of magnitude estimation looking at that diagram to compare against 200mm2.

I'd guess more around 90-110mm² myself... But why the infatuation with 200mm²?

Won't argue that figure...still <<< 200 mm2

I'm using the 200 mm2 as a yardstick at 90 nm. I'm kinda on the optimitic side! :) ...Earlier in this thread I was trying to show if the PS3 BE was feasible for Sony at 65 nm and the die area was around ~ 300 mm2. The PS2 EE was 240 mm2 at 250 nm and the GS was 279 mm2 at 250nm at launch.

Others may disagree but I'm hedging my bet on Sony releasing the BE at 65 nm and then quickly moving to 45 nm. Similar to what they did with PS2 from 250 nm to 180 nm.

IMHO, I'm pretty sure Sony will not release a PS3 CPU under 200 mm2 at 65nm. If MS is on an older process for Xenon, and for them to compete with the PS3 CPU, then they will need something in the range of 200mm2 at 90nm at least. That would be comparable to 100mm2 at 65nm to Sony's 300mm2. A significant gap. So if we estimate any Xe dies <<< 200mm2, then it's unlikely to be realised unless MS go for the cost saving route instead of matching PS3.


archie4oz said:
I'm trying to eliminate the PPC 440 as the cores for the Xe CPU by implying that they've never been clocked anywhere near 3GHz

Why are you trying to eliminate the 440? You don't like it or something?

Nothing against them, IIRC, they're being used in BlueGene /L 8) I'm trying to get a plausible shortlist of candidates for the Xe CPU cores from IBM. I've seen everything from the PPC 603 to the Power5+ bandied around! :oops:

Earlier in this thread, comments were made about the patents describing processors local memories reading other processors cache and how both the Xe and PS3s CPUs could have alot in common in that sense. But perhaps that could also extend to the Cells PUs and Xe cores also if IBM are developing a new 64bit PPC embedded range?

archie4oz said:
How about both their forefathers are the Power1

No... They're not even binary compatible...

I have a feeling you've answered this question before! ;)

IIRC, the top of their family trees would still trace back to the Power1, no? The bastdard child was the 601 and the 32bit PPCs followed from there to the 750. The 64 bit 620 was ultimately derived from the 601 also and was meant to give birth to the 970 but it wasn't succesful or something and IBM used the Power4 core instead to make a 64bit PPC 970. If that's the case their family trees still trace back to the Power1.
 
Brimstone said:
Speculation Mode


Super-Scalar, But Not as You Know It

<snip>

IMO, Sun are doomed. They have taken the eye off the single thread performance ball and they are going to lose.

They make it sound like wide super scalars are a bad idea. They argue that most of your execution units will idle since many programs have limited instruction level parallism (ILP). This is where SMT comes in. SMT takes advantage of the fact that you're likely to have an excess of execution units with all that goes with it (lots of rename registers etc.) The reason why SMT is comparatively cheap (10-15% extra logic) is because all that is added is the capability of the OOO scheduling engine to track more than one context (and probably add registers for the extra context).

For the many programs with limited ILP, Sun's approach and a wide super scalar with SMT will do approximately the same, both keeping the majority of their execution units busy by executing multiple threads. But a significant fraction of programs will have lots of ILP, and these will fly like shit off a silver shovel on our wide super scalar compared to Sun's Niagara.

Cheers
Gubbi
 
archie4oz said:
Strictly speaking these CPUs also do OoO execution, albeit to a very limited degree.

Yes I know they have some limited OOE capablities (I'm pretty intimate with 750s and 74xxs)...
I knew you'd know, I just couldn't help myself :)

I think you're right when you say a narrow superscalar core is the more likely. But I don't think it'll be based off either PPC 440 or 750.

archie4oz said:
Well it depends on how much you mean by based off of... I mean I can see them taking a 750 and stretching the snot out of it's execution pipeline, increasing the rename resources, instruction window, enlarging the cache buffers, etc... Anyways I sorta envision something more along the lines of Intrisity's FastMIPS cores...

Given the rumours concerning SMT, I'm think it's be more likely that it will be a cut down 970 (or Power5 derivative). Remove some issue(dispatch :) ) ports, execution units, rename registers seems easier than stretching pipelines, enlarging rename resources and building context aware instruction scheduling.

I also think Microsoft is aiming for something that will be relatively easy to develop for. The FastMIPS/FastMATH systems will require some significant to-the-metal programming skills to make fly.

I think SIMD is a given, and since Altivec is about as neat as SIMD extensions go, a PPC derivative is the most likely (fast 7450 or cut down 970/P5 with Altivec), all IMO of course.

Cheers
Gubbi
 
Gubbi said:
Brimstone said:
Speculation Mode


Super-Scalar, But Not as You Know It

<snip>

IMO, Sun are doomed. They have taken the eye off the single thread performance ball and they are going to lose.

They make it sound like wide super scalars are a bad idea. They argue that most of your execution units will idle since many programs have limited instruction level parallism (ILP). This is where SMT comes in. SMT takes advantage of the fact that you're likely to have an excess of execution units with all that goes with it (lots of rename registers etc.) The reason why SMT is comparatively cheap (10-15% extra logic) is because all that is added is the capability of the OOO scheduling engine to track more than one context (and probably add registers for the extra context).

For the many programs with limited ILP, Sun's approach and a wide super scalar with SMT will do approximately the same, both keeping the majority of their execution units busy by executing multiple threads. But a significant fraction of programs will have lots of ILP, and these will fly like shit off a silver shovel on our wide super scalar compared to Sun's Niagara.

Cheers
Gubbi

I think Alpha's Aranha/EV8 with an 8-way super-scalar engine + SMT would have upset SUn designers quite a bit ;).
 
Gubbi said:
But a significant fraction of programs will have lots of ILP, and these will fly like shit off a silver shovel on our wide super scalar compared to Sun's Niagara.

Cheers
Gubbi

How about DMT or spawning new threads to take advantage of that ILP to drive the Niagara cores ?
 
Back
Top