Future console CPUs: will they go back to OoOE, and other questions.

In fact, I wouldn't be surprised if the dual PPC970 X360 devkits yielded better overall CPU performance than the final machine.

funny that you mention that.. : ) i think some 'early adopters' could really tell stories on this subject..
 
Microsoft could have used an OOO core if they really wanted it. Xenon was built to order for MS so IBM would have given them whatever they desired.

I talked to a guy recently formerly from Apples OS group, generally just bitching about the decisions in both processors.

He claimed that both Xenos and Cell are based off cores originally desgined by IBM in conjunction with Apple and other customers. Apple in the end turned it down, but I doubt very much that MS went in and said no we don't want on OOO core.

It's much more likely they went in with we want a multicpu core clocked in the 3Ghz range with power output no greater than X and an emphasis on FP performance.

FWIW in most code the original 2Ghz PPC CPUs in the devkits are faster than the final units. But this was pretty easy to predict given it's 3 issue design, although a lot of early devs didn't.

We run all our benchmarks on PPU, SPU, X360 CPU and usually a PC, and I've yet to see the PC not totally dominate on a single processor benchmark.
 
1. At 3.5Ghz?
2. Maybe something changed there recently (I really didn't bother following this) but 970MP was one of IBMs phantom chips last I checked (alongside with 970FX and a bunch of 750xxx derivatives).
On paper IBM has talked about forever(and looked really nice), but I'm not aware of any actual real world showing for any of them (but like I said I haven't followed this development recently. And at any rate, you'll have one hell of a time providing an argument that MP could have been ready for very large volume production in 2004 ).

No IBM probably couldn't have provided an OOOE cpu at 3.5ghz, but their in order cpu only made 3.2ghz, and supposendly had poor yields. Say if they only hit 2.4ghz to 2.6ghz on an OOOE cpu, well Xenon already very likely gives lower performance than a Pentium 4 per mhz, and IBM already had OOOE cpus at 2.5ghz in mass production by that point in time that could beat a 3.2ghz p4. (BTW, Xbox 360 launched in 2005 right? so 2004 isn't relavent at all?)
 
I talked to a guy recently formerly from Apples OS group, generally just bitching about the decisions in both processors.

He claimed that both Xenos and Cell are based off cores originally desgined by IBM in conjunction with Apple and other customers. Apple in the end turned it down, but I doubt very much that MS went in and said no we don't want on OOO core.

What's the point of this comment? Apple wanted a processor that was going to be competitive with the PC market. Basically they are going for the route where they want to be able to compete with PC hardware all the way, and then beat the competition with their design and their OS, keeping the option open for running dual-boot OS/X and Windows.

This did not match the Cell processor at all - it would have been hard enough to optimise OS/X for the Cell, and they would again have been gambling on whether IBM could produce chips that could compete with the Intel/AMD arms race (they were too late with a suitable mobile CPU and that cost Apple dearly for their Powerbooks, another reason they wanted to shift to Intel).

Furthermore, all it says that the PPE part of the Cell processor is an evolution from the PowerPC core. Yes, we all know that. We also know that the Cell's PPE (which has more advanced multi-threading support among others) isn't the most exciting part of the Cell's chip design, now is it?

It's much more likely they went in with we want a multicpu core clocked in the 3Ghz range with power output no greater than X and an emphasis on FP performance.

Yes, and the result is this in-order chip.

FWIW in most code the original 2Ghz PPC CPUs in the devkits are faster than the final units. But this was pretty easy to predict given it's 3 issue design, although a lot of early devs didn't.

We run all our benchmarks on PPU, SPU, X360 CPU and usually a PC, and I've yet to see the PC not totally dominate on a single processor benchmark.

When you say PPU, SPU, what do you mean exactly?

And what kind of benchmarks are you running? I don't quite follow. Do you mean a single core benchmark, and what kind of PC processor are you matching against what other CPUs exactly, and how? (though I can imagine that this kind of stuff is confidential)
 
When you say PPU, SPU, what do you mean exactly?

And what kind of benchmarks are you running? I don't quite follow. Do you mean a single core benchmark, and what kind of PC processor are you matching against what other CPUs exactly, and how? (though I can imagine that this kind of stuff is confidential)

I read it as a single-threaded benchmark on the PPU/E, an SPU/E, a Xenon core, and a P4/C2D/A64 core. The high-performance, aggressively OOO PC parts yield the highest performance -- which, really, is just plain common sense.
 
Nothing concrete about this approach has materialized, how does this scouting thread know where to stream in from ? It's exactly these data-dependent loads that are the problem and that OOO is so good at scheduling around.

Needing revolutionary breakthroughs in software to get good performance on your new core, sorry but I'm a sceptic.

It was reasonably oncretely described in Sun's IEEE paper on throughput computing. It is also not a software technique, but a hardware one. It's simply a twist on traditional speculative execution, only the execution happens using idle thread resources. When you miss a load, you set a hardware checkpoint, and your current thread blocks waiting for a load into a register, while another thread, the scouting thread, continues to execute. It does not commit any writes to memory, but merely serves to "warm up" the caches/fetch predictors. Any future load dependent on a past load which has not yet resolved is ignored. The branch prediction unit is used to speculatively continue executing past a branch dependent on a conditional which was calculated using unloaded registers. When the loads finish, you return to executing from the checkpoint. A future technique "out of order commit" will be able to leverage speculately calculated results from a scout thread without having to recalculate them, as long as they can be proven correct. This also, is done in hardware, by reusing the special "not loaded yet" bits already utilized by the scout thread on registers.

Sun's paper includes actual benchmarks on the technique, which show impressive database, SpecInt2000 and SpecFp2000 improvements. It boosts single-thread performance by 40% average. Also, Azul Systems ships systems today for Java which leverage the technique and use up to 378-threads per system, and 48 threads per core.

You should look at it, well atleast a future generalization of it, as simply another way to implement OOOe (when out-of-order commit techniques are added), but one that potentially scales better vs onchip resources and requires less silicon. The main difference with Rock vs today's OOOe is that you don't get the benefit of retired instructions and a store queue. However, with TLP processors this is less of an issue, because you will still retire instructions for other threads while waiting, so work is getting done, just not in your thread, and over all throughput goes up.

I see the the ILP/TLP approaches as just specialized cases of future data-parallel/transactional memory designs, where the CPU core, depending on what work is being done, essentially profiles and finds data-parallel snippets of code, executes them in separate contexts, be it ILP or TLP, sometimes optimistically, and then commits them in a transactional fashion. You should then be able to scale up the number of instructions inflight to thousands, maybe tens of thousands, just like GPUs. The current instruction windows of OOOe processor implementations simply ain't going to scale power wise, since they scale n^2 with window size.

IMHO, if you've got hardware to accelerate faster single thread performance, it should also be able to accelerate faster multithread performance as well. The approaches need to be unified. Sun's Niagara/Rock approaches just show the tip fo the iceberg. It would be nice if CPUs could extract ILP out of tiny windows, and then, using the same on-chip resources, switch to TLP when ILP fails. It would also be nice if concurrency-safe computing was "natural" and inherent , instead of implemented via locking (which also hurts OOOe performance).

And OOO leads In-orders in performance/power at all performance points all the way down to where it gets completely uninteresting.

Does it? Atleast in SpecWeb, Sun's Niagara based T1000 has 5 times the performance per watt of a quad-core Xeon Dell PowerEdge 2850. Twice the throughput and less than 1/2 the power consumption. In Database OLTP benchmarks, triple the throughput (transactions per minute) at 1/2 the power consumption.

Doesn't seem so clear cut to me. If you look at other TLP systems like Azul (which is designed to run thread Java applications), the performance per watt advantage of commodity Intel boxes is incredible.

I'm not saying a generalized approach is going to beat OOOe cores at single thread execution speed, although I think they could probably win with equal transistor budgets, but what I am saying is that I think TLP has gotten a bad rap as people think "oh, databases and web servers", whereas, if you look at recent IEEE papers, activity in the TLP areas has produced some facinating approaches to extracting memory parallelism and out-of-order by reusing the same HW resources used to deal with threads. It also scales well and removes many limitations that the fixed size chip buffers/windows impose. For example, hardware scouting can prefetch thousands of loads ahead in theory, limited only by the L1 cache architecture, there are no load/store queues, instruction queues, or reorder buffers to impose limits.

Adding out-of-order commit can be done without as many resources as traditional OOOe and scales better, according to the out of order commit papers I've read (interestingly, they were invented for traditional ILP performance speedup) They basically get rid of the reorder buffer and replace it with a checkpoint table that essentially is storing thread context at various points in execution. The OOOe then scales with the size of the checkpoint table, and the store-queue.

Moreover, one paper I read takes the interesting approach that reminds me of generational garbage collection in virtual machines, in that they detect instructions which they can predict will take a long time to complete I/O, and they move these instructions into their own separate "slow" queues. This means that instructions which take much much longer than others won't tie up onchip resources used by "fast" instructions (queues), because they are waiting so long, eating up slots that could be used for others.

IMHO, it's much like the debates over unified shading. NVidia might still win the performance crown for the near future with "big" non-US designs (like big OOOe chips), but that throughput oriented designs in the future may end up beating traditional OOOe designs, even in single thread performance, simply because of fundamental scaling in the current OOOe designs.
 
What's the point of this comment? Apple wanted a processor that was going to be competitive with the PC market. Basically they are going for the route where they want to be able to compete with PC hardware all the way, and then beat the competition with their design and their OS, keeping the option open for running dual-boot OS/X and Windows.

My point wasn't apples decision, more that neither the X360's core nor Cell are as custom as some people would believe.

They were both basically customised parts that IBM already had lieing around.
 
It was reasonably oncretely described in Sun's IEEE paper on throughput computing. It is also not a software technique, but a hardware one. It's simply a twist on traditional speculative execution, only the execution happens using idle thread resources. When you miss a load, you set a hardware checkpoint, and your current thread blocks waiting for a load into a register, while another thread, the scouting thread, continues to execute. It does not commit any writes to memory, but merely serves to "warm up" the caches/fetch predictors. Any future load dependent on a past load which has not yet resolved is ignored. The branch prediction unit is used to speculatively continue executing past a branch dependent on a conditional which was calculated using unloaded registers. When the loads finish, you return to executing from the checkpoint. A future technique "out of order commit" will be able to leverage speculately calculated results from a scout thread without having to recalculate them, as long as they can be proven correct. This also, is done in hardware, by reusing the special "not loaded yet" bits already utilized by the scout thread on registers.

That makes sense. So it ignores data dependecies and speculates control dependencies to load as much as possible. It'll be interesting to see if it's actually a win since it's effectively using two threads to do one thread's work.

Sun's paper includes actual benchmarks on the technique, which show impressive database, SpecInt2000 and SpecFp2000 improvements. It boosts single-thread performance by 40% average. Also, Azul Systems ships systems today for Java which leverage the technique and use up to 378-threads per system, and 48 threads per core.

You should look at it, well atleast a future generalization of it, as simply another way to implement OOOe (when out-of-order commit techniques are added), but one that potentially scales better vs onchip resources and requires less silicon. The main difference with Rock vs today's OOOe is that you don't get the benefit of retired instructions and a store queue.

From your description it's clear it'll help single thread performance in most cases, since the actual executing thread will have a higher hit rate in caches than without the scout. The only exception would if the scout speculates off on a wild goose chase wasting bandwidth, bandwidth that would have been used by loading needed data (ie. will only happen when the memory bus is saturated).

I see the the ILP/TLP approaches as just specialized cases of future data-parallel/transactional memory designs, where the CPU core, depending on what work is being done, essentially profiles and finds data-parallel snippets of code, executes them in separate contexts, be it ILP or TLP, sometimes optimistically, and then commits them in a transactional fashion. You should then be able to scale up the number of instructions inflight to thousands, maybe tens of thousands, just like GPUs. The current instruction windows of OOOe processor implementations simply ain't going to scale power wise, since they scale n^2 with window size.

Well, todays cache-line writes can be seen as transactional commits, but I'm guessing you want coarser (and user/compiler controlled) granularity, and I'm guessing that's why you like the CELL approach. I just think CELL will fail because it's virtually impossible to virtualize the SPEs, which IMO is essential in a massively threaded approach.

A ROB should scale linearly with size. It will of course be slower and in order to run it at the same latency, power would go up. But, look at two wildly different approaches:
1.) P4 with one, big, fat 128 entry instruction ROB.
2.) K8 with multiple smaller ROBs.

K8 is particularly interesting IMO. It groups instructions in three, and thereby effectively has 3 smaller ROBs in parallel for it's global scheduler. This is the same technique used by IBM in Power4/5, PPC970 which have 5 (4+branch) instructions in a group. From the global scheduler instructions are issued to the int and fp schedulers which are smaller and faster.

There's no reason this principle of hierarchical ROBs couldn't be extended (indefinately as we see with caches).

Does it? Atleast in SpecWeb, Sun's Niagara based T1000 has 5 times the performance per watt of a quad-core Xeon Dell PowerEdge 2850. Twice the throughput and less than 1/2 the power consumption. In Database OLTP benchmarks, triple the throughput (transactions per minute) at 1/2 the power consumption.

Congrats! :) You succeeded in finding one of the few benchmarks that has basically an infinite amount of parallism in it, since each request is completely independent from other request (and thereby isn't affected by Amdahl's law).

If you look at other OLTP benchmarks, like SAP you see that the T1 roughly equals two dual core 2.2GHz Opteron, the former having an edge in power 72W vs ~ 80-100W and the latter an edge in cost. A quad core opteron would have approximately the same die size as Niagara, and would burn the same amount of power.

I generally agree with you that in the long run to scale performance we need to be much better at extracting TLP. I also think that approaches will converge, we're already seing Niagara 2 improving significantly on single thread performance over Niagara. And I'll bet that we'll see acceleration of context switching if that ever gets critical in x86 land.

Cheers
 
Last edited by a moderator:
My point wasn't apples decision, more that neither the X360's core nor Cell are as custom as some people would believe.

They were both basically customised parts that IBM already had lieing around.

But this sort of depends also on whether you consider the PPE to be 'Cell' or not... I think most people would view the SPEs as the core of the architecture, and certainly those are custom.

I remember the series of articles One posted here last year on the creation of Cell, and I don't think many would find it surprising that IBM rammed an already explored design into it once the compromise of a heterogenous Power-based design was reached.

For my part, I was surprised as details of the XeCPUs core came out that it was able to be built around the same part/concept, but hey whatever.
 
Take this for what it's worth (3rd hand information on the internet), but the apple guy claims that IBM had a lot more than just the PPU "lieing around". And that Sony/Toshiba were minor players in the Cell design.
 
IMHO, it's much like the debates over unified shading. NVidia might still win the performance crown for the near future with "big" non-US designs (like big OOOe chips), but that throughput oriented designs in the future may end up beating traditional OOOe designs, even in single thread performance, simply because of fundamental scaling in the current OOOe designs.
Nice post and in my naive way I think this final comparison is highly valid.

One thing that seems to have been missed in this discussion is the very large register files that Xenon cores and SPEs have (for Vec4 "VMX"). GPUs also have seriously large register files.

NVidia describes each fragment in flight as being a thread - consequently each thread (fragment) has, say, 3 registers assigned to it. There might be 5000+ threads in flight, in this counting scheme. Being more down to earth, in G71 there's effectively only a maximum of 6 threads actually in flight (one per quad-pipeline) where each pipeline has its own program counter. Each thread in a GPU is doing its own per-clause (or per pair of instructions) windowing. Still the register file has to be relatively huge to accommodate all the fragments that are in flight (5000+). ATI's newer GPUs are obviously more interesting because they run batches of fragments out of order, as well as re-ordering ALU and TEX instructions within a "thread".

Jawed
 
And you may make the case that the 970 doesn't reach the same 3.2 GHz frequency as Xenon, but the issue-width is considerably wider. In fact, I wouldn't be surprised if the dual PPC970 X360 devkits yielded better overall CPU performance than the final machine.
The two CPUs in the alpha kit are IBM PowerPC 970 CPUs. These chips do not have the new features that will be in the Xbox 360 CPUs. There are also some performance differences to be aware of:
· The two CPUs in the alpha kits don’t support simultaneous multithreading (SMT), whereas the final hardware will have three SMT CPUs.
· The two CPUs in the alpha kit are separate chips, each with its own independent 512-KB L2 cache, whereas the final hardware will have three CPUs on one chip with 1 MB of L2 cache on the same chip.
· The PowerPC 970 can dispatch up to five instructions per clock cycle with aggressive out-of-order execution. In contrast, each Xbox 360 CPU core will be able to issue a maximum of two instructions per clock cycle in a more linear fashion, meaning that the Xbox 360 CPU may execute fewer instructions per clock cycle. However, the Xbox 360 CPU will run at a much higher clock speed and will have much higher peak vector performance.
· The instruction latencies on the alpha kit CPU are typically fewer clock cycles compared to the Xbox 360 CPU, but the clock cycles will be shorter on the Xbox 360 CPU.
· The vector units in the alpha kits have fewer registers and are missing some important instructions, such as dot product.

Jawed
 
Take this for what it's worth (3rd hand information on the internet), but the apple guy claims that IBM had a lot more than just the PPU "lieing around". And that Sony/Toshiba were minor players in the Cell design.

Well, I mean that might or that might not be the case - there's not too much to go on out there unfortunately, and most everything we would hear in the English-speaking world would normally come out of IBM or their partner companies (Apple) anyway.

Still I'll provide the link to that older thread of One's just because I still find it a great read.

The Engineers Who Created Cell
 
But this sort of depends also on whether you consider the PPE to be 'Cell' or not... I think most people would view the SPEs as the core of the architecture, and certainly those are custom.

Personally, I'd point to the communications/data flow aspects and the corresponding resources as the core of the architecture. Not a FLOP to be found there though, and its features are rather technical, maybe that's why it doesn't get much attention.
But then, I'm a data flow kind of guy.
 
Last edited by a moderator:
That makes sense. So it ignores data dependecies and speculates control dependencies to load as much as possible. It'll be interesting to see if it's actually a win since it's effectively using two threads to do one thread's work.

Well, but a thread in this case just represents CPU context state which must be stored, so it won't be that expensive. OOOe also stores stuff just with different granularity. TLP trades off computation for space, and writes contexts during a checkpoint in a very granular fashion (every X instructions) instead of writing CPU pipeline state for each instruction encountered (the OOOe approach, you do it incrementarily, recording state as you go, instead of doing in in chunks during "checkpoints/bulk commits")


From your description it's clear it'll help single thread performance in most cases, since the actual executing thread will have a higher hit rate in caches than without the scout. The only exception would if the scout speculates off on a wild goose chase wasting bandwidth, bandwidth that would have been used by loading needed data (ie. will only happen when the memory bus is saturated).

They try to limit this by 1) having scout ignore any addresses dependent on calculated data not loaded yet and 2) using the branch predictor to guess conditionals that can't be evaluated yet because they depend on unloaded data. Now, early out loops could lead to wastage if you start a very slow I/O right before the loop, and then would have quit in the first iteration depending on data, whilest the scout keeps executing the loop over and over. But I think it's probably statistically much less likely and therefore the expected benefit is still >1.

Well, todays cache-line writes can be seen as transactional commits, but I'm guessing you want coarser (and user/compiler controlled) granularity, and I'm guessing that's why you like the CELL approach. I just think CELL will fail because it's virtually impossible to virtualize the SPEs, which IMO is essential in a massively threaded approach.

But I'm not talking about SPEs. I'm talking about a future TLP processor + out of order commits + maybe transactional memory. It appears to me that once you spend the resources to do out of order commits, you can also take advantage of the same checkpoint tables to realize some form of transactional memory, atleast with bigger granularity than a cache line. This would allow the elimination of most locks, the bane of any multithreaded app. The essential structure used inthe out-of-order-commit scheme to me resembles a transaction log.

Lookup "software transactional memory" for the compiler/virtual machine oriented approach. I was hoping for something with hardware support, atleast for transactions small enough to fit within onchip resources, to overcome the costs of doing it in software. (Much like Azul's extra bit in address registers to support zero-pause truly concurrent garbage collection) Most locks that cause proplems are "hot locks" around small shared datastructures, like adding a node to a list, tree, or set. The update would certainly fit within a chip's resources without requiring the elaborate STM.


A ROB should scale linearly with size. It will of course be slower and in order to run it at the same latency, power would go up.

The out of order commit papers claim ROBs don't scale this this, because ROBs are intimately related to the size of instruction queues, load/store queues, all of which must be upsized.

I'll quote the original paper that started the fuss:
Using these two mechanisms our processor has a per-
formance degradation of only 10% for SPEC2000fp over
a conventional processor requiring more than an order of
magnitude additional entries in the ROB and instruction
queues, and about a 200% improvement over a current pro-
cessor with a similar number of entries


Congrats! :) You succeeded in finding one of the few benchmarks that has basically an infinite amount of parallism in it, since each request is completely independent from other request (and thereby isn't affected by Amdahl's law).

Well, OLTP isn't truly parallel, since you have shared mutable state. It just so happens that RDBMS are very good at avoiding contention.

If you look at other OLTP benchmarks, like SAP you see that the T1 roughly equals two dual core 2.2GHz Opteron, the former having an edge in power 72W vs ~ 80-100W and the latter an edge in cost. A quad core opteron would have approximately the same die size as Niagara, and would burn the same amount of power.

Maybe, but the T1 would probably still smash it in many enterprise applications, and remember, the T1 is essentially a prototype first-gen chip. Practically lifted whole sale from a startup company with a couple of modifications to rescue Sun's failing HW line. The Niagara 2 and Rock will probably have better power and single thread performance coupled with many many more cores and threads. If such a chip were to arrive in 2007, it would most likely smoke a quad core opteron in any enterprise server benchmark, and would probably give them a run for their money if someone wrote an optimized game for it. :)
 
Personally, I'd point to the communications/data flow aspects and the corresponding resources as the core of the architecture. Not a FLOP to be found there though, and its features are rather technical, maybe that's why it doesn't get much attention.
But then, I'm a data flow kind of guy.

Indeed, this seems to be one very important aspect of the 'Cell Broadband Engine', implied in the name even. If you look at the stuff processors do today that really bring them down, it comes to handling and modifying large streams of data.

In favor in this respect of the Cell surely could be the new supercomputer IBM has been contracted to build, containing 16.000 Cell processors. I assume that everyone here already knows that the Cell Programmer's Handbook is online and answers all questions? I have it on my PSP ... :D

And surely also the recent testing of the Cell's FLOPS capabilities by the science guys, rating even its double precision floating point capabilities at 14.6 GFLOPS, is nothing to sneeze at?

(BookR does an amazing job of allowing me to read that 10mb / 600 page PDF on it. I wish the official FirmWare supported BookR!)
 
BTW, my idea for these future processors is that you start out with a large number of functional units and general purpose context state, and the CPU performs the magic described above (TLP prefetching, out of order commits, etc) it also in the process tries to distribute to functional units as much ILP as it can extract from data-parallel sections within a thread. You probably have many more functional units than current OOOe designs.

Of course, a single threaded app will never maximize all the usage of the function units or context storage, and will certainly lead to idle and used capacity. However, an application with many threads probably would be able to maximize them.

Then we can dispense with the "N-core CPU" concept just like we got rid of the "N-pipeline" GPU concept. CPUs in the future would be described as ratios, just like GPUs. Number of functional units, max # of threads, max # of caches (maybe partitionable). Then, a "dual core CPU" is just a virtual notion, like one of these TLP processors with half the function units, cache, and thread state partitioned to each virtual unit. Much like a 16 pipeline GPU in next-gen architectures is really a virtual concept, a grouping of underlying resources.

I would not call them neccessarily "in order", since if they implement out-of-order-commit, they are OOOe in a way, just a different way.
 
Personally, I'd point to the communications/data flow aspects and the corresponding resources as the core of the architecture. Not a FLOP to be found there though, and its features are rather technical, maybe that's why it doesn't get much attention.
But then, I'm a data flow kind of guy.

Very true, I agree. I was highlighting the SPEs as being the meat of the execution logic of Cell - perhaps the focus of the architecture was not the right emphasis for me to draw. :)

I definitely recognize where you're coming from on the dataflow aspects; in fact if you haven't already seen it, you'll probably be a fan of the thinking espoused here in this interview.
 
Take this for what it's worth (3rd hand information on the internet), but the apple guy claims that IBM had a lot more than just the PPU "lieing around". And that Sony/Toshiba were minor players in the Cell design.
imho your source is not worth that much, Sony and Toshiba guys were not minor players at all, they co-signed with IBM guys dozen of patents since 2002, japanese guys even move to US to closely work with IBM guys.
 
Last edited:
imho your source is not worth that much, Sony and Toshiba guys were not minor players at all, they co-signed with IBM guys dozen of patents since 2002, japanese guys even move to US to closely work with IBM guys.

Just an interesting datapoint. If he had no credibility I'd have dismissed it out of hand.
And I'm sure that Sony and Toshiba worked with IBM, he just said that he'd seen plans for a similar processor while with Apple before the Cell project happened.
 
Back
Top