In fact, I wouldn't be surprised if the dual PPC970 X360 devkits yielded better overall CPU performance than the final machine.
funny that you mention that.. : ) i think some 'early adopters' could really tell stories on this subject..
In fact, I wouldn't be surprised if the dual PPC970 X360 devkits yielded better overall CPU performance than the final machine.
Microsoft could have used an OOO core if they really wanted it. Xenon was built to order for MS so IBM would have given them whatever they desired.
1. At 3.5Ghz?
2. Maybe something changed there recently (I really didn't bother following this) but 970MP was one of IBMs phantom chips last I checked (alongside with 970FX and a bunch of 750xxx derivatives).
On paper IBM has talked about forever(and looked really nice), but I'm not aware of any actual real world showing for any of them (but like I said I haven't followed this development recently. And at any rate, you'll have one hell of a time providing an argument that MP could have been ready for very large volume production in 2004 ).
I talked to a guy recently formerly from Apples OS group, generally just bitching about the decisions in both processors.
He claimed that both Xenos and Cell are based off cores originally desgined by IBM in conjunction with Apple and other customers. Apple in the end turned it down, but I doubt very much that MS went in and said no we don't want on OOO core.
It's much more likely they went in with we want a multicpu core clocked in the 3Ghz range with power output no greater than X and an emphasis on FP performance.
FWIW in most code the original 2Ghz PPC CPUs in the devkits are faster than the final units. But this was pretty easy to predict given it's 3 issue design, although a lot of early devs didn't.
We run all our benchmarks on PPU, SPU, X360 CPU and usually a PC, and I've yet to see the PC not totally dominate on a single processor benchmark.
When you say PPU, SPU, what do you mean exactly?
And what kind of benchmarks are you running? I don't quite follow. Do you mean a single core benchmark, and what kind of PC processor are you matching against what other CPUs exactly, and how? (though I can imagine that this kind of stuff is confidential)
Nothing concrete about this approach has materialized, how does this scouting thread know where to stream in from ? It's exactly these data-dependent loads that are the problem and that OOO is so good at scheduling around.
Needing revolutionary breakthroughs in software to get good performance on your new core, sorry but I'm a sceptic.
And OOO leads In-orders in performance/power at all performance points all the way down to where it gets completely uninteresting.
What's the point of this comment? Apple wanted a processor that was going to be competitive with the PC market. Basically they are going for the route where they want to be able to compete with PC hardware all the way, and then beat the competition with their design and their OS, keeping the option open for running dual-boot OS/X and Windows.
It was reasonably oncretely described in Sun's IEEE paper on throughput computing. It is also not a software technique, but a hardware one. It's simply a twist on traditional speculative execution, only the execution happens using idle thread resources. When you miss a load, you set a hardware checkpoint, and your current thread blocks waiting for a load into a register, while another thread, the scouting thread, continues to execute. It does not commit any writes to memory, but merely serves to "warm up" the caches/fetch predictors. Any future load dependent on a past load which has not yet resolved is ignored. The branch prediction unit is used to speculatively continue executing past a branch dependent on a conditional which was calculated using unloaded registers. When the loads finish, you return to executing from the checkpoint. A future technique "out of order commit" will be able to leverage speculately calculated results from a scout thread without having to recalculate them, as long as they can be proven correct. This also, is done in hardware, by reusing the special "not loaded yet" bits already utilized by the scout thread on registers.
Sun's paper includes actual benchmarks on the technique, which show impressive database, SpecInt2000 and SpecFp2000 improvements. It boosts single-thread performance by 40% average. Also, Azul Systems ships systems today for Java which leverage the technique and use up to 378-threads per system, and 48 threads per core.
You should look at it, well atleast a future generalization of it, as simply another way to implement OOOe (when out-of-order commit techniques are added), but one that potentially scales better vs onchip resources and requires less silicon. The main difference with Rock vs today's OOOe is that you don't get the benefit of retired instructions and a store queue.
I see the the ILP/TLP approaches as just specialized cases of future data-parallel/transactional memory designs, where the CPU core, depending on what work is being done, essentially profiles and finds data-parallel snippets of code, executes them in separate contexts, be it ILP or TLP, sometimes optimistically, and then commits them in a transactional fashion. You should then be able to scale up the number of instructions inflight to thousands, maybe tens of thousands, just like GPUs. The current instruction windows of OOOe processor implementations simply ain't going to scale power wise, since they scale n^2 with window size.
Does it? Atleast in SpecWeb, Sun's Niagara based T1000 has 5 times the performance per watt of a quad-core Xeon Dell PowerEdge 2850. Twice the throughput and less than 1/2 the power consumption. In Database OLTP benchmarks, triple the throughput (transactions per minute) at 1/2 the power consumption.
My point wasn't apples decision, more that neither the X360's core nor Cell are as custom as some people would believe.
They were both basically customised parts that IBM already had lieing around.
Nice post and in my naive way I think this final comparison is highly valid.IMHO, it's much like the debates over unified shading. NVidia might still win the performance crown for the near future with "big" non-US designs (like big OOOe chips), but that throughput oriented designs in the future may end up beating traditional OOOe designs, even in single thread performance, simply because of fundamental scaling in the current OOOe designs.
The two CPUs in the alpha kit are IBM PowerPC 970 CPUs. These chips do not have the new features that will be in the Xbox 360 CPUs. There are also some performance differences to be aware of:And you may make the case that the 970 doesn't reach the same 3.2 GHz frequency as Xenon, but the issue-width is considerably wider. In fact, I wouldn't be surprised if the dual PPC970 X360 devkits yielded better overall CPU performance than the final machine.
Take this for what it's worth (3rd hand information on the internet), but the apple guy claims that IBM had a lot more than just the PPU "lieing around". And that Sony/Toshiba were minor players in the Cell design.
But this sort of depends also on whether you consider the PPE to be 'Cell' or not... I think most people would view the SPEs as the core of the architecture, and certainly those are custom.
That makes sense. So it ignores data dependecies and speculates control dependencies to load as much as possible. It'll be interesting to see if it's actually a win since it's effectively using two threads to do one thread's work.
From your description it's clear it'll help single thread performance in most cases, since the actual executing thread will have a higher hit rate in caches than without the scout. The only exception would if the scout speculates off on a wild goose chase wasting bandwidth, bandwidth that would have been used by loading needed data (ie. will only happen when the memory bus is saturated).
Well, todays cache-line writes can be seen as transactional commits, but I'm guessing you want coarser (and user/compiler controlled) granularity, and I'm guessing that's why you like the CELL approach. I just think CELL will fail because it's virtually impossible to virtualize the SPEs, which IMO is essential in a massively threaded approach.
A ROB should scale linearly with size. It will of course be slower and in order to run it at the same latency, power would go up.
Using these two mechanisms our processor has a per-
formance degradation of only 10% for SPEC2000fp over
a conventional processor requiring more than an order of
magnitude additional entries in the ROB and instruction
queues, and about a 200% improvement over a current pro-
cessor with a similar number of entries
Congrats! You succeeded in finding one of the few benchmarks that has basically an infinite amount of parallism in it, since each request is completely independent from other request (and thereby isn't affected by Amdahl's law).
If you look at other OLTP benchmarks, like SAP you see that the T1 roughly equals two dual core 2.2GHz Opteron, the former having an edge in power 72W vs ~ 80-100W and the latter an edge in cost. A quad core opteron would have approximately the same die size as Niagara, and would burn the same amount of power.
Personally, I'd point to the communications/data flow aspects and the corresponding resources as the core of the architecture. Not a FLOP to be found there though, and its features are rather technical, maybe that's why it doesn't get much attention.
But then, I'm a data flow kind of guy.
Personally, I'd point to the communications/data flow aspects and the corresponding resources as the core of the architecture. Not a FLOP to be found there though, and its features are rather technical, maybe that's why it doesn't get much attention.
But then, I'm a data flow kind of guy.
imho your source is not worth that much, Sony and Toshiba guys were not minor players at all, they co-signed with IBM guys dozen of patents since 2002, japanese guys even move to US to closely work with IBM guys.Take this for what it's worth (3rd hand information on the internet), but the apple guy claims that IBM had a lot more than just the PPU "lieing around". And that Sony/Toshiba were minor players in the Cell design.
imho your source is not worth that much, Sony and Toshiba guys were not minor players at all, they co-signed with IBM guys dozen of patents since 2002, japanese guys even move to US to closely work with IBM guys.