Future console CPUs: will they go back to OoOE, and other questions.

It seems to me a lot of people are missing the proper argument here. It isn't really 'is OOOE better the IOE' or 'is SMP better than UMP' or 'it's lots of cache better than little cache'. It's a question of will we get rid of IOE in future console processors so that in whatever multicore system you have, it's all OOOE.

Putting a better alternative to the PPE into Cell makes sense, and is something I'm sure everyone wants - but does improving Cell also mean adding OOOE to SPEs? Or will future processors get rid of SPE like processors altogether and go back to multi-core GP processors?

There's a lot of discussion here but I think most of it isn't getting to the heart of the debate ;)

What is the best solution depends on the application.

1) File, web, and database servers:
Multiple symmetrical oooe cores with lots of cache is the way to go. Servers are limited by i/o performance rather than cpu or fp performance. This requires large stacks, caches, buffers, and other pointer addressed data structures in RAM. This is exactly what the SPEs are rubbish at. Also the server has to handle many independent instances of connections and requests which can most easily be handled by spawning off new threads/processes and letting the OS schedule them on an SMP or NUMA architecture with many identical cores. No need to do manual scheduling here, much easier to let the OS do it automatically.

2) HPC/supercomputing:
The Cell concept - one GP ioe core plus lots of assymetric DSP cores - is ideal here. The problem needs to be forked out, and then the results need to be put together at various stages. Hence for performance, the parallel processes require close coordination which requires manual scheduling. Using the OS to schedule processes as a means of distributing workload to multiple cores won't work well unless the processes can run independently. Hence with the requirement for manual scheduling, the advantage of easier programming using symmetric cores and automatic scheduling is lost. For hardware efficiency Cell is the best approach.

3) Games consoles:
The Cell concept wins here. Tight coordination is required between parralel processes is required here so automatic scheduling of processes by the OS can't be used here. Hence no advantage in using symmetric cores - better to optimise the cores to do best what they will be asked to do - GP code execution for the PPE, DSP type application for the SPEs. Oooe creates indeterminacy in timing - not particularly desirable in games or in tightly bound parallel code, so why bother with oooe..

4) Desktop ix86 PC:
The optimum in terms of cost/performance is one oooe core with lots of cache to give the best possible Windows of Linux single thread performance on non-optimised code (Windows OS and application code will always be generic, and for Linux on ix86, the same will be true for most distributions and applications), and lots of SPEs to boost sound, media playback etc, and boost FP performance. Also a powerful GPU maybe with SPEs to help with Window and display management. AMD and Intel may be able to use the CPU-GPU on a core concept to tailor the GPU and SPEs to complement each other for this. The only problem with this approach is that the SPEs being assymetric can only be used to accelerate code that can be rewritten and optimised for the SPE architecture - which for Windows means only drivers and emulated devices can be accelerated, and for Linux drivers, emulated devices and standard libraries. With Linux, it is possible to accelerate any open source program if there is a need, but unless there is a standard ix86 architecture which includes a universal SPE architecture, who will bother for a small fraction of the market? Still, because certain things like media players and graphics are very speed critical, I think it is worthwhile even with only these accelerated.
 
Lets speculate about a possible scenario for the next generation console.
MS could design a new console´s CPU for 2011 with the following goals:
a - Increase the number of customers worldwide.
b - Lower the cost (and price) of the console.
c - Increase the production.
d - Learn from the XBox 360 experience.
e - Enjoy the XBox 360 ecosystem.
f - Better pixels, not more pixels.

A more evolutionary/conservative step.

One possibility is reduce the CPU die size to below 100mm2 using an stable/mature process by 2011 (45nm will be a good candidate).
This means more chips per wafer and better yields with consequent higher production and lower cost.

The next thing is perfect the XBox 360 CPU (Lets call it XBox 720 CPU :) ).
- Improve monothread performance (developers complains).
- Make it easier to program/develop a game.
- Keep the current multithread capability.

Then maybe something like the chip below:
- ~80mm2 die size, 45nm process, ~260 millions transistors
- single 2MByte cache at 3.2GHz
- 4 symmetrical IOE deep pipeline PowerPC cores at 6.4GHz (the same 2-issue core but with little improvments/changes)
- 8 simultaneous threads
- 51 Giga issues per second
- 25.5 Giga Dots product per second
- 309 GFlops/sec
- 12.8 GBytes/sec FSB
- ~30% smaller TPD (Thermal Power Dissipation)

Also it could have improved/mature compilers and tools, and enjoy a large installed base or ecosystem.

Could MS improve the cores with some Basic OoOE (lets call it BOOE)?
I mean an intermediate step between the advanced OOE in larger cores and IOE, just implementing one or two tricks without change too much transistor count/core size.
What would you do as BOOE and what could be the consequences (core size and monothread performance increase)?
 
Last edited by a moderator:
If the presentations I've seen recently from various vendors are to be believed more logical threads per physical core are to be expected. My best guess would be that even the best OOO designs aren't getting close to maxing out there execution resources in the general case.

I've heard short term discussions of 4x4x4 in the PC space 4 hardware threads per core/4 cores per chip/4 physical packages.

As I've said before though I don't think too many people would disagree about where processors are heading, but it's more interestng to discuss memory architecture and how inter processor communication is achieved. The more threads you have the finer granularity of parallelism, the bigger issue this becomes.
 
I've heard short term discussions of 4x4x4 in the PC space 4 hardware threads per core/4 cores per chip/4 physical packages.
64 Threads ???

If I buy a hardware like this I will probably go nowhere because:
- I will have no money to buy software (all will be spend in hardware)
- No software will use it efficientlly
- No money to pay the electricity bill (including air conditioner)
:LOL:

Do you know how much an XBox 360 cost here? US$ 1,100.00 without accessories and good games around US$ 90.00.
 
As I've said before though I don't think too many people would disagree about where processors are heading, but it's more interestng to discuss memory architecture and how inter processor communication is achieved. The more threads you have the finer granularity of parallelism, the bigger issue this becomes.
Hope for the future with new SOI process to see Z-RAM caches or similar to have bigger caches. http://en.wikipedia.org/wiki/Z_RAM
 
Whether or not someone can write safe multithreaded code isn't the only problem. The problem is whether your programmers have the skills to design their own parallel algorithms in the first place, vs people who are mostly "off the shelf" programmers who use STL or Boost and reimplement Graphic Gems, A*, and other off the shelf algorithms as opposed to more academically inclined coders.

believe it or not, but that's exactly what i tried to say a couple of pages ago. it did not turn out so eloquent, though.

basically writing "safe" multithreaded code and writing multithreaded code that does yield a gain are two orthogonal things. ideally you want advancements along both axes.

Safe code is one thing, but to take advantage of multicores you need to write more than just bug-free code, you need to known how to take advantage of parallelism, and it isn't always so obvious as offloading monolithic parts of your app to different cores.

precisely.

and to put a bit more value into this post of mine, let me say that i actually find the SPEs paradim somewhat stimulating for the developers to start pushing the bars along both those axes mentioned above. why? - for two reasons - because of the extreme, almost-manual-labor NUMA approach with the SPE storages ('coherence-what?'), and the abundance of independent computational resources that the cell SPEs constitute. eventually, cell may not be the best multi-threaded design possible, but i think devs who have grasped the principles of productivity on a cell will feel very much at home during the next desktop iterations. it'd be curious to hear what actual PS3 devs feel on this subject.
 
Sounds suspiciously like something something you'd see in a server machine.

It will be possibe on a single chip, once Sun's Niagra II to come out next year.
8 cores with 8 threads each.

Should be very good on servers dedicated to tasks with light to medium compute intensive threads.
 
Just finished reading this very good article at ArsTechnica, it is old but worth the reading: http://arstechnica.com/articles/paedia/cpu/xbox360-2.ars

Inside the Xbox 360, Part II: the Xenon CPU
Wednesday, June 01, 2005

......

Rumors and some game developer comments (on the record and off the record) have Xenon's performance on branch-intensive game control, AI, and physics code as ranging from mediocre to downright bad. Xenon will be a streaming media monster, but the parts of the game engine that have to do with making the game fun to play (and not just pretty to look at) are probably going to suffer. Even if the PPE's branch prediction is significantly better than I think it is, the relatively meager 1MB L2 cache that the game control, AI, and physics code will have to share with procedural synthesis and other graphics code will ensure that programmers have a hard time getting good performance out of non-graphics parts of the game.

And here an old overview about the Xenon: http://www-128.ibm.com/developerworks/power/library/pa-fpfxbox/?ca=dgr-lnxw09XBoxDesign
figure1.gif

~168mm2 total chip
~28mm2 each core
~38mm2 l2 cache
~46mm2 glue and other logic

Hmmm, maybe twice the l2 cache could help here ...
edited: with 45nm, 3 cores and 2MB L2 the chip could be only 55mm2 and very fast :)
 
Last edited by a moderator:
This is getting more and more ridiculous. First you compare a VIA Eden cpu intended for low cost fanless embedded SBC applications with the latest Athlon CPU claiming the difference is down to oooe. Now you quote a $40 average die manufacturing cost for the P4 (note die cost not chip cost) to prove how the latest dual core Athlons and Conroes could have been viable in the Xenon on a cost basis. If you had bothered to read the link you quoted:


I am looking at and comparing the cost to MICROSOFT of the CPU chip they are putting into the Xbox 360. The die cost you are quoting is the average cost to the foundry of creating a die which is a small piece of bare silicon excluding overheads. Also I believe the $40 quoted is the typical for single core P4 not a dual core CPU.

The die cost does not include the cost of packaging which involves mounting the die on the package and welding gold wires between the die and the pins, testing each chip, after this allowing for the cost of defective chips you have to throw away (likeky to be at 60% for a large chip at the early phase of it's life), and overheads and profits to pay for salaries, the maintaining manufacturing plant and factory and return on investment. All of these are significant costs. The total cost can easily come to 3 times or more of the cost of manufacturing a die. $160 is about right for the cost of a tested and packaged low end dual core Athlon X2 chip to Microsoft. Look up bulk trade prices on components to get an idea of cost of supplied tested components.

As I said, current AMD and Intel dual core chip prices are too high for it to be used in consoles. This would have been even more the case when the XBox 360 started production one and a half years ago (if production for a dual core was indeed feasible for the XBox 360 deadline).
Good things come to those who wait, they say. ;-)
Apparently Intel published detailed yields and cost for future quad core cpus @ IDF. They stated that for each 300mm wafer they get about good 320 Conroe dies, which leads to yield rate of 75%.
The following table shows the detailed costs Intel gave for Clovertown:
Intel courtesy of german c't magazine said:
Code:
Clovertown      interpolated prediction for 4Q07
Die             $29,37
Package         $21,02
Assembly        $1,39
Test            $3,62
Mfg Overhead    $7,62
Core Yield loss $8,08
Total Cost      $71,1
Remember this figures are for Clovertown which uses 2 conroe dies in multi-chip package. So we can estime the cost for a conroe (for simplicities sake I will assume that the cost for core yield loss and manufacturing overhead stay the same and the cost for rest halves):
estimated cost of conroe said:
Code:
Conroe (4 MB L2) estimated
Die             $14,69
Package         $10,51
Assembly        $0,70
Test            $1,81
Mfg Overhead    $7,62
Core Yield loss $8,08
Total Cost      $43,40
Now it's important to remember that clovertown is made of two conroe cores with 4MB L2 cache each, resulting in a whooping 291M transistors per die. The allendale with 2 MB L2 cache only have 167M transistors, which is comparable to Xenon's 165M. The cost for an allendale cpu should roughly be Cost(Conroe 4MB *(219/167)) = $8,43. Yield will be higher as well so core yield loss should be about 15% less or $7,03. So the cost for a allendale is about $36,09.

Either way, $43,40 or $36.09 is very comfortably below $60.

qed.
 
Hmmm.... now are you talking about XeCPU eventully on 65nm...? Or in the present day on 90nm? Because you need to account for that difference (double the die size).

Also Intel's fab operation is considered the best in the world in terms of 'efficiency,' so always be ready to tack on a couple more dollars for the other guys' dies.
 
They seem to be comparing the cost of packaging two conroe/woodcrest dual core in a single package with the cost of a monolithic quad core, and have concluded that the monolithic quad core will still be too expensive at Q4 of 2007, so Intel will mount two dual core conroes inside a single chip housing and wire them up to make a quad core. They say it will remain that way even for the first chips of the 45nm generation.

Google translation:
One experiences that to a 300-mm-Wafer about 320 good Conroe/Woodcrest chips are allotted, which can be combined to 160 Quad cores - which corresponds to a Yield rate of roughly 75 per cent. After the experience the error probability rises exponentially with the the size. Like that this is to count with a monolithic Quad core only with 130 recovers (23 per cent of less yield). With all manufacturing costs, projected on the fourth quarter 2007, a difference of 79,86 US Dollar results opposite 71.10 dollar, therefore about 12 per cent. Reason of enough for Intel to remain first with two chips in the housing. That will remain in such a way obviously also with the first chips for next year of the planned 45-nm-Generation. Only later chips such as Yorkfield are to then unite all four cores on a chip.

Don't forget also that you need to add Intel's profit, R&D overhead etc. to get the cost of chips sold to others like Microsoft. This will be at least in the order of 20-30% for large scale mass produced chips. For smaller production runs it could be anything maybe up to 100% or even more.

Price to manufacturer for dual core conroe in Q4 2007 65nm process (a years time)

Die $29,37/2 = $14.69
Package $21,02 (may reduce but not by much)
Assembly $1,39
Test $3,62
Mfg Overhead $7,62 (may reduce but not by much)
Core Yield loss $8,08
Total Cost $56.42

Including profit and non-manufacturing overhead = $56.42 x 1.25 = say $70 cost to third party buying a dual core conroe chip. This seems reasonable for a years time. Currently the lowest cost low end dual core Athlon 64 to a third party is about $100-$140.

What you should be looking at is 90nm process in Q4 of 2005 when comparing Dual Core with Xenon 90nm in Q4 2005. It would be on the estimated feasibility, yields and cost for that time, made maybe two years in advance that the decision to go for in-order triple core Xenon would have been made.
 
Xenon is made on a completely different, more complex process (SOI, DSL etc.) so the figures for it will be completely different. Xenons contains a relatively small cache with more logic transistors, they all have to work fully at 3.2GHz at the correct voltage with all 3 cores fully functional, you won't see any XBox 300s with "celeron" processors. These factors will send yield down.

However it's all academic anyway since Xenon was clearly designed as a streaming floating point monster and Core2 based parts weren't. To match Xenon, Intel would have to up the clock and add another core, silicon costs would go up, yields would fall (probably sharply). You also have to cool the now 120W+ chip.

The real challenge would be to get Intel to sell what would be their highest end part for around the price they sell their lowest end chip...

Xenon was built to order to give specific results at a certain price point. A Core2 based chip might be able to give the same results but it'll wreck Intel's business model.
 
aaaaa00, it is a server right?

Question to the developers: Typically how many threads could make sense in a game?
- Game engine core and control
- AI (one for each entitie?)
- Physics
- Graphics (multiple?)
- 3D Sound
- online tools
- network

Thanks :)
 
This discussion has been about OOO being added back into console processors. I've been reading over Intel's presentations on "Terascale" processing which is planned to appear by around 2010 and it looks very much that they are planning on dropping OOO in favour of a large number of heavily threaded processors.

So, if even Intel are planning on dropping OOO what chance is there of it going into a future console processor?
 
This discussion has been about OOO being added back into console processors. I've been reading over Intel's presentations on "Terascale" processing which is planned to appear by around 2010 and it looks very much that they are planning on dropping OOO in favour of a large number of heavily threaded processors.

So, if even Intel are planning on dropping OOO what chance is there of it going into a future console processor?

Well, considering that AMD likes to reuse designs much more than Intel, I wouldn't be surprised to see their processors in 2010 still using the K8 or K8L core (or something similar) but minimized. May even give them a performance advantage in key areas if they're the only ones still using OOOE.
 
Well, considering that AMD likes to reuse designs much more than Intel, I wouldn't be surprised to see their processors in 2010 still using the K8 or K8L core (or something similar) but minimized. May even give them a performance advantage in key areas if they're the only ones still using OOOE.

AMD likes it more? Well, derivates of the P6 core have been in use now since ~1995. Yonah is an awful lot like Pentium Pro in many ways. Core 2 Duo is essentially a modernized version of it. Netburst was the only major change and it's gone now lol.

The same can be said of K8 and K7, but those aren't even close to the same age as P6.
 
Back
Top