Future console CPUs: will they go back to OoOE, and other questions.

I dont know about this part but is not MS/IBM providing good compilers to the Xenon CPU?

To be honest I don't know what MS are providing, but IBM are relying on GCC for the PPE, which is essentially the same CPU. I don't think you can compile the vast quantities of stuff that make up a complete desktop operating system with GCC for an architecture like that and expect blanket decent performance without investing a lot of developer effort in optimization. Still - I admit this is arm waving on my part - I don't have much experience of GCC outside of X86. Also, RISC has been waiting for for compliers to take the place of silicon for twenty years now and both Transmeta's stuff and IA64 seem to argue that we haven't reached that point yet...
 
Does an OOE processor have an innate advantage with multitasking? And, considering most applications aren't coded by people who know how to optimize things, I think performance would be shockingly terrible in most situations. I think it would be unrealistic to switch to IOE on such a general platform.

OOE doesn't extract parallel instructions from separate tasks, that would be multithreading.
If the in-order has a shorter pipeline, it would probably do better (if everything else were equal, which never really happens).

However, there are other factors that can influence performance in a multi-tasking scenario, or in any workload where multiple threads are taking turns with the same cache.

If the memory behavior isn't perfect between the threads, cache miss rates go up, especially for the smaller L1 caches. If there is a lower level of cache to catch the evicted data, this performance loss could be managed.

However, this all adds unpredictable latency to execution, something in-order hates. As long as the data stays on chip, OO can be expected to handle it very cleanly, or at least more gracefully than an in-order.

To reverse the question, what is the performance you'd expect from Xenon CPU on pointer chasing OOP method calls versus Conroe?

If it's highly serial and nasty spaghetti code, I'd expect an in-order like Xenon to be strongly competitive with a Conroe.
There's little chance of accurately predicting through layers of indirection, and if it's just one long chain of dependencies, Conroe's issue width is wasted.
In a lot of tricky integer apps, raw clock speed is often the primary determinant of performance.

This does assume, however, that the application in question basically ignores all of Conroe's strengths and is an all-out serial pointer chase. Even basic checks done by a lot of object-oriented code supply some ILP.
 
One more thing people forget is the core size.
IIRC Xenon CPU core is only 28mm2. Scale it down to 65nm and you have 14mm2, and probably less than half the size of the Conroe core. It means more cores per chip.
Could someone confirm my guess? Thanks :)
 
One more thing people forget is the core size.
IIRC Xenon CPU core is only 28mm2. Scale it down to 65nm and you have 14mm2, and probably less than half the size of the Conroe core. It means more cores per chip.
Could someone confirm my guess? Thanks :)

That sounds odd... unless they found a way to stuff part of the die into another dimension.

I would imagine Xenon's die should be quite a bit bigger than that -- somewhere around 150mm^2 (can't remember actual size).
 
That sounds odd... unless they found a way to stuff part of the die into another dimension.

I would imagine Xenon's die should be quite a bit bigger than that -- somewhere around 150mm^2 (can't remember actual size).

I think he cut out cache and is just counting each core.
 
O
If it's highly serial and nasty spaghetti code, I'd expect an in-order like Xenon to be strongly competitive with a Conroe.
There's little chance of accurately predicting through layers of indirection, and if it's just one long chain of dependencies, Conroe's issue width is wasted.
In a lot of tricky integer apps, raw clock speed is often the primary determinant of performance.

This does assume, however, that the application in question basically ignores all of Conroe's strengths and is an all-out serial pointer chase. Even basic checks done by a lot of object-oriented code supply some ILP.

OK. In a less extreme case, though, I'd have thought that stuff like function call return address caching + out of order loads would give Conroe quite an edge on OOP code - providing some actual compuation was done between method calls + other pointer indirections?
 
That sounds odd... unless they found a way to stuff part of the die into another dimension.

I would imagine Xenon's die should be quite a bit bigger than that -- somewhere around 150mm^2 (can't remember actual size).
No other dimension, this one.

The total Xenon CPU core at 90nm are 3 x 28mm2 (per core) = 84mm2
Now add 1MByte cache (50mm2) and glue logic (20mm2), 84 + 50 + 20 = 154mm2 (more or less)

Now scale a single Xenon CPU core from 90nm (28mm2) to 65nm and we have something around 14mm2. Is not Conroe core at ~30mm2 with 65nm?
 
No other dimension, this one.

The total Xenon CPU core at 90nm are 3 x 28mm2 (per core) = 84mm2
Now add 1MByte cache (50mm2) and glue logic (20mm2), 84 + 50 + 20 = 154mm2 (more or less)

Now scale a single Xenon CPU core from 90nm (28mm2) to 65nm and we have something around 14mm2. Is not Conroe core at ~30mm2 with 65nm?

Yeah, makes sense now -- I guess I should keep my mouth shut if I am not paying that close attention to the conversation at hand. I thought you were talking about the full die. =p
 
No other dimension, this one.

The total Xenon CPU core at 90nm are 3 x 28mm2 (per core) = 84mm2
Now add 1MByte cache (50mm2) and glue logic (20mm2), 84 + 50 + 20 = 154mm2 (more or less)

Now scale a single Xenon CPU core from 90nm (28mm2) to 65nm and we have something around 14mm2. Is not Conroe core at ~30mm2 with 65nm?

Is that relevent though? IMO, the entire die is what counts and not how small the cores are.

In roughly the same space, the Xenon has 3 cpu cores, and 1MB of cache that Conroe has 2 cpu cores and 2MB cache. Different process generations, but I don't think there's much reason to arbitrarily hold that against conroe anymore than you would hold one cpu's mhz against another; it was part of the design and what has been successfully achieved for some time now, if you're going to compare the chips, compare them as they are now. Right now, a 65nm Xenon is just as real as a single Core Conroe with Cell's SPEs. (except the former will actually happen some day)
 
I would rather know what is considered "media centric" and what is considered "general purpose".

Seems to me like 90% of your average benchmarking suite at any given site could be considered "media centric" (encoding, decoding, (de)compression, gaming, file conversion) with pretty much only your "office productivity" tests being considered "general purpose", at least in the sense that general purpose is everything that isn't "media centric".

But if thats the case im finding it hard to beleive that AMD/Intel spent so much effort speeding up that part of the CPU which is already clearly fast enough at the relative expense (compared to what they could be doing) of the parts which are mostly used to judge their speed these days.

If anything I would have thought Intel and AMD would focus most on gaming performance since thats what gets the most focus from the media. Hell the P4 and A64 were fairly even outside of gaming except for the last 2-3 processor speed bumps, but the whole world seemed to focus on the A64 being the far superior CPU purely because it thrashed the P4 in games.

Yes, but doesn't application benchmarking of a PC mean running exactly the same binary program on a number of different systems to check performance differences. These benchmarks test only conventional single threaded, and occasionally SMP OS scheduled multi-tasking code. It is not surprising that conventional oooe processors come out looking good in these benchmarks. If you rewrote an application to actually use all the cores in Cell or Xenon as they were intented to be used, you would get a fairer comparison of CPU potential for performance.

Maybe that is the reason why PC microprocessor manufacturers have so far stuck with the deep oooe and massive cache philosophy to extract the last drop of single thread performance at the expense of everything else - simply because when you benchmark them on the same binary applications, that is the only thing that shows up.

Windows is a big weight holding things back here - Microsoft won't support anything but generic ix86 libraries, so accelerating Windows by rewriting various libraries say for Cell or Xenon to use parallel processing can't be done. The same applies to use of compilers optimised to a particular ioe processor - proprietary applications and Windows are closed source and can't be recompiled.

The same thing applies to general purpose Linux applications that run on many platforms, but the fact that Linux and most Linux applications are open source means that distributions targetting specific non PC platforms eg. Linux on the PS3 can make use of these optimisations - particularly for libraries.
 
One more thing people forget is the core size.
IIRC Xenon CPU core is only 28mm2. Scale it down to 65nm and you have 14mm2, and probably less than half the size of the Conroe core. It means more cores per chip.
Could someone confirm my guess? Thanks :)

Given that those Xenon cores are only 2-issue, that's no surprise.
 
Is that relevent though? IMO, the entire die is what counts and not how small the cores are.
Yes, it is for the topic of this thread. We are talking about architectures, arent we?

It shows that a IOE deep pipeline RISC with strong SIMD support can be highly eficient for a multicore chip.

For the same 65nm die space you have 2MB cache and 2 conroe cores you could (maybe) have:
- 2MB cache
- 4 IOE deep pipe RISC at 4.5GHz
- 18 Giga dot products/sec.
- 218 GFlops peak
:cool:

Given that those Xenon cores are only 2-issue, that's no surprise.
And IOE too.
 
Last edited by a moderator:
I don't think cache size should be tossed out though of die-size calculations; afterall, cache has a direct material effect on the performance of many of the architectures we're discussing.
 
For the same 65nm die space you have 2MB cache and 2 conroe cores you could (maybe) have:
- 2MB cache
- 4 IOE deep pipe RISC at 4.5GHz
- 18 Giga dot products/sec.
- 218 GFlops peak


Xenon would scale to 4.5GHz today but it'd use an obscene amount of power in the process. The purpose of a 65nm shrink in the case of the 360 will be to cut costs and power consumption so don't expect to see any clock speed or cache size changes.

Cell is a different matter as it'll be selling in different markets, it'll appear at 4GHz at some point even in 90nm and it'll probably go higher in 65nm. The 65nm Cell in the PS3 will most likely run at the same 3.2GHz speed.
 
To reverse the question, what is the performance you'd expect from Xenon CPU on pointer chasing OOP method calls versus Conroe?

Well, I can tell you how 1 Xenon core compares to 1 P4 3ghz core. I wrote this little benchmark with some typical STL usage and the result is pretty shocking. Basically the P4 is TWICE faster. The code generated for the PPC actually looks better than the x386 code, yet it fails to perform as fast.
Of course your mileage may vary.
 
I don't think cache size should be tossed out though of die-size calculations; afterall, cache has a direct material effect on the performance of many of the architectures we're discussing.
Nobody is disconsidering the cache size.

OK, some guess work. Lets think about a 65nm process
- 2MB cache Xenon CPU style 50mm2
- chip glue logic 25mm2
- 4 IOE RISC cores 4 x 14mm2 = 56mm2

Total 50 + 25 + 56 = 131mm2
 
Xenon would scale to 4.5GHz today but it'd use an obscene amount of power in the process. The purpose of a 65nm shrink in the case of the 360 will be to cut costs and power consumption so don't expect to see any clock speed or cache size changes.
I was thinking about the same TPD.

But this doesnt interfere with the overall architecture assumptions.
 
OK. In a less extreme case, though, I'd have thought that stuff like function call return address caching + out of order loads would give Conroe quite an edge on OOP code - providing some actual compuation was done between method calls + other pointer indirections?

That last qualifier basically requires me to say yes, mostly. If there's enough ILP between loads, yes, Conroe can win easily.
If there isn't, the race can be a lot closer than it would seem otherwise.

Caching the return address would help, but I'm not sure that Xenon can't do that. In-order cores can store an address as well as any other core.

Out of order loads won't help with pointer chasing, because you can't send a load before its target address is calculated. If a chip is pointer chasing, the target address can't be calculated until after the prior load is completed.

This is part of the reason why Intel's high-clocking Xeons could in some cases beat lower-clocked Opterons in some integer code despite AMD's offerings excelling everywhere else.

Sometimes the limiting factor is how fast you can churn through those dependency chains.
 
Back
Top