End of the line for fast consumer CPU means what for gaming?

Can you give more detail on what parallels you see?
Well sorry for not being clear to begin with. In the aforementioned thread there were several comments about how either xenon or ppu are not exactly poster child for "high" performance in order cores. I found what ARM does to this new CPU interesting as they managed (supposedly) to improved its performances compares to their previous IO cores while lowering power consumption (a bit related to the lithography used though) and using less transistors. ARM stated that they optimized the CPU for android usage. A7 looks like a A8 done right. I believe that IBM power A2 is xenon heir, I expect it to perform closer to peak perf than xenon. To me a AT module looks like a Xenon done right.
I find both those CPU interesting as designers seems to have significantly up the performance without relying on out of order execution.

Can you point to the statements that indicated this?
About R&D cost of implementing an OoO engine and the following testing I believe there was multiple comments on the matter.
For the PPU and Px as well as the idea of SPUs I believe it's ERP that heard confusing noise /rumors on the matter (no matter the official statements).
Overall what I was trying to say is that neither Ms or Sony may found IBM with enough money so they "out do themselves". I don't expect IBM to do a better front end than what they did for the POWER7 for most likely a "lesser" project, neither I expect them to push out better "throughput optimized cores" than the power a2 for lesser project. IBM may jump on the opportunity to have MS/Sony so they found prject they had hanging around before hand but my idea was that MS or Sony may as well go with of the shelves parts save R&D and focus their R1D on putting everything together properly, or actually asking your opinion about such a choice.
 
Last edited by a moderator:
A7 looks like a A8 done right.
The A8 was a less than optimal implementation of ARM, which A7 does correct.
A7 is also designed to appear functionally identical the OoO A15.

I believe that IBM power A2 is xenon heir, I expect it to perform closer to peak perf than xenon. To me a AT module looks like a Xenon done right.
I find both those CPU interesting as designers seems to have significantly up the performance without relying on out of order execution.
There's a lot to the POWERPC EN platform besides the cores. It targets workloads Xenon does not, and its much lower clock and small core-level caches may not lead to uniform improvement.


About R&D cost of implementing an OoO engine and the following testing I believe there was multiple comments on the matter.
Which comments stated IBM used an already existing design they had hanging around?
I do not remember strong evidence of that. Given what Sony and Microsoft paid IBM, hopefully more work was put into the project.
Microsoft was willing to pay for an OoO Xenon, but it didn't have the time. There will be plenty of cheap OoO cores out in the market by the time the next consoles roll out, so it's less of an excuse this time around.
 
The 3DS has some major problems to fight. One, it looks like a DS and the games.. at least from what i know.. looks like DS games.

Without wanting to offend you, thats a really ignorant thing to say. 3DS is easily a generation ahead of DS graphically.
 
Last edited by a moderator:
Without wanting to offend you, thats a really ignorant thing to say. 3DS is easily a generation ahead of DS graphically.
It just occured to me that maybe 3DS is offering what appears limited graphical progress because DS had a lot of 2D content? I may be very wrong here - have very little experience of DS's library. But what I have seen is often sprite based (Layton, Super Mario sort-of-thing). 3DS naturally wants 3D vector graphics, which look poop by comparison to beautiful hand-drawn sprites or cleanly rendered 2D vectors. So where 3DS improves on DS with the visuals, comparing 3DS's graphics to DS's (2D) graphics, it doesn't look to be an improvement. Also are devs using 3DS advanced features yet, or are they still using DS type design, as you'd expect from early titles.

Please do correct me if I'm wrong . ;)
 
The A8 was a less than optimal implementation of ARM, which A7 does correct.
A7 is also designed to appear functionally identical the OoO A15.
Ok no problem with that but would you deem Xenon as a good IO processor?

There's a lot to the POWERPC EN platform besides the cores. It targets workloads Xenon does not, and its much lower clock and small core-level caches may not lead to uniform improvement.
Well Power A2 core or AT module (4 cores + 2MB of L2) are only part of the wire speed processor Power EN. They are also used in new blue gene chips. They have indeed tinier L1 caches but bigger L2, so far IBM didn't disclosed much when it comes to L1 and L2 respective latencies, still I would bet that the L2 is not running at half the chip clock speed as in Xenon and that latency figures are better. I don't know for L1. IBM let a lot of thing in the dark: what are the pipelines lengths, have the branch predictor been improved, etc.
Charlie Jonhson stated that the chip could run ~3GHz here but for them it as not the sweet spot in power efficiency. I don't want to read much into it (like two processes ahead of xenon if the chip were heavily pipelined as Xenon it should be able to reach way higher clocks, power6 like).
So for now it's a belief when the chip will become commercially available I guess we will learn more (or if someone has access to details about the architecture) but I believe that the power A2 core (not the whole speed wire processor or blue gene) is an evolution of xenon and it has to be better than xenos at its job, ie achieving higher throughput per watt overall. I would not be surprised if Xenon served as a guidance if not a prototype for them while fine tuning/ designing the Power a2 cores. Power A2 running at the same speed may lose to xenon on some tasks but my gut feeling is that it would beat it across the board.
Which comments stated IBM used an already existing design they had hanging around?
I do not remember strong evidence of that. Given what Sony and Microsoft paid IBM, hopefully more work was put into the project.
Microsoft was willing to pay for an OoO Xenon, but it didn't have the time. There will be plenty of cheap OoO cores out in the market by the time the next consoles roll out, so it's less of an excuse this time around.
I'm speaking of this, this, this and this comments and may be a few others going down the thread.
I don't think we will know the truth anyway, if a company tricked somehow a company in co-founding a processor they used as a prototype for upcoming products neither want it to be public. Either way IBM may really have failed still used Xenon as a test platform and work in improvement for their A2 cores.
 
Last edited by a moderator:
Ok no problem with that but would you deem Xenon as a good IO processor?
No. Xenon is not particularly good.
The thing about A7 is that one significant part of its specification is its close resemblance to the OoO A15. Part of its functionality going forward is working alongside OoO cores on the same chip.

I'm speaking of this, this, this and this comments and may be a few others going down the thread.
I don't think we will know the truth anyway, if a company tricked somehow a company in co-founding a processor they used as a prototype for upcoming products neither want it to be public. Either way IBM may really have failed still used Xenon as a test platform and work in improvement for their A2 cores.

At least some of that information is inaccurate. Microsoft could have had an OoO design, but it was warned it would take longer. Microsoft opted for in-order.
IBM's input on Cell was heavily focused on the PPE and the ring bus. The SPEs were part of what Sony and Toshiba wanted. IBM wanted a homogenous PowerPC-based chip. Toshiba wanted a bunch of SPE-like cores of a different ISA.
 
At least some of that information is inaccurate. Microsoft could have had an OoO design, but it was warned it would take longer. Microsoft opted for in-order.
IBM's input on Cell was heavily focused on the PPE and the ring bus. The SPEs were part of what Sony and Toshiba wanted. IBM wanted a homogenous PowerPC-based chip. Toshiba wanted a bunch of SPE-like cores of a different ISA.
Well that was I thought till I re-read this old thread. I guess that 6 years after release we learnt enough to discard ERP source. Anyway it was not really what the main part of my post.
I'll try to be clearer so if it gets easier for you to answer.

If I get properly your pov you're calling for a SMP deisgn, right?
You would want relatively tiny cores, with not too aggressive out of order execution, and 2 way SMT to extract further parallelism from the code or cover the out of order engine when it fails to cover say L2 latency. You think that IBM should be able to deliver as they are likely to not be time constrain this time around.

My question (not really a pov) were:
1) even given enough time, will IBM do as good of a job as they did with POWER7 or POWER A2? It's somehow budget related, so do you think MSFT or Sony could provide enough money for IBM to really go optimize a chip as well as they do on their server chips (which with matching software earn them a lot of money so they may indeed optimize their chips a lot not only for perfs but in regard to power saving features, etc).

2) Assuming the answer to the aforementioned question is no, I wonder if it could be better to use what supposedly are really optimized parts put them together and so having a heterogeneous design? So instead of "average cores" you have one big core with high single thread performances (power7) and tinier cores (IO power A2 ones) which are optimized for throughput per Watts.
So as a gross example:
would you go with 2 power7 cores, or 4 "average cores" or 1 power 7 and 4 power A2 cores?
It's not as simple as even in a gross estimate this choice could have an impact on the amount cache and the cache hierarchy, it could be dual setup: 2x256 L2 +3MB L3, quad set-up 4x512KB L2, heterogeneous 256KB for the L2 and 3MB L3 (acting as the L2 for the power A2).
In regard to power consumption, as a ranking (low to top) I would expect: heterogeneous, dual cores, quad cores. Either way clock speeds could go lower in the same order.

I don't expect precise answer on such "gross" presentation just your (others members are welcome by the way) gut feeling about it? Heterogeneous versus SMP and in case of heterogeneous set-up how about going with existing materials?
 
Last edited by a moderator:
If I get properly your pov you're calling for a SMP deisgn, right?
You would want relatively tiny cores, with not too aggressive out of order execution, and 2 way SMT to extract further parallelism from the code or cover the out of order engine when it fails to cover say L2 latency. You think that IBM should be able to deliver as they are likely to not be time constrain this time around.
I'd want a massively overpowered system, but I'd expect a multicore design with modest OoO cores that is at a high level similar to what is out there now with multicore general purpose SMP. Alongside this will probably be specialized graphics and other dedicated hardware for a media device.

Since there are now cheaper designs done by teams with far less experience at OoO than IBM Microelectronics, it would take a lot of explaining why they can't do that as well.

My question (not really a pov) were:
1) even given enough time, will IBM do as good of a job as they did with POWER7 or POWER A2?
With enough time and money, they could. However, they probably still wouldn't because the rest of the console system will probably be nowhere near good enough to feed a core with the raw performance of a POWER7, and they won't have a massive silicon budget either. The chip would not target a workload that needs a POWER7 or A2.

2) Assuming the answer to the aforementioned question is no, I wonder if it could be better to use what supposedly are really optimized parts put them together and so having a heterogeneous design? So instead of "average cores" you have one big core with high single thread performances (power7) and tinier cores (IO power A2 ones) which are optimized for throughput per Watts.
It's tougher to justify the engineering investment in something like what ARM offers with the A7 and A15 since consoles have much more wiggle room in power terms than a mobile device. The inefficiencies of a power supply for the console would probaly swamp the difference, assuming the console chip has power gating and power management. Unlike the current gen consoles, deeply integrated power management will probably be a requirement.

Unlike the A7 and A15, I don't think POWER7 and A2 are able to impersonate one another with full compatibility.
 
Last edited by a moderator:
3dilettante, I have a rather off-topic question but I hope you will answer as you seem very knowledgeable.
In terms of running typical game code, what x86 PC CPU is most equivalent to Xenon? I recall it being compared to a 2GHz Athlon X2 (this would have been back in the S939 days).

I am always interested to know these things, and from a gamer's perspective it's really hard to judge since console -> PC ports are generally poorly optimized and require much more CPU power than they logically should.

Relating to the thread title, this is not really the end of the line for fast consumer CPUs. Even at 77W Ivy Bridge will be incredibly fast, faster than Sandy Bridge and we all know what a beast that is. Further, Haswell will move back to a 95W top TDP (perhaps for reasons of moving more stuff from the motherboard onto the CPU but still) and we can only imagine how fast that will be.

So long as there is a demand, there will be fast consumer CPUs. We are simply in an odd place right now since Intel is in a league of its own and has nobody to compete with but itself. If Piledriver comes out strong and puts some real pressure on Intel (not likely but we can hope), don't be surprised if Intel changes its tune regarding TDP.
 
Last edited by a moderator:
In terms of running typical game code, what x86 PC CPU is most equivalent to Xenon? I recall it being compared to a 2GHz Athlon X2 (this would have been back in the S939 days).
Athlon 64 X2 (2GHz) was the most powerful PC CPU at the time Xbox 360 was released. It was natural that some developers building the first generation Xbox games compared them (and found that Athlon X2 performed better in some of their existing code). But you have to understand, that the Xbox 360 hardware was brand new, and the developers didn't have much experience on it. Multithreaded game programming was just taking its first baby steps, and suddenly they had to program for a 6 thread (SMT) in-order CPU (with powerful VMX128 vector units). Most games were single threaded back then (all previous consoles were single threaded, and PC got first dual core CPUs on 2005). XBox 360 on the other hand required developers to fully split their code to six threads if you wanted to get anywhere close to full performance out if it. It was a big change.

It you compare the Xbox 360 launch titles to the games we have now (Battlefield 3, Crysis 2, Rage) the difference is huge. A 2GHz dual core Athlon would have likely resulted in slightly better launch titles... But for running the recent fully optimized multithreaded games, it wouldn't have any chance in competing against the six threaded XCPU. At the time XCPU also had very forward looking vector instruction set. VMX128 includes dot product (SSE4.1, 2008), FMA (Bulldozer / Sandy Bridge, 2011) and float16/32 conversion (Bulldozer / Ivy Bridge, 2011/2012) instructions.

It would be really hard to compare XCPU directly to PC CPUs. In-order vs out-of-order is the first difficulty, then there's SMT, RISC instruction set and different vector instructions. In-order execution hurts less, if you optimize for it, SMT however helps less if the code doesn't have any stalls, vector instructions do help a lot, but only if the particular code can be vectorized. There's isn't any PC in-order CPU (except for ATOM but it's not in any way comparable), there are some 3 core (AMD) CPUs, but those do not have SMT (6 threads), none of the older SSE versions exactly match VMX128 capabilities. AVX does, but it has twice as wide vectors (much higher throughput).
 
Last edited by a moderator:
Athlon 64 X2 (2GHz) was the most powerful PC CPU at the time Xbox 360 was released. It was natural that some developers building the first generation Xbox games compared them (and found that Athlon X2 performed better in some of their existing code). But you have to understand, that the Xbox 360 hardware was brand new, and the developers didn't have much experience on it. Multithreaded game programming was just taking its first baby steps, and suddenly they had to program for a 6 thread (SMT) in-order CPU (with powerful VMX128 vector units). Most games were single threaded back then (all previous consoles were single threaded, and PC got first dual core CPUs on 2005). XBox 360 on the other hand required developers to fully split their code to six threads if you wanted to get anywhere close to full performance out if it. It was a big change.

It you compare the Xbox 360 launch titles to the games we have now (Battlefield 3, Crysis 2, Rage) the difference is huge. A 2GHz dual core Athlon would have likely resulted in slightly better launch titles... But for running the recent fully optimized multithreaded games, it wouldn't have any chance in competing against the six threaded XCPU. At the time XCPU also had very forward looking vector instruction set. VMX128 includes dot product (SSE4.1, 2008), FMA (Bulldozer / Sandy Bridge, 2011) and float16/32 conversion (Bulldozer / Ivy Bridge, 2011/2012) instructions.

It would be really hard to compare XCPU directly to PC CPUs. In-order vs out-of-order is the first difficulty, then there's SMT, RISC instruction set and different vector instructions. In-order execution hurts less, if you optimize for it, SMT however helps less if the code doesn't have any stalls, vector instructions do help a lot, but only if the particular code can be vectorized. There's isn't any PC in-order CPU (except for ATOM but it's not in any way comparable), there are some 3 core (AMD) CPUs, but those do not have SMT (6 threads), none of the older SSE versions exactly match VMX128 capabilities. AVX does, but it has twice as wide vectors (much higher throughput).

That you for going into all that.

So from that I get out of that, XCPU, wasn't particularly good for the workloads common at launch, but is quite good for workloads that we'll be seeing now and in the future (both console and PC). Interesting that some of it was so forward looking that some features weren't matched in desktop CPU's until just recently.

Regards,
SB
 
The emphasis for vector throughput is something that the console chips seized upon ahead of the desktop cores, at the expense of performance on the bulk of desktop workloads Intel and AMD supported.

As noted, the vector capability for Xenon and Cell was far beyond what a desktop chip could offer. In terms of peak SP throughput, it wouldn't be until the quad cores came out that the desktop got anywhere near Xenon, and not until Sandy Bridge and AVX for it to get near Cell.

On integer code, it would be nice to see comparative benchmarks. The hardware on the console chips is inferior, and the desktop chips were optimized for strong performance on applications that do not run on consoles.
 
On integer code, it would be nice to see comparative benchmarks. The hardware on the console chips is inferior, and the desktop chips were optimized for strong performance on applications that do not run on consoles.
There must be many multiplatform developers lurking in the forums. Would be nice to know their opinion about the matter.

The Llano APU (four AMD stars cores + integrated Radeon with 160 unified VLIW5 shaders) runs most console ports at 40-60 fps when resolution is set to 720p and antialiasing and anisotropic filtering are disabled (matching the console experience). So we could estimate that Llano's GPU and CPU are both both faster than the console counterparts when running average game code. Also if you equip Llano with a Radeon 6970 the frame rates are even higher, indicating that Llano's GPU is the bottleneck when running console ports. The four core stars CPU is not the limiting factor of performance.

There are some console ports (Mass Effect 2 for example) that run smoothly at 30 fps (720p minimum settings) even on AMD E350 APU. It has two Bobcat (out-of-order) cores running at 1.6 GHz and an integrated Radeon core with 80 unified (VLIW5) shaders.
 
There must be many multiplatform developers lurking in the forums. Would be nice to know their opinion about the matter.

The Llano APU (four AMD stars cores + integrated Radeon with 160 unified VLIW5 shaders) runs most console ports at 40-60 fps when resolution is set to 720p and antialiasing and anisotropic filtering are disabled (matching the console experience). So we could estimate that Llano's GPU and CPU are both both faster than the console counterparts when running average game code. Also if you equip Llano with a Radeon 6970 the frame rates are even higher, indicating that Llano's GPU is the bottleneck when running console ports. The four core stars CPU is not the limiting factor of performance.

There are some console ports (Mass Effect 2 for example) that run smoothly at 30 fps (720p minimum settings) even on AMD E350 APU. It has two Bobcat (out-of-order) cores running at 1.6 GHz and an integrated Radeon core with 80 unified (VLIW5) shaders.

I'm not sure if you can do a comparison like that though ... ? PC versions of games tend to lean even less on CPU than on consoles, and then multi-platform titles are suspect in that regard anyway. I looked at some PC performance tables a while back and it is quite interesting to see at what point a CPU is going to slow down a GPU.

It is almost more informative in that respect to look at Folding@Home's indication of what type of their jobs run better on what type of CPU?

Also, to take it to the other side, Llano would basically be 5-10 times more powerful than the console equivalent if it can run a console title at a higher framerate with otherwise identical settings, considering how optimised console versions can be. At least that would be the case if we assume an UE3 engine game here, and that Epic's (and others) comments on how far you can optimise for dedicated hardware in this respect are true.

And then we have questions such as what type of memory is Llano using in this scenario, how much, etc.
 
The Llano APU (four AMD stars cores + integrated Radeon with 160 unified VLIW5 shaders) runs most console ports at 40-60 fps when resolution is set to 720p and antialiasing and anisotropic filtering are disabled (matching the console experience).

Llano has 400 shaders and a maximum tdp of 100W (vastly higher than the 360S), but some models have varying amounts of shaders disabled. Only the very slowest desktop parts have 160 active shaders.

There are some console ports (Mass Effect 2 for example) that run smoothly at 30 fps (720p minimum settings) even on AMD E350 APU.

But is that a stable 30 fps with vsync on, even in hectic battles? A lot of the time Mass Effect 2 looks great but probably isn't pushing the CPU. A game with a 30fps cap could have lots of idle time while running, making comparison to a 30fps "average" uncapped PC game (with either no vsync or triple buffering) very misleading.

Running the 360 game with no cap / vsync off is the only way to get a reasonable comparison to the frame rate figures that PC gamers and hardware websites almost exclusively use. But only developers can do that.
 
I'm not sure if you can do a comparison like that though ... ? PC versions of games tend to lean even less on CPU than on consoles
Odd, as in my experience it's the opposite, and it would be considering the API's that the PC is always forced to go through. Console *ports* are generally very low CPU usage (my X2 215 Athlon is very rarely the limiting factor in getting significantly better framerates in console titles than their native counterparts), but that just seems to be due to the required level of optimization.

CPU-hogs are far more often games that are designed for the PC from the outset (Metro, Stalker, RTS's for example).
 
Back
Top