Predict: The Next Generation Console Tech

Crossbar · Dec 19, 2007

hoho said:
That depends.

Spec*_rate always uses multiple CPUs/cores. Non-rate might not. Though it is allowed to compile the non-rate benchmarks so that compilers try to automagically parallelize the code and run it in multiple threads. Whether auto parallelization is enabled or not is written in the detailed benchmark results page.

Thanks, that makes it even harder to compare the Xeon QX6850's and Power 6's SpecInt numbers, since the intel CPU had "auto paralell" on and the IBM cpu had "auto parallel" off for the SpecInt tests mentioned previously.

The fact that the QX6850 has 4 cores and Power6 has 2 cores (2 threads/core) adds even more confusion to how to judge the efficiency of the different architectures.

hoho · Dec 19, 2007

3GHz Xeon without autoparallel-18.9 base, 20.8 result. That Power was 18.8 base, 20.8 result

What we can say is that compilers are not exactly good at automatically parallelizing code. That also kind of answers pascal question that code needs to be highly parallel, it is really hard to automatically parallelize it.

[edit]

Here is the highest Core2 CPU non-rate, no-autopar score I could find: 22.6 rate, 20.2 base

Gubbi · Dec 19, 2007

Crossbar said:
Thanks, that makes it even harder to compare the Xeon QX6850's and Power 6's SpecInt numbers, since the intel CPU had "auto paralell" on and the IBM cpu had "auto parallel" off for the SpecInt tests mentioned previously.

My bad. The highest non-autoparallalization score a Core 2 chip achieves is 20.2, still beating Power 6. Note that the chip here is a dual core, but with similar level 2 cache (4MB per two core) as that of the QX6850, so single thread performance should be exactly the same.

Crossbar said:
The fact that the QX6850 has 4 cores and Power6 has 2 cores (2 threads/core) adds even more confusion to how to judge the efficiency of the different architectures.

Indeed it does, I already said it was an apples to oranges comparison. Power 6 is very clearly built for the OLTP and financial transaction/analysis market, the massive bandwidth and the decimal floating point capability caters directly to these markets' needs.

Cheers

Gubbi · Dec 19, 2007

pascal said:
IMHO future processors should be prepared for a mix of workloads.

I whole heartily agree. We need cpu cores with good solid single thread performance, - but at a low power usage point so that we can stick a whole bunch of them on the same die. Which is why I still consider PA Semi's PPC core the best building block for the next gen massively multicored MPU SOC.

Cheers

pascal · Dec 19, 2007

I agree that we need good cpus with solid performance and at a low power usage. The question is how low? Also how many CPUs?

My guess developers will have a really hard time trying to use more than 16 CPUs usefully.

Now what to do with this 32CPUs in the 32nm chip? Or the 128CPUs in the 16nm chip?

I will have a look at the PA Semi's PPC core :smile:

Shifty Geezer · Dec 19, 2007

deathkiller said:
My point was that if they use an "upgraded" Cell that won't be a cutting edge architecture by the next gen even if it is a "new design".

What constitutes cutting edge then if not massively parallel many-core SIMD stream processing?

What is exactly the problems that you think that Nintendo would have with using a new chip? Do you think that they won't use anything except an old chip without any changes?

I think they won't use anything but a cheap chip

They won't spend out on expensive, massive multicore processors, so won't be using a next-gen Cell.

hoho · Dec 19, 2007

Well, one can get quite decent dualcore x86 CPUs for ~$40 at the moment, I bet some kind of Power based quadcore is doable for <$30 by the time Wii2 debutes.

Then again perhaps something with integrated GPU is cheaper thanks to simplified memory hierarchy and less connections? Too bad IBM doesn't have GPUs, mini-cell with added GPU functionality sounds like an interesting thing

Shifty Geezer · Dec 19, 2007

That'd be my guess. Some form of standard multicore CPU. Anything somewhat current generation. I can't see them going with an exotic architecture like Larrabee or Cell.

AlNom · Dec 19, 2007

I'd be more interested in what Nintendo will design next. Perhaps something more curved and less angular? They probably won't have a device bigger than the Wii, and the DVD drive can be made only so small.

archangelmorph · Dec 19, 2007

pascal said:
I agree that we need good cpus with solid performance and at a low power usage. The question is how low? Also how many CPUs?

My guess developers will have a really hard time trying to use more than 16 CPUs usefully.

Now what to do with this 32CPUs in the 32nm chip? Or the 128CPUs in the 16nm chip?

I doubt developers would have a hard time finding work for more cores at all..

In the next generation of games you're going to have richer worlds with greater need for more real-time processing (& not necessarily more types of processing to any huge degree)..

Greater believability will be achieved by more animated entities (with higher fidelity animation data & greater degrees of actuation) in the scene at once for instance.. This means more work to be done in the same time frame with little to no data dependancies.. These kinds of tasks are extremely well suited to vast arrays of stream processors (e.g. SPEs) & you only have to look at something like SPURS to see how performance in this area can scale very well per core..

Then will the versatility of SPE like cores you could have a system with tighter interaction (massive bandwidth) between the GPU & CPU [than say PS3] which could really help the system perform well doing things like advanced lightning, post-processing, physics etc..

If you're taking about general PC applications then it's easy to imagine most of a 128core monster would go unused. But when you're talking about consoles and video games which, in their very nature require massively parallel computing (as well as strong support in things like branch-heavy single threaded work) it would be difficult to design an engine to utilise all your cores (maybe not all of the time though..)

Imagine a console with enough hardware resources to allow you to, whilst playing a game & using several reserved OS threads, be downloading several game demos, trailers & music, do protein folding, rip a DVD & record your favourite TV program all in the background at full whack. That does sound interesting!

rekator · Dec 19, 2007

archangelmorph said:
Imagine a console with enough hardware resources to allow you to, whilst playing a game & using several reserved OS threads, be downloading several game demos, trailers & music, do protein folding, rip a DVD & record your favourite TV program all in the background at full whack. That does sound interesting!

A multi-task system… already exist since the Amiga…

archangelmorph · Dec 20, 2007

rekator said:
A multi-task system… already exist since the Amiga…

More like a High-Performance-Massively-Multi-Tasking System..

3dilettante · Dec 20, 2007

Quick correction to a data point I made earlier. Speculation is that POWER6 has a TDP in a similar range to POWER5+: ~150-170W.
The external L3 cache chip has its own power draw, so the figure per POWER6 chip is ~200, not 250.

This still does not compare favorably to Conroe's per-chip TDP of 65W, but it's already been pointed out that the workloads POWER6 targets are very specific, and large-system scaling is a key component of the design while price, yield, and power consumption are more relaxed parameters.

That kind of scaling is something Core2 is weaker in, and game consoles have no reason to care about.

ADEX · Dec 21, 2007

The number of register ports necessary is equivalent to the number of simultaneous instructions' operands. That is superscalar, not OoO.

Since each instructions can have two operands this register file needs two ports for each issue slot, 3 in K8, 4 in Core 2, for a total of 6 and 8 ports respectively. This is the exact same amount of ports needed in an in-order design of similar width.

If you want to double the performance of an SPE you'll somehow have to get it to issue twice the number of instructions per cycle, that means you'll have to read more from the register file, ergo more ports are required. Whether it is superscalar or OOO is irrelevant.

The basic rule of thumb for OoO is an average of 50% over an in-order when all else is equal. That was prior to the very aggressive OoO chips we know today.

It'll depend on a number of factors so I don't think a single figure is possible.

POWER6 has an undisclosed TDP that likely breaks 250 Watts.

According to IBM the figure is marginally higher than the POWER5+.

However that is not the point I was making. I was answering the point that OOO = higher performance, it's not. In the case of POWER6 IBM gained performance by dropping OOO. Sun's Niagara processors are also in-order and their performance is also massively higher than their predecessors, although it is for a relatively restricted range of tasks.

Per core performance in SPECInt is actually inferior the OoO Core2, which has a TDP less than half that.

As has been pointed out these are completely different processors for different applications. I was comparing POWER6 to POWER5+.

Anyway, SPEC is more of a compiler test that anything else these days, since Niagara and Cell appeared it's become effectively useless. For anything other than comparing Intel and AMD. We now that many things can run very fast on Cell, however the SPEC rules mean it can't be tested properly, so it would run like a dog on Cell.

IBM used the PPE and Xenon as test beds for some of the ideas that went into POWER6. It can thank Sony and Microsoft for sharing the wealth.

It's the other way around, the PPC processors in Cell are based on a design which stated back in 1997. High end processors take a very long time to develop so it's unlikely they could have learned anything from it that could have had any impact.

--

Anyway this is all irrelevant.
My point was that you cannot add OOO and expect a doubling of performance without increasing power consumption. It's just not going to happen!

As for being cutting edge, some of the most important features in Cell were present in Cray 1, 30 years ago!

3dilettante · Dec 21, 2007

ADEX said:
It'll depend on a number of factors so I don't think a single figure is possible.

A rule of thumb is not an absolute. It is a quick estimate that usually is roughly correct.

According to IBM the figure is marginally higher than the POWER5+.

Yes, another undisclosed and very high TDP.
I was off by roughly 50 Watts. POWER6 would be over 150-170W, without taking into account the TDP of the L3 cache it is tied to.

However that is not the point I was making. I was answering the point that OOO = higher performance, it's not. In the case of POWER6 IBM gained performance by dropping OOO.

And making a process transition, upping the clock speed, adding look-ahead load execution, fixing a number of design issues, expanding the caches, and seriously expanding socket level bandwidth.

Without a lot of extra logic and resources, POWER6 would suffer a huge dip due to its lower efficiency.

Sun's Niagara processors are also in-order and their performance is also massively higher than their predecessors, although it is for a relatively restricted range of tasks.

It's different enough that I'm hesitant to say Niagara has predecessors, except in a limited subset of SPARC machines.

As has been pointed out these are completely different processors for different applications. I was comparing POWER6 to POWER5+.

I think it would be an interesting comparison to see how much closer POWER5+ would be if it had the same expansion of bandwidth, better process, and time to refactor some implementation-specific faults, like its 2-cycle result forwarding.

Anyway, SPEC is more of a compiler test that anything else these days, since Niagara and Cell appeared it's become effectively useless.

Unless you use the real-world applications that are either the exact same applications or operate similarly to the real-world application exemplars in SPEC.

The key point is that there are measures of performance where a spectacularly aggressive in-order design loses to an OoO chip, and in some measures it loses incredibly badly: such as power and cost.

For anything other than comparing Intel and AMD. We now that many things can run very fast on Cell, however the SPEC rules mean it can't be tested properly, so it would run like a dog on Cell.

What rule would that be?

It's the other way around, the PPC processors in Cell are based on a design which stated back in 1997. High end processors take a very long time to develop so it's unlikely they could have learned anything from it that could have had any impact.

The circuit design techniques used in the PPE and Xenon did inform the final circuit design for POWER6. It's not entirely coincidence that they all are on high-performance SOI processes. IBM has low-power and bulk processes as well, and they would have been cheaper.

Anyway this is all irrelevant.
My point was that you cannot add OOO and expect a doubling of performance without increasing power consumption. It's just not going to happen!

That would be true, but the point you used, specifically tying OoO with more register file read ports was wrong.

Furthermore, using POWER6 as an example is fraught with danger because the chip uses a gigantic amount of resources to make up for its being in-order. It does a number of things that would make no sense for an SPE.

Since clock scaling leads to rapid climbs in power consumption and future processes are making it increasingly difficult to yield great clocks, irrespective of design, console chips will likely have modest gains in clock speed, if any.

Since clock speed is a prime factor for in-order performance, since it often lacks so much else, the benefit of slightly increasing design complexity to increase per-clock efficiency is there.

It may not benefit the SPE too much right now, but it might help the PPE.
Since the console market's TDP cap artificially limits clocks, there is a gray area where evaluating power consumption versus efficiency can lead to interesting outcomes.

Barbarian · Dec 21, 2007

liolio said:
they would better go with few state of the art X86 cores? or they would better go with more cores more spu like?

I think MS will go with more homogenous cores with more (shared) cache. In the 2010 time frame I expect they'll be able to fit about 8 cores on a die, with up to 4 hyperthreads each. I believe they'll stick to in-order execution.
Hyperthreads should be the preferred way to hide latency (as opposed to OOO), except for one little problem, it forces you to manually syncronize access to shared memory. I wish they had a HW solution that does a super cheap atomic operations if the contention is between the hyperthreads, or just disables switching to a hypethread whenever you're in a critical section. Right now you pay the same (expensive) price for atomic ops or mutex locks regardless if you have contention or not, and regardless if the contention is between real threads or hyperthreads.
Of course a programmer has no way of knowing where the contentions is coming from, unless one micromanages where the threads are assigned to, which might be quite difficult if you have 32 logical cores, not to mention the potential perils if one were to mistakenly assign the threads to the wrong logical cores.

In general, the beauty of GPU architecture/shaders is that it hides the access latency AND deals with concurrency issues for you.

We still need to figure out the correct mix of SW and HW to achieve this kind of abstraction on a general purpose core.

3dilettante · Dec 21, 2007

Unlike CPUs, GPUs are a lot more lax about managing synchronization.

Synchronization behavior in a number of areas is simply undefined, so yeah there are fewer "issues" besides the "hope it's not a problem" or "don't do that". I do not think there is a strong definition of synchronization behavior for code executing in different clusters on G80, for example.
Memory consistency/coherency when threads read and write in the same memory is also less stringent than it is for some CPUs.

GPUs these days also include explicit synchronization instructions, so can you clarify how that's different than CPUs?

Megadrive1988 · Dec 22, 2007

Barbarian said:
I think MS will go with more homogenous cores with more (shared) cache. In the 2010 time frame I expect they'll be able to fit about 8 cores on a die, with up to 4 hyperthreads each.

CPUs will be going from multi-core (2 to 8 cores) to manycore (10 or more cores). I think all next-gen consoles except for Nintendo will have manycore CPUs.

I'm thinking 16 cores given a 2011-2012 timeframe.

p.s. (pure speculation) I do not think Microsoft will use the same PPE cores for next-gen Xbox. I expect more robust cores. I don't even know if they'll be PPC at all. Microsoft wants its own CPU, and have their Computer Architecture Group working on it. my best guess is a new core jointly-designed between MS's CAG and IBM that meets MS's goals. the new core will probably be able to adapt or handle different types of work pretty well, unlike the PPE. depending on time of introduction, 10 to 16 of these cores will be put on a single chip. Microsoft can achieve BC by including Xenon as a seperate or through emulation. Microsoft will stay with AMD/ATI for GPU.

Diamond.G · Dec 22, 2007

If MS sticks with the same tech just more cores for the CPU, would it b possible for them to utilize a 256-bit bus for the entire system for more bandwidth? It would seem like that was a failing on MS's part for this generation. Something that could be fixed, assuming using such a wide bus isn't super expensive still.

That and maybe add more cache?

Acert93 · Dec 22, 2007

Each generation has no constraints and demands. e.g. The bottlenecks between the N64/PS generation to the PS2/Xbox/GCN generation to the PS3/360 generation are different because hardware technology doesn't progress linearly, how it is exploited, or the direction software makers take their gameplay designs.

The GPU side will be mainly constrained by the development path of NV and ATI/AMD, with some outside darkhorses (Intel). MS will want to leverage DX. On the CPU side the door is more open... but just keeping the "same tech" as the Xbox 360 will run into many issues, especially innerchip communication and cache coherence. Does anyone see 15 in order cores sharing 5MB of cache as being a "Good Thing (TM)" in 2011-2017? MS may go with a more "traditional" multicore design, but there is going to be a need for a lot of progress in the design outside of slapping the same system together but with more cores. That wouldn't most likely meet their desired goals anyhow.

Gubbi has some good posts in this thread; all I would emphasize to his are MS will want designs that meet their general performance goals within their budget. They were willing to ditch Intel and x86/BC for IBM/PPC, so I wouldn't put it past them to do something similar down the road if the new design meets their IP and cost/performance demands.

Predict: The Next Generation Console Tech

Crossbar

hoho

Gubbi

Gubbi

pascal

Shifty Geezer

uber-Troll!

hoho

Shifty Geezer

uber-Troll!

AlNom

Moderator

archangelmorph

rekator

archangelmorph

3dilettante

ADEX

3dilettante

Barbarian

3dilettante

Megadrive1988

Diamond.G

Acert93

Artist formerly known as Acert93

Similar threads