Predict: The Next Generation Console Tech

ADEX · Dec 18, 2007

At 32nm Cell should be around 15-20W and 50mm^2.
They've already talked about a Cell with 32 SPEs and 2 PPEs and more than likely DP as well.

That and a little a clock rate increase sounds just about right.

Then again from something I read there might be something extra in there...

Oh and of course they'll probably be using RAMBUS's Terabyte / second link technology (though I wouldn't expect quite that level of bandwidth).

pascal · Dec 18, 2007

Shifty Geezer said:
OoOE on SPE makes little sense due to the nature of its workload. Tailor made algorithms will have far higher performance per transistor and watt than OoOE.

IIRC the current SPEs when running simple scallar algorithms have similar performance of a 1GHz CPU. They are a Floating-Point monsters, but are not very good for a linear scallar algorithm. A simple (not sophisticated) OoOE could double that scallar performance without a large watts or silicon space penalty.

In fact CELL is a compromisse between traditional CPUs with lots of optimizations and very simple stream processors (GPUs). This was the balance point for the ~90nm fabrication process range. Will future fabrication process (<=32nm) and future game engine needs generate a different balance point?

What I worry about are:
- The capability of developers to scalle the engines to use a large number (more than 16) of processors
- The possibility of having not scallable algorithms which can be the performance bottleneck.
- The possibility of having important algorithms/procedures excluded because of not availability of powerfull enough processors.

So maybe the processors architecture should include assymetrical processing capability. Maybe a few very fast processors to do some serial work and a large number of more stream oriented (higlly parallel) workload.

Shifty Geezer said:
Why add more instructions when the same workload can be done in the same amount of time with the limited instruction set? That was the whole principle of RISC.

Well, the XBOX360 RISC CPUs can do the dot products. But if it can be done faster with the current ISA then fine :smile:

Shifty Geezer said:
Double precision with lower penalty is in there for the next gen Cell already. They've got 50% DP performance, and IIRC they are looking at much closer to 100% DP performance, although this is pretty irrelevant for games.

But could be relevant for other forms of use of the PS4. I really like the idea of consoles more flexible and maybe general computing capable.

Shifty Geezer said:
9 GHz isn't going to happen! Not without liquid nitrogen cooling or a fundamental shift in technology.

Well, I just did the simple math 9.6 GHz = 3.2 GHz x (90nm / 32nm).
Maybe performance will not scale 100% , but it could scale somwhere in the middle like 6.4GHz.

Shifty Geezer said:
GPU's are already stream-processors. And they don't have OoOE either

Yes I know

The_legend_of_drtre · Dec 18, 2007

Supposedly all three have signed on to use the next generation Cell Processor as their CPU's.

sunscar · Dec 18, 2007

Nope, baseless speculation. There's no evidence to support that line of thought, and plenty evidence to counter it. For Microsoft, the reason is simple - The design is not their own, and owning the micro-architectural IP rights was key in the design that eventually went within the 360, and worse still, it belongs, in part to their competitor. Nintendo is the wild-card, here. We know litterally nothing about their plans, but the liklihood of them going with a CPU design belonging partly to a competitor sounds off a bit.

What is realistic, and can be confirmed is that all three are still speaking to IBM, and the specifics will include multi-core CPUs - For SONY, newer revisions of the CELL are obvious, and for Microsoft a new spin on the 360s CPU is likely, too. Nintendo is still a wild card, but they may just go with what they've got, multiplied a few times over - Nothing radical, nothing exotic.

liolio · Dec 18, 2007

I agree with the "baseless speculation" part of your post, but I think that if Ms what a competitive CPU Xenon willneed a major redisgn.

sunscar · Dec 18, 2007

That's why I think it'll probably be a respin, not just a rehash. Obviously, it'll have similarities - Multicore architecture, for example, seems to be here to stay, so that's a given... A focus on strong, flexible communication among processors, and processing elements, is also a given, so in those regards, the legacy will carry forward, but IMO, we appear to be on the doorstep of a paradigm shift in the way CPUs and GPUs are defined and how they compliment eachother, as well as the rest of the components on the board, so there will have to be major revisions along the way, as well.

Shifty Geezer · Dec 18, 2007

The_legend_of_drtre said:
Supposedly all three have signed on to use the next generation Cell Processor as their CPU's.

Not gonna happen. Nintendo's remit is to reuse standard technology in interesting ways, and not to go cutting edge on people. Nintendo dropping a complex latest-generation Cell architecture on the development teams isn't going to happen unless the next Cell is quite a shift from the current design. I peg them at a symmetrical multicore CPU.

Barbarian · Dec 18, 2007

Gubbi said:
The only way Microsoft would end up using Larrabee would be if it was so mind boggling good that they would have to, - and there's really little to indicate that it is, other than a few marketing powerpoint slides

I agree. As much as I like the potential of Larrabee, I don't think Intel will sell their new brain child to MS to go into a console. Maybe in a next-next gen console, if it turns out Intel has gotten things right, and once the technology gets cheaper and some of the programming obstacles have been cleared.
If MS were to go for a X86 derivative, it has to be from AMD/ATI. First AMD is in a weak position right now, so they might be willing to sell some IP, to fund R&D to battle Intel, or whatever. Not to mention that ATI's chip inside Xbox360 is a fantastic piece of equipment. They should stick to the guys that made that.
If you ask me they should drop the PPC cores. Way too many problems, especially because they are in-order - Cache misses, Load-Hit-Stores, Branch penalties etc.

SPM · Dec 18, 2007

pascal said:
IIRC the current SPEs when running simple scallar algorithms have similar performance of a 1GHz CPU. They are a Floating-Point monsters, but are not very good for a linear scallar algorithm. A simple (not sophisticated) OoOE could double that scallar performance without a large watts or silicon space penalty.

I think this has been flogged to death already, but:

An SPE will run a "linear scalar" algorithm, (by which you presumably mean an algorithm that you can't split into parallel execution streams), very fast indeed, if you can fit it into local store. Of course, if you can't spilt the algorithm up between processors, you are talking about executing it on a single core - a single SPE core or if you can't fit it into the SPE's local store, you have the PPE for that. OooE won't double PPE performance, nor will it match a SPE's performance if the code can fit in SPE local store, but it will use up disproportionately greater hardware resources. That is why every advanced massively multi-core design (including Larabee) uses in-order execution.

To put it another way, OooE just doesn't give you the performance necessary for the hardware resources you have to allocate to it to make it worthwhile in cases where you can optimise the code, and write code where many of the algorithms can be parallelized.

OooE is worthwhile for running non optimized general purpose code as in binary operating systems code, and other code algorithms written for single core execution. OooE is only worth having if you are limited to a single core, or run an application like operating systems/applications. or some of the older games or legacy code which are written for single core processors, and compiled for different architectures and can't be recompiled with optimizations for the in order processor. The ideal application for an OooE processor is for running a general purpose OS like Windows or non-optimized (binary) Linux, or for running a multi-core server where SMP is used to load balance independent new process threads spawned out for every new server connection.

deathkiller · Dec 18, 2007

Shifty Geezer said:
Not gonna happen. Nintendo's remit is to reuse standard technology in interesting ways, and not to go cutting edge on people. Nintendo dropping a complex latest-generation Cell architecture on the development teams isn't going to happen unless the next Cell is quite a shift from the current design. I peg them at a symmetrical multicore CPU.

Cell architecture won't be "cutting edge" by the next gen if they don't shift from it's current design. They could get a decent amount of PS3's middleware for their next machine and they could use an smaller version of Cell2 than Sony/Microsoft.

I also peg them at a symmetrical multicore CPU but I don't think that using an small Cell2 derivate would be bad for them if Sony/MS use also Cell2 based CPUs.

Shifty Geezer · Dec 18, 2007

deathkiller said:
Cell architecture won't be "cutting edge" by the next gen if they don't shift from it's current design.

That was the point being raised...

Supposedly all three have signed on to use the next generation Cell Processor as their CPU's.

liolio · Dec 18, 2007

Barbarian said:
I agree. As much as I like the potential of Larrabee, I don't think Intel will sell their new brain child to MS to go into a console. Maybe in a next-next gen console, if it turns out Intel has gotten things right, and once the technology gets cheaper and some of the programming obstacles have been cleared.
If MS were to go for a X86 derivative, it has to be from AMD/ATI. First AMD is in a weak position right now, so they might be willing to sell some IP, to fund R&D to battle Intel, or whatever. Not to mention that ATI's chip inside Xbox360 is a fantastic piece of equipment. They should stick to the guys that made that.
If you ask me they should drop the PPC cores. Way too many problems, especially because they are in-order - Cache misses, Load-Hit-Stores, Branch penalties etc.

SO Barbarian as a developer do you think :

they would better go with few state of the art X86 cores?

or

they would better go with more cores more spu like?

spu like core could be spe with real fast L1 et L2 cache (as Intel promise with larrabe) so perfs would be if not as good as with LS really good enough.

EDIT yes I found the old thread I've been unable to find for days (Don't know why...)
anyway there's a lot of interesting talk always interesting to read.
http://forum.beyond3d.com/showthread.php?t=33335&highlight=future+console+cpu&page=2

ADEX · Dec 19, 2007

A simple (not sophisticated) OoOE could double that scallar performance without a large watts or silicon space penalty.

Even simple OOO will require fast tracking logic and that consumes power, it'll also require a whole load more read ports on the register file, that's lots more power. In order to allow multiple issue you'll need to double the execution units, that is again more power and a load more chip space. You'll also need to increase the SPE LS read bandwidth and memory bandwidth as well. OOO is going to have no small impact on Wattage.

As for OOO doubling computing power, how? Intel quotes high figures (300%) for OOO but that's because it's using OOO to get around the limitations of the rather limited x86 instruction set, OOO effectively gives it a lot more registers than it has as standard.

PPC is quite different and IBM quotes figures of around 35%, there's less limitations to get around so it's not helping nearly as much.

The SPEs are different again, they have tons of registers and the compiler can make use of them to do exactly the sorts of things OOO does, it's impact on SPE is likely to be rather less than even 35%

So in the case of the SPE while OOO may gain some performance, it's likely to be very limited. The cost of adding it likely to be much greater than any performance gained and you'll probably find that's exactly why it's not present.

If you ask me they should drop the PPC cores. Way too many problems, especially because they are in-order - Cache misses, Load-Hit-Stores, Branch penalties etc.

You evidently haven't heard of POWER6, an in-order PPC running at 4.7GHz, it's predecessor was an aggressive OOO design but the new chip runs rings around it.

I expect some of this tech could make it's way into future PPEs & Xenons giving a healthy performance boost without a big power increase.

--

As for MS using Larrabee, I doubt it but I wouldn't fully discount it. MS could always include a Xenon for backwards compatibility and OS functions and use Larrabee for the grunt.

However Intel's problem is they won't want to sell MS a powerful version of the chip as it'll potentially hit their high end.

3dilettante · Dec 19, 2007

ADEX said:
Even simple OOO will require fast tracking logic and that consumes power, it'll also require a whole load more read ports on the register file, that's lots more power.

The number of register ports necessary is equivalent to the number of simultaneous instructions' operands.
That is superscalar, not OoO.
The design with the most register ports I can think of right now is the in-order Itanium.

In order to allow multiple issue you'll need to double the execution units, that is again more power and a load more chip space. You'll also need to increase the SPE LS read bandwidth and memory bandwidth as well. OOO is going to have no small impact on Wattage.

As for OOO doubling computing power, how? Intel quotes high figures (300%) for OOO but that's because it's using OOO to get around the limitations of the rather limited x86 instruction set, OOO effectively gives it a lot more registers than it has as standard.

PPC is quite different and IBM quotes figures of around 35%, there's less limitations to get around so it's not helping nearly as much.

The basic rule of thumb for OoO is an average of 50% over an in-order when all else is equal. That was prior to the very aggressive OoO chips we know today.

You evidently haven't heard of POWER6, an in-order PPC running at 4.7GHz, it's predecessor was an aggressive OOO design but the new chip runs rings around it.

POWER6 has an undisclosed TDP that likely breaks 250 Watts. The entire MCM that houses the top model at top clocks consumes the better part of a KW.
Per core performance in SPECInt is actually inferior the OoO Core2, which has a TDP less than half that.

The continuum of power/performance is not as clear cut as whether a chip is in-order, as POWER6 shows.

I expect some of this tech could make it's way into future PPEs & Xenons giving a healthy performance boost without a big power increase.

IBM used the PPE and Xenon as test beds for some of the ideas that went into POWER6. It can thank Sony and Microsoft for sharing the wealth.

The gain would be somewhat more modest because those cores already share some of the circuit techniques used in POWER6.

Gubbi · Dec 19, 2007

ADEX said:
You evidently haven't heard of POWER6, an in-order PPC running at 4.7GHz, it's predecessor was an aggressive OOO design but the new chip runs rings around it.

As 3dilettante already pointed out, comparing Power 6 to anything else will always be an apples to oranges comparison.

And yeah comparing pears to apples; a 65nm Xeon beats it on both SpecInt and SpecInt Rate 2006 (4 copies/socket):

Xeon QX6850's 64.9 SpecInt Rate vs Power 6's 53.2 SpecInt Rate

Not to mention the single thread SpecInt performance:
QX6850's 21.6 vs Power 6's 17.8

That's a 286mm^2 (in 65nm) $999 commodity part you can put in a cheap and cheerful motherboard beating a 341mm^2 boutique chip with 7352 pins multiple custom eDRAM chips for the 32MB level 3 cache and a system infrastructure that adds up to >75GB/s bandwidth (50 read/25 write).

Reading the micro architecture article here it's clear that it wasn't built to run Spec. It's also clear that whatever power they saved by not implementing OOO they spent on the cache hierarchy and minimizing latencies in the pipeline in general.

There are certainly things that could be carried over from the Power 6 to a PPE design, most notably the single cycle integer execution latency (compared to two cycles in the PPE), and the 4 vs 6 cycles load-to-use latency of the Power 6 caches vs PPE's.

Cheers

Gubbi · Dec 19, 2007

ADEX said:
Even simple OOO will require fast tracking logic and that consumes power, it'll also require a whole load more read ports on the register file, that's lots more power. In order to allow multiple issue you'll need to double the execution units, that is again more power and a load more chip space. You'll also need to increase the SPE LS read bandwidth and memory bandwidth as well. OOO is going to have no small impact on Wattage.

I agree with you that OOO in the SPEs would make less sense. A large part of the attraction of OOO is its capability to deal with non-deterministic latencies when dealing with the memory system. Since a SPE will be executing straight out of the local store that advantage is simply not there, hence OOO is superfluous.

However, what you state above regarding complexity and power isn't really true.

Both K8 and Core 2 use a similar ROB-structure called a data-capture scheduler. In both you have a register file for the retired, ensured-to-be-valid state and one for the state of the CPU at the speculated program counter; in K8 called the future file, in Core2 (going back to its PPRO heritage) called the active register file. Values from the active register/future file is read when instructions are inserted into the ROB. Since each instructions can have two operands this register file needs two ports for each issue slot, 3 in K8, 4 in Core 2, for a total of 6 and 8 ports respectively. This is the exact same amount of ports needed in an in-order design of similar width.

The trick is that results from the execution units are broadcast on result-busses that are snooped by the individual entries in the ROB. The complexity of the ROB is then decided by the amount of result busses and the number of instruction slot entries. Note that in the original PPRO design the ROB and RAT made up 10% of the core, in todays Core 2s it would be a much smaller part, relatively.

You also get a higher usage of your execution units. Since execution will not halt if two consecutive instructions needs the same execution unit you can get away with less. An in-order design runs in feast-or-famine mode, when there are no data-hazards you need to have enough execution units to get work done.

And OOO allows you to save power. Since your CPU core has a higher tolerance for latencies you can opt for a lower power cache hierarchy and execution units.

Cheers

deathkiller · Dec 19, 2007

Shifty Geezer said:
That was the point being raised...

My point was that if they use an "upgraded" Cell that won't be a cutting edge architecture by the next gen even if it is a "new design". What is exactly the problems that you think that Nintendo would have with using a new chip? Do you think that they won't use anything except an old chip without any changes?

Crossbar · Dec 19, 2007

Gubbi said:
Reading the micro architecture article here

Interesting reading, thanks.

A noob question: Does the SpecInt performance test numbers scale with the number of cores or is it a strict single thread test?

hoho · Dec 19, 2007

Crossbar said:
A noob question: Does the SpecInt performance test numbers scale with the number of cores or is it a strict single thread test?

That depends.

Spec*_rate always uses multiple CPUs/cores. Non-rate might not. Though it is allowed to compile the non-rate benchmarks so that compilers try to automagically parallelize the code and run it in multiple threads. Whether auto parallelization is enabled or not is written in the detailed benchmark results page.

pascal · Dec 19, 2007

The fundamental question is how will future applications (game engine, etc...) be developed? Will it be highlly parallelized? Will it have bottlenecks?

IMHO future processors should be prepared for a mix of workloads.

Predict: The Next Generation Console Tech

ADEX

pascal

The_legend_of_drtre

sunscar

liolio

Aquoiboniste

sunscar

Shifty Geezer

uber-Troll!

Barbarian

SPM

deathkiller

Shifty Geezer

uber-Troll!

liolio

Aquoiboniste

ADEX

3dilettante

Gubbi

Gubbi

deathkiller

Crossbar

hoho

pascal

Similar threads