IBM Power7 Derivative: A Viable Console CPU?

"Pleasure to read as usual :)
As you are here, I may ask you your pov on something.
You may have read and take part to the old "next generation CPU will they go back to OoO execution etc." thread which I couldn't find after multiple researches.

It seems that pretty much everybody agrees now that OoO execution should be part of the next generation CPU. I'm straying away a bit from this thread topic but I wonder if throughput is still a relevant design goal for next generation CPU?

What is your opinion on the matter? From your posts I would assert that you think that big OoO cores akin to Intel one are the way to go but I wonder about how a more (fp) throughput oriented CPUs would be perceived by the one with actual knowledge on those things.

Lately I wondered about the relevance for a pretty "big" cpu cores to feed more than one SIMD units. I figurred that it could have benefit especially with a chip supporting 4 way SMT.
Basically it would be like bulldozer in the concept sharing the cost of the front end , OoO engine, etc. not on multiple "cores" but SIMD.
My idea is that it may be easier to feed a 2 SIMD units than a bigger ones (load and stores on the 2 units are unlikely to happen at the same time, it could though) and that it could be overall more efficient than having a SIMD unit twice as big (both are not exclusive through).

Is that a complete misunderstanding ? If not do you think it could be something desirable for a next gen CPU?
 
Power7 contains a lot of stuff that is completely useless in a console. The core is balanced for single element double precision throughput, with 4 individual double precision execution units (and even one decimal FPU!). This is essentially completely wasted in a console. Any power7 cpu cut down to console use cases would no longer resemble a power7 cpu very much.

Also, the Power7 line is not designed to be modular and embeddable. In that way, it's no worse than any previous IBM cpu. It's just that in this generation IBM does have a cpu designed to be modular and embeddable. I am, of course, talking of the PowerPC 470S. It's floating point unit is designed to be swappable, so you can switch out the double precision one for anything from the VMX line you fancy. It's bus design is built so that it can work as a part of a cache-coherent whole with parts not built by IBM, so all the game dev gods get what they want. It's a very energy and die space efficient design, so it produces admirable performance while leaving most of the design TDP and space for the GPU. And while it's single-threaded performance is nothing approaching a Power7, it would still be a huge, huge improvement over the present gen, especially in the worst-case situations.

Also, the 1.6GHz is not the absolute maximum the design can stand, it's just the frequency IBM decided to pimp it out as a power-efficient embedded CPU. Give it a modern process, and just a tiny bit of more power budget, and we are talking frequencies that near the "magical" 3GHz barrier last gen shipped at. With a 4 issue CPU (compared to the 2-issue ones last gen), and enough OOOe resources that it shouldn't hopelessly stall on every L1 and L2 miss.

I really, honestly think that 470S and it's successors are not just the best available options, but, considering all design constraints, really very close to being the best possible options. I'm really hoping that the "16 cores" leak means that MS is shipping with a full 470S solution.

Isn't 470 32bit only?
how it compare to ppc a2?

For those interested a PPT on the PowerPC 470S. It is a 32bit OOOe CPU. On the Memory Management slide it notes it is a 42-bit (4TB) real address space. General info:

  • Synthesizable Core (476FP is an IBM ASCI hard core) with coherency enabled L1 caches. L1 Instruction and Data caches: 32KB, 4 way set-associate and 32 byte cache line
  • Coherant L2 cache in 256KB, 512KB, and 1024KB configurations (half rate L2? "L2 cache & PLB6 complex clock: 800MHz"
  • 32bit PPC implementation
  • SMP multicore support
  • 9 stage, four issue, out of order issue and execution, and in order complete
  • Up to 32 instruction in flight
  • Five Integer pipelines: Complex Integer, Multiple/Divide, Load/Store, Simple Integer, Branch
  • Supports a FP Unit as well as extensions like VMX
  • 45nm SOI, Eight metal layer; 1.53GHz @ 0.9V, 1.6GHz @ 0.94V
  • 1.9W @ 1.0V
  • 3.765mm^2 (with how much L2?)
Interesting. I could be wrong (?) but it looks like the L2 is half speed? And while the power and size are great at 1.6GHz and some of the trade-offs in the design I wonder: would it be much faster than Xenon? I would guess ramping up to 3.2GHz would probably require the pipeline to become longer than 9 stages?

EDIT: Looks like the discussion and links got ahead of me. At least I outlined my notes :p
 

I was wrong. The 476FP manual makes it abundantly clear that the core only implements the 32-bit subset of Power isa 2.06 Book III-E. For my defense, I was probably fooled by the way the registers are still referred to as if 64-bit, but only the lowest 32 bits actually work. Sorry.

(And now I have no idea whatsoever what they mean when they claim 49 bits of virtual address space. Are they counting process tags or something?)

As you are here, I may ask you your pov on something.
You may have read and take part to the old "next generation CPU will they go back to OoO execution etc." thread which I couldn't find after multiple researches.

It seems that pretty much everybody agrees now that OoO execution should be part of the next generation CPU. I'm straying away a bit from this thread topic but I wonder if throughput is still a relevant design goal for next generation CPU?

I have some ... strong opinions on the subject. Specifically, I don't think throughput was a relevant goal for the last gen CPUs. Most code isn't hand-written assembler for vector registers. What matters is how fast compiled object oriented C++ runs, or, increasingly, how fast the best Lua implementation can run. The rest is mostly "mental masturbation". Last gen provided throughput not because it was asked for, but because it was what IBM had available to sell. (And no-one else really had anything better.)

What is your opinion on the matter? From your posts I would assert that you think that big OoO cores akin to Intel one are the way to go but I wonder about how a more (fp) throughput oriented CPUs would be perceived by the one with actual knowledge on those things.
I actually have a bit of a hard-on for small OoO cores. Big fat cores like Intel's and the Power7 push the single-threaded perf way past the knee of the curve. If that's what you need, then fine. For a game console, I'm probably willing to put up with the complication of twice the threads for an order of magnitude simpler and cheaper cores.

Lately I wondered about the relevance for a pretty "big" cpu cores to feed more than one SIMD units. I figurred that it could have benefit especially with a chip supporting 4 way SMT.
Basically it would be like bulldozer in the concept sharing the cost of the front end , OoO engine, etc. not on multiple "cores" but SIMD.
My idea is that it may be easier to feed a 2 SIMD units than a bigger ones (load and stores on the 2 units are unlikely to happen at the same time, it could though) and that it could be overall more efficient than having a SIMD unit twice as big (both are not exclusive through).

Is that a complete misunderstanding ? If not do you think it could be something desirable for a next gen CPU?
That is a valid design, and sort of where x86 ended up. IBM seems to prefer simple self-contained VMX units, without the complication of dealing with forwarding and it's ilk. Probably for cheaper and safer design as much as any performance reasons.

Then again, I don't actually develop low-level game engine code for a living, I leave that part to the professinals. You could ask them? Carmack is pretty responsive on twitter. (ID_AA_Carmack)
 
just a dumb question
Freescale is a complete different company, or ibm can sell it's power based designs?
 
Some random questions, to stimulate discussion, for those who may actually know something about POWER7.

Question #1
: Is this even remotely possible? Is this far too optimistic or a roughly accurate ball park for what IBM could fit within that silicon/power budget?


Question #2: Would this make a good console CPU?
I think so.

Question #3: What would you reduce? Frequency, L3, memory controller, execution units, etc? What execution units and why?
Reduce L3 cache to 8MB. Remove the memory controllers and have the CPU interface with the GPU through a fast interface (similar to how the 360 does it). Remove the decimal floating point units. Reduce the 4 issue ports for floating point to two. Use SIMD to get FP throughput. Get rid of all the RAS features.

Question #4: What would you add? VMX128 support? At what cost?

They might want to add VMX128 for backwards compatibility but otherwise it is a waste. It was expanded to 128 registers because of the in-order nature of the 360's CPU, with OOOe you get 128 rename registers (or more).

Question #5: To my knowledge IBM only sells Power7 chips in complete server packages for tens of thousands of dollars for the low end. Would IBM even be interested in creating a console variant of POWER7?
Why wouldn't they? It wouldn't canibalize any of IBMs product lines and would be revenue for their chip design unit (and possibly fab). Also bragging rights.

Question #6: How is the POWER7's real code performance compared to an AMD Bulldozer core? Per-mm^2? Per-Watt?

IMO, it's roughly comparable to Intel's Sandy Bridge. Slower on single thread workloads, faster on throughput. Regarding size, even though the P7 chip is massive, each core is only 27 mm² on 45nm, on 32nm that would equate to 16-18 mm².

Question #9: As a developer, thinking of the 5-7 year window of console development, would you prefer 4 cores/16 threads in a robust CPU (IBM design) or the shift of budgets to a 2m/4c AMD design but with on-die Shader Array? Why?

I think 4 cores (of any kind) is thoroughly unambitious. We're talking about a system design that will live until 2020. I expect at least 8 cores.

Question #10. Would this IBM design need a beefed up vector unit or is the real world performance/thoroughput on POWER7 chips more than sufficient?

It would need SIMD.

Question #11. Thinking in console contexts, if you could change one thing about POWER7, what would it be?

The derivative would need to be optimized for power and for process variations.

P7 is built for speed with exotic power consumption as a result. It is also binned aggressively with a fairly big spread in speeds, the fastests chips going into high end servers and slower ones into blades. None of this can be afforded in a console design. Power consumption per core needs to be lowered and the target frequency of the console should be low enough that you can use most of your good CPUs.

Question #12. Does a POWER7 design indicate a split memory design?

No, let the CPU interface to the GPU, it already has massive memory controller resources; Lots of bandwidth and lots of outstanding memory transactions.

Cheers
 
Thanks for the insights :)

You are pretty much in the Gubbi's (or it was 3dilletante?) camp, back in time they thought the perfect CPUs for this gen may should have looked more like those PWRficient CPUs (in dual core fashion) than what we ended up with. I believe the IP got bough by the governement before Apple took over P.A Semi.
So I get it that your answer is that you may take 8 of those with 4 wide SIMD vs 4 SnB with 8 wide SIMD. There are still some white paper out there for those CPUs.

Overall the sweet spot according to you might be a 3issue CPU with a reasonable number of execution units. PWRficient, Krait or A15 in the RISC world may be k7/athlon in the X86 would be reasonable building block / sane basis for the design.
---------------------
WRT to having 2 SIMD in such CPUs (or in wider) as well as the relevance of no SMT, 2 way or 4 way SMT in the kind of code that runs in game, lets say I would never dare to post on J.Carmack blog and so take form his time.

May be some people here if they have the time to (like Barbarian, Nao, Nick, ERP, Sebbi and others) could give theirs opinion. Not that I think that their time is cheaper than JC on but they use to post here.
If anything if some members which may put the question in proper form and understand properly the response a real programer could give to it, feel interested in the matter they may go ahead.


EDIT I was answering to Tunafish, dirty copy paste job, sorry.
 
Last edited by a moderator:
Wasn't Xenon already a highly custom part? Why couldn't the next gen Xbox CPU just inherent some of the better parts of the POWER7 architecture like cache latency and throw out the functional units they don't need (while adding the ones they do)? Surely Microsoft is prepared to pay for the custom design behind a CPU that will likely move 50+ million units over the course of 8+ years if the cycle goes like it did this time.
 
Wasn't Xenon already a highly custom part? Why couldn't the next gen Xbox CPU just inherent some of the better parts of the POWER7 architecture like cache latency and throw out the functional units they don't need (while adding the ones they do)? Surely Microsoft is prepared to pay for the custom design behind a CPU that will likely move 50+ million units over the course of 8+ years if the cycle goes like it did this time.

I think it depends on what their requirements are.

It's really clear that XBox CPU was all about Flops to the exclusion of pretty much everything else. But it wasn't all that different from the part IBM used as the PPU in Cell.
i.e. It wasn't all that custom there were some minor additions/changes, but it was heavily based on a design that IBM already had (I've been told the design predates Cell as well FWIW).

I think we'll likely see the same thing, it won't be an off the shelf part, but it will be very similar to a design that already exists.

The Power 7 is heavily optimized for the workloads it's used for, although you could just start lopping cache off it, the current design is designed to function with a lot of cache, starting to lop cache off may have unforeseen performance consequences. Same would be true for scaling any part of the chip down without addressing the consequences. I think it's easier to add than to subtract in this case.
 
ERP, in that case what do you think of the PowerPC 476S (or FP)? They seem to be designs that are a step up from Xenon in some areas (e.g. cache latency), small, lower power consumption, etc and in the 'S' version it is aimed to be custom modified. It seems the 476 could well be a chip happy to be changed via addition where needed. The one problem I see is it is 32bit (Xenon was 64bit iirc) and it is aimed at 1.6GHz on 45nm (16 chips at 65W iirc). It may need some work to double the frequency--and that may come at the expense of some of the latencies.

The other IBM chip I can think of is the A2 but it seems to fall into the same criteria as POWER7 as the changes would be more toward subtraction as it, too, is a larger chip.
 
The latency of the caches in 476FP is only shorter measured in cycles, it has the same latency measured in real time. The clock of 476FP tops out at 1.6GHz in 45nm, exactly half the speed of Xenon's 3.2GHz, load-to.use latency is 2 vs. 4 cycles and thus the same.

Power consumption is only 2W/core at 1.6GHz, but Microsoft would want to have a SIMD vector unit in there so datapaths would need to be reworked with wider load/store path for caches. The OOO scheduler is an "old" schedule-after-read (future/active-register file) instead of the more modern read-after-schedule used in BD, Sandy Bridge , Power7 and ARM cortex A9/15. If you add 128bit SIMD you have to make all your ROB entries and result buses 128 bits wide. The ROB also only holds 32 entries, which is just 8 cycles at full tilt. Power consumption can only go up.

Then there is the complexity of using a lot of weak cores instead of a few fast ones. Paraphrasing Seymour Cray: Which would you use for plowing, a few oxen or a flock of chickens?

Cheers
 
Actually it was codeveloped by sony. Sony not knowing it XD
Magic of the IBM's R&D management

Bollocks.

It follows the same narrow in-order philosophy that led to Power 6. The Xenon and PPU is more likely the result of an early (test) implementation of said philosophy.

Cheers
 
Hey Gubbi, it seems that you along with quite a few of the other serious developers & industry experienced guys here aren't overly enthusiastic about the performance of Xenon as a console CPU. Possibly Cell too although I'm never quite sure of the general opinion there.

Anyway, there was a link posted recently from the developers of Metro 2033 which talked about Xenon (all 3 cores) being equivalent in power to about 75-85% of a single Nehalem core at the same clockspeed. That is unless you properly vectorise the code in which case Xenon can actually be faster than a Nehalem on a clock/thread basis. Or in other words, In properly vectorised code, Xenon could have roughly the performance of a quad Nehalem at 3.2Ghz.

What's your take on this? Is it possible to vectorise a significant portion of CPU gaming code to extract that level of performance out of something like Xenon (or Cell)? If so then it seems that a scaled up version of either of those CPU's could be pretty potent for a next gen console.
 
Hey Gubbi, it seems that you along with quite a few of the other serious developers & industry experienced guys here aren't overly enthusiastic about the performance of Xenon as a console CPU. Possibly Cell too although I'm never quite sure of the general opinion there.

Anyway, there was a link posted recently from the developers of Metro 2033 which talked about Xenon (all 3 cores) being equivalent in power to about 75-85% of a single Nehalem core at the same clockspeed. That is unless you properly vectorise the code in which case Xenon can actually be faster than a Nehalem on a clock/thread basis. Or in other words, In properly vectorised code, Xenon could have roughly the performance of a quad Nehalem at 3.2Ghz.

What's your take on this? Is it possible to vectorise a significant portion of CPU gaming code to extract that level of performance out of something like Xenon (or Cell)? If so then it seems that a scaled up version of either of those CPU's could be pretty potent for a next gen console.

Are we talking vectorized code on Xenon vs non-AVX vector code on Nehalem, or with AVX involved, because I'm pretty sure, with AVX involved, it won't be pretty for Xenon.
 
What's your take on this? Is it possible to vectorise a significant portion of CPU gaming code to extract that level of performance out of something like Xenon (or Cell)?
Not all code can be vectorized. Some things, like physics or media processing, can be very neatly vectorized with almost a linear speedup for the vector width. But some things, like "game script", or ai, really gains absolutely nothing from vectorization. Generally, simple "smooth" loads vectorize nicely, but if your problem needs to branch a lot, the vectorized code path would suffer a combinatorial explosion in paths it needs to take, so really it's two ifs and then you'd be better off not bothering at all.

The codebase of a game consists of a lot of different kinds of code. For some of it, Cell and Xenon are actually very good cpus. But they push performance for one kinds of loads way way past the point of sense. And when you have a set of different loads, the more you optimize one kinds of loads, the less you gain from each linear improvement, because the portion of your total time budget spent by them shrinks.
 
Are we talking vectorized code on Xenon vs non-AVX vector code on Nehalem, or with AVX involved, because I'm pretty sure, with AVX involved, it won't be pretty for Xenon.
Then again Nehalem doesn't support AVX, that came with Sandy Bridge :)

Though yes, it does have 128bit SSE but I'm quite sure that with extra registers and functions in Xenon it still probably can't catch up with Xenon at per-core basis (as long as the problem isn't heavily cache-memory latency bound).
 
Then again Nehalem doesn't support AVX, that came with Sandy Bridge :)

Though yes, it does have 128bit SSE but I'm quite sure that with extra registers and functions in Xenon it still probably can't catch up with Xenon at per-core basis (as long as the problem isn't heavily cache-memory latency bound).

Huh, I thought Nehalem had AVX. Learned something new......
 
I'm trying to understand here, is VSX only 128 bit wide, or is it IBM's 256 competitor to AVX?

Assuming VSX in Power7 is 128........
The confusing thing with the vmx 128 units is that 128 refers to the number of registers.
At the same time the units being 4 wide, so 4 FP32 elements it is also 128 bits wide.
the SIMD in POWER 7 are 4 wide.

The nice thing is about that we could have vmx 128 v2 which would be 8 wide :)
So 128 256 bits registers.

In case of an OOo processor they may not need that may visible registers. Altivec usually has 32 (4 wide / 128 bits wide.register).

Anyway for now IBM has no SIMD that match the width of Intel AVX ( 8 wide in fp only if memory serves right).
 
Xenon v.2 may be a good excuse for IBM to make a 256bit SIMD unit. And they could later make it part of the PPC spec, just like Altivec ;)
 
Last edited by a moderator:
Back
Top