IBM Power7 Derivative: A Viable Console CPU?

Discussion in 'Console Technology' started by Acert93, Apr 8, 2012.

  1. kalelovil

    Regular

    Joined:
    Sep 8, 2011
    Messages:
    465
  2. liolio

    liolio French frog
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,565
    Location:
    Bx, France
    As you are here, I may ask you your pov on something.
    You may have read and take part to the old "next generation CPU will they go back to OoO execution etc." thread which I couldn't find after multiple researches.

    It seems that pretty much everybody agrees now that OoO execution should be part of the next generation CPU. I'm straying away a bit from this thread topic but I wonder if throughput is still a relevant design goal for next generation CPU?

    What is your opinion on the matter? From your posts I would assert that you think that big OoO cores akin to Intel one are the way to go but I wonder about how a more (fp) throughput oriented CPUs would be perceived by the one with actual knowledge on those things.

    Lately I wondered about the relevance for a pretty "big" cpu cores to feed more than one SIMD units. I figurred that it could have benefit especially with a chip supporting 4 way SMT.
    Basically it would be like bulldozer in the concept sharing the cost of the front end , OoO engine, etc. not on multiple "cores" but SIMD.
    My idea is that it may be easier to feed a 2 SIMD units than a bigger ones (load and stores on the 2 units are unlikely to happen at the same time, it could though) and that it could be overall more efficient than having a SIMD unit twice as big (both are not exclusive through).

    Is that a complete misunderstanding ? If not do you think it could be something desirable for a next gen CPU?
     
  3. Acert93

    Acert93 Artist formerly known as Acert93
    Legend

    Joined:
    Dec 9, 2004
    Messages:
    7,782
    Location:
    Seattle
    For those interested a PPT on the PowerPC 470S. It is a 32bit OOOe CPU. On the Memory Management slide it notes it is a 42-bit (4TB) real address space. General info:

    • Synthesizable Core (476FP is an IBM ASCI hard core) with coherency enabled L1 caches. L1 Instruction and Data caches: 32KB, 4 way set-associate and 32 byte cache line
    • Coherant L2 cache in 256KB, 512KB, and 1024KB configurations (half rate L2? "L2 cache & PLB6 complex clock: 800MHz"
    • 32bit PPC implementation
    • SMP multicore support
    • 9 stage, four issue, out of order issue and execution, and in order complete
    • Up to 32 instruction in flight
    • Five Integer pipelines: Complex Integer, Multiple/Divide, Load/Store, Simple Integer, Branch
    • Supports a FP Unit as well as extensions like VMX
    • 45nm SOI, Eight metal layer; 1.53GHz @ 0.9V, 1.6GHz @ 0.94V
    • 1.9W @ 1.0V
    • 3.765mm^2 (with how much L2?)
    Interesting. I could be wrong (?) but it looks like the L2 is half speed? And while the power and size are great at 1.6GHz and some of the trade-offs in the design I wonder: would it be much faster than Xenon? I would guess ramping up to 3.2GHz would probably require the pipeline to become longer than 9 stages?

    EDIT: Looks like the discussion and links got ahead of me. At least I outlined my notes :p
     
  4. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    478
    I was wrong. The 476FP manual makes it abundantly clear that the core only implements the 32-bit subset of Power isa 2.06 Book III-E. For my defense, I was probably fooled by the way the registers are still referred to as if 64-bit, but only the lowest 32 bits actually work. Sorry.

    (And now I have no idea whatsoever what they mean when they claim 49 bits of virtual address space. Are they counting process tags or something?)

    I have some ... strong opinions on the subject. Specifically, I don't think throughput was a relevant goal for the last gen CPUs. Most code isn't hand-written assembler for vector registers. What matters is how fast compiled object oriented C++ runs, or, increasingly, how fast the best Lua implementation can run. The rest is mostly "mental masturbation". Last gen provided throughput not because it was asked for, but because it was what IBM had available to sell. (And no-one else really had anything better.)

    I actually have a bit of a hard-on for small OoO cores. Big fat cores like Intel's and the Power7 push the single-threaded perf way past the knee of the curve. If that's what you need, then fine. For a game console, I'm probably willing to put up with the complication of twice the threads for an order of magnitude simpler and cheaper cores.

    That is a valid design, and sort of where x86 ended up. IBM seems to prefer simple self-contained VMX units, without the complication of dealing with forwarding and it's ilk. Probably for cheaper and safer design as much as any performance reasons.

    Then again, I don't actually develop low-level game engine code for a living, I leave that part to the professinals. You could ask them? Carmack is pretty responsive on twitter. (ID_AA_Carmack)
     
  5. fehu

    Veteran Regular

    Joined:
    Nov 15, 2006
    Messages:
    1,115
    Location:
    Somewhere over the ocean
    just a dumb question
    Freescale is a complete different company, or ibm can sell it's power based designs?
     
  6. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,095
    I think so.

    Reduce L3 cache to 8MB. Remove the memory controllers and have the CPU interface with the GPU through a fast interface (similar to how the 360 does it). Remove the decimal floating point units. Reduce the 4 issue ports for floating point to two. Use SIMD to get FP throughput. Get rid of all the RAS features.

    They might want to add VMX128 for backwards compatibility but otherwise it is a waste. It was expanded to 128 registers because of the in-order nature of the 360's CPU, with OOOe you get 128 rename registers (or more).

    Why wouldn't they? It wouldn't canibalize any of IBMs product lines and would be revenue for their chip design unit (and possibly fab). Also bragging rights.

    IMO, it's roughly comparable to Intel's Sandy Bridge. Slower on single thread workloads, faster on throughput. Regarding size, even though the P7 chip is massive, each core is only 27 mm² on 45nm, on 32nm that would equate to 16-18 mm².

    I think 4 cores (of any kind) is thoroughly unambitious. We're talking about a system design that will live until 2020. I expect at least 8 cores.

    It would need SIMD.

    The derivative would need to be optimized for power and for process variations.

    P7 is built for speed with exotic power consumption as a result. It is also binned aggressively with a fairly big spread in speeds, the fastests chips going into high end servers and slower ones into blades. None of this can be afforded in a console design. Power consumption per core needs to be lowered and the target frequency of the console should be low enough that you can use most of your good CPUs.

    No, let the CPU interface to the GPU, it already has massive memory controller resources; Lots of bandwidth and lots of outstanding memory transactions.

    Cheers
     
  7. liolio

    liolio French frog
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,565
    Location:
    Bx, France
    Thanks for the insights :)

    You are pretty much in the Gubbi's (or it was 3dilletante?) camp, back in time they thought the perfect CPUs for this gen may should have looked more like those PWRficient CPUs (in dual core fashion) than what we ended up with. I believe the IP got bough by the governement before Apple took over P.A Semi.
    So I get it that your answer is that you may take 8 of those with 4 wide SIMD vs 4 SnB with 8 wide SIMD. There are still some white paper out there for those CPUs.

    Overall the sweet spot according to you might be a 3issue CPU with a reasonable number of execution units. PWRficient, Krait or A15 in the RISC world may be k7/athlon in the X86 would be reasonable building block / sane basis for the design.
    ---------------------
    WRT to having 2 SIMD in such CPUs (or in wider) as well as the relevance of no SMT, 2 way or 4 way SMT in the kind of code that runs in game, lets say I would never dare to post on J.Carmack blog and so take form his time.

    May be some people here if they have the time to (like Barbarian, Nao, Nick, ERP, Sebbi and others) could give theirs opinion. Not that I think that their time is cheaper than JC on but they use to post here.
    If anything if some members which may put the question in proper form and understand properly the response a real programer could give to it, feel interested in the matter they may go ahead.


    EDIT I was answering to Tunafish, dirty copy paste job, sorry.
     
    #27 liolio, Apr 10, 2012
    Last edited by a moderator: Apr 10, 2012
  8. anexanhume

    Regular

    Joined:
    Dec 5, 2011
    Messages:
    842
    Wasn't Xenon already a highly custom part? Why couldn't the next gen Xbox CPU just inherent some of the better parts of the POWER7 architecture like cache latency and throw out the functional units they don't need (while adding the ones they do)? Surely Microsoft is prepared to pay for the custom design behind a CPU that will likely move 50+ million units over the course of 8+ years if the cycle goes like it did this time.
     
  9. ERP

    ERP
    Moderator Veteran

    Joined:
    Feb 11, 2002
    Messages:
    3,669
    Location:
    Redmond, WA
    I think it depends on what their requirements are.

    It's really clear that XBox CPU was all about Flops to the exclusion of pretty much everything else. But it wasn't all that different from the part IBM used as the PPU in Cell.
    i.e. It wasn't all that custom there were some minor additions/changes, but it was heavily based on a design that IBM already had (I've been told the design predates Cell as well FWIW).

    I think we'll likely see the same thing, it won't be an off the shelf part, but it will be very similar to a design that already exists.

    The Power 7 is heavily optimized for the workloads it's used for, although you could just start lopping cache off it, the current design is designed to function with a lot of cache, starting to lop cache off may have unforeseen performance consequences. Same would be true for scaling any part of the chip down without addressing the consequences. I think it's easier to add than to subtract in this case.
     
  10. Acert93

    Acert93 Artist formerly known as Acert93
    Legend

    Joined:
    Dec 9, 2004
    Messages:
    7,782
    Location:
    Seattle
    ERP, in that case what do you think of the PowerPC 476S (or FP)? They seem to be designs that are a step up from Xenon in some areas (e.g. cache latency), small, lower power consumption, etc and in the 'S' version it is aimed to be custom modified. It seems the 476 could well be a chip happy to be changed via addition where needed. The one problem I see is it is 32bit (Xenon was 64bit iirc) and it is aimed at 1.6GHz on 45nm (16 chips at 65W iirc). It may need some work to double the frequency--and that may come at the expense of some of the latencies.

    The other IBM chip I can think of is the A2 but it seems to fall into the same criteria as POWER7 as the changes would be more toward subtraction as it, too, is a larger chip.
     
  11. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,095
    The latency of the caches in 476FP is only shorter measured in cycles, it has the same latency measured in real time. The clock of 476FP tops out at 1.6GHz in 45nm, exactly half the speed of Xenon's 3.2GHz, load-to.use latency is 2 vs. 4 cycles and thus the same.

    Power consumption is only 2W/core at 1.6GHz, but Microsoft would want to have a SIMD vector unit in there so datapaths would need to be reworked with wider load/store path for caches. The OOO scheduler is an "old" schedule-after-read (future/active-register file) instead of the more modern read-after-schedule used in BD, Sandy Bridge , Power7 and ARM cortex A9/15. If you add 128bit SIMD you have to make all your ROB entries and result buses 128 bits wide. The ROB also only holds 32 entries, which is just 8 cycles at full tilt. Power consumption can only go up.

    Then there is the complexity of using a lot of weak cores instead of a few fast ones. Paraphrasing Seymour Cray: Which would you use for plowing, a few oxen or a flock of chickens?

    Cheers
     
  12. fehu

    Veteran Regular

    Joined:
    Nov 15, 2006
    Messages:
    1,115
    Location:
    Somewhere over the ocean
    Actually it was codeveloped by sony. Sony not knowing it XD
    Magic of the IBM's R&D management
     
  13. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,095
    Bollocks.

    It follows the same narrow in-order philosophy that led to Power 6. The Xenon and PPU is more likely the result of an early (test) implementation of said philosophy.

    Cheers
     
  14. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,452
    Location:
    Guess...
    Hey Gubbi, it seems that you along with quite a few of the other serious developers & industry experienced guys here aren't overly enthusiastic about the performance of Xenon as a console CPU. Possibly Cell too although I'm never quite sure of the general opinion there.

    Anyway, there was a link posted recently from the developers of Metro 2033 which talked about Xenon (all 3 cores) being equivalent in power to about 75-85% of a single Nehalem core at the same clockspeed. That is unless you properly vectorise the code in which case Xenon can actually be faster than a Nehalem on a clock/thread basis. Or in other words, In properly vectorised code, Xenon could have roughly the performance of a quad Nehalem at 3.2Ghz.

    What's your take on this? Is it possible to vectorise a significant portion of CPU gaming code to extract that level of performance out of something like Xenon (or Cell)? If so then it seems that a scaled up version of either of those CPU's could be pretty potent for a next gen console.
     
  15. Mobius1aic

    Mobius1aic Quo vadis?
    Veteran

    Joined:
    Oct 30, 2007
    Messages:
    1,392
    Location:
    Texas, USA
    Are we talking vectorized code on Xenon vs non-AVX vector code on Nehalem, or with AVX involved, because I'm pretty sure, with AVX involved, it won't be pretty for Xenon.
     
  16. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    478
    Not all code can be vectorized. Some things, like physics or media processing, can be very neatly vectorized with almost a linear speedup for the vector width. But some things, like "game script", or ai, really gains absolutely nothing from vectorization. Generally, simple "smooth" loads vectorize nicely, but if your problem needs to branch a lot, the vectorized code path would suffer a combinatorial explosion in paths it needs to take, so really it's two ifs and then you'd be better off not bothering at all.

    The codebase of a game consists of a lot of different kinds of code. For some of it, Cell and Xenon are actually very good cpus. But they push performance for one kinds of loads way way past the point of sense. And when you have a set of different loads, the more you optimize one kinds of loads, the less you gain from each linear improvement, because the portion of your total time budget spent by them shrinks.
     
  17. hoho

    Veteran

    Joined:
    Aug 21, 2007
    Messages:
    1,218
    Location:
    Estonia
    Then again Nehalem doesn't support AVX, that came with Sandy Bridge :)

    Though yes, it does have 128bit SSE but I'm quite sure that with extra registers and functions in Xenon it still probably can't catch up with Xenon at per-core basis (as long as the problem isn't heavily cache-memory latency bound).
     
  18. Mobius1aic

    Mobius1aic Quo vadis?
    Veteran

    Joined:
    Oct 30, 2007
    Messages:
    1,392
    Location:
    Texas, USA
    Huh, I thought Nehalem had AVX. Learned something new......
     
  19. liolio

    liolio French frog
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,565
    Location:
    Bx, France
    The confusing thing with the vmx 128 units is that 128 refers to the number of registers.
    At the same time the units being 4 wide, so 4 FP32 elements it is also 128 bits wide.
    the SIMD in POWER 7 are 4 wide.

    The nice thing is about that we could have vmx 128 v2 which would be 8 wide :)
    So 128 256 bits registers.

    In case of an OOo processor they may not need that may visible registers. Altivec usually has 32 (4 wide / 128 bits wide.register).

    Anyway for now IBM has no SIMD that match the width of Intel AVX ( 8 wide in fp only if memory serves right).
     
  20. Acert93

    Acert93 Artist formerly known as Acert93
    Legend

    Joined:
    Dec 9, 2004
    Messages:
    7,782
    Location:
    Seattle
    Xenon v.2 may be a good excuse for IBM to make a 256bit SIMD unit. And they could later make it part of the PPC spec, just like Altivec ;)
     
    #40 Acert93, Apr 15, 2012
    Last edited by a moderator: Apr 15, 2012

Share This Page

  • About Beyond3D

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...