IBM Power7 Derivative: A Viable Console CPU?

Acert93

Artist formerly known as Acert93
Legend
Discuss: Is a CPU, derived from IBM's POWER7 architecture, viable for consoles?

Sources:
Wikipedia
Ars
Anand
Information Week

Some quick facts:

  • Launched in 2010 on 45nm SOI
  • 567mm^2, 1.2B transistors
  • 3.0GHz to 4.14GHz
  • 33 GFLOPs (peak) per core at 4.14GHz
  • 100 to 170W TDP; IBM has fit both 4 core and 8 core Power7 variants operating at 3.0GHz, the BladeCenter PS700 and PS701 respectively, into a single Blade Slot.
  • 4, 6, and 8 Core Variants (EDIT: Possible correction, almost 250W for 8 cores at 4.14GHz)
  • 4 SMT Threads per Core (Power6 was 2 way SMT per core)
  • 32+32 KB L1 Cache per Core (Power7: 2 cycles latency; Power6: 4 cycles)
  • 256 KB L2 Cache per Core (Power7: 8 cycles latency; Power6: 26 cycles)
  • 4MB L3 Cache (eDRAM) per Core; up to 32MB per Chip
  • 12 execution units per core (2 fixed-point units; 2 load/store units; 4 double-precision floating-point units; 1 vector unit supporting VSX (AltiVec); 1 decimal floating-point unit; 1 branch unit; 1 condition register unit)
  • Aggressive OOOe. per IBM via Wiki, ""Each POWER7 processor core implements aggressive out-of-order (OoO) instruction execution to drive high efficiency in the use of available execution paths. The POWER7 processor has an Instruction Sequence Unit that is capable of dispatching up to six instructions per cycle to a set of queues. Up to eight instructions per cycle can be issued to the Instruction Execution units. The POWER7 processor has a set of twelve execution units as [described above"
  • TurboCore: Half of the cores can be disabled so frequency is ramped up for remaining cores; remaining cores have full access to all of the chip cache and full memory controller
  • Although the Power7 architecture operates at lower frequencies than Power6 a Power7 core is "up to twice" the performance of a Power6 core
  • POWER7 features two DDR3 memory controllers that can do up to 100GB/s
Power7 is a beefy chip design for high performance and, relative to Power6, better power efficiency. As has been noted by many future processor performance in realworld code is as dependant the memory architecture as the peak execution performance. In this regards Power7 has made a number of architectural changes. Per Ars, "First in the chain is the 32KB L1 data cache, which has seen its latency cut in half, from four cycles in the POWER6 to two cycles in POWER7. Then there's the 256KB L2, the latency of which has dropped from 26 cycles in POWER6 to eight cycles in POWER7—that's quite a reduction, and will help greatly to mitigate the impact of the shared L3's increased latency." The last bit is interesting as IBM has migrated to slower eDRAM for the L3 cache. The benefit, of course, is eDRAM is substantially more dense than SRAM meaning a chip can either pack more memory on chip or be smaller--or both--than an SRAM design. Importantly, eDRAM also provides significant power savings over SRAM, "The POWER7's L3 is its most unique feature, and, at 32MB, it's positively gigantic. IBM was able to cram such a large L3 onto the chip by making it out of embedded DRAM (eDRAM) instead of the usual SRAM. This decision cost the cache a few cycles of latency, but in exchange IBM got a 3.5x improvement in power efficiency and a 3x improvement in cache density."

The Rumor: A recent unsubstantiated rumor suggested Microsoft's third Xbox edition (code name "Durango") will use an IBM processor with 16 "cores."

The size and power requirements for a Power7 chip, as they current stand, are far and away outside the design limitations of a console. Considering the Xbox 360 and PS3 had a total power draw in the low 200W range a Power7 chip far exceeds the budgets for a console CPU. Furthermore the silicon budget of an 8 core (32 thread) Power7 chip is equal to, or greater than, the total silicon budget of both past generation consoles.

Making POWER7 work for Consoles?: If Microsoft has decided on a POWER7 derivative, what would they need to do to fit it into a console in 2013? Some thoughts...

First in regards to getting the die size within console budgets:

  • 4 cores (16 threads). This should cut the die size in nearly half from 567mm^2 to just under 300mm^2 on 45nm. Still too large for a console.
  • Migration to 32nm (or 28nm). iirc IBM has been working with Global Foundries on 32nm. This could see the total die size reduce by 30-50%, depending on the memory controller and how dense the logic can go. As caches scale better than logic there is the potential for a 32nm variant scaling closer to 50% size reduction.
  • Elimination of some under-utilized (for game code) execution units.
  • Memory controller re-design. POWER7 has a (max) 100GB/s on two DDR3 memory controllers. IBM used a shared controller on the Xbox 360 with the GPU; minimally it would seem a 4 core variant would only need 1 memory controller.
  • Reduction in L3 cache size; e.g. a move from 4MB per core to 2MB per core (16MB down to 8MB). This will impact performance, and eDRAM is both fairly small and fairly power efficient, but it may be determined a fair sacrifice to reduce area budgets and not be a significant impact to console game code.
It seems possible, in theory, to reduce the humongous POWER7 (567mm^2 for the 8 core variant on 45nm) to a more reasonable 120-170mm^2, 4 core (16 thread) derived design on 32nm. To put this into perspective the PS3 Cell was over 230mm^2 on 90nm.

Moving on to power:

  • With the reduction in (a) cores from 8 to 4 and (b) migration to the 32nm node there should be a significant drop in power usage. A move to 32nm should provide a 30-40% power efficiency per transistor. Assuming a 3.0GHz, 4 core POWER7 chip is 100W (unconfirmed, but it is the low end of the range) on the 45nm process a 32nm variant could come in as low as 60-70W.
  • Reduction in frequency. POWER7 is much, much faster than POWER6 per clock. Further sacrificing frequency for a lower voltage design may be possible while keeping performance in an acceptible range.
  • Reduction in Execution Units, Features. As a server oriented chip the POWER7 has a number of features that may be expendible in a console environment.
  • Reduction in eDRAM. While eDRAM requires much less power than SRAM by cutting eDRAM in half (from 16MB ro 8MB for a 4 core design) there could be some additional power savings.
  • Interposer. I know it is all the rage but the power required to power the traces from a CPU to memory are significant. IBM has been developing 4 chip POWER7 interposer designs (up to 32 cores, 128 threads). The chances of such are slim to none for an interposer for the CPU/Memory.

Let's say you work for IBM and are trying to sell Microsoft on Power7 for a 2013 console. Your spec sheet looks something like this for producubg the following in 2013:

  • POWER7 derivative
  • About 150mm^2 on 32nm
  • 2.8-3.2GHz, 60W TDP
  • 4 Cores, 16 SMT threads
  • 32+32 KB L1, 256KB L2, 8MB L3 eDRAM (2MB per core)
Some random questions, to stimulate discussion, for those who may actually know something about POWER7.

Question #1
: Is this even remotely possible? Is this far too optimistic or a roughly accurate ball park for what IBM could fit within that silicon/power budget?

Question #2: Would this make a good console CPU?

Question #3: What would you reduce? Frequency, L3, memory controller, execution units, etc? What execution units and why?

Question #4: What would you add? VMX128 support? At what cost?

Question #5
: To my knowledge IBM only sells Power7 chips in complete server packages for tens of thousands of dollars for the low end. Would IBM even be interested in creating a console variant of POWER7?

Question #6: How is the POWER7's real code performance compared to an AMD Bulldozer core? Per-mm^2? Per-Watt?

Question #7: Does IBM have a better CPU architecture/solution that can be used in the 1500mm^2 / 60W range? (Preferrably something that is already in that range or can be scaled DOWN... just scaling chips up, especially the idea of throwing 16 single cores on a chip as if that "just works" is a non-starter. If you don't know why "just" throwing 16 Xenon cores on a die and calling it a day is a non-starter please skip this question. I want to know what other many-core architectures IBM has actively discussed that may work, not theoretical new designs connected with fanboy duct tape.)

Question #8: How does this theoretical POWER7 design compare against a 2 module / 4 int. core / 480 SP AMD APU at 3.0GHz?

Question #9: As a developer, thinking of the 5-7 year window of console development, would you prefer 4 cores/16 threads in a robust CPU (IBM design) or the shift of budgets to a 2m/4c AMD design but with on-die Shader Array? Why?

Question #10. Would this IBM design need a beefed up vector unit or is the real world performance/thoroughput on POWER7 chips more than sufficient?

Question #11. Thinking in console contexts, if you could change one thing about POWER7, what would it be?

Question #12. Does a POWER7 design indicate a split memory design?

Question #13. Would TurboCore be a feature valuable to consoles? e.g. For Arcade games that may be single threaded?
 
Discuss:
Question #7: Does IBM have a better CPU architecture/solution that can be used in the 1500mm^2 / 60W range? (Preferrably something that is already in that range or can be scaled DOWN... just scaling chips up, especially the idea of throwing 16 single cores on a chip as if that "just works" is a non-starter.

The PowerPC 476FP embedded core is designed to scale up to a 16 core SMP configuration using IBM's PLB6 bus using only 1.6W and 3.7mm2 per 1.6Ghz core on IBM's 45nm process.

Although I don't think it would be a good choice for the PS4/Next Xbox for the reasons outlined: http://semiaccurate.com/forums/showpost.php?p=158733&postcount=171

The 476FP was launched in 2009 though, so Microsoft/Sony would probably be considering its successor and whatever improvements that makes to the design.


Question #9: As a developer, thinking of the 5-7 year window of console development, would you prefer 4 cores/16 threads in a robust CPU (IBM design) or the shift of budgets to a 2m/4c AMD design but with on-die Shader Array? Why?
IBM is an APU and SoC innovator as well, there's no reason you couldn't mix Power7 CPU cores and GCN(+) shader arrays on a future console APU.
 
This post is appropriate for over here. I added Ninja's link to RealWorldTech about the TDP of a 8 core / 4.14GHz POWER7 being over 240W.
The chip in the Anandtech link is 3.3ghz and goes up to 170w... there is your proof. Those are server setups, you only see high clocked ones in HPC machines usually. If you dig around there is a IBM whitepaper with Power7 8 core at 3.55 with turbo up to 3.86, IIRC it says TDP is 200w. The 4ghz to 4.14 with turbo is 240w+. David Kanters info is always solid.

I am not saying David is wrong, but I don't think what Anandtech says indicates conclusively that the 3.3GHz model is 170W. Read it again:

The Power 7 CPUs are in the 100 to 170W TDP range, while the Xeon E7s are in the 95 to 130W TDP range.
The Xeon E7 is a product line up (Westmere-EX). For example the E7-4870, a 6 core 1.87GHz chip, has a TDP of 95W and the 8 core 2.67GHz E7-8837 a 130W. So when Anand says the E7's have a 95-130W TDP range that fits *exactly* with the product line up (check my link).

If that is the context Anand is using in that sentence there remains ambiguity as the POWER7 lineup runs from 4 core to 8 core variants ranging from 3.0GHz all the way up to 4.14GHz (Turbo). If Anand's information is correct (?) you would think a 3.3GHz 4 core chip would fall on the lower end of the 100W-170W range.

Put that into perspective: If a 3.3GHz 4 core chip is 170W how is a 4.14GHz 8 core chip only (less than) 250W? That is some amazing scaling if doubling the cores and jacking the frequency up 25% results in a 50% bump in TDP--that is amazing even. And not likely, so pardon my reservations. I don't necessarily doubt David's 4.14GHz / 8 cores / 240W, but that would indicate the 3.0GHz / 4 core models are running at the 100W low end TDP Anand provides.

As for those whitepapers I have not seen or found them. You may be right and I have been digging. But what I do know is that based on the following slides, page 14, a Power 755 4U has 8 chips (32 total cores, 4 cores a chip) at 3.3GHz with 256GB of memory has a total peak power draw of 1650W. While that may reconcile to 170W per chip it also could reconcile to a lower end as well.
 
It would of course need a much-beefed-up FPU, since even the 8-core, 4.4GHz version doesn't even approach the now over half-decade old Cell processor. Cutting this CPU in half for a 4-core version and downclocking to save power, and it's no faster at floating point calculations than the even older (from an on-the-market point-of-view), IBM-developed Xenon CPU from the 360.

That'd be rather anticlimactic I would think, and wouldn't please developers very much. They have a reasonable expectation of power increase, not the opposite (even if that power ought to be considerably easier to tap compared to current console hardware.)
 
It would of course need a much-beefed-up FPU, since even the 8-core, 4.4GHz version doesn't even approach the now over half-decade old Cell processor. Cutting this CPU in half for a 4-core version and downclocking to save power, and it's no faster at floating point calculations than the even older (from an on-the-market point-of-view), IBM-developed Xenon CPU from the 360.

That'd be rather anticlimactic I would think, and wouldn't please developers very much. They have a reasonable expectation of power increase, not the opposite (even if that power ought to be considerably easier to tap compared to current console hardware.)

It isn't just about it being "easier" to extract performance. Modern processors are just faster--regardless of the FLOPs rating.

A lot of architectural issues impact utilization. As for the POWER7, just look at something like Intel's i5 (4 core, 8 thread) which just trashes Cell in almost any application. Sure, there are specific segments of code that run better on Cell than said i5, but I don't think you will even find the most ardent Cell supports who would say that, given a choice of "what is faster for game code?" would pick the PS3 Cell -- higher peak FLOPs and all -- over said i5. Take a peak at the NV/AMD architectures prior to GCN where the FLOPs a GPU didn't dictate which performed better on real code. We saw the same situation with Cell versus Xenon; not every problem played to Cell's strengths (more cores, fast local memory, SIMD). It wasn't simply an issue of lazy developers or not enough time to extract performance but not all solutions map well to an architecture--this is why, afterall, we have discreet chips for graphics (GPU) and another discrete chip for general purpose code (CPU).

All that to say that Power7 per core is a LOT faster than Xenon (Waternoose). Having a ton of eDRAM is going to avoid a lot of 600+ cycle penalties for a cache miss and the L2 (8 cycles) is very fast. Latencies and penalties were a big draw back in Xenon. So mitigating many of these by bigger eDRAM to avoid cache hits and calls to system memory, diminishing penalties, and improving L2 performance are all architectural changes that make a big improvement. That is not to mention the fact POWER7 is OOOe with more execution units and more threads per core (4) to hide stalls. The links I posted in the OP actually have information from IBM comparing the Power6 architecture and showing how, even though it had a higher frequency, architectural issues (e.g. a longer pipeline) lead to significantly less performance.

FLOPs are no different than Frequency. Most, by now, understand frequency alone does not determine performance. Peak FLOPs is the same as it won't tell you what is a faster/better processor for game code.

And ... what if ... a developer had code that was embarrassingly parallel and mapped well to SIMD? Sure, a sea of SPEs would be nice for those situations but it is going to be very fast on a Power7 (or i5) also but if a developer was demanding a huge performance leap you would think at that point the code would be shuffled to the GPU as a compute task as that sort of problem will many times work well there. Chances are an embarrassingly parallel problem that maps well to SIMD that requires significant resources is actually a graphics problem anyways ;)
 
ACERT; Fantastic intro..thats gotta be the most detailed start off ive seen.good stuff!..

Personally i think you have hit the nail on the head with that...the only thing i would question is like Grall says FPU...you would expect an upgraded VMX 256 or something that could fit into the budget....that processor 4 x OoOe with 4x SMT...8mb cache + VMX 256 on 32nm...yes i think thats certainly possible....and boy would that be awesome for games! ;)
If they could find a way to get a Tahiti Pro..+ 4gb ram in there...well we would all be laughing!

Edit; Question, im not too hot on these things, so would the 4x SMT apply to FPU instructions as well?? or does that count as 1 VMX thread per core..ie seperate from integer?
 
I'd like to see dual 256-bit VMX units per core. It would be really interesting to see what talented developers could do with some truly astounding, easy-to-use float performance. Shoving off work to the GPU is all well and good for some tasks perhaps, but it takes away a lot of rendering performance. Time spent doing calculations for...whatever, is time not spent drawing stuff that goes on the screen.

And if it's one thing history has shown us since the era of 3D graphics consoles began, it's that persistent 60Hz screen updates in every game is NOT something we'll see the next generation. So I don't want that GPU spending time on anything else other than actually drawing graphics, if it is at all possible to avoid it.... :p
 
It would of course need a much-beefed-up FPU, since even the 8-core, 4.4GHz version doesn't even approach the now over half-decade old Cell processor. Cutting this CPU in half for a 4-core version and downclocking to save power, and it's no faster at floating point calculations than the even older (from an on-the-market point-of-view), IBM-developed Xenon CPU from the 360.

That'd be rather anticlimactic I would think, and wouldn't please developers very much. They have a reasonable expectation of power increase, not the opposite (even if that power ought to be considerably easier to tap compared to current console hardware.)

IIRC a Power7 at 4.14ghz is a little over double the FP of a full Cell at 3.2ghz. Comparing to the Cell in PS3, with only 7 SPEs and one reserved for system functions, its around 3.33x. Unless I'm reading the spec wrong? Not to mention its OoOE and probably gets much higher real world utilization. In terms of useable perf its probably ~5x+ what PS3s Cell is.
 
IIRC a Power7 at 4.14ghz is a little over double the FP of a full Cell at 3.2ghz. Comparing to the Cell in PS3, with only 7 SPEs and one reserved for system functions, its around 3.33x. Unless I'm reading the spec wrong? Not to mention its OoOE and probably gets much higher real world utilization. In terms of useable perf its probably ~5x+ what PS3s Cell is.

Per clock they are essentially identical peak FLOPs:

POWER7 is a max 33.12 GFLOPS per core at 4.14GHz (8GFLOPs/GHz)
http://en.wikipedia.org/wiki/POWER7#Specifications

Cell SPEs are a max of 25.6 GFLOPs at 3.2GHz (8GFLOPs/GHz)
http://en.wikipedia.org/wiki/Cell_Processor#Synergistic_Processing_Elements_.28SPE.29

Even assuming 1 disabled SPE and 1 reserved, 6 SPEs + 1 PPE (another 25.6GFLOPs) is 179.2GFLOPs for Cell. Seeing as there is no way we will see an 8 core POWER7, let alone one at 4.14GHz, I think it is safe to say Cell's peak is better than what we would find in a console (e.g. a 3.2GHz 4 core, 102.4 GFLOPs). Even an 8 core 2 4.14GHz is "only" 265GFLOPs and once you apply the criteria (a) 1 core disabled and (b) 1 core reserved for the OS it drops down to 199GFLOPs. And of course Cell variants went up to 4.0GHz so there you go, if you are taking a top end Cell versus a top end POWER7, adding similar restrictions, Cell wins in peak flops.

But I agree that in most situations the POWER7 is going to be a lot faster and those problems that mapped well to Cell to hit peak rates would seem to be good candidates in general to move to the GPU.
 
Why not include SPE's on the chip itself? A cut down 4 core Power 7 with 4
SPE's per core is something I've been drooling at for some time now. The chip would still be Power7, but also a refined CELL where it provides the best of both worlds. It would be a monster of a CPU and would bring so much power to the table. It's not like it would be unworkable and devs wouldn't be able to get the hang of it, just might take a while. Of course having 16 SPE's might be overkill, maybe 3 per core would be better in terms of transistor count and manageability.

But still, I guess the CPU going into Wii U is either based off of Power7 or A2. If it's Power7 based then cool, I look forward to seeing how it will compete with the 360 and PS3 in terms of programming and all that. The rumors from last year stated 4 core with edram, so 4 MB per core is great, but seems overkill for a console. The 2 MB you suggested before sounds good.
 
Per clock they are essentially identical peak FLOPs:

POWER7 is a max 33.12 GFLOPS per core at 4.14GHz (8GFLOPs/GHz)
http://en.wikipedia.org/wiki/POWER7#Specifications

Cell SPEs are a max of 25.6 GFLOPs at 3.2GHz (8GFLOPs/GHz)
http://en.wikipedia.org/wiki/Cell_Processor#Synergistic_Processing_Elements_.28SPE.29

Even assuming 1 disabled SPE and 1 reserved, 6 SPEs + 1 PPE (another 25.6GFLOPs) is 179.2GFLOPs for Cell. Seeing as there is no way we will see an 8 core POWER7, let alone one at 4.14GHz, I think it is safe to say Cell's peak is better than what we would find in a console (e.g. a 3.2GHz 4 core, 102.4 GFLOPs). Even an 8 core 2 4.14GHz is "only" 265GFLOPs and once you apply the criteria (a) 1 core disabled and (b) 1 core reserved for the OS it drops down to 199GFLOPs. And of course Cell variants went up to 4.0GHz so there you go, if you are taking a top end Cell versus a top end POWER7, adding similar restrictions, Cell wins in peak flops.

But I agree that in most situations the POWER7 is going to be a lot faster and those problems that mapped well to Cell to hit peak rates would seem to be good candidates in general to move to the GPU.

IIRC, again, just from memory, the Power7 spec is using DPflops and the Cell is using SP. I don't think you would just disable cores if you were using P7, it would be a huge waste relative to disabling one SPE, which isn't that much. Nor would you need a whole core just for the OS. SPE does not equal a core. I also don't think you can add in the PPEs throughput to the SPEs, since most of its time is spent managing the SPEs.
 
I'd like to see dual 256-bit VMX units per core. It would be really interesting to see what talented developers could do with some truly astounding, easy-to-use float performance. Shoving off work to the GPU is all well and good for some tasks perhaps, but it takes away a lot of rendering performance. Time spent doing calculations for...whatever, is time not spent drawing stuff that goes on the screen.

And if it's one thing history has shown us since the era of 3D graphics consoles began, it's that persistent 60Hz screen updates in every game is NOT something we'll see the next generation. So I don't want that GPU spending time on anything else other than actually drawing graphics, if it is at all possible to avoid it.... :p


I'm not sure Tim Sweeney would desire VMX units.

The big lesson we can learn from GPUs is that a powerful, wide vector engine can boost the performance of many parallel applications dramatically. This adds a whole new dimension to the performance equation: it's now a function of Cores * Clock Rate * Vector Width.

For the past decade, this point has been obscured by the underperformance of SIMD vector extensions like SSE and Altivec. But, in those cases, the basic idea was sound, but the resulting vector model wasn't a win because it was far too narrow and lacked the essential scatter/gather vector memory addressing instructions.

All of this shows there's a compelling case for Intel and AMD to put Larrabee-like vector units future mainstream CPUs, gaining 16x more performance on data-parallel code very economically.

http://rebelscience.blogspot.com/2008/08/larrabee-intels-hideous-heterogeneous.html
 
Well, whatever you wanna do to reach the neccessary goal, as long as it does not involve dumping CPU processing on the GPU. There should be no particular bias towards any one particular technology/implementation; it's the end result that matters. If CPUs need scatter/gather; give it to them. And so on.

Then again, I'm not sure I'd listen all that much to Tim Sweeney's predictions of the future; the guy's very good at what he's actually doing (UE3 is the most flexible, powerful and technically impressive 3D engine out there), but his soothsaying powers have proven to be fairly weaksauce. :D
 
How do they get 100GB/s out of 2 DRR3 memory controller? Unless each controller is quad channel?

FLOPS wise assuming it has the same throughput per core/clock as Sandybridge a 4 core 3.2 Ghz version would come in at 204.8 GFLOPS. It would probably use at most a single memory controller as well. Drop off a load of L3 and maybe you're getting something approaching usable in a console (although probably still not a great choice).
 
How do they get 100GB/s out of 2 DRR3 memory controller? Unless each controller is quad channel?
Yes, it's two quad-channel controllers.


Power 7 would in my opinion be a relatively bad CPU to pick as a basis for a console CPU. You'd get much nicer thing by taking an already relatively simple design and add console-specific stuff to it (wide SIMD) than taking a huge behemoth with enormous amounts of resources put to improving mainframe-style workloads (huge internal and external buses, lots of wiring to get energy around, ...) and cut it down to something usable. Even then the P7 will need a SIMD unit added to it as I don't think it really has something good enough for consoles.
 
Yes, it's two quad-channel controllers.


Power 7 would in my opinion be a relatively bad CPU to pick as a basis for a console CPU. You'd get much nicer thing by taking an already relatively simple design and add console-specific stuff to it (wide SIMD) than taking a huge behemoth with enormous amounts of resources put to improving mainframe-style workloads (huge internal and external buses, lots of wiring to get energy around, ...) and cut it down to something usable. Even then the P7 will need a SIMD unit added to it as I don't think it really has something good enough for consoles.

I'm trying to understand here, is VSX only 128 bit wide, or is it IBM's 256 competitor to AVX?

Assuming VSX in Power7 is 128........

With the rich and storied history of PowerPC processors the past decade, there are a number of hypothetical candidates for the Wii U CPU.

How about a quad core Power5 with expanded L2 cache, VMX 128 or VSX 256, and GDDR5? IIRC Power5 is a dual issue OoO architecture. I assume a quad version could be approached in a similar manner to Xenon but devs wouldn't have to worry about the anemic L2 cache and in-order processing related problems. It would be a bit limiting compared to today's best solutions, but it would be very familiar territory for current developers, with hugely expanded real world usable GFLOPS. Even in quad configuration with 4 MB L2 cache and increased vector processing capability, it would probably come in under 150 mm². Power5+ @ 90 nm was 243 mm². 32 nm would bring that under 100 mm² easy, hence my assumption for under 150 mm² for improved quad. Lastly I would ramp the clocks up to 3.2 GHz for parity with the other systems.

It would be pricey to develop a new processor, or even a current one with "bolted on" features. A quad core Power7 on 32 nm with 256 bit VSX and memory controllers adapted to run GDDR5 makes sense to me if the power and TDW can be brought down. It's clock efficiency, 4 threads per core and brilliant integer performance would be good for the current crop of developers who are used to such wide cores on PC and sick of the narrow in-order ones on the 360 and PS3.
 
Last edited by a moderator:
Power7 contains a lot of stuff that is completely useless in a console. The core is balanced for single element double precision throughput, with 4 individual double precision execution units (and even one decimal FPU!). This is essentially completely wasted in a console. Any power7 cpu cut down to console use cases would no longer resemble a power7 cpu very much.

Also, the Power7 line is not designed to be modular and embeddable. In that way, it's no worse than any previous IBM cpu. It's just that in this generation IBM does have a cpu designed to be modular and embeddable. I am, of course, talking of the PowerPC 470S. It's floating point unit is designed to be swappable, so you can switch out the double precision one for anything from the VMX line you fancy. It's bus design is built so that it can work as a part of a cache-coherent whole with parts not built by IBM, so all the game dev gods get what they want. It's a very energy and die space efficient design, so it produces admirable performance while leaving most of the design TDP and space for the GPU. And while it's single-threaded performance is nothing approaching a Power7, it would still be a huge, huge improvement over the present gen, especially in the worst-case situations.

Also, the 1.6GHz is not the absolute maximum the design can stand, it's just the frequency IBM decided to pimp it out as a power-efficient embedded CPU. Give it a modern process, and just a tiny bit of more power budget, and we are talking frequencies that near the "magical" 3GHz barrier last gen shipped at. With a 4 issue CPU (compared to the 2-issue ones last gen), and enough OOOe resources that it shouldn't hopelessly stall on every L1 and L2 miss.

I really, honestly think that 470S and it's successors are not just the best available options, but, considering all design constraints, really very close to being the best possible options. I'm really hoping that the "16 cores" leak means that MS is shipping with a full 470S solution.
 
Thanks tunafish, that pretty much puts the Power 7 theories to bed. So we could actually be looking at a genuine 16 core CPU using customised 470S cores.

Any idea how 16 stock 470S's would perform vs say a quad Sandybridge with hyperthreading? I assume the 470 is single threaded so still comes in at twice the threads of a quad Sandybridge with HT?
 
Thanks tunafish, that pretty much puts the Power 7 theories to bed. So we could actually be looking at a genuine 16 core CPU using customised 470S cores.

"Customized" is a strongly overloaded word. Here it would likely mean "The cores themselves are untouched, it's just that IBM built a core with some pluggable parts, and the customer can choose which ones to plug in", as opposed to the very heavy design and customization work that was done for the PPE, and even Xenon.

Any idea how 16 stock 470S's would perform vs say a quad Sandybridge with hyperthreading? I assume the 470 is single threaded so still comes in at twice the threads of a quad Sandybridge with HT?
Yes, single threaded. The biggest differences are that the 470 only has a single (128-bit) load-store pipe. This makes sense from a power/cost saving perspective -- doing proper memory OoOe gets a *lot* more expensive when you add more memory pipes. It also, unfortunately, severely limits performance. I believe the load-store pipe is the single biggest bottleneck of the chip, and that it would limit optimal throughput to roughly half of SNB, clock-for-clock and per core.

Other than that, I'd expect really nice IPC. Short pipeline, 32 instructions wide instruction window (not really, it's actually 8*4 wide instruction window, which is not quite as good), and 2-cycle access to a 32kB L1i cache (so twice as large per thread as SNB or Xenon), should be enough to mask all L1 accesses, and get some real work done during L2 ones.

Isn't 470 32bit only?
No, and I have no idea how this one started. There hasn't been a new 32-bit power chip for quite some time -- the 470 is 64-bit, with 42 bits of real address space and 49 bits of virtual address space. (I really, really hope they allow putting tags into the upper 16 bits of the pointers. That is death for forward compatibility on the pc, so I can see why it's disallowed there, but why not for consoles?)

how it compare to ppc a2?
A2 is meant for simple throughput loads, for shifting around large amounts of data and doing computation on it. For object-oriented loads and their ilk, it would be quite a lot slower than a 470. In pure achievable flops throughput, it would completely blow it out of the water. I hope we won't get one more of those.
 
Back
Top