Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 08-Apr-2012, 00:05   #1
Acert93
Artist formerly known as Acert93
 
Join Date: Dec 2004
Location: Seattle
Posts: 7,811
DirectX IBM Power7 Derivative: A Viable Console CPU?

Discuss: Is a CPU, derived from IBM's POWER7 architecture, viable for consoles?

Sources:
Wikipedia
Ars
Anand
Information Week

Some quick facts:
  • Launched in 2010 on 45nm SOI
  • 567mm^2, 1.2B transistors
  • 3.0GHz to 4.14GHz
  • 33 GFLOPs (peak) per core at 4.14GHz
  • 100 to 170W TDP; IBM has fit both 4 core and 8 core Power7 variants operating at 3.0GHz, the BladeCenter PS700 and PS701 respectively, into a single Blade Slot.
  • 4, 6, and 8 Core Variants (EDIT: Possible correction, almost 250W for 8 cores at 4.14GHz)
  • 4 SMT Threads per Core (Power6 was 2 way SMT per core)
  • 32+32 KB L1 Cache per Core (Power7: 2 cycles latency; Power6: 4 cycles)
  • 256 KB L2 Cache per Core (Power7: 8 cycles latency; Power6: 26 cycles)
  • 4MB L3 Cache (eDRAM) per Core; up to 32MB per Chip
  • 12 execution units per core (2 fixed-point units; 2 load/store units; 4 double-precision floating-point units; 1 vector unit supporting VSX (AltiVec); 1 decimal floating-point unit; 1 branch unit; 1 condition register unit)
  • Aggressive OOOe. per IBM via Wiki, ""Each POWER7 processor core implements aggressive out-of-order (OoO) instruction execution to drive high efficiency in the use of available execution paths. The POWER7 processor has an Instruction Sequence Unit that is capable of dispatching up to six instructions per cycle to a set of queues. Up to eight instructions per cycle can be issued to the Instruction Execution units. The POWER7 processor has a set of twelve execution units as [described above"
  • TurboCore: Half of the cores can be disabled so frequency is ramped up for remaining cores; remaining cores have full access to all of the chip cache and full memory controller
  • Although the Power7 architecture operates at lower frequencies than Power6 a Power7 core is "up to twice" the performance of a Power6 core
  • POWER7 features two DDR3 memory controllers that can do up to 100GB/s
Power7 is a beefy chip design for high performance and, relative to Power6, better power efficiency. As has been noted by many future processor performance in realworld code is as dependant the memory architecture as the peak execution performance. In this regards Power7 has made a number of architectural changes. Per Ars, "First in the chain is the 32KB L1 data cache, which has seen its latency cut in half, from four cycles in the POWER6 to two cycles in POWER7. Then there's the 256KB L2, the latency of which has dropped from 26 cycles in POWER6 to eight cycles in POWER7—that's quite a reduction, and will help greatly to mitigate the impact of the shared L3's increased latency." The last bit is interesting as IBM has migrated to slower eDRAM for the L3 cache. The benefit, of course, is eDRAM is substantially more dense than SRAM meaning a chip can either pack more memory on chip or be smaller--or both--than an SRAM design. Importantly, eDRAM also provides significant power savings over SRAM, "The POWER7's L3 is its most unique feature, and, at 32MB, it's positively gigantic. IBM was able to cram such a large L3 onto the chip by making it out of embedded DRAM (eDRAM) instead of the usual SRAM. This decision cost the cache a few cycles of latency, but in exchange IBM got a 3.5x improvement in power efficiency and a 3x improvement in cache density."

The Rumor: A recent unsubstantiated rumor suggested Microsoft's third Xbox edition (code name "Durango") will use an IBM processor with 16 "cores."

The size and power requirements for a Power7 chip, as they current stand, are far and away outside the design limitations of a console. Considering the Xbox 360 and PS3 had a total power draw in the low 200W range a Power7 chip far exceeds the budgets for a console CPU. Furthermore the silicon budget of an 8 core (32 thread) Power7 chip is equal to, or greater than, the total silicon budget of both past generation consoles.

Making POWER7 work for Consoles?: If Microsoft has decided on a POWER7 derivative, what would they need to do to fit it into a console in 2013? Some thoughts...

First in regards to getting the die size within console budgets:
  • 4 cores (16 threads). This should cut the die size in nearly half from 567mm^2 to just under 300mm^2 on 45nm. Still too large for a console.
  • Migration to 32nm (or 28nm). iirc IBM has been working with Global Foundries on 32nm. This could see the total die size reduce by 30-50%, depending on the memory controller and how dense the logic can go. As caches scale better than logic there is the potential for a 32nm variant scaling closer to 50% size reduction.
  • Elimination of some under-utilized (for game code) execution units.
  • Memory controller re-design. POWER7 has a (max) 100GB/s on two DDR3 memory controllers. IBM used a shared controller on the Xbox 360 with the GPU; minimally it would seem a 4 core variant would only need 1 memory controller.
  • Reduction in L3 cache size; e.g. a move from 4MB per core to 2MB per core (16MB down to 8MB). This will impact performance, and eDRAM is both fairly small and fairly power efficient, but it may be determined a fair sacrifice to reduce area budgets and not be a significant impact to console game code.
It seems possible, in theory, to reduce the humongous POWER7 (567mm^2 for the 8 core variant on 45nm) to a more reasonable 120-170mm^2, 4 core (16 thread) derived design on 32nm. To put this into perspective the PS3 Cell was over 230mm^2 on 90nm.

Moving on to power:
  • With the reduction in (a) cores from 8 to 4 and (b) migration to the 32nm node there should be a significant drop in power usage. A move to 32nm should provide a 30-40% power efficiency per transistor. Assuming a 3.0GHz, 4 core POWER7 chip is 100W (unconfirmed, but it is the low end of the range) on the 45nm process a 32nm variant could come in as low as 60-70W.
  • Reduction in frequency. POWER7 is much, much faster than POWER6 per clock. Further sacrificing frequency for a lower voltage design may be possible while keeping performance in an acceptible range.
  • Reduction in Execution Units, Features. As a server oriented chip the POWER7 has a number of features that may be expendible in a console environment.
  • Reduction in eDRAM. While eDRAM requires much less power than SRAM by cutting eDRAM in half (from 16MB ro 8MB for a 4 core design) there could be some additional power savings.
  • Interposer. I know it is all the rage but the power required to power the traces from a CPU to memory are significant. IBM has been developing 4 chip POWER7 interposer designs (up to 32 cores, 128 threads). The chances of such are slim to none for an interposer for the CPU/Memory.

Let's say you work for IBM and are trying to sell Microsoft on Power7 for a 2013 console. Your spec sheet looks something like this for producubg the following in 2013:
  • POWER7 derivative
  • About 150mm^2 on 32nm
  • 2.8-3.2GHz, 60W TDP
  • 4 Cores, 16 SMT threads
  • 32+32 KB L1, 256KB L2, 8MB L3 eDRAM (2MB per core)
Some random questions, to stimulate discussion, for those who may actually know something about POWER7.

Question #1
: Is this even remotely possible? Is this far too optimistic or a roughly accurate ball park for what IBM could fit within that silicon/power budget?

Question #2: Would this make a good console CPU?

Question #3: What would you reduce? Frequency, L3, memory controller, execution units, etc? What execution units and why?

Question #4: What would you add? VMX128 support? At what cost?

Question #5
: To my knowledge IBM only sells Power7 chips in complete server packages for tens of thousands of dollars for the low end. Would IBM even be interested in creating a console variant of POWER7?

Question #6: How is the POWER7's real code performance compared to an AMD Bulldozer core? Per-mm^2? Per-Watt?

Question #7: Does IBM have a better CPU architecture/solution that can be used in the 1500mm^2 / 60W range? (Preferrably something that is already in that range or can be scaled DOWN... just scaling chips up, especially the idea of throwing 16 single cores on a chip as if that "just works" is a non-starter. If you don't know why "just" throwing 16 Xenon cores on a die and calling it a day is a non-starter please skip this question. I want to know what other many-core architectures IBM has actively discussed that may work, not theoretical new designs connected with fanboy duct tape.)

Question #8: How does this theoretical POWER7 design compare against a 2 module / 4 int. core / 480 SP AMD APU at 3.0GHz?

Question #9: As a developer, thinking of the 5-7 year window of console development, would you prefer 4 cores/16 threads in a robust CPU (IBM design) or the shift of budgets to a 2m/4c AMD design but with on-die Shader Array? Why?

Question #10. Would this IBM design need a beefed up vector unit or is the real world performance/thoroughput on POWER7 chips more than sufficient?

Question #11. Thinking in console contexts, if you could change one thing about POWER7, what would it be?

Question #12. Does a POWER7 design indicate a split memory design?

Question #13. Would TurboCore be a feature valuable to consoles? e.g. For Arcade games that may be single threaded?
__________________
"In games I don't like, there is no such thing as "tradeoffs," only "downgrades" or "lazy devs" or "bugs" or "design failures." Neither do tradeoffs exist in games I'm a rabid fan of, and just shut up if you're going to point them out." -- fearsomepirate
Acert93 is offline   Reply With Quote
Old 08-Apr-2012, 02:14   #2
kalelovil
Member
 
Join Date: Sep 2011
Posts: 288
Default

Quote:
Originally Posted by Acert93 View Post
Discuss:
Question #7: Does IBM have a better CPU architecture/solution that can be used in the 1500mm^2 / 60W range? (Preferrably something that is already in that range or can be scaled DOWN... just scaling chips up, especially the idea of throwing 16 single cores on a chip as if that "just works" is a non-starter.
The PowerPC 476FP embedded core is designed to scale up to a 16 core SMP configuration using IBM's PLB6 bus using only 1.6W and 3.7mm2 per 1.6Ghz core on IBM's 45nm process.

Although I don't think it would be a good choice for the PS4/Next Xbox for the reasons outlined: http://semiaccurate.com/forums/showp...&postcount=171

The 476FP was launched in 2009 though, so Microsoft/Sony would probably be considering its successor and whatever improvements that makes to the design.


Quote:
Originally Posted by Acert93 View Post
Question #9: As a developer, thinking of the 5-7 year window of console development, would you prefer 4 cores/16 threads in a robust CPU (IBM design) or the shift of budgets to a 2m/4c AMD design but with on-die Shader Array? Why?
IBM is an APU and SoC innovator as well, there's no reason you couldn't mix Power7 CPU cores and GCN(+) shader arrays on a future console APU.
kalelovil is offline   Reply With Quote
Old 08-Apr-2012, 04:36   #3
Acert93
Artist formerly known as Acert93
 
Join Date: Dec 2004
Location: Seattle
Posts: 7,811
Default

This post is appropriate for over here. I added Ninja's link to RealWorldTech about the TDP of a 8 core / 4.14GHz POWER7 being over 240W.
Quote:
Quote:
Originally Posted by Ninjaprime View Post
The chip in the Anandtech link is 3.3ghz and goes up to 170w... there is your proof. Those are server setups, you only see high clocked ones in HPC machines usually. If you dig around there is a IBM whitepaper with Power7 8 core at 3.55 with turbo up to 3.86, IIRC it says TDP is 200w. The 4ghz to 4.14 with turbo is 240w+. David Kanters info is always solid.
I am not saying David is wrong, but I don't think what Anandtech says indicates conclusively that the 3.3GHz model is 170W. Read it again:

Quote:
The Power 7 CPUs are in the 100 to 170W TDP range, while the Xeon E7s are in the 95 to 130W TDP range.
The Xeon E7 is a product line up (Westmere-EX). For example the E7-4870, a 6 core 1.87GHz chip, has a TDP of 95W and the 8 core 2.67GHz E7-8837 a 130W. So when Anand says the E7's have a 95-130W TDP range that fits *exactly* with the product line up (check my link).

If that is the context Anand is using in that sentence there remains ambiguity as the POWER7 lineup runs from 4 core to 8 core variants ranging from 3.0GHz all the way up to 4.14GHz (Turbo). If Anand's information is correct (?) you would think a 3.3GHz 4 core chip would fall on the lower end of the 100W-170W range.

Put that into perspective: If a 3.3GHz 4 core chip is 170W how is a 4.14GHz 8 core chip only (less than) 250W? That is some amazing scaling if doubling the cores and jacking the frequency up 25% results in a 50% bump in TDP--that is amazing even. And not likely, so pardon my reservations. I don't necessarily doubt David's 4.14GHz / 8 cores / 240W, but that would indicate the 3.0GHz / 4 core models are running at the 100W low end TDP Anand provides.

As for those whitepapers I have not seen or found them. You may be right and I have been digging. But what I do know is that based on the following slides, page 14, a Power 755 4U has 8 chips (32 total cores, 4 cores a chip) at 3.3GHz with 256GB of memory has a total peak power draw of 1650W. While that may reconcile to 170W per chip it also could reconcile to a lower end as well.
__________________
"In games I don't like, there is no such thing as "tradeoffs," only "downgrades" or "lazy devs" or "bugs" or "design failures." Neither do tradeoffs exist in games I'm a rabid fan of, and just shut up if you're going to point them out." -- fearsomepirate
Acert93 is offline   Reply With Quote
Old 08-Apr-2012, 08:18   #4
Grall
Invisible Member
 
Join Date: Apr 2002
Location: La-la land
Posts: 6,806
Default

It would of course need a much-beefed-up FPU, since even the 8-core, 4.4GHz version doesn't even approach the now over half-decade old Cell processor. Cutting this CPU in half for a 4-core version and downclocking to save power, and it's no faster at floating point calculations than the even older (from an on-the-market point-of-view), IBM-developed Xenon CPU from the 360.

That'd be rather anticlimactic I would think, and wouldn't please developers very much. They have a reasonable expectation of power increase, not the opposite (even if that power ought to be considerably easier to tap compared to current console hardware.)
__________________
"Du bist Metall!"
-L.V.
Grall is offline   Reply With Quote
Old 08-Apr-2012, 09:17   #5
Acert93
Artist formerly known as Acert93
 
Join Date: Dec 2004
Location: Seattle
Posts: 7,811
Default

Quote:
Originally Posted by Grall View Post
It would of course need a much-beefed-up FPU, since even the 8-core, 4.4GHz version doesn't even approach the now over half-decade old Cell processor. Cutting this CPU in half for a 4-core version and downclocking to save power, and it's no faster at floating point calculations than the even older (from an on-the-market point-of-view), IBM-developed Xenon CPU from the 360.

That'd be rather anticlimactic I would think, and wouldn't please developers very much. They have a reasonable expectation of power increase, not the opposite (even if that power ought to be considerably easier to tap compared to current console hardware.)
It isn't just about it being "easier" to extract performance. Modern processors are just faster--regardless of the FLOPs rating.

A lot of architectural issues impact utilization. As for the POWER7, just look at something like Intel's i5 (4 core, 8 thread) which just trashes Cell in almost any application. Sure, there are specific segments of code that run better on Cell than said i5, but I don't think you will even find the most ardent Cell supports who would say that, given a choice of "what is faster for game code?" would pick the PS3 Cell -- higher peak FLOPs and all -- over said i5. Take a peak at the NV/AMD architectures prior to GCN where the FLOPs a GPU didn't dictate which performed better on real code. We saw the same situation with Cell versus Xenon; not every problem played to Cell's strengths (more cores, fast local memory, SIMD). It wasn't simply an issue of lazy developers or not enough time to extract performance but not all solutions map well to an architecture--this is why, afterall, we have discreet chips for graphics (GPU) and another discrete chip for general purpose code (CPU).

All that to say that Power7 per core is a LOT faster than Xenon (Waternoose). Having a ton of eDRAM is going to avoid a lot of 600+ cycle penalties for a cache miss and the L2 (8 cycles) is very fast. Latencies and penalties were a big draw back in Xenon. So mitigating many of these by bigger eDRAM to avoid cache hits and calls to system memory, diminishing penalties, and improving L2 performance are all architectural changes that make a big improvement. That is not to mention the fact POWER7 is OOOe with more execution units and more threads per core (4) to hide stalls. The links I posted in the OP actually have information from IBM comparing the Power6 architecture and showing how, even though it had a higher frequency, architectural issues (e.g. a longer pipeline) lead to significantly less performance.

FLOPs are no different than Frequency. Most, by now, understand frequency alone does not determine performance. Peak FLOPs is the same as it won't tell you what is a faster/better processor for game code.

And ... what if ... a developer had code that was embarrassingly parallel and mapped well to SIMD? Sure, a sea of SPEs would be nice for those situations but it is going to be very fast on a Power7 (or i5) also but if a developer was demanding a huge performance leap you would think at that point the code would be shuffled to the GPU as a compute task as that sort of problem will many times work well there. Chances are an embarrassingly parallel problem that maps well to SIMD that requires significant resources is actually a graphics problem anyways
__________________
"In games I don't like, there is no such thing as "tradeoffs," only "downgrades" or "lazy devs" or "bugs" or "design failures." Neither do tradeoffs exist in games I'm a rabid fan of, and just shut up if you're going to point them out." -- fearsomepirate
Acert93 is offline   Reply With Quote
Old 08-Apr-2012, 10:44   #6
french toast
Senior Member
 
Join Date: Jan 2012
Location: Leicestershire - England
Posts: 1,634
Default

ACERT; Fantastic intro..thats gotta be the most detailed start off ive seen.good stuff!..

Personally i think you have hit the nail on the head with that...the only thing i would question is like Grall says FPU...you would expect an upgraded VMX 256 or something that could fit into the budget....that processor 4 x OoOe with 4x SMT...8mb cache + VMX 256 on 32nm...yes i think thats certainly possible....and boy would that be awesome for games!
If they could find a way to get a Tahiti Pro..+ 4gb ram in there...well we would all be laughing!

Edit; Question, im not too hot on these things, so would the 4x SMT apply to FPU instructions as well?? or does that count as 1 VMX thread per core..ie seperate from integer?
french toast is offline   Reply With Quote
Old 08-Apr-2012, 15:21   #7
Grall
Invisible Member
 
Join Date: Apr 2002
Location: La-la land
Posts: 6,806
Default

I'd like to see dual 256-bit VMX units per core. It would be really interesting to see what talented developers could do with some truly astounding, easy-to-use float performance. Shoving off work to the GPU is all well and good for some tasks perhaps, but it takes away a lot of rendering performance. Time spent doing calculations for...whatever, is time not spent drawing stuff that goes on the screen.

And if it's one thing history has shown us since the era of 3D graphics consoles began, it's that persistent 60Hz screen updates in every game is NOT something we'll see the next generation. So I don't want that GPU spending time on anything else other than actually drawing graphics, if it is at all possible to avoid it....
__________________
"Du bist Metall!"
-L.V.
Grall is offline   Reply With Quote
Old 08-Apr-2012, 22:34   #8
Ninjaprime
Member
 
Join Date: Jun 2008
Posts: 337
Default

Quote:
Originally Posted by Grall View Post
It would of course need a much-beefed-up FPU, since even the 8-core, 4.4GHz version doesn't even approach the now over half-decade old Cell processor. Cutting this CPU in half for a 4-core version and downclocking to save power, and it's no faster at floating point calculations than the even older (from an on-the-market point-of-view), IBM-developed Xenon CPU from the 360.

That'd be rather anticlimactic I would think, and wouldn't please developers very much. They have a reasonable expectation of power increase, not the opposite (even if that power ought to be considerably easier to tap compared to current console hardware.)
IIRC a Power7 at 4.14ghz is a little over double the FP of a full Cell at 3.2ghz. Comparing to the Cell in PS3, with only 7 SPEs and one reserved for system functions, its around 3.33x. Unless I'm reading the spec wrong? Not to mention its OoOE and probably gets much higher real world utilization. In terms of useable perf its probably ~5x+ what PS3s Cell is.
Ninjaprime is offline   Reply With Quote
Old 08-Apr-2012, 22:56   #9
Acert93
Artist formerly known as Acert93
 
Join Date: Dec 2004
Location: Seattle
Posts: 7,811
Default

Quote:
Originally Posted by Ninjaprime View Post
IIRC a Power7 at 4.14ghz is a little over double the FP of a full Cell at 3.2ghz. Comparing to the Cell in PS3, with only 7 SPEs and one reserved for system functions, its around 3.33x. Unless I'm reading the spec wrong? Not to mention its OoOE and probably gets much higher real world utilization. In terms of useable perf its probably ~5x+ what PS3s Cell is.
Per clock they are essentially identical peak FLOPs:

POWER7 is a max 33.12 GFLOPS per core at 4.14GHz (8GFLOPs/GHz)
http://en.wikipedia.org/wiki/POWER7#Specifications

Cell SPEs are a max of 25.6 GFLOPs at 3.2GHz (8GFLOPs/GHz)
http://en.wikipedia.org/wiki/Cell_Pr...ents_.28SPE.29

Even assuming 1 disabled SPE and 1 reserved, 6 SPEs + 1 PPE (another 25.6GFLOPs) is 179.2GFLOPs for Cell. Seeing as there is no way we will see an 8 core POWER7, let alone one at 4.14GHz, I think it is safe to say Cell's peak is better than what we would find in a console (e.g. a 3.2GHz 4 core, 102.4 GFLOPs). Even an 8 core 2 4.14GHz is "only" 265GFLOPs and once you apply the criteria (a) 1 core disabled and (b) 1 core reserved for the OS it drops down to 199GFLOPs. And of course Cell variants went up to 4.0GHz so there you go, if you are taking a top end Cell versus a top end POWER7, adding similar restrictions, Cell wins in peak flops.

But I agree that in most situations the POWER7 is going to be a lot faster and those problems that mapped well to Cell to hit peak rates would seem to be good candidates in general to move to the GPU.
__________________
"In games I don't like, there is no such thing as "tradeoffs," only "downgrades" or "lazy devs" or "bugs" or "design failures." Neither do tradeoffs exist in games I'm a rabid fan of, and just shut up if you're going to point them out." -- fearsomepirate
Acert93 is offline   Reply With Quote
Old 09-Apr-2012, 05:27   #10
Sonic
Senior Member
 
Join Date: Feb 2002
Location: San Francisco, CA
Posts: 1,714
Default

Why not include SPE's on the chip itself? A cut down 4 core Power 7 with 4
SPE's per core is something I've been drooling at for some time now. The chip would still be Power7, but also a refined CELL where it provides the best of both worlds. It would be a monster of a CPU and would bring so much power to the table. It's not like it would be unworkable and devs wouldn't be able to get the hang of it, just might take a while. Of course having 16 SPE's might be overkill, maybe 3 per core would be better in terms of transistor count and manageability.

But still, I guess the CPU going into Wii U is either based off of Power7 or A2. If it's Power7 based then cool, I look forward to seeing how it will compete with the 360 and PS3 in terms of programming and all that. The rumors from last year stated 4 core with edram, so 4 MB per core is great, but seems overkill for a console. The 2 MB you suggested before sounds good.
Sonic is offline   Reply With Quote
Old 09-Apr-2012, 07:14   #11
Ninjaprime
Member
 
Join Date: Jun 2008
Posts: 337
Default

Quote:
Originally Posted by Acert93 View Post
Per clock they are essentially identical peak FLOPs:

POWER7 is a max 33.12 GFLOPS per core at 4.14GHz (8GFLOPs/GHz)
http://en.wikipedia.org/wiki/POWER7#Specifications

Cell SPEs are a max of 25.6 GFLOPs at 3.2GHz (8GFLOPs/GHz)
http://en.wikipedia.org/wiki/Cell_Pr...ents_.28SPE.29

Even assuming 1 disabled SPE and 1 reserved, 6 SPEs + 1 PPE (another 25.6GFLOPs) is 179.2GFLOPs for Cell. Seeing as there is no way we will see an 8 core POWER7, let alone one at 4.14GHz, I think it is safe to say Cell's peak is better than what we would find in a console (e.g. a 3.2GHz 4 core, 102.4 GFLOPs). Even an 8 core 2 4.14GHz is "only" 265GFLOPs and once you apply the criteria (a) 1 core disabled and (b) 1 core reserved for the OS it drops down to 199GFLOPs. And of course Cell variants went up to 4.0GHz so there you go, if you are taking a top end Cell versus a top end POWER7, adding similar restrictions, Cell wins in peak flops.

But I agree that in most situations the POWER7 is going to be a lot faster and those problems that mapped well to Cell to hit peak rates would seem to be good candidates in general to move to the GPU.
IIRC, again, just from memory, the Power7 spec is using DPflops and the Cell is using SP. I don't think you would just disable cores if you were using P7, it would be a huge waste relative to disabling one SPE, which isn't that much. Nor would you need a whole core just for the OS. SPE does not equal a core. I also don't think you can add in the PPEs throughput to the SPEs, since most of its time is spent managing the SPEs.
Ninjaprime is offline   Reply With Quote
Old 09-Apr-2012, 07:48   #12
Brimstone
B3D Shockwave Rider
 
Join Date: Feb 2002
Posts: 1,835
Default

Quote:
Originally Posted by Grall View Post
I'd like to see dual 256-bit VMX units per core. It would be really interesting to see what talented developers could do with some truly astounding, easy-to-use float performance. Shoving off work to the GPU is all well and good for some tasks perhaps, but it takes away a lot of rendering performance. Time spent doing calculations for...whatever, is time not spent drawing stuff that goes on the screen.

And if it's one thing history has shown us since the era of 3D graphics consoles began, it's that persistent 60Hz screen updates in every game is NOT something we'll see the next generation. So I don't want that GPU spending time on anything else other than actually drawing graphics, if it is at all possible to avoid it....

I'm not sure Tim Sweeney would desire VMX units.

Quote:
The big lesson we can learn from GPUs is that a powerful, wide vector engine can boost the performance of many parallel applications dramatically. This adds a whole new dimension to the performance equation: it's now a function of Cores * Clock Rate * Vector Width.

For the past decade, this point has been obscured by the underperformance of SIMD vector extensions like SSE and Altivec. But, in those cases, the basic idea was sound, but the resulting vector model wasn't a win because it was far too narrow and lacked the essential scatter/gather vector memory addressing instructions.

All of this shows there's a compelling case for Intel and AMD to put Larrabee-like vector units future mainstream CPUs, gaining 16x more performance on data-parallel code very economically.
http://rebelscience.blogspot.com/200...rogeneous.html
__________________
When God plays an online shooter he plays Shadowrun. He buys resurrection first round and selects Dwarf.

www.shadowrunshow.com
Brimstone is offline   Reply With Quote
Old 09-Apr-2012, 09:56   #13
Grall
Invisible Member
 
Join Date: Apr 2002
Location: La-la land
Posts: 6,806
Default

Well, whatever you wanna do to reach the neccessary goal, as long as it does not involve dumping CPU processing on the GPU. There should be no particular bias towards any one particular technology/implementation; it's the end result that matters. If CPUs need scatter/gather; give it to them. And so on.

Then again, I'm not sure I'd listen all that much to Tim Sweeney's predictions of the future; the guy's very good at what he's actually doing (UE3 is the most flexible, powerful and technically impressive 3D engine out there), but his soothsaying powers have proven to be fairly weaksauce.
__________________
"Du bist Metall!"
-L.V.
Grall is offline   Reply With Quote
Old 09-Apr-2012, 12:42   #14
pjbliverpool
B3D Scallywag
 
Join Date: May 2005
Location: Guess...
Posts: 5,899
Send a message via MSN to pjbliverpool
Default

How do they get 100GB/s out of 2 DRR3 memory controller? Unless each controller is quad channel?

FLOPS wise assuming it has the same throughput per core/clock as Sandybridge a 4 core 3.2 Ghz version would come in at 204.8 GFLOPS. It would probably use at most a single memory controller as well. Drop off a load of L3 and maybe you're getting something approaching usable in a console (although probably still not a great choice).
__________________
PowerVR PCX1 -> Voodoo Banshee -> GeForce2 MX200 -> GeForce2 Ti -> GeForce4 Ti 4200 -> 9800Pro -> 8800GTS -> Radeon HD 4890 -> GeForce GTX 670 DCUII TOP

8086 8Mhz -> Pentium 90 -> K6-2 233Mhz -> Athlon 'Thunderbird' 1Ghz -> AthlonXP 2400+ 2Ghz -> Core2 Duo E6600 2.4 Ghz -> Core i5 2500K 3.3Ghz
pjbliverpool is offline   Reply With Quote
Old 09-Apr-2012, 12:58   #15
hoho
Senior Member
 
Join Date: Aug 2007
Location: Estonia
Posts: 1,218
Send a message via MSN to hoho Send a message via Skype™ to hoho
Default

Quote:
Originally Posted by pjbliverpool View Post
How do they get 100GB/s out of 2 DRR3 memory controller? Unless each controller is quad channel?
Yes, it's two quad-channel controllers.


Power 7 would in my opinion be a relatively bad CPU to pick as a basis for a console CPU. You'd get much nicer thing by taking an already relatively simple design and add console-specific stuff to it (wide SIMD) than taking a huge behemoth with enormous amounts of resources put to improving mainframe-style workloads (huge internal and external buses, lots of wiring to get energy around, ...) and cut it down to something usable. Even then the P7 will need a SIMD unit added to it as I don't think it really has something good enough for consoles.
hoho is offline   Reply With Quote
Old 09-Apr-2012, 18:36   #16
Mobius1aic
Quo vadis?
 
Join Date: Oct 2007
Location: Texas, USA
Posts: 1,367
Default

Quote:
Originally Posted by hoho View Post
Yes, it's two quad-channel controllers.


Power 7 would in my opinion be a relatively bad CPU to pick as a basis for a console CPU. You'd get much nicer thing by taking an already relatively simple design and add console-specific stuff to it (wide SIMD) than taking a huge behemoth with enormous amounts of resources put to improving mainframe-style workloads (huge internal and external buses, lots of wiring to get energy around, ...) and cut it down to something usable. Even then the P7 will need a SIMD unit added to it as I don't think it really has something good enough for consoles.
I'm trying to understand here, is VSX only 128 bit wide, or is it IBM's 256 competitor to AVX?

Assuming VSX in Power7 is 128........

With the rich and storied history of PowerPC processors the past decade, there are a number of hypothetical candidates for the Wii U CPU.

How about a quad core Power5 with expanded L2 cache, VMX 128 or VSX 256, and GDDR5? IIRC Power5 is a dual issue OoO architecture. I assume a quad version could be approached in a similar manner to Xenon but devs wouldn't have to worry about the anemic L2 cache and in-order processing related problems. It would be a bit limiting compared to today's best solutions, but it would be very familiar territory for current developers, with hugely expanded real world usable GFLOPS. Even in quad configuration with 4 MB L2 cache and increased vector processing capability, it would probably come in under 150 mm˛. Power5+ @ 90 nm was 243 mm˛. 32 nm would bring that under 100 mm˛ easy, hence my assumption for under 150 mm˛ for improved quad. Lastly I would ramp the clocks up to 3.2 GHz for parity with the other systems.

It would be pricey to develop a new processor, or even a current one with "bolted on" features. A quad core Power7 on 32 nm with 256 bit VSX and memory controllers adapted to run GDDR5 makes sense to me if the power and TDW can be brought down. It's clock efficiency, 4 threads per core and brilliant integer performance would be good for the current crop of developers who are used to such wide cores on PC and sick of the narrow in-order ones on the 360 and PS3.

Last edited by Mobius1aic; 09-Apr-2012 at 18:47.
Mobius1aic is offline   Reply With Quote
Old 09-Apr-2012, 18:50   #17
tunafish
Member
 
Join Date: Aug 2011
Posts: 408
Default

Power7 contains a lot of stuff that is completely useless in a console. The core is balanced for single element double precision throughput, with 4 individual double precision execution units (and even one decimal FPU!). This is essentially completely wasted in a console. Any power7 cpu cut down to console use cases would no longer resemble a power7 cpu very much.

Also, the Power7 line is not designed to be modular and embeddable. In that way, it's no worse than any previous IBM cpu. It's just that in this generation IBM does have a cpu designed to be modular and embeddable. I am, of course, talking of the PowerPC 470S. It's floating point unit is designed to be swappable, so you can switch out the double precision one for anything from the VMX line you fancy. It's bus design is built so that it can work as a part of a cache-coherent whole with parts not built by IBM, so all the game dev gods get what they want. It's a very energy and die space efficient design, so it produces admirable performance while leaving most of the design TDP and space for the GPU. And while it's single-threaded performance is nothing approaching a Power7, it would still be a huge, huge improvement over the present gen, especially in the worst-case situations.

Also, the 1.6GHz is not the absolute maximum the design can stand, it's just the frequency IBM decided to pimp it out as a power-efficient embedded CPU. Give it a modern process, and just a tiny bit of more power budget, and we are talking frequencies that near the "magical" 3GHz barrier last gen shipped at. With a 4 issue CPU (compared to the 2-issue ones last gen), and enough OOOe resources that it shouldn't hopelessly stall on every L1 and L2 miss.

I really, honestly think that 470S and it's successors are not just the best available options, but, considering all design constraints, really very close to being the best possible options. I'm really hoping that the "16 cores" leak means that MS is shipping with a full 470S solution.
tunafish is offline   Reply With Quote
Old 09-Apr-2012, 19:21   #18
pjbliverpool
B3D Scallywag
 
Join Date: May 2005
Location: Guess...
Posts: 5,899
Send a message via MSN to pjbliverpool
Default

Thanks tunafish, that pretty much puts the Power 7 theories to bed. So we could actually be looking at a genuine 16 core CPU using customised 470S cores.

Any idea how 16 stock 470S's would perform vs say a quad Sandybridge with hyperthreading? I assume the 470 is single threaded so still comes in at twice the threads of a quad Sandybridge with HT?
__________________
PowerVR PCX1 -> Voodoo Banshee -> GeForce2 MX200 -> GeForce2 Ti -> GeForce4 Ti 4200 -> 9800Pro -> 8800GTS -> Radeon HD 4890 -> GeForce GTX 670 DCUII TOP

8086 8Mhz -> Pentium 90 -> K6-2 233Mhz -> Athlon 'Thunderbird' 1Ghz -> AthlonXP 2400+ 2Ghz -> Core2 Duo E6600 2.4 Ghz -> Core i5 2500K 3.3Ghz
pjbliverpool is offline   Reply With Quote
Old 09-Apr-2012, 19:25   #19
fehu
Member
 
Join Date: Nov 2006
Location: Somewhere over the ocean
Posts: 798
Default

Isn't 470 32bit only?
how it compare to ppc a2?
fehu is offline   Reply With Quote
Old 10-Apr-2012, 01:14   #20
tunafish
Member
 
Join Date: Aug 2011
Posts: 408
Default

Quote:
Originally Posted by pjbliverpool View Post
Thanks tunafish, that pretty much puts the Power 7 theories to bed. So we could actually be looking at a genuine 16 core CPU using customised 470S cores.
"Customized" is a strongly overloaded word. Here it would likely mean "The cores themselves are untouched, it's just that IBM built a core with some pluggable parts, and the customer can choose which ones to plug in", as opposed to the very heavy design and customization work that was done for the PPE, and even Xenon.

Quote:
Any idea how 16 stock 470S's would perform vs say a quad Sandybridge with hyperthreading? I assume the 470 is single threaded so still comes in at twice the threads of a quad Sandybridge with HT?
Yes, single threaded. The biggest differences are that the 470 only has a single (128-bit) load-store pipe. This makes sense from a power/cost saving perspective -- doing proper memory OoOe gets a *lot* more expensive when you add more memory pipes. It also, unfortunately, severely limits performance. I believe the load-store pipe is the single biggest bottleneck of the chip, and that it would limit optimal throughput to roughly half of SNB, clock-for-clock and per core.

Other than that, I'd expect really nice IPC. Short pipeline, 32 instructions wide instruction window (not really, it's actually 8*4 wide instruction window, which is not quite as good), and 2-cycle access to a 32kB L1i cache (so twice as large per thread as SNB or Xenon), should be enough to mask all L1 accesses, and get some real work done during L2 ones.

Quote:
Originally Posted by fehu View Post
Isn't 470 32bit only?
No, and I have no idea how this one started. There hasn't been a new 32-bit power chip for quite some time -- the 470 is 64-bit, with 42 bits of real address space and 49 bits of virtual address space. (I really, really hope they allow putting tags into the upper 16 bits of the pointers. That is death for forward compatibility on the pc, so I can see why it's disallowed there, but why not for consoles?)

Quote:
how it compare to ppc a2?
A2 is meant for simple throughput loads, for shifting around large amounts of data and doing computation on it. For object-oriented loads and their ilk, it would be quite a lot slower than a 470. In pure achievable flops throughput, it would completely blow it out of the water. I hope we won't get one more of those.
tunafish is offline   Reply With Quote
Old 10-Apr-2012, 01:39   #21
kalelovil
Member
 
Join Date: Sep 2011
Posts: 288
Default

The PowerPC 470 series is 32bit:
http://www-03.ibm.com/technology/logic/powerpc.html
kalelovil is offline   Reply With Quote
Old 10-Apr-2012, 02:36   #22
liolio
French frog
 
Join Date: Jun 2005
Location: France
Posts: 4,982
Default

Quote:
Originally Posted by tunafish View Post
"Pleasure to read as usual
As you are here, I may ask you your pov on something.
You may have read and take part to the old "next generation CPU will they go back to OoO execution etc." thread which I couldn't find after multiple researches.

It seems that pretty much everybody agrees now that OoO execution should be part of the next generation CPU. I'm straying away a bit from this thread topic but I wonder if throughput is still a relevant design goal for next generation CPU?

What is your opinion on the matter? From your posts I would assert that you think that big OoO cores akin to Intel one are the way to go but I wonder about how a more (fp) throughput oriented CPUs would be perceived by the one with actual knowledge on those things.

Lately I wondered about the relevance for a pretty "big" cpu cores to feed more than one SIMD units. I figurred that it could have benefit especially with a chip supporting 4 way SMT.
Basically it would be like bulldozer in the concept sharing the cost of the front end , OoO engine, etc. not on multiple "cores" but SIMD.
My idea is that it may be easier to feed a 2 SIMD units than a bigger ones (load and stores on the 2 units are unlikely to happen at the same time, it could though) and that it could be overall more efficient than having a SIMD unit twice as big (both are not exclusive through).

Is that a complete misunderstanding ? If not do you think it could be something desirable for a next gen CPU?
__________________
Sebbbi about virtual texturing
The Law, by Frederic Bastiat
'The more corrupt the state, the more numerous the laws'.
- Tacitus
liolio is online now   Reply With Quote
Old 10-Apr-2012, 03:20   #23
Acert93
Artist formerly known as Acert93
 
Join Date: Dec 2004
Location: Seattle
Posts: 7,811
Default

Quote:
Originally Posted by tunafish View Post
Power7 contains a lot of stuff that is completely useless in a console. The core is balanced for single element double precision throughput, with 4 individual double precision execution units (and even one decimal FPU!). This is essentially completely wasted in a console. Any power7 cpu cut down to console use cases would no longer resemble a power7 cpu very much.

Also, the Power7 line is not designed to be modular and embeddable. In that way, it's no worse than any previous IBM cpu. It's just that in this generation IBM does have a cpu designed to be modular and embeddable. I am, of course, talking of the PowerPC 470S. It's floating point unit is designed to be swappable, so you can switch out the double precision one for anything from the VMX line you fancy. It's bus design is built so that it can work as a part of a cache-coherent whole with parts not built by IBM, so all the game dev gods get what they want. It's a very energy and die space efficient design, so it produces admirable performance while leaving most of the design TDP and space for the GPU. And while it's single-threaded performance is nothing approaching a Power7, it would still be a huge, huge improvement over the present gen, especially in the worst-case situations.

Also, the 1.6GHz is not the absolute maximum the design can stand, it's just the frequency IBM decided to pimp it out as a power-efficient embedded CPU. Give it a modern process, and just a tiny bit of more power budget, and we are talking frequencies that near the "magical" 3GHz barrier last gen shipped at. With a 4 issue CPU (compared to the 2-issue ones last gen), and enough OOOe resources that it shouldn't hopelessly stall on every L1 and L2 miss.

I really, honestly think that 470S and it's successors are not just the best available options, but, considering all design constraints, really very close to being the best possible options. I'm really hoping that the "16 cores" leak means that MS is shipping with a full 470S solution.
Quote:
Originally Posted by fehu View Post
Isn't 470 32bit only?
how it compare to ppc a2?
For those interested a PPT on the PowerPC 470S. It is a 32bit OOOe CPU. On the Memory Management slide it notes it is a 42-bit (4TB) real address space. General info:
  • Synthesizable Core (476FP is an IBM ASCI hard core) with coherency enabled L1 caches. L1 Instruction and Data caches: 32KB, 4 way set-associate and 32 byte cache line
  • Coherant L2 cache in 256KB, 512KB, and 1024KB configurations (half rate L2? "L2 cache & PLB6 complex clock: 800MHz"
  • 32bit PPC implementation
  • SMP multicore support
  • 9 stage, four issue, out of order issue and execution, and in order complete
  • Up to 32 instruction in flight
  • Five Integer pipelines: Complex Integer, Multiple/Divide, Load/Store, Simple Integer, Branch
  • Supports a FP Unit as well as extensions like VMX
  • 45nm SOI, Eight metal layer; 1.53GHz @ 0.9V, 1.6GHz @ 0.94V
  • 1.9W @ 1.0V
  • 3.765mm^2 (with how much L2?)
Interesting. I could be wrong (?) but it looks like the L2 is half speed? And while the power and size are great at 1.6GHz and some of the trade-offs in the design I wonder: would it be much faster than Xenon? I would guess ramping up to 3.2GHz would probably require the pipeline to become longer than 9 stages?

EDIT: Looks like the discussion and links got ahead of me. At least I outlined my notes :P
__________________
"In games I don't like, there is no such thing as "tradeoffs," only "downgrades" or "lazy devs" or "bugs" or "design failures." Neither do tradeoffs exist in games I'm a rabid fan of, and just shut up if you're going to point them out." -- fearsomepirate
Acert93 is offline   Reply With Quote
Old 10-Apr-2012, 03:40   #24
tunafish
Member
 
Join Date: Aug 2011
Posts: 408
Default

Quote:
Originally Posted by kalelovil View Post
I was wrong. The 476FP manual makes it abundantly clear that the core only implements the 32-bit subset of Power isa 2.06 Book III-E. For my defense, I was probably fooled by the way the registers are still referred to as if 64-bit, but only the lowest 32 bits actually work. Sorry.

(And now I have no idea whatsoever what they mean when they claim 49 bits of virtual address space. Are they counting process tags or something?)

Quote:
Originally Posted by liolio View Post
As you are here, I may ask you your pov on something.
You may have read and take part to the old "next generation CPU will they go back to OoO execution etc." thread which I couldn't find after multiple researches.

It seems that pretty much everybody agrees now that OoO execution should be part of the next generation CPU. I'm straying away a bit from this thread topic but I wonder if throughput is still a relevant design goal for next generation CPU?
I have some ... strong opinions on the subject. Specifically, I don't think throughput was a relevant goal for the last gen CPUs. Most code isn't hand-written assembler for vector registers. What matters is how fast compiled object oriented C++ runs, or, increasingly, how fast the best Lua implementation can run. The rest is mostly "mental masturbation". Last gen provided throughput not because it was asked for, but because it was what IBM had available to sell. (And no-one else really had anything better.)

Quote:
What is your opinion on the matter? From your posts I would assert that you think that big OoO cores akin to Intel one are the way to go but I wonder about how a more (fp) throughput oriented CPUs would be perceived by the one with actual knowledge on those things.
I actually have a bit of a hard-on for small OoO cores. Big fat cores like Intel's and the Power7 push the single-threaded perf way past the knee of the curve. If that's what you need, then fine. For a game console, I'm probably willing to put up with the complication of twice the threads for an order of magnitude simpler and cheaper cores.

Quote:
Lately I wondered about the relevance for a pretty "big" cpu cores to feed more than one SIMD units. I figurred that it could have benefit especially with a chip supporting 4 way SMT.
Basically it would be like bulldozer in the concept sharing the cost of the front end , OoO engine, etc. not on multiple "cores" but SIMD.
My idea is that it may be easier to feed a 2 SIMD units than a bigger ones (load and stores on the 2 units are unlikely to happen at the same time, it could though) and that it could be overall more efficient than having a SIMD unit twice as big (both are not exclusive through).

Is that a complete misunderstanding ? If not do you think it could be something desirable for a next gen CPU?
That is a valid design, and sort of where x86 ended up. IBM seems to prefer simple self-contained VMX units, without the complication of dealing with forwarding and it's ilk. Probably for cheaper and safer design as much as any performance reasons.

Then again, I don't actually develop low-level game engine code for a living, I leave that part to the professinals. You could ask them? Carmack is pretty responsive on twitter. (ID_AA_Carmack)
tunafish is offline   Reply With Quote
Old 10-Apr-2012, 08:54   #25
fehu
Member
 
Join Date: Nov 2006
Location: Somewhere over the ocean
Posts: 798
Default

just a dumb question
Freescale is a complete different company, or ibm can sell it's power based designs?
fehu is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 05:23.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.