Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 10-Apr-2012, 09:07   #26
Gubbi
Senior Member
 
Join Date: Feb 2002
Posts: 2,856
Default

Quote:
Originally Posted by Acert93 View Post
Some random questions, to stimulate discussion, for those who may actually know something about POWER7.

Question #1
: Is this even remotely possible? Is this far too optimistic or a roughly accurate ball park for what IBM could fit within that silicon/power budget?


Question #2: Would this make a good console CPU?
I think so.

Quote:
Originally Posted by Acert93 View Post
Question #3: What would you reduce? Frequency, L3, memory controller, execution units, etc? What execution units and why?
Reduce L3 cache to 8MB. Remove the memory controllers and have the CPU interface with the GPU through a fast interface (similar to how the 360 does it). Remove the decimal floating point units. Reduce the 4 issue ports for floating point to two. Use SIMD to get FP throughput. Get rid of all the RAS features.

Quote:
Originally Posted by Acert93 View Post
Question #4: What would you add? VMX128 support? At what cost?
They might want to add VMX128 for backwards compatibility but otherwise it is a waste. It was expanded to 128 registers because of the in-order nature of the 360's CPU, with OOOe you get 128 rename registers (or more).

Quote:
Originally Posted by Acert93 View Post
Question #5: To my knowledge IBM only sells Power7 chips in complete server packages for tens of thousands of dollars for the low end. Would IBM even be interested in creating a console variant of POWER7?
Why wouldn't they? It wouldn't canibalize any of IBMs product lines and would be revenue for their chip design unit (and possibly fab). Also bragging rights.

Quote:
Originally Posted by Acert93 View Post
Question #6: How is the POWER7's real code performance compared to an AMD Bulldozer core? Per-mm^2? Per-Watt?
IMO, it's roughly comparable to Intel's Sandy Bridge. Slower on single thread workloads, faster on throughput. Regarding size, even though the P7 chip is massive, each core is only 27 mm˛ on 45nm, on 32nm that would equate to 16-18 mm˛.

Quote:
Originally Posted by Acert93 View Post
Question #9: As a developer, thinking of the 5-7 year window of console development, would you prefer 4 cores/16 threads in a robust CPU (IBM design) or the shift of budgets to a 2m/4c AMD design but with on-die Shader Array? Why?
I think 4 cores (of any kind) is thoroughly unambitious. We're talking about a system design that will live until 2020. I expect at least 8 cores.

Quote:
Originally Posted by Acert93 View Post
Question #10. Would this IBM design need a beefed up vector unit or is the real world performance/thoroughput on POWER7 chips more than sufficient?
It would need SIMD.

Quote:
Originally Posted by Acert93 View Post
Question #11. Thinking in console contexts, if you could change one thing about POWER7, what would it be?
The derivative would need to be optimized for power and for process variations.

P7 is built for speed with exotic power consumption as a result. It is also binned aggressively with a fairly big spread in speeds, the fastests chips going into high end servers and slower ones into blades. None of this can be afforded in a console design. Power consumption per core needs to be lowered and the target frequency of the console should be low enough that you can use most of your good CPUs.

Quote:
Originally Posted by Acert93 View Post
Question #12. Does a POWER7 design indicate a split memory design?
No, let the CPU interface to the GPU, it already has massive memory controller resources; Lots of bandwidth and lots of outstanding memory transactions.

Cheers
__________________
I'm pink, therefore I'm spam
Gubbi is offline   Reply With Quote
Old 10-Apr-2012, 15:48   #27
liolio
French frog
 
Join Date: Jun 2005
Location: France
Posts: 4,977
Default

Thanks for the insights

You are pretty much in the Gubbi's (or it was 3dilletante?) camp, back in time they thought the perfect CPUs for this gen may should have looked more like those PWRficient CPUs (in dual core fashion) than what we ended up with. I believe the IP got bough by the governement before Apple took over P.A Semi.
So I get it that your answer is that you may take 8 of those with 4 wide SIMD vs 4 SnB with 8 wide SIMD. There are still some white paper out there for those CPUs.

Overall the sweet spot according to you might be a 3issue CPU with a reasonable number of execution units. PWRficient, Krait or A15 in the RISC world may be k7/athlon in the X86 would be reasonable building block / sane basis for the design.
---------------------
WRT to having 2 SIMD in such CPUs (or in wider) as well as the relevance of no SMT, 2 way or 4 way SMT in the kind of code that runs in game, lets say I would never dare to post on J.Carmack blog and so take form his time.

May be some people here if they have the time to (like Barbarian, Nao, Nick, ERP, Sebbi and others) could give theirs opinion. Not that I think that their time is cheaper than JC on but they use to post here.
If anything if some members which may put the question in proper form and understand properly the response a real programer could give to it, feel interested in the matter they may go ahead.


EDIT I was answering to Tunafish, dirty copy paste job, sorry.
__________________
Sebbbi about virtual texturing
The Law, by Frederic Bastiat
'The more corrupt the state, the more numerous the laws'.
- Tacitus

Last edited by liolio; 10-Apr-2012 at 17:32.
liolio is online now   Reply With Quote
Old 13-Apr-2012, 19:10   #28
anexanhume
Member
 
Join Date: Dec 2011
Posts: 808
Default

Wasn't Xenon already a highly custom part? Why couldn't the next gen Xbox CPU just inherent some of the better parts of the POWER7 architecture like cache latency and throw out the functional units they don't need (while adding the ones they do)? Surely Microsoft is prepared to pay for the custom design behind a CPU that will likely move 50+ million units over the course of 8+ years if the cycle goes like it did this time.
anexanhume is offline   Reply With Quote
Old 13-Apr-2012, 19:44   #29
ERP
Moderator
 
Join Date: Feb 2002
Location: Redmond, WA
Posts: 3,669
Default

Quote:
Originally Posted by anexanhume View Post
Wasn't Xenon already a highly custom part? Why couldn't the next gen Xbox CPU just inherent some of the better parts of the POWER7 architecture like cache latency and throw out the functional units they don't need (while adding the ones they do)? Surely Microsoft is prepared to pay for the custom design behind a CPU that will likely move 50+ million units over the course of 8+ years if the cycle goes like it did this time.
I think it depends on what their requirements are.

It's really clear that XBox CPU was all about Flops to the exclusion of pretty much everything else. But it wasn't all that different from the part IBM used as the PPU in Cell.
i.e. It wasn't all that custom there were some minor additions/changes, but it was heavily based on a design that IBM already had (I've been told the design predates Cell as well FWIW).

I think we'll likely see the same thing, it won't be an off the shelf part, but it will be very similar to a design that already exists.

The Power 7 is heavily optimized for the workloads it's used for, although you could just start lopping cache off it, the current design is designed to function with a lot of cache, starting to lop cache off may have unforeseen performance consequences. Same would be true for scaling any part of the chip down without addressing the consequences. I think it's easier to add than to subtract in this case.
ERP is offline   Reply With Quote
Old 13-Apr-2012, 19:50   #30
Acert93
Artist formerly known as Acert93
 
Join Date: Dec 2004
Location: Seattle
Posts: 7,809
Default

ERP, in that case what do you think of the PowerPC 476S (or FP)? They seem to be designs that are a step up from Xenon in some areas (e.g. cache latency), small, lower power consumption, etc and in the 'S' version it is aimed to be custom modified. It seems the 476 could well be a chip happy to be changed via addition where needed. The one problem I see is it is 32bit (Xenon was 64bit iirc) and it is aimed at 1.6GHz on 45nm (16 chips at 65W iirc). It may need some work to double the frequency--and that may come at the expense of some of the latencies.

The other IBM chip I can think of is the A2 but it seems to fall into the same criteria as POWER7 as the changes would be more toward subtraction as it, too, is a larger chip.
__________________
"In games I don't like, there is no such thing as "tradeoffs," only "downgrades" or "lazy devs" or "bugs" or "design failures." Neither do tradeoffs exist in games I'm a rabid fan of, and just shut up if you're going to point them out." -- fearsomepirate
Acert93 is offline   Reply With Quote
Old 13-Apr-2012, 22:47   #31
Gubbi
Senior Member
 
Join Date: Feb 2002
Posts: 2,856
Default

The latency of the caches in 476FP is only shorter measured in cycles, it has the same latency measured in real time. The clock of 476FP tops out at 1.6GHz in 45nm, exactly half the speed of Xenon's 3.2GHz, load-to.use latency is 2 vs. 4 cycles and thus the same.

Power consumption is only 2W/core at 1.6GHz, but Microsoft would want to have a SIMD vector unit in there so datapaths would need to be reworked with wider load/store path for caches. The OOO scheduler is an "old" schedule-after-read (future/active-register file) instead of the more modern read-after-schedule used in BD, Sandy Bridge , Power7 and ARM cortex A9/15. If you add 128bit SIMD you have to make all your ROB entries and result buses 128 bits wide. The ROB also only holds 32 entries, which is just 8 cycles at full tilt. Power consumption can only go up.

Then there is the complexity of using a lot of weak cores instead of a few fast ones. Paraphrasing Seymour Cray: Which would you use for plowing, a few oxen or a flock of chickens?

Cheers
__________________
I'm pink, therefore I'm spam
Gubbi is offline   Reply With Quote
Old 13-Apr-2012, 22:50   #32
fehu
Member
 
Join Date: Nov 2006
Location: Somewhere over the ocean
Posts: 790
Default

Quote:
Originally Posted by anexanhume View Post
Wasn't Xenon already a highly custom part?
Actually it was codeveloped by sony. Sony not knowing it XD
Magic of the IBM's R&D management
fehu is offline   Reply With Quote
Old 13-Apr-2012, 23:18   #33
Gubbi
Senior Member
 
Join Date: Feb 2002
Posts: 2,856
Default

Quote:
Originally Posted by fehu View Post
Actually it was codeveloped by sony. Sony not knowing it XD
Magic of the IBM's R&D management
Bollocks.

It follows the same narrow in-order philosophy that led to Power 6. The Xenon and PPU is more likely the result of an early (test) implementation of said philosophy.

Cheers
__________________
I'm pink, therefore I'm spam
Gubbi is offline   Reply With Quote
Old 14-Apr-2012, 00:29   #34
pjbliverpool
B3D Scallywag
 
Join Date: May 2005
Location: Guess...
Posts: 5,882
Send a message via MSN to pjbliverpool
Default

Hey Gubbi, it seems that you along with quite a few of the other serious developers & industry experienced guys here aren't overly enthusiastic about the performance of Xenon as a console CPU. Possibly Cell too although I'm never quite sure of the general opinion there.

Anyway, there was a link posted recently from the developers of Metro 2033 which talked about Xenon (all 3 cores) being equivalent in power to about 75-85% of a single Nehalem core at the same clockspeed. That is unless you properly vectorise the code in which case Xenon can actually be faster than a Nehalem on a clock/thread basis. Or in other words, In properly vectorised code, Xenon could have roughly the performance of a quad Nehalem at 3.2Ghz.

What's your take on this? Is it possible to vectorise a significant portion of CPU gaming code to extract that level of performance out of something like Xenon (or Cell)? If so then it seems that a scaled up version of either of those CPU's could be pretty potent for a next gen console.
__________________
PowerVR PCX1 -> Voodoo Banshee -> GeForce2 MX200 -> GeForce2 Ti -> GeForce4 Ti 4200 -> 9800Pro -> 8800GTS -> Radeon HD 4890 -> GeForce GTX 670 DCUII TOP

8086 8Mhz -> Pentium 90 -> K6-2 233Mhz -> Athlon 'Thunderbird' 1Ghz -> AthlonXP 2400+ 2Ghz -> Core2 Duo E6600 2.4 Ghz -> Core i5 2500K 3.3Ghz
pjbliverpool is offline   Reply With Quote
Old 14-Apr-2012, 06:11   #35
Mobius1aic
Quo vadis?
 
Join Date: Oct 2007
Location: Texas, USA
Posts: 1,364
Default

Quote:
Originally Posted by pjbliverpool View Post
Hey Gubbi, it seems that you along with quite a few of the other serious developers & industry experienced guys here aren't overly enthusiastic about the performance of Xenon as a console CPU. Possibly Cell too although I'm never quite sure of the general opinion there.

Anyway, there was a link posted recently from the developers of Metro 2033 which talked about Xenon (all 3 cores) being equivalent in power to about 75-85% of a single Nehalem core at the same clockspeed. That is unless you properly vectorise the code in which case Xenon can actually be faster than a Nehalem on a clock/thread basis. Or in other words, In properly vectorised code, Xenon could have roughly the performance of a quad Nehalem at 3.2Ghz.

What's your take on this? Is it possible to vectorise a significant portion of CPU gaming code to extract that level of performance out of something like Xenon (or Cell)? If so then it seems that a scaled up version of either of those CPU's could be pretty potent for a next gen console.
Are we talking vectorized code on Xenon vs non-AVX vector code on Nehalem, or with AVX involved, because I'm pretty sure, with AVX involved, it won't be pretty for Xenon.
Mobius1aic is offline   Reply With Quote
Old 14-Apr-2012, 06:47   #36
tunafish
Member
 
Join Date: Aug 2011
Posts: 408
Default

Quote:
Originally Posted by pjbliverpool View Post
What's your take on this? Is it possible to vectorise a significant portion of CPU gaming code to extract that level of performance out of something like Xenon (or Cell)?
Not all code can be vectorized. Some things, like physics or media processing, can be very neatly vectorized with almost a linear speedup for the vector width. But some things, like "game script", or ai, really gains absolutely nothing from vectorization. Generally, simple "smooth" loads vectorize nicely, but if your problem needs to branch a lot, the vectorized code path would suffer a combinatorial explosion in paths it needs to take, so really it's two ifs and then you'd be better off not bothering at all.

The codebase of a game consists of a lot of different kinds of code. For some of it, Cell and Xenon are actually very good cpus. But they push performance for one kinds of loads way way past the point of sense. And when you have a set of different loads, the more you optimize one kinds of loads, the less you gain from each linear improvement, because the portion of your total time budget spent by them shrinks.
tunafish is offline   Reply With Quote
Old 14-Apr-2012, 08:31   #37
hoho
Senior Member
 
Join Date: Aug 2007
Location: Estonia
Posts: 1,218
Send a message via MSN to hoho Send a message via Skype™ to hoho
Default

Quote:
Originally Posted by Mobius1aic View Post
Are we talking vectorized code on Xenon vs non-AVX vector code on Nehalem, or with AVX involved, because I'm pretty sure, with AVX involved, it won't be pretty for Xenon.
Then again Nehalem doesn't support AVX, that came with Sandy Bridge

Though yes, it does have 128bit SSE but I'm quite sure that with extra registers and functions in Xenon it still probably can't catch up with Xenon at per-core basis (as long as the problem isn't heavily cache-memory latency bound).
hoho is offline   Reply With Quote
Old 15-Apr-2012, 06:31   #38
Mobius1aic
Quo vadis?
 
Join Date: Oct 2007
Location: Texas, USA
Posts: 1,364
Default

Quote:
Originally Posted by hoho View Post
Then again Nehalem doesn't support AVX, that came with Sandy Bridge

Though yes, it does have 128bit SSE but I'm quite sure that with extra registers and functions in Xenon it still probably can't catch up with Xenon at per-core basis (as long as the problem isn't heavily cache-memory latency bound).
Huh, I thought Nehalem had AVX. Learned something new......
Mobius1aic is offline   Reply With Quote
Old 15-Apr-2012, 06:52   #39
liolio
French frog
 
Join Date: Jun 2005
Location: France
Posts: 4,977
Default

Quote:
Originally Posted by Mobius1aic View Post
I'm trying to understand here, is VSX only 128 bit wide, or is it IBM's 256 competitor to AVX?

Assuming VSX in Power7 is 128........
The confusing thing with the vmx 128 units is that 128 refers to the number of registers.
At the same time the units being 4 wide, so 4 FP32 elements it is also 128 bits wide.
the SIMD in POWER 7 are 4 wide.

The nice thing is about that we could have vmx 128 v2 which would be 8 wide
So 128 256 bits registers.

In case of an OOo processor they may not need that may visible registers. Altivec usually has 32 (4 wide / 128 bits wide.register).

Anyway for now IBM has no SIMD that match the width of Intel AVX ( 8 wide in fp only if memory serves right).
__________________
Sebbbi about virtual texturing
The Law, by Frederic Bastiat
'The more corrupt the state, the more numerous the laws'.
- Tacitus
liolio is online now   Reply With Quote
Old 15-Apr-2012, 15:28   #40
Acert93
Artist formerly known as Acert93
 
Join Date: Dec 2004
Location: Seattle
Posts: 7,809
Default

Xenon v.2 may be a good excuse for IBM to make a 256bit SIMD unit. And they could later make it part of the PPC spec, just like Altivec
__________________
"In games I don't like, there is no such thing as "tradeoffs," only "downgrades" or "lazy devs" or "bugs" or "design failures." Neither do tradeoffs exist in games I'm a rabid fan of, and just shut up if you're going to point them out." -- fearsomepirate

Last edited by Acert93; 15-Apr-2012 at 21:02. Reason: Grammar.
Acert93 is offline   Reply With Quote
Old 15-Apr-2012, 16:15   #41
liolio
French frog
 
Join Date: Jun 2005
Location: France
Posts: 4,977
Default

Quote:
Originally Posted by Acert93 View Post
Xenon v.2 may good excuse for IBM to make a 256bit SIMD unit. And they could later make it part of the PPC spec, just like Altivec
It could thought especially after TUnafish answer I wonder about the odds of "reasonably" wide OoO CPU which would include 2 4wide units. It may be easier in regard to data path, etc.

I won't ask Carmack is POV but may be some members like Sebbbi that deals with CPU too could give us his take on such a design (tough without detail of the implementation but rough ideas about it).
__________________
Sebbbi about virtual texturing
The Law, by Frederic Bastiat
'The more corrupt the state, the more numerous the laws'.
- Tacitus

Last edited by liolio; 15-Apr-2012 at 17:09.
liolio is online now   Reply With Quote
Old 15-Apr-2012, 23:02   #42
fehu
Member
 
Join Date: Nov 2006
Location: Somewhere over the ocean
Posts: 790
Default

It's the right time for a big.little like design?
Maybe 2 fat power7 class cores with a lot of A2 class ones working as a traditional multicore instead than a cell
fehu is offline   Reply With Quote
Old 16-Apr-2012, 15:29   #43
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,435
Default

big.LITTLE starts to matter when you have a TDP measured in the single-digit watt range and will be running on battery power for days.

The idea is more restrictive since the it is meant for significant power savings on low-performance and near-idle loads.
With console TDPs still in the hundreds of watts, the general inefficiency of a power supply and voltage regulators specced for that order of magnitude would probably waste more power than a big.Little chip would save.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 16-Apr-2012, 16:23   #44
fehu
Member
 
Join Date: Nov 2006
Location: Somewhere over the ocean
Posts: 790
Default

I was thinking a more traditional implementation in which you can have all the cores at max frequency at the same time.
You can't have a 16 cores Power7, but a 16 cores A2 can be good for parallel task but bad for single thread, so I was thinking about an heterogeneous design in which you have some powerful cores to run demanding tasks, and a lot of decent cores to run the remaining.
Considering that all ibm designs at the moment share the same instruction set it's not out of reality, but something like this can be really useful for a developer?
fehu is offline   Reply With Quote
Old 16-Apr-2012, 18:18   #45
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,435
Default

big.LITTLE is concerned with migrating all work off of the big cores to power-optimized low-power cores when performance is not needed.

The more general case with different cores that run different loads based on their strengths is some form of asymmetric or heterogenous mupltiprocessing.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 16-Apr-2012, 20:19   #46
darkblu
Senior Member
 
Join Date: Feb 2002
Posts: 2,642
Default

I'm entirely with Gubbi here. While 47x might seem like a very nice embedded choice, I'm not convinced it would scale up that well, not if the 'should sorta meet the former gen's throughput' part of the order was heavy-weighed (apparently speaking primarily of Xenon here rather than any geometry-culling-busy SPU flocks). Basically, the way I see IBM's options this gen is:
(1) Carry on with some Xenon/PPE iteration, trying to strike some sort of a balanced design, or..
(2) Go P7-based. Viewed from my armchair here, P7 seems to offer a lot for trimming.

Quote:
Originally Posted by fehu View Post
just a dumb question
Freescale is a complete different company, or ibm can sell it's power based designs?
A completely different company. Freescale are the former Motorola Semiconductor (of the AIM alliance fame).
darkblu is offline   Reply With Quote
Old 16-Apr-2012, 20:45   #47
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,435
Default

We will have to wait and see.
Taking apart a multibillion transistor chip to make a derivative isn't free either.

The floating point resources would likely need to be redesigned, as well as the memory hiearchy and cache coherence protocol. The front end and scheduling logic would likely face cuts, and this is all assuming IBM even wants its best tech licensed to Microsoft to produce on its own.

It's not a simple task to cut chunks out of a complex core whose components are balanced and verified for that design point. I wouldn't be certain it would any easier to cut a POWER7 down than it would be to speed an A2 type design up.

I'm not saying it's impossible, but it might not be desirable.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 17-Apr-2012, 17:59   #48
anexanhume
Member
 
Join Date: Dec 2011
Posts: 808
Default

I guess it all comes down to perceived return on investment. I'm going to guess the trade isn't going to show a highly customized CPU with high floating point throughput that is easy to program to isn't going to necessarily translate to more market share in any scenario. No matter how hard the PS4 is to program for, it's still going to get multi-platform games, and those games aren't going to look a lot more fantastic on the xbox just because the hardware is vastly superior because optimization takes time. It will also depend if microsoft gets to own the design (360 GPU) versus has to buy it (original xbox gpu).

I would expect a lightly modified Power7 core as a trade between optimization and development cost.
anexanhume is offline   Reply With Quote
Old 17-Apr-2012, 21:03   #49
Gubbi
Senior Member
 
Join Date: Feb 2002
Posts: 2,856
Default

Quote:
Originally Posted by anexanhume View Post
No matter how hard the PS4 is to program for, it's still going to get multi-platform games, and those games aren't going to look a lot more fantastic on the xbox just because the hardware is vastly superior because optimization takes time
It took multi platform games on the PS3 three years to reach parity with the 360, largely because of added complexities in architecture (split memory, CPU).

Developers do the bulk of their development on workstations, - bog standard PC hardware. The fewer pathological cases and gotchas developers experience when moving to prodution hardware, the better the end result will be.

Cheers
__________________
I'm pink, therefore I'm spam
Gubbi is offline   Reply With Quote
Old 17-Apr-2012, 21:22   #50
hoho
Senior Member
 
Join Date: Aug 2007
Location: Estonia
Posts: 1,218
Send a message via MSN to hoho Send a message via Skype™ to hoho
Default

Quote:
Originally Posted by Gubbi View Post
It took multi platform games on the PS3 three years to reach parity with the 360, largely because of added complexities in architecture (split memory, CPU).
Wasn't the huge difference between CPU and GPU powers in each console a much bigger contribution than memory model or complexity of Cell? I would imagine it's far harder to offload some GPU stuff to Cell than to balance memory usage between pools
hoho is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 02:59.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.