If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#26 | ||||||||
|
Senior Member
Join Date: Feb 2002
Posts: 2,577
|
Quote:
Quote:
They might want to add VMX128 for backwards compatibility but otherwise it is a waste. It was expanded to 128 registers because of the in-order nature of the 360's CPU, with OOOe you get 128 rename registers (or more). Quote:
Quote:
Quote:
Quote:
Quote:
P7 is built for speed with exotic power consumption as a result. It is also binned aggressively with a fairly big spread in speeds, the fastests chips going into high end servers and slower ones into blades. None of this can be afforded in a console design. Power consumption per core needs to be lowered and the target frequency of the console should be low enough that you can use most of your good CPUs. Quote:
Cheers
__________________
I'm pink, therefore I'm spam |
||||||||
|
|
|
|
|
#27 |
|
French frog
Join Date: Jun 2005
Location: France
Posts: 4,172
|
Thanks for the insights
You are pretty much in the Gubbi's (or it was 3dilletante?) camp, back in time they thought the perfect CPUs for this gen may should have looked more like those PWRficient CPUs (in dual core fashion) than what we ended up with. I believe the IP got bough by the governement before Apple took over P.A Semi. So I get it that your answer is that you may take 8 of those with 4 wide SIMD vs 4 SnB with 8 wide SIMD. There are still some white paper out there for those CPUs. Overall the sweet spot according to you might be a 3issue CPU with a reasonable number of execution units. PWRficient, Krait or A15 in the RISC world may be k7/athlon in the X86 would be reasonable building block / sane basis for the design. --------------------- WRT to having 2 SIMD in such CPUs (or in wider) as well as the relevance of no SMT, 2 way or 4 way SMT in the kind of code that runs in game, lets say I would never dare to post on J.Carmack blog and so take form his time. May be some people here if they have the time to (like Barbarian, Nao, Nick, ERP, Sebbi and others) could give theirs opinion. Not that I think that their time is cheaper than JC on but they use to post here. If anything if some members which may put the question in proper form and understand properly the response a real programer could give to it, feel interested in the matter they may go ahead. EDIT I was answering to Tunafish, dirty copy paste job, sorry.
__________________
What's trying to be a bunch of presentations PS360 youtube channel Sebbbi about virtual texturing Tuned EADGCF and liking it :) Last edited by liolio; 10-Apr-2012 at 17:32. |
|
|
|
|
|
#28 |
|
Member
Join Date: Dec 2011
Posts: 680
|
Wasn't Xenon already a highly custom part? Why couldn't the next gen Xbox CPU just inherent some of the better parts of the POWER7 architecture like cache latency and throw out the functional units they don't need (while adding the ones they do)? Surely Microsoft is prepared to pay for the custom design behind a CPU that will likely move 50+ million units over the course of 8+ years if the cycle goes like it did this time.
|
|
|
|
|
|
#29 | |
|
Moderator
Join Date: Feb 2002
Location: Redmond, WA
Posts: 3,198
|
Quote:
It's really clear that XBox CPU was all about Flops to the exclusion of pretty much everything else. But it wasn't all that different from the part IBM used as the PPU in Cell. i.e. It wasn't all that custom there were some minor additions/changes, but it was heavily based on a design that IBM already had (I've been told the design predates Cell as well FWIW). I think we'll likely see the same thing, it won't be an off the shelf part, but it will be very similar to a design that already exists. The Power 7 is heavily optimized for the workloads it's used for, although you could just start lopping cache off it, the current design is designed to function with a lot of cache, starting to lop cache off may have unforeseen performance consequences. Same would be true for scaling any part of the chip down without addressing the consequences. I think it's easier to add than to subtract in this case. |
|
|
|
|
|
|
#30 |
|
Artist formerly known as Acert93
Join Date: Dec 2004
Location: Seattle
Posts: 7,704
|
ERP, in that case what do you think of the PowerPC 476S (or FP)? They seem to be designs that are a step up from Xenon in some areas (e.g. cache latency), small, lower power consumption, etc and in the 'S' version it is aimed to be custom modified. It seems the 476 could well be a chip happy to be changed via addition where needed. The one problem I see is it is 32bit (Xenon was 64bit iirc) and it is aimed at 1.6GHz on 45nm (16 chips at 65W iirc). It may need some work to double the frequency--and that may come at the expense of some of the latencies.
The other IBM chip I can think of is the A2 but it seems to fall into the same criteria as POWER7 as the changes would be more toward subtraction as it, too, is a larger chip.
__________________
"In games I don't like, there is no such thing as "tradeoffs," only "downgrades" or "lazy devs" or "bugs" or "design failures." Neither do tradeoffs exist in games I'm a rabid fan of, and just shut up if you're going to point them out." -- fearsomepirate |
|
|
|
|
|
#31 |
|
Senior Member
Join Date: Feb 2002
Posts: 2,577
|
The latency of the caches in 476FP is only shorter measured in cycles, it has the same latency measured in real time. The clock of 476FP tops out at 1.6GHz in 45nm, exactly half the speed of Xenon's 3.2GHz, load-to.use latency is 2 vs. 4 cycles and thus the same.
Power consumption is only 2W/core at 1.6GHz, but Microsoft would want to have a SIMD vector unit in there so datapaths would need to be reworked with wider load/store path for caches. The OOO scheduler is an "old" schedule-after-read (future/active-register file) instead of the more modern read-after-schedule used in BD, Sandy Bridge , Power7 and ARM cortex A9/15. If you add 128bit SIMD you have to make all your ROB entries and result buses 128 bits wide. The ROB also only holds 32 entries, which is just 8 cycles at full tilt. Power consumption can only go up. Then there is the complexity of using a lot of weak cores instead of a few fast ones. Paraphrasing Seymour Cray: Which would you use for plowing, a few oxen or a flock of chickens? Cheers
__________________
I'm pink, therefore I'm spam |
|
|
|
|
|
#32 |
|
Member
Join Date: Nov 2006
Location: Somewhere over the ocean
Posts: 634
|
|
|
|
|
|
|
#33 | |
|
Senior Member
Join Date: Feb 2002
Posts: 2,577
|
Quote:
It follows the same narrow in-order philosophy that led to Power 6. The Xenon and PPU is more likely the result of an early (test) implementation of said philosophy. Cheers
__________________
I'm pink, therefore I'm spam |
|
|
|
|
|
|
#34 |
|
B3D Scallywag
|
Hey Gubbi, it seems that you along with quite a few of the other serious developers & industry experienced guys here aren't overly enthusiastic about the performance of Xenon as a console CPU. Possibly Cell too although I'm never quite sure of the general opinion there.
Anyway, there was a link posted recently from the developers of Metro 2033 which talked about Xenon (all 3 cores) being equivalent in power to about 75-85% of a single Nehalem core at the same clockspeed. That is unless you properly vectorise the code in which case Xenon can actually be faster than a Nehalem on a clock/thread basis. Or in other words, In properly vectorised code, Xenon could have roughly the performance of a quad Nehalem at 3.2Ghz. What's your take on this? Is it possible to vectorise a significant portion of CPU gaming code to extract that level of performance out of something like Xenon (or Cell)? If so then it seems that a scaled up version of either of those CPU's could be pretty potent for a next gen console.
__________________
PowerVR PCX1 4MB --> Voodoo Banshee 16MB --> GeForce2 MX200 32MB --> GeForce2 Ti 64MB --> GeForce4 Ti 4200 128MB --> 9800Pro 128MB --> 8800GTS 640MB --> Radeon HD 4890 1GB --> GeForce GTX 670 DirectCU II TOP 2GB |
|
|
|
|
|
#35 | |
|
Quo vadis?
Join Date: Oct 2007
Location: Texas, USA
Posts: 1,338
|
Quote:
|
|
|
|
|
|
|
#36 | |
|
Member
Join Date: Aug 2011
Posts: 371
|
Quote:
The codebase of a game consists of a lot of different kinds of code. For some of it, Cell and Xenon are actually very good cpus. But they push performance for one kinds of loads way way past the point of sense. And when you have a set of different loads, the more you optimize one kinds of loads, the less you gain from each linear improvement, because the portion of your total time budget spent by them shrinks. |
|
|
|
|
|
|
#37 | |
|
Senior Member
|
Quote:
Though yes, it does have 128bit SSE but I'm quite sure that with extra registers and functions in Xenon it still probably can't catch up with Xenon at per-core basis (as long as the problem isn't heavily cache-memory latency bound). |
|
|
|
|
|
|
#38 | |
|
Quo vadis?
Join Date: Oct 2007
Location: Texas, USA
Posts: 1,338
|
Quote:
|
|
|
|
|
|
|
#39 | |
|
French frog
Join Date: Jun 2005
Location: France
Posts: 4,172
|
Quote:
At the same time the units being 4 wide, so 4 FP32 elements it is also 128 bits wide. the SIMD in POWER 7 are 4 wide. The nice thing is about that we could have vmx 128 v2 which would be 8 wide So 128 256 bits registers. In case of an OOo processor they may not need that may visible registers. Altivec usually has 32 (4 wide / 128 bits wide.register). Anyway for now IBM has no SIMD that match the width of Intel AVX ( 8 wide in fp only if memory serves right).
__________________
What's trying to be a bunch of presentations PS360 youtube channel Sebbbi about virtual texturing Tuned EADGCF and liking it :) |
|
|
|
|
|
|
#40 |
|
Artist formerly known as Acert93
Join Date: Dec 2004
Location: Seattle
Posts: 7,704
|
Xenon v.2 may be a good excuse for IBM to make a 256bit SIMD unit. And they could later make it part of the PPC spec, just like Altivec
__________________
"In games I don't like, there is no such thing as "tradeoffs," only "downgrades" or "lazy devs" or "bugs" or "design failures." Neither do tradeoffs exist in games I'm a rabid fan of, and just shut up if you're going to point them out." -- fearsomepirate Last edited by Acert93; 15-Apr-2012 at 21:02. Reason: Grammar. |
|
|
|
|
|
#41 | |
|
French frog
Join Date: Jun 2005
Location: France
Posts: 4,172
|
Quote:
I won't ask Carmack is POV but may be some members like Sebbbi that deals with CPU too could give us his take on such a design
__________________
What's trying to be a bunch of presentations PS360 youtube channel Sebbbi about virtual texturing Tuned EADGCF and liking it :) Last edited by liolio; 15-Apr-2012 at 17:09. |
|
|
|
|
|
|
#42 |
|
Member
Join Date: Nov 2006
Location: Somewhere over the ocean
Posts: 634
|
It's the right time for a big.little like design?
Maybe 2 fat power7 class cores with a lot of A2 class ones working as a traditional multicore instead than a cell |
|
|
|
|
|
#43 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,141
|
big.LITTLE starts to matter when you have a TDP measured in the single-digit watt range and will be running on battery power for days.
The idea is more restrictive since the it is meant for significant power savings on low-performance and near-idle loads. With console TDPs still in the hundreds of watts, the general inefficiency of a power supply and voltage regulators specced for that order of magnitude would probably waste more power than a big.Little chip would save.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#44 |
|
Member
Join Date: Nov 2006
Location: Somewhere over the ocean
Posts: 634
|
I was thinking a more traditional implementation in which you can have all the cores at max frequency at the same time.
You can't have a 16 cores Power7, but a 16 cores A2 can be good for parallel task but bad for single thread, so I was thinking about an heterogeneous design in which you have some powerful cores to run demanding tasks, and a lot of decent cores to run the remaining. Considering that all ibm designs at the moment share the same instruction set it's not out of reality, but something like this can be really useful for a developer? |
|
|
|
|
|
#45 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,141
|
big.LITTLE is concerned with migrating all work off of the big cores to power-optimized low-power cores when performance is not needed.
The more general case with different cores that run different loads based on their strengths is some form of asymmetric or heterogenous mupltiprocessing.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#46 |
|
Senior Member
Join Date: Feb 2002
Posts: 2,636
|
I'm entirely with Gubbi here. While 47x might seem like a very nice embedded choice, I'm not convinced it would scale up that well, not if the 'should sorta meet the former gen's throughput' part of the order was heavy-weighed (apparently speaking primarily of Xenon here rather than any geometry-culling-busy SPU flocks). Basically, the way I see IBM's options this gen is:
(1) Carry on with some Xenon/PPE iteration, trying to strike some sort of a balanced design, or.. (2) Go P7-based. Viewed from my armchair here, P7 seems to offer a lot for trimming. A completely different company. Freescale are the former Motorola Semiconductor (of the AIM alliance fame). |
|
|
|
|
|
#47 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,141
|
We will have to wait and see.
Taking apart a multibillion transistor chip to make a derivative isn't free either. The floating point resources would likely need to be redesigned, as well as the memory hiearchy and cache coherence protocol. The front end and scheduling logic would likely face cuts, and this is all assuming IBM even wants its best tech licensed to Microsoft to produce on its own. It's not a simple task to cut chunks out of a complex core whose components are balanced and verified for that design point. I wouldn't be certain it would any easier to cut a POWER7 down than it would be to speed an A2 type design up. I'm not saying it's impossible, but it might not be desirable.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#48 |
|
Member
Join Date: Dec 2011
Posts: 680
|
I guess it all comes down to perceived return on investment. I'm going to guess the trade isn't going to show a highly customized CPU with high floating point throughput that is easy to program to isn't going to necessarily translate to more market share in any scenario. No matter how hard the PS4 is to program for, it's still going to get multi-platform games, and those games aren't going to look a lot more fantastic on the xbox just because the hardware is vastly superior because optimization takes time. It will also depend if microsoft gets to own the design (360 GPU) versus has to buy it (original xbox gpu).
I would expect a lightly modified Power7 core as a trade between optimization and development cost. |
|
|
|
|
|
#49 | |
|
Senior Member
Join Date: Feb 2002
Posts: 2,577
|
Quote:
Developers do the bulk of their development on workstations, - bog standard PC hardware. The fewer pathological cases and gotchas developers experience when moving to prodution hardware, the better the end result will be. Cheers
__________________
I'm pink, therefore I'm spam |
|
|
|
|
|
|
#50 |
|
Senior Member
|
Wasn't the huge difference between CPU and GPU powers in each console a much bigger contribution than memory model or complexity of Cell? I would imagine it's far harder to offload some GPU stuff to Cell than to balance memory usage between pools
|
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|