Predict: The Next Generation Console Tech

Status
Not open for further replies.
So what do we need the extra CPU power for? Is it feasable for Sony to use Cell, as is, in PS4 and with the savings made have a much better GPU?

I would think that the main problem with this is deminishing returns, if PS4 uses a top of the line GPU anyway doubling the amount spent on GPU wont give double the power so in this regard spending that budget on CPU could give better gains. Hope that made sense :LOL:
 
Last edited by a moderator:
If the GPU is truely going to consume the CPU then whats the point in a new CPU architecture at this point? In the case for Microsoft especially is there any reason why they would need to do more than add say another core and out of order execution at the 28nm Global Foundry node with that taking up 50-75mm^2 of die space and simply devote the rest of the 200-250mm^2 die space to a new and advanced GPU architecture and simply beef up the ED-Ram to 30MB and chuck in a GDDR5 memory bus with 8 * 1024Mbit/2048Mbit GDDR5 modules on a 128MB bus?

Is there any reason why more than this minimum is required to extract good/cheap performance for the next generation? 80GB/S Bandwidth should be more than enough with ED-Ram + memory bandwidth and slightly more than Juniper performance should give you enough juice to run most games at 60 FPS @ 1920/1080 with 2xMSAA in an architecture that shouldn't use more than 100-120W all up.
 
Games tend to be computation heavy so I think games would benefit more from 4 SPEs than 1 Power7 core.

Power 7 has dual issue for VMX. Being OOO and having 6-way issue would bring it pretty close to your four SPEs in raw computing power. Running real programs, it would likely stomp all over them.

And you shouldn't really pitch one P7 vs. 4SPEs, you should pitch 8 P7s vs 32 SPEs.

Another benefit of the SPEs are that they don´t depend on having an additional Level 3 cache to sustain performance, so that extra die space could be used for additional SPEs as well.

That's not a benefit.... At all!

It means that all data going into and out of the SPEs is from/to main memory, wasting precious bandwidth (and adding latency).

Cheers
 
Power 7 has dual issue for VMX. Being OOO and having 6-way issue would bring it pretty close to your four SPEs in raw computing power. Running real programs, it would likely stomp all over them.

And you shouldn't really pitch one P7 vs. 4SPEs, you should pitch 8 P7s vs 32 SPEs.

Why is the wikipedia article on Power7 that quotes an IBM presentation say that an 8 core Power7 chip gives 258.6 GFlops? The IBM presentation does not specify what that is exactly, it could be DP, it could be SP, but 260GFlops is not much more than what Cell can theoretically deliver.

http://www.it.utah.edu/leadership/committees/IT_Managers/papers/IBMinEducation.ppt

It's on the last page, 32.3GFlops per core for a Power7 core, it's basically the same as an SPE at 4Ghz. 32 SPEs would give a theoretical TFlop.

What's the difference in power consumption?
 
Last edited by a moderator:
Power 7 has dual issue for VMX. Being OOO and having 6-way issue would bring it pretty close to your four SPEs in raw computing power.
Integer performance?
The SPUs are dual issue so well written code should have and ipc close to 2.

Running real programs, it would likely stomp all over them.
But consoles run games so real programs are not really interesting.

And you shouldn't really pitch one P7 vs. 4SPEs, you should pitch 8 P7s vs 32 SPEs.
I am not sure I understand your point, could you elaborate?

That's not a benefit.... At all!

It means that all data going into and out of the SPEs is from/to main memory, wasting precious bandwidth (and adding latency).

Or you could look at it from the other side, asynchronous memory access let you use the memory bandwidth close to it´s maximum without seeing a drop in performance. Do you really want to store all streamed data in the L3 cache anyway?
 
Why is the wikipedia article on Power7 that quotes an IBM presentation say that an 8 core Power7 chip gives 258.6 GFlops? The IBM presentation does not specify what that is exactly, it could be DP, it could be SP, but 260GFlops is not much more than what Cell can theoretically deliver.

http://www.it.utah.edu/leadership/committees/IT_Managers/papers/IBMinEducation.ppt

It's on the last page, 32.3GFlops per core for a Power7 core, it's basically the same as an SPE at 4Ghz. 32 SPEs would give a theoretical TFlop.

What's the difference in power consumption?

It's double precision. Since each core does 4 FMADDs per cycle with just two FP issue ports, it is vectorized FP. Single precision would be double that, at least it would be in a consolized version.

Anyway, stop looking at megabollocks/second. There isn't all that much correlation between peak FP throughput and actual performance. A Core 2 Duo @ 2.4GHz has under 40 GFLOPS of peak SP throughput, but crushes XCPU in every single game that is cross platform.

Cheers
 
Integer performance?
I am not sure I understand your point, could you elaborate?

There's a world of difference between having a concurrency of 4 and 32.

Or you could look at it from the other side, asynchronous memory access let you use the memory bandwidth close to it´s maximum without seeing a drop in performance.

But you would still be limited by your main memory bandwidth. At CELL's 256 GFLOPS and 20GB/s bandwidth you need to do, on average, 12 floating point ops per byte, or around 50 ops per 32bit sp value.

Do you really want to store all streamed data in the L3 cache anyway?

No, I don't. I want a reasonable prefetcher to load the data for me, and I want to use stores with non-temporal hints to store past the caches.

Cheers
 
Last edited by a moderator:
It's double precision. Since each core does 4 FMADDs per cycle with just two FP issue ports, it is vectorized FP. Single precision would be double that, at least it would be in a consolized version.

Anyway, stop looking at megabollocks/second. There isn't all that much correlation between peak FP throughput and actual performance. A Core 2 Duo @ 2.4GHz has under 40 GFLOPS of peak SP throughput, but crushes XCPU in every single game that is cross platform.

Cheers

Isn't that the difference though? Games don't need DP, so if an 8 core Power7 delivers half a TFlop, it's still hypothetically less than a 32 SPE Cell.

As for the Core Duo vs the Xenon, I thought that there was a consensus that the PPE or Xenos core was a relatively poor processor? Cell isn't a 3 core PPE. Also, what work does a Core 2 Duo do regarding a PC game in relation to say what Cell does in certain PS3 games? Games have fixed performance on the consoles due to knowing what is going on with the hardware at any given moment. That can not be said about a PC game.

What is the power consumption of an 8 core Power7?
 
A Core 2 Duo @ 2.4GHz has under 40 GFLOPS of peak SP throughput, but crushes XCPU in every single game that is cross platform.

Actually, in some heavily threaded games like GTA4 and Lost Planet (in the CPU intensive Caves section for example) the Core 2 Duo isn't all that much faster. In Saints Row 2 it's actually slower, iirc.

Not that this changes the essence of your point, but there are cases where for whatever reason the XCPU does okay compared to dual core processors from 2006.
 
There's a world of difference between having a concurrency of 4 and 32.
Yes and no, if you arrange your jobs in one queue which is the preferable method of Cell then increasing the number of SPEs do not add any additional complexity at all.

BTW didn´t the Power7 have 4 hardware threads/core?

But you would still be limited by your main memory bandwidth. At CELL's 256 GFLOPS and 20GB/s bandwidth you need to do, on average, 12 floating point ops per byte, or around 50 ops per 32bit sp value.

No, I won't. I want a reasonable prefetcher to load the data for me, and I want to use stores with non-temporal hints to store past the caches.
There are of course cases Cell would benefit from an L3(L2) cache and I am not sure how much of the DMA transfers goes through the L2 cache of Cell, but if you want to get elaborate with hints you can also start passing data between SPES within the CPU, their local memory will after all be four times the density of Power7.

And I can´t help believing that process switches is what really makes that big L3 cache shine. Therefor I think Power7 is more likely to end up in Servers than in a next gen console.
 
Actually, in some heavily threaded games like GTA4 and Lost Planet (in the CPU intensive Caves section for example) the Core 2 Duo isn't all that much faster. In Saints Row 2 it's actually slower, iirc.

Not that this changes the essence of your point, but there are cases where for whatever reason the XCPU does okay compared to dual core processors from 2006.

I think that has much more to do with the level of effort from the developers than the relative performance of the CPU's.

Lost Planet has great performance on a C2D anyway. Saints Row on the other hand must have been programmed by monkeys. I've never seen such an insanely bad porting effort in my life. The game would probably run better through an emulator!
 
Therefor I think Power7 is more likely to end up in Servers than in a next gen console.

and what a custom processor built starting from power7 cores without the fpdp transistors?

something like a 4 core / 16 thread with 8MB of edram and almosto nothing else can be powerfull enoght, easy to program for, small to produce, and low power enought to fit in a nextgen console?
 
Did Repi post his love for Nehalem here yet?

16 Threads on the CPU seems like a decent average target for 2012.
 
and what a custom processor built starting from power7 cores without the fpdp transistors?

something like a 4 core / 16 thread with 8MB of edram and almosto nothing else can be powerfull enoght, easy to program for, small to produce, and low power enought to fit in a nextgen console?

That's for a next-gen console targeting a 2012 release, right?
 
and what a custom processor built starting from power7 cores without the fpdp transistors?

something like a 4 core / 16 thread with 8MB of edram and almosto nothing else can be powerfull enoght, easy to program for, small to produce, and low power enought to fit in a nextgen console?

I don't think the question is if you can make something that is good enough, because you obviously can. The question is more like is it the best alternative considering the target application which is a game console?

Questions like:
  • Is single thread performance that important?
  • Is integer performance more important than floating point performance?
  • Is the L3 cache that important if you have full control of your data streams and don´t have to bother with frequent process switches?

I am really not pitching Cell and SPUs as the strongest contender, IBM probably have more alternatives up their sleeves. Anyway I think it is quite telling that neither the 360 or the PS3 ended up with OOO CPUs this generation, but the criterias may have changed in this round.

I also doubt we will see edram in any console CPU unless it can be supported by multiple foundries, but I do hope it will happen.
 
Power7

All this talk of POWER7 and what is in the PS4 is completely ignoring power consumption and cost.

POWER7 is a 200W chip that is large and has huge memory and I/O busses.

Cell on the other hand has all the top places on the Green500 list. It's fast, efficient and these days about 35W in the PS3.

What they might be able to do is have a bunch of POWER7 cores, remove the L3 and add in a load of SPEs, pretty much what the PowerXCell 32iv was going to be.

However that may be far too big. I think a POWER7 core or 2 is a distinct possibility, they'll be fast and low power enough. However the SPEs are very efficient and very good at what they're designed for. There's also a lot of experience using them now and tools to use. I think they'll stick with them.

Changing to Larrabee in place of Cell would be suicidal, it's a completely different arch and it's not as if it's a normal x86, I suspect its going to be just as difficult, if not more so, to program than Cell.
 
From what I've read, Fermi defines a significant step toward general pupose computation on GPU chips. Having both a Cell-like CPU and a Fermi-like GPU doesn't sound likely to me. What algorithms run well on the SPE but aren't going to run well on a GPU like Fermi? Wouldn't having a fair amount of OOO with a large cache a better option? It would provide a better support for algorithms that don't run well on the SPE and make multiplatform titles easier to port to the PS4. On the other hand, stream oriented computations could be run directly on the GPU cores. Another option would be making a Cell-based GPU, adding texture units and whatever dedicated hardware is still worth having in the next decade.
 
ok look at this :p

the ps3's cell has only 7 active spe, and some are reserved for the os
spe cores are very small even at current production process

what about a reworked power7+ with 4 traditional cores and 4 spe put there for compatibility and to not waste all the matured know how?
 
Status
Not open for further replies.
Back
Top