CELL V2.0 (out-of-order vs in-order PPE)

ioe vs. oooe PPE cores for CELL V2.0

  • Unleash the hounds Smithers: go for ioe and clock speed.

    Votes: 9 25.0%
  • Appease pesky multiplatform developers and implement oooe.

    Votes: 27 75.0%

  • Total voters
    36

Brimstone

B3D Shockwave Rider
Veteran
So...

Have oooe PPE cores which are more power hungry or go for clock speed (6.4ghz CELL V2.0) with ioe PPE cores?
 
If you could marry an i3 level dual core/4 thread OoOE design with 32-64 SPEs, I think that'd be a pretty good start to the next gen.
 
If you could marry an i3 level dual core/4 thread OoOE design with 32-64 SPEs, I think that'd be a pretty good start to the next gen.

This is the sort of thing I've been suggesting all over this board! Well my suggestion was something like an i3-2100 married to a GTX460 level GPU with DX11.1/12 capabilities. I think SPEs would be rather redundant on a next-gen console if it has a Fermi or GCN type GPU.
 
I think SPEs would be rather redundant on a next-gen console if it has a Fermi or GCN type GPU.

Can't Sony go with 128 or 256 SPEs and paired with a handful of something like PowerVR CLX2 that's in Dreamcast and Naomi 2. Let the SPEs do all the shading work, then SPEs won't be redundant.
 
Current PPU waste a lot of cycles because of long pipelines and stalls.
I believe that reasonably clocked ARM core (1.5-2GHz) can outrun it with a fraction of power consumption (still VMX >> NEON)

I think SPEs would be rather redundant on a next-gen console if it has a Fermi or GCN type GPU.
They're will be as useful as today. GPU calculations has high latency. In console it could be done much faster, but not as fast as SPU jobs. A lot of code is inefficient on GPUs, but good for SPU.
 
Last edited by a moderator:
Can't Sony go with 128 or 256 SPEs and paired with a handful of something like PowerVR CLX2 that's in Dreamcast and Naomi 2. Let the SPEs do all the shading work, then SPEs won't be redundant.

Well, aside from following the path of Larrabee, consider that Cell on 45nm is 115mm^2...

Shaders do need texture units too and the sort. :p
 
Current PPU waste a lot of cycles because of long pipelines and stalls.
I believe that reasonably clocked ARM core (1.5-2GHz) can outrun it with a fraction of power consumption (still VMX >> NEON)
Only the recent Intel and AMD CPUs can read data from CPU store queue (and require certain alignment conditions to be met), but I don't know if ARM CPUs can yet do the same. So we could be getting LHS stalls on ARM as well if data is moved between float<->vector<->int registers (or if just written data is accessed again). Of course the out-of-order execution would fill some of the stall cycles with next instructions, but ARM doesn't have SMT, so all instructions must come from the same thread.

I have only programmed in-order ARM CPUs, and have no first hand experience of their out-of-order designs, but I doubt their second generation out-of-order CPU would yet match IBM, Intel or AMD designs that have been redefined for 15-20 years. And I doubt they are even trying to match the single threaded IPC of those monsters, since low power consumption is one of their main goals. ARM in-order design would likely be much better than the current in-order console CPUs, but I would expect that some stall cases still remain (especially when moving data between vector registers <-> general purpose registers). And a gaming console must have powerful vector instructions, and those will be used a lot.
I think SPEs would be rather redundant on a next-gen console if it has a Fermi or GCN type GPU.
They're will be as useful as today. GPU calculations has high latency. In console it could be done much faster, but not as fast as SPU jobs. A lot of code is inefficient on GPUs, but good for SPU.
Agreed completely. CPUs execute more and more threads and get wider and wider vector units all the time (AVX is 256 bit, and Intel has also stated they will expand it to 1024 bit). At the same time GPUs become more and more programmable (better branching, new synchronization primitives, shared memory, etc). It will be harder and harder to fit something between them (Cell/SPEs have fewer use cases left where they perform the best).

I don't believe we would even need a Fermi or GCN, as the current VLIW4 Radeons are performing very well in DirectCompute. Yes, they are a bit slower in general purpose scalar heavy code than Fermis, but offer the best performance (and performance per watt) for highly optimized vectorized DirectCompute code. Pretty much any recent GPU could be used as general purpose parallel processor, so we can expect next gen consoles to have one.
People that are talking about >50 SPE Cells, what kind of interconnect system would they use?
That's what I was wondering as well... to feed 256 SPEs you would need a lot main memory bandwidth, and a radically faster bus between 256 x local stores <-> main memory.
 
Last edited by a moderator:
Stuffing gazillion SPEs is not really a practical idea. Managing an explicit parallel model on so many compute units will be way over the top for a gaming/media console. Efficient interconnect is another headache to consider. I think, if Sony is to stick with Cell, that the next iteration should boast some advanced ISA, wider vectors (and may be a scalar component) and more efficient memory model -- SPE count should really be of minor consideration. If Sony manages a good and balanced design of Cell, then it really shouldn't need an exotic GPGPU monster for graphics, just a proper GPU that will be good for what it is - graphics.
 
Sebbbi may be this this could help in regard to CPU/GPU communication in ARM based SoC.

About IO order designs it would be interesting to compare their nex cortex A7 performances per cycle versus the ones of xenon for example. I guess with a integer pipeline length being ~the third of xenon, way faster L1 and L2 access I expect the A7 to perform significantly better :)
Still it's impossible as there is no A7 on the market for now :(
 
Last edited by a moderator:
I guess a integer pipeline length being ~the third of xenon, way faster L1 and L2 access I expect the A7 to perform significantly better :)
Still it's impossible as there is no A7 on the market for now :(
Cortex-A7 has a short 8 stage pipeline, but it's a in-order CPU. ARM is promising 20% improved performance over Cortex-A8 and much lower power consumption, but no scaling beyond 1.2 GHz. It's bigger brother, the Cortex-A15 sounds like a much more capable CPU for gaming. It scales to 2.5 GHz and has out-of-order execution and beefier NEON units.

I don't know how much the "way faster L1" would help. Compilers reorder instructions so that L1 latency is mostly hidden. Faster L2 would of course help, as would the very short 8 stage pipeline. But running at 1.2 GHz it would be no contender to a 3.2 GHz CPU (both being in-order CPUs). Most instructions only take one cycle to execute, and longer pipeline doesn't reduce the peak performance. Longer pipeline of course is harder to keep perfectly utilized, and it makes stalls longer (if we just compare cycle counts, but at 3.2 GHz of course cycles are shorter as well). If you want to run the CPU at higher frequencies you need longer pipelines (since you have less time to do processing on each pipeline stage).
 
I doubt there will be a lot of SPEs. IBM has Power8 set for 2013 which seems possible to harvest for CELL PPE cores.

With the oooe strategy.

2 Power 8 cores with 4 threads per core
16 SPE


With ioe...and cranking the clock speed.

4 Power 6 cores
32 SPE.
 
So...

Have oooe PPE cores which are more power hungry or go for clock speed (6.4ghz CELL V2.0) with ioe PPE cores?
As I understand it, the Cell, as a whole, acts as an oooE CPU. However, it has the power efficiency of an in-order CPU. With that in mind, why not just crank up the clock?

It's kind of hard to imagine the strength of 16 or more SPUs at 6.4GHz. I mean the Cell is still faster than almost all processors made today, when used efficiently for most gaming related tasks. 16 SPUs at that clock speed should stay well ahead of anything on the CPU end, until next-next gen consoles role out.

Of course, Vitaly mention that the PPU is wasting cycles due to long pipelines and stalls. That should be improved first, before raising the raising the clock.
 
I have read rumours Sony to adopt IBM bluegene/q architecture with modifications for PS4. Can be the FPU of bluegene/q cores be modified to run SPU code?. How much silicon would it need?.
 
Of course, Vitaly mention that the PPU is wasting cycles due to long pipelines and stalls. That should be improved first, before raising the raising the clock.
The reason for long pipelines is to enable higher clocks. If you shorten the pipeline, the chip doesn't clock that high any more. It's a trade off. Of course with newer process technologies you can clock chips higher. If you aim at the same 3.2 GHz you could make the pipelines shorter. But if you aim at 4-5 GHz (like the recent IBM speed demons), you couldn't do that.
 
As I understand it, the Cell, as a whole, acts as an oooE CPU. However, it has the power efficiency of an in-order CPU.
I don't see how anyone can make the claim Cell acts like an out of order CPU. It's different enough that it doesn't act like other CPUs.
 
As I understand it, the Cell, as a whole, acts as an oooE CPU.
I cannot think of a way that this could be justified.
It's a bunch of in-order cores. No instruction stream exists on more than one core, so no instructions are reordered.


I'm not certain Sony or IBM are committed to using the SPE in the future. Development on Cell has been dead for quite some time now. Maybe there will be an SPE-like element to the design, but IBM's position is more that it may use a few elements in a later core, not revive Cell.

The SPE has proven useful in keeping the PS3's GPU from strangling the design, but that's only a strong argument for using it in the PS4 if Sony assumes it's going to have a gimpy GPU again.
That's not a healthy mindset to take.
Backwards compatibility might be an argument, but even with all those shrinks 8 SPEs would still be a decent amount of die area just for that limited marketing checkbox.
 
The SPE has proven useful in keeping the PS3's GPU from strangling the design, but that's only a strong argument for using it in the PS4 if Sony assumes it's going to have a gimpy GPU again.
That's not a healthy mindset to take.
Backwards compatibility might be an argument, but even with all those shrinks 8 SPEs would still be a decent amount of die area just for that limited marketing checkbox.

Aren't there some games (I'm thinking primarily of later revisions of Super Stardust HD) that actually use Cell to do an unusually high number of particles / amount of geometry?

That seemed to be the big selling point of Cell in Sony's presentations prior to the realization of how crappy RSX really was.

Are game programmers doing game world simulation on GPU these days to any extent?

Is it easier or harder to do that than it is to do it on SPU with Cell?
 
I agree with 3dilettante's view in principle. Sony should keep their options open and re-evaluate. Perhaps PS3 games can be emulated on a different architecture, or a beefed up SPU architecture. Reusing tech is fine but only if it gives them and the users the biggest bang for the buck.

Cell was an attempt to leapfrog h/w tech limitations, not purely for RSX but memory wall, power wall, and such in general, 5-7 years ago. Today, they will have new possibilities.
 
Back
Top