CELL V2.0 (out-of-order vs in-order PPE)

ioe vs. oooe PPE cores for CELL V2.0

  • Unleash the hounds Smithers: go for ioe and clock speed.

    Votes: 9 25.0%
  • Appease pesky multiplatform developers and implement oooe.

    Votes: 27 75.0%

  • Total voters
    36
I am not aware of any significant simulation being done on GPUs on desktops, nor the older hardware on consoles. As far as simulation goes, at present I'd give the edge to the SPE over a GPU.
The PPE as it exists is not a high point for Cell and would be replaced. The SPE is the primary driver of peak performance, but a huge chunk of that is often devoted to compensating for the GPU.

A future design could probably get decent peak performance with a design that differs from the SPE, and if there is better planning on the GPU side, the graphics portion won't need a massive amount of hand holding.
 
With their experiences on Cell, I think they should already learned a thing or two about new approaches in software (rendering, security) as well as overall system design. It's time to consolidate and innovate yet again.
 
Only the recent Intel and AMD CPUs can read data from CPU store queue (and require certain alignment conditions to be met), but I don't know if ARM CPUs can yet do the same.
Seems this needs to be thoroughly tested.

So we could be getting LHS stalls on ARM as well if data is moved between float<->vector<->int registers (or if just written data is accessed again).
Highly unlikely.
1) ARM has direct path between register files
2) A9+ is OOOE

And a gaming console must have powerful vector instructions, and those will be used a lot.
NEON in some ways is much more flexible than VMX.
But it has 64bit datapath and 1/2 issue rate IIRC.

That's what I was wondering as well... to feed 256 SPEs you would need a lot main memory bandwidth
Too keep the same BW/SPU ratio, it would be 25.6 * (256/8) = 819GB/s :cool:
So 1TB memory bandwidth looks pretty reasonable.
 
I don't see how anyone can make the claim Cell acts like an out of order CPU. It's different enough that it doesn't act like other CPUs.

I cannot think of a way that this could be justified.
It's a bunch of in-order cores. No instruction stream exists on more than one core, so no instructions are reordered.


I'm not certain Sony or IBM are committed to using the SPE in the future. Development on Cell has been dead for quite some time now. Maybe there will be an SPE-like element to the design, but IBM's position is more that it may use a few elements in a later core, not revive Cell.

The SPE has proven useful in keeping the PS3's GPU from strangling the design, but that's only a strong argument for using it in the PS4 if Sony assumes it's going to have a gimpy GPU again.
That's not a healthy mindset to take.
Backwards compatibility might be an argument, but even with all those shrinks 8 SPEs would still be a decent amount of die area just for that limited marketing checkbox.

Isn't the PPU designed to handle scheduling to SPEs? It can hand any code to any SPE to be processed during it's free cycles, right? Also, the MFC is not in-order. That means, in the end, it doesn't have to be in order, right? Isn't that oooE-like?

http://cellperformance.beyond3d.com/articles/public/concurrency_rabit_hole.pdf
(page 191)

Edit: BTW, only the vertex processing side of the RSX is less capable. According to Naughty Dog's slides, Uncharted 2 performed vertex processing on 5 SPUs.
 
Last edited by a moderator:
Isn't the PPU designed to handle scheduling to SPEs?
Aside from some initialization, the PPU handles scheduling for the SPEs as much as the software task system is coded to do so. The PPE can coordinate SPEs, but the software can have them working independently.
This has helped in various situations where SPEs inevitably found themselves idling while the PPE choked on something more robust processors breeze through.

It can hand any code to any SPE to be processed during it's free cycles, right? Also, the MFC is not in-order. That means, in the end, it doesn't have to be in order, right? Isn't that oooE-like?
OoOE is related to whether the hardware can execute instructions within a single thread based on when their data is (usually) ready instead of the order that they are read from memory.
What multiple threads do in relation to each other doesn't matter because there is no established relationship between instructions in other threads outside of synchronization instructions, which the MFC helps with.
 
I am not aware of any significant simulation being done on GPUs on desktops, nor the older hardware on consoles. As far as simulation goes, at present I'd give the edge to the SPE over a GPU.
The PPE as it exists is not a high point for Cell and would be replaced. The SPE is the primary driver of peak performance, but a huge chunk of that is often devoted to compensating for the GPU.

A future design could probably get decent peak performance with a design that differs from the SPE, and if there is better planning on the GPU side, the graphics portion won't need a massive amount of hand holding.
I mainly agree. On the SPUs helping GPU side, I think of the SPUs as the unified shaders of a GPU. You can switch them all toward one task of part of the frame and something else for another part. It's just that the SPUs can hint many different tasks.

Will there ever be a time where we will not want more vertex processing or pixel processing on a GPU? That's where the Cell comes in. It can add new life to an always/forever aging GPU, among other things.
 
Mike Houston discussed that the main game loop could actually run on an SPE. I think that SPE's can even be fed their instructions from other SPE's but I'm not sure.
 
ARM has direct path between register files
I stand corrected. I browsed the intrinsic list, and things like float32_t vgetq_lane_f32 (float32x4_t, const int) (generates single vmov.32) sound very good indeed. I just wonder why x86 vector instruction sets do not have similar instructions yet. Maybe the history of separate FPU is still affecting things on x86 side. AVX2 gather load instructions are going to help a bit, but it's not going to solve all the issues.
 
Mike Houston discussed that the main game loop could actually run on an SPE. I think that SPE's can even be fed their instructions from other SPE's but I'm not sure.

Data (or code) can be sent between the local stores of different SPEs, if that was what you meant?
 
Data (or code) can be sent between the local stores of different SPEs, if that was what you meant?

Yes, that's what I meant, particularly the code part. Because sometimes the PPE could be bottlenecked, Mike thought that they could improve their games further by using an SPE to run the main game loop / manage SPE jobs. That is, if I remember correctly.
 
people are missing the obvious thing here. How bout 3 Cell CPUs cennected together? Now that is supreme power

cennected = connected via cement ? :runaway:

I think besides interconnects and memory layout, they would want to research into ways to make the compute elements more integrated with the GPU since it's been used more often there.
 
That costs 4 x PS3 though. A nextgen box should be able to hit a better performance/price ratio. The SPU itself can also be improved.
 
I think at this point they might just end up with an ARM CPU. I don't think Backwards compatibility is even being considered anymore.
 
I'm just reading this for a new PC graphics card: "Power consumption wise we estimate a needed 210 Watt to feed the GPU when peaking, add some overhead for overclocking and the rest of your PC and we feel that a 550 Watt power supply should be your bare minimum. "

Yeah. So, hoping that the next-gen console will be something really efficient. Also, I really hope that whatever next gen brings, it will help make worlds more alive, more physics, more things breaking apart, lots of birds, fluids, animations. One thing that struck me about this gen is that Cell seemed to be able to do interesting things in real-time vertex animation that were relatively hard to do otherwise. I wouldn't want to give up something like Cell that easily to be honest. I'd want a good balance and range of stuff and above all, I think the most interesting part of next-gen will be again how the various components work together. Follow the data!
 
I'm just reading this for a new PC graphics card: "Power consumption wise we estimate a needed 210 Watt to feed the GPU when peaking, add some overhead for overclocking and the rest of your PC and we feel that a 550 Watt power supply should be your bare minimum. "

Yeah. So, hoping that the next-gen console will be something really efficient. Also, I really hope that whatever next gen brings, it will help make worlds more alive, more physics, more things breaking apart, lots of birds, fluids, animations. One thing that struck me about this gen is that Cell seemed to be able to do interesting things in real-time vertex animation that were relatively hard to do otherwise. I wouldn't want to give up something like Cell that easily to be honest. I'd want a good balance and range of stuff and above all, I think the most interesting part of next-gen will be again how the various components work together. Follow the data!

With heterogeneous designs like seen in cell we could easily see 12-16 high performance cores, all the while preserving high performance for single threaded applications. I would hypothesize that a homogeneous design risks having lower single thread performance or sacrificing such to increase core number for parallelism. It would also seem that disabling a small spu or two out of 13-17 should be less costly than disabling 1-2 large cores out of 4-7 six.

I'm curious to how that would compare in physics simulations with the sort of mid-range gpu that's likely for a next-gen console. I would imagine a next-gen cell with 18 cores freed by a better gpu would be pretty tough to beat as far as console cpus go.
 
If you could marry an i3 level dual core/4 thread OoOE design with 32-64 SPEs, I think that'd be a pretty good start to the next gen.
Can't Sony go with 128 or 256 SPEs and paired with a handful of something like PowerVR CLX2 that's in Dreamcast and Naomi 2. Let the SPEs do all the shading work, then SPEs won't be redundant.

You gents are thinking small. Let's try 512 SPEs with 4 core Intel Core-class style PPE. TDP and area be damned!

people are missing the obvious thing here. How bout 3 Cell CPUs cennected together? Now that is supreme power

Ahhh can we clock these up to 7GHz on air for fun? What is that gonna take, 2.6volts?

On a serious note Cell on 45nm wasn't scaled well (hack job) but it still dont' scale a ton. On 28nm if Sony was aiming for the same footprint as Cell in 2006 (unlikely due to cost as Cell was pretty big back then even) they might be able to get 32SPEs in there.

From reading Wiki at 90nm DD1 was 221mm^2 and DD2 was 23mm^2. 65nm was 120mm^2 and total PS3 power consumption dropped from power from 200W to 135W (although there may have been more changes.) At 45nm IBM said it would have a 40% reduction in power consumption and 34% area reduction.

But these numbers look wrong according to the B3D article: (What, B3D has a homepage?) "Less exciting than the thermal and frequency gains has been the pace of die size reduction, which after the second full node shrink has only now come down to an area of ~115mm2, a size that would not have seemed out of place had it been reached on the previous 65nm node (the original 90nm chip had a die area of 235mm2, and the 65nm chip an area of ~174mm2). Plaguing the rate of reduction are the analog circuits and I/O logic associated with the Rambus interfaces, which place a higher cap on the dimensional reduction of the die than could otherwise be achieved."

So it sounds like at 45nm there is room to spare (a hand tweaked version would hit under 100mm^2). Anyways, that makes a new Cell design, with a better PPE design, would not go much above 32SPEs without pushing beyond the Cell footprint on 90nm. Maybe if the PPEs weren't a lot bigger and didn't see a big bump in cache and the SPEs didn't get a lot of new functionality they could go a little above the 32 but there are power issues to consider. And there may be desirable upgrades, like the below from Wiki, that could help address the heat issues and usability making it seem unlikely to see 64 or 256 (!) SPE designs.

It would be feasible to double the local store to 512 KiB per SPU leaving the total die area devoted to the SPU processors roughly unchanged. In this scenario, the SPU area devoted to the local store would increase to 60% while other areas shrink by half. Going this route would reduce heat, and increase performance on memory intensive workloads, but without yielding IBM much if any reduction in cost of manufacture.

GPGPUs are eating the SPE lunch anyhow. Cell v2.0 is Fermi/GCN ;)
 
. Plaguing the rate of reduction are the analog circuits and I/O logic associated with the Rambus interfaces, which place a higher cap on the dimensional reduction of the die than could otherwise be achieved.
I hope this doesn't apply to rambus terabyte XDR tech.
 
Back
Top