CELL V2.0 (out-of-order vs in-order PPE)

3dilettante · Nov 30, 2011

Analog and pad limitations aren't unique to Rambus interfaces, it's something all chips face.
edit: or will if they use most of their perimeter prior to a shrink

steampoweredgod · Nov 30, 2011

Why wouldn't it?

I would imagine they've tried to improve on this area.

BTW, seeing the jump from EE to Cell we get something like a 30+x gain in Gflops, but it would seem that the next jump could be just 2-5x in Gflops(2-4main cores and 10+spu pool). Would there be any viable changes to the architecture that could offer a similar 30+x gain on a nextgen Cell?

If there are any such possible changes it would put paper performance on par with the 6Tflops cell cluster that can do realtime raytracing of large complex scenes.

homerdog · Dec 4, 2011

Acert93 said:
GPGPUs are eating the SPE lunch anyhow. Cell v2.0 is Fermi/GCN

I dunno if you're being sarcastic, but I've been trying to make this argument a lot around here. Would Cell really be necessary with a fat Fermi/Kepler/GCN GPU right beside it ? Or even with an older design GPU like Cayman?

Lucid_Dreamer · Dec 5, 2011

homerdog said:
I dunno if you're being sarcastic, but I've been trying to make this argument a lot around here. Would Cell really be necessary with a fat Fermi/Kepler/GCN GPU right beside it ? Or even with an older design GPU like Cayman?

If it's as flexible as Cell, I would say no. Are they as flexible, though? Plus, SPEs draw very little power. The 7 SPEs in the PS3 draw only about 35W, total, at 90nm. What would it be for 32 of them at 28nm?

tunafish · Dec 5, 2011

Lucid_Dreamer said:
If it's as flexible as Cell, I would say no. Are they as flexible, though? Plus, SPEs draw very little power. The 7 SPEs in the PS3 draw only about 35W, total, at 90nm. What would it be for 32 of them at 28nm?

Flexible isn't a very useful word. Specifically, the GPGPUs aren't as fast in running code with a lowish threadcount, lots of jumps, and whose data fits well in the small local pool.

On the other hand, they are much better in loads where you want to directly access a larger pool of memory.

upnorthsox · Dec 5, 2011

Lucid_Dreamer said:
If it's as flexible as Cell, I would say no. Are they as flexible, though? Plus, SPEs draw very little power. The 7 SPEs in the PS3 draw only about 35W, total, at 90nm. What would it be for 32 of them at 28nm?

It would be around 7W at 28nm if they didn't do any improvements, or about 1W per spe.

I have a hard time believing you'd be able to find any significant idle time on the gpu to run this gpgpu code let alone efficiently or autonomously. And if you do have significant idle time, maybe you'd be better off using a smaller gpu.

upnorthsox · Dec 5, 2011

tunafish said:
Flexible isn't a very useful word. Specifically, the GPGPUs aren't as fast in running code with a lowish threadcount, lots of jumps, and whose data fits well in the small local pool.

On the other hand, they are much better in loads where you want to directly access a larger pool of memory.

Yep, even if you could get a CU to be as flexible as a SPE, in the end it's still a 4x slower clock.

With 32 SPE's, you can get alot of parellism going on a large data set.

steampoweredgod · Dec 6, 2011

tunafish said:
Flexible isn't a very useful word. Specifically, the GPGPUs aren't as fast in running code with a lowish threadcount, lots of jumps, and whose data fits well in the small local pool.

On the other hand, they are much better in loads where you want to directly access a larger pool of memory.

Generally one would imagine physics would be the main resource hog, cloth/hair/particle/fluid/collision/etc, which would provide the most shocking improvement in visuals. How would they compare in such?

steampoweredgod · Dec 6, 2011

Was checking elsewhere with regards to the physics question and found this in another thread

My opinion is that GPGPU has proven to be a huge red herring. HardOCP has been posting videos of tech demos from the Infernal Engine all week. The developers narrating are very emphatic on the point that it is far preferable to have your GPU processing graphics. Their physics engine, which is very impressive, relies on heavily threaded, CPU based processing. The results are pretty cool, but they still point out that the PS3 version is the one they like to demo because it can simulate the most objects. Even better than their PC build with 8 threads on an i7 platform.-Brad Grenz

Considering a next gen cell would be paired with a unified gpu it would be freed up to take such tasks, and physics is likely the biggest hog and one that can provide very impressive visual bang. Hearing that the 5 year old cell outcompetes the 4 core i7 is quite nice, and if true is further evidence that heterogeneous cores are a must have in the console space(and if true makes one wonder about performance differences between cell and wiiu's 3 core cpu, as surpassing i7 core for core is doubtful).

Cell has 8 and 16 bit SIMD instructions for one. For another, GPUs use parallelism not merely to hide memory latency, they use it to hide pretty much all latency in the pipeline ... because they usually work on massively parallel problems and because memory latency is orders of magnitude higher than everything else they just don't bother trying to keep latencies low for anything else. Their actual instruction latency is hideous compared to Cell. That means you are forced to have many more threads than on Cell.-Mfa

Wonder how this affects physics calculations on gpgpus vs cell.

sebbbi · Dec 6, 2011

It seems that many persons around here have the false assumption that physics is just raw vector maths (number crunching). It's not. Physics engines use a lot of complex acceleration structures to speed up their work and to keep the memory access patterns manageable. Traversing these structures often includes a lot of branches and semi random memory accesses. There are some forms of physics simulations (for example particle systems) that are straightforward to simulate in parallel, but also systems that are much harder (for example complex rigid body systems with lots of constraints between the bodies).

V3 · Dec 6, 2011

steampoweredgod said:
If there are any such possible changes it would put paper performance on par with the 6Tflops cell cluster that can do realtime raytracing of large complex scenes.

I wonder, if a newer processor like SPE can be made wider, like 16-wide or something and larger local store to accomodate.. SPE was design with clockspeed in mind, but we know that clockspeed design is not the way to go, unless they made some sort of breakthrough in managing power.

Also another old discussion, was including local store instead of cache in SPE a mistake ?

Shifty Geezer · Dec 6, 2011

There's a whole other thread discussing Cell. Regards the LS, IBM chose 256 kB as the best compromise between size and latency. Larger storage means more cycles per read. High clock is also more valuable in some processing cases where parallelism isn't good. Again, see the existing Cell thread for discussion on specifics. Suffice to say a GPU isn't a direct replacement for Cell or any other CPU.

upnorthsox · Dec 6, 2011

The LS can easily be increased as it's already setup with a LS mem check on bootup. As to latency, yes it increases the size of your register table but then going to main mem is a huge hit in cycles so you've got to look at your typical access patterns (which they should be well aware of now). Sony dictated Cell specs not IBM and size and power requirements were more important then a cycle or two.

Going wider to a 256bit simd is a possibility as is adding an additional even pipe. Going 256bit though would probably require both an added even pipe and additional LS. That makes for a pretty beefy spe(47% larger) though so you've got to ask whether you're not better off just adding more spe's instead.

One last thing, if you add alot of edram on chip you could realize a pretty significant boost there too and make for a better dev environment.

steampoweredgod · Dec 6, 2011

V3 said:
I wonder, if a newer processor like SPE can be made wider, like 16-wide or something and larger local store to accomodate.. SPE was design with clockspeed in mind, but we know that clockspeed design is not the way to go, unless they made some sort of breakthrough in managing power.

Also another old discussion, was including local store instead of cache in SPE a mistake ?

Was reading a thread titled something like "is there something cell can still do better than modern cpu/gpu" here at beyond3d. And the answer seemed to be that one of the problem with architecture's such as larrabee was the cache, and that cache structures introduce complexity/power problems as you scale core numbers, cell's local store can scale much better according to some people in that thread.

It was also suggested that the memory bandwidth allowed cell cores to increase performance linearly in some tasks in part because the cores did not have to share cache.

IF those assertions are true, while it may be harder to program, it seems to provide more scalability, efficiency and performance.

The lack of coherence between the Local Stores is probably seen as a disadvantage but once you start to scale Cell it'll turn out to be a big advantage.

Once you start adding in piles of cores coherent caches will become a major source of latency and power consumption. -ADEX

That was one of the larger design decisions on creating the Cell was that there was a limit to the amount of cache you can use before you hit diminishing returns. Where as not only is the sdram predictable its infinitly scaleable. -Terarrim

If physics is not well suited to run on CELL then CELL designers have failed
cause physics was one of the applications they tried to address with CELL design.

I believe they know better than you and me, in fact CELL architecture seems well suited for physics calculations. -Nao

I also found this quote with regard to crowd ai somewhere

Each individual chicken has its own behavior model interacting with other birds. The simulator was demonstrated to provide realtime (30fps) performance with several thousand chickens. In fact, when the number of chickens was increased to a total of 15,000 birds the Cell B./E. processor was still able to perform the simulation with interactive speed, but the graphics rendering was not able to keep pace, even on a stateoftheart NVidia GPU and started dropping 2 out of 3 frames, resulting in 10fps "sluggish" video output.-rapidmind chickenfarm simulation

15000 at 30fps, pretty impressive

The memory wall: the processor frequency has now surpassed the speed of the DRAM and the current workaround of using multilevel caching leads to increased memory latency....

The slow main memory access on traditional x86 architectures creates a data flow bottleneck causing processor idle times. This results in much lower sustained performance than the theoretical peak of the CPU. To combat the bottleneck, state of the art processors have significant cache (L1, L2, L3), typically several megabytes on the processor chip. This uses up space that would otherwise be available to allow more transistors (and more processing power, as well as more heat). This “wasted” cache memory area is one explanation for why Moore’s law no longer translates into equivalent performance increases.-link

The memory wall is said by some to limit standard architectures to about 8 cores before performance starts to drastically drop, with 16 cores delivering 2 core performance at some tasks, and performance going down a cliff as one approaches 64 cores.

Brad Grenz · Dec 7, 2011

sebbbi said:
It seems that many persons around here have the false assumption that physics is just raw vector maths (number crunching). It's not. Physics engines use a lot of complex acceleration structures to speed up their work and to keep the memory access patterns manageable. Traversing these structures often includes a lot of branches and semi random memory accesses. There are some forms of physics simulations (for example particle systems) that are straightforward to simulate in parallel, but also systems that are much harder (for example complex rigid body systems with lots of constraints between the bodies).

That has been my assumption. Can GPGPU accelerated physics in a next generation console actually interact with the gameplay to the degree demanded, or will it always be the realm of Physx-style billowing smoke and fluttering scraps of paper? Does not an upgraded Cell theoretically provide a happy medium between simulating more complex systems than feasible on a homogenous OoOE CPU, while maintaining gameplay interaction with the player to a degree no possible through GPGPU.

Gubbi · Dec 7, 2011

sebbbi said:
It seems that many persons around here have the false assumption that physics is just raw vector maths (number crunching). It's not.

Not just physics, performance of a given problem in general.

IMO, the memory system is much more important.

How many outstanding requests, bandwidth and latency are the most important parameters. Look at how much die area is devoted to floating point math and to the memory system. A (simd) FPU is a fraction of a core and a core is a fraction of a CPU die. At the same time you have load store units the size of FPUs, on-die L1, L2 and L3 cache and integrated memory controllers. The combined memory subsystem of a modern CPU can be more than 60% of the total die size.

Wrt. CELL: The SPUs in CELL are really dumb, but fast, small processors. What empowers the SPUs is the DMA engine of CELL. The semi-autonomous DMA engine can implement flexible aggregating operations, eg. gather/scatter is trivial.

Cheers

sebbbi · Dec 7, 2011

steampoweredgod said:
The memory wall is said by some to limit standard architectures to about 8 cores before performance starts to drastically drop, with 16 cores delivering 2 core performance at some tasks, and performance going down a cliff as one approaches 64 cores.

Yes, this is true in general. However the scaling depends very much on the memory access patterns. If each core has it's own L2 cache and majority of memory operations happen inside the core's own L2, the scaling is much better.

Cache is basically an automated local store. If you run an algorithm with similar memory access patterns as an algorithm optimized for Cell local store, the CPU with cache shouldn't do any more main memory accesses than the Cell-based system (with all other things being equal). The automated cache logic isn't 100% as efficient as the manual memory transfers between main memory <-> local store, but with manual cache control instructions you can pretty much reach parity. As long as there is a single shared main system memory, the multicore scaling will be limited. Local work memories (automated caches or manual memories) do help, but do not completely solve the scaling problem (assuming all data must be loaded from main memory and results stored to main memory at some point).

Of course the automated cache logic (and coherency logic) cost die space (and cause extra heat and manufacturing costs). The big question is, how much? And how much more execution units (and other performance boosting features) we could have if we didn't have automated caches. Many algorithms require fast data caching, and it's very hard to beat (fixed function) hardware cache logic by software cache implementations. Also it's harder to implement general purpose (not performance critical) code without any automated cache logic. So the question also becomes, is the required extra software development cost reasonable just to get a few percents performance boost by making the hardware simpler?

tunafish · Dec 7, 2011

Gubbi said:
IMO, the memory system is much more important.

How many outstanding requests, bandwidth and latency are the most important parameters. Look at how much die area is devoted to floating point math and to the memory system. A (simd) FPU is a fraction of a core and a core is a fraction of a CPU die. At the same time you have load store units the size of FPUs, on-die L1, L2 and L3 cache and integrated memory controllers. The combined memory subsystem of a modern CPU can be more than 60% of the total die size.

Exactly this. Compute is cheap, and raw gigaflop numbers are mostly meaningless. What matters is getting data where it needs to be.

Wrt. CELL: The SPUs in CELL are really dumb, but fast, small processors. What empowers the SPUs is the DMA engine of CELL. The semi-autonomous DMA engine can implement flexible aggregating operations, eg. gather/scatter is trivial.

But *only* for operations where granularity of memory operations is large. There are workloads where a 32bit*4 gather over the entire memory pool through a cache will absolutely crush the cell dma. Also, the burden of implementing all that is on the programmer.

Manually managed pools (ala Cell) are easier to build, and give you better performance numbers for the same amount of transistors spent, but have a huge cost in programmer productivity.

Some think this is okay, but as a programmer my outlook is quite a bit bleaker. IMHO, a cpu that is hard to program for is a pointless waste of silicon. There are always a few superstars who use them to their fullest, but most games aren't built by superstars.

If PS4 has SPUs, Sony should package the dev kit with a SPU-enabled physics engine, and never expect the game programmers to touch them directly. (But that should of course be allowed.)

Gubbi · Dec 7, 2011

tunafish said:
Manually managed pools (ala Cell) are easier to build, and give you better performance numbers for the same amount of transistors spent,

Easier to build, gives better performance if the problem fits the architecture, and greatly increases the burden on the programmer.

tunafish said:
but have a huge cost in programmer productivity.

Agreed. CELL has a steeper learning curve and more work associated with it. It gave the PS3 a big time-to-parity disadvantage, -the time before developers reached parity with the competition.

The local store effectively has register semantics. You manually load and store data to and from it, - just like with registers. The local store thus has the same advantages and disadvantages as registers. Great when you have data with lots of temporal locality, and modified state doesn't need to be visible to other thread/contexts, - and almost useless if that isn't the case.

As I see it, the problem with CELL is that they went all in on the local store idea. IMO, a smallish core with a modest set of wide vector registers would have almost all of the advantages of the SPUs, but would be a lot more flexible for normal code.

Something like:
2-way superscalar
32 regular registers
32x1024bit vector register (holding 32x32bit values or 16x64bit ones)
128 bit FPU (4x32 bit, 2x64)

8KB data cache, 16KB instruction cache and 128KB L2 cache.

Vector loads and stores goes directly to L2. Build the 128KB L2 with 32 sets in it and support full gather/scatter without thrashing L2.

Cheers

steampoweredgod · Dec 7, 2011

Of course the automated cache logic (and coherency logic) cost die space (and cause extra heat and manufacturing costs). The big question is, how much? And how much more execution units (and other performance boosting features) we could have if we didn't have automated caches. Many algorithms require fast data caching, and it's very hard to beat (fixed function) hardware cache logic by software cache implementations. Also it's harder to implement general purpose (not performance critical) code without any automated cache logic. So the question also becomes, is the required extra software development cost reasonable just to get a few percents performance boost by making the hardware simpler?

It depends, what I'd like to see is clothing physics in all characters, hair physics, deformable terrain, enviroments composed of destructible objects, muscle deformation physics, fluid physics, weather simulations, volumetric clouds affected by wind physics and maybe cloud formation physics, etc not just token use of physics here and there but ubiquitous use of physics everywhere affecting gameplay and looks. This could very well take up most of the performance provided, not just a few percent.

For general purpose a few high performance OoO cores are provided, but I'm not seeing the application of tens of large cores for that purpose in a console, I see things like physics eating up most of the resources.

The ageia PPU(physics accelerator) was said to offer up to 200 times the performance of cpus at some tasks with designs said to be similar to cell. What we want is huge performance increases in this area, something that makes those old nice realtime gpu 30fps clothing and hair simulations practical in games. IF the design is an order of magnitude faster at game physics but slower at running a word processor, I don't see the relevance for performance in the latter in a heterogeneous design, seeing as we'd have several OoO high performance general purpose cores for handling the latter.

Manually managed pools (ala Cell) are easier to build, and give you better performance numbers for the same amount of transistors spent, but have a huge cost in programmer productivity.

Consoles are supposed to last about a decade, and middleware developers can handle software used by multiple companies. Even those with internal engines can reuse and optimize what they learn. A nextgen cell can also leverage what was learned in this generation.

It seems getting the most performance for the least cost is what's desired in this space.

Yes, this is true in general. However the scaling depends very much on the memory access patterns. If each core has it's own L2 cache and majority of memory operations happen inside the core's own L2, the scaling is much better.

I've heard the caches are getting ever bigger in an attempt to deal with the issue, but as the number of cores goes up the approach breaks down.

In this respect, the "memory wall" is a classic producer/consumer problem, and it's the reason that on-die cache sizes have ballooned in recent years. As the memory wall gets higher and higher, it takes more and more cache to get you over it.-arstechnica

It would be an interesting exercise to track cache size per core in deployed HPC systems, since larger caches have been the biggest defense against the memory wall. Cache has been growing exponentially to try and keep up with the multiplying cores.-hpcwire

IF cache sizes have had to balloon exponentially to keep up with just a few cores, I'm not entire sure putting 30+ cores with smallish caches will not result in subpar performance as expected from the memory wall issues.

CELL V2.0 (out-of-order vs in-order PPE)

ioe vs. oooe PPE cores for CELL V2.0

Unleash the hounds Smithers: go for ioe and clock speed.

Appease pesky multiplatform developers and implement oooe.

3dilettante

steampoweredgod

homerdog

donator of the year

Lucid_Dreamer

tunafish

upnorthsox

upnorthsox

steampoweredgod

steampoweredgod

sebbbi

V3

Shifty Geezer

uber-Troll!

upnorthsox

steampoweredgod

Brad Grenz

Philosopher & Poet

Gubbi

sebbbi

tunafish

Gubbi

steampoweredgod

Similar threads