Feasibility of Cell emulation

I don't think that you could emulate it right now without SERIOUS cost restrictions.

Why bother to emulate it in any case? Even if they don't want to use it as their main CPU in whatever design they have for PS4, why not use it as some sort of satellite processor, not only providing BC for PS3 games but also giving whatever other CPU they might use a big fat multimedia/maths co-processor hung off the side? Not harking back to the days of 68k and 68882 here, but isn't this how Toshiba have basically used the SPURS engine?

How much would it cost them to stick say a 28nm Cell into a PS4 even if it had a different main CPU?

Also, could the point that they have not bothered to integrate Cell with RSX on the same die/package as Microsoft have done suggest that Sony has a whole have plans to use the chip in other CE devices once it's on a small enough node? What I mean by this is that the only reason to integrate it with RSX would be if they were never going to use it for anything else in the future other than PS3. Why have they not done this?
 
I think it holds merit, so with all due apologies to Shifty I'll reopen this one. Also, in order to rile thing up, I suggest that people do a slightly better work at documenting themselves...moving beyond "omagad it's deep magicTM" would be good. CELL is old, cute but not exactly cutting edge, it doesn't have an awesome memory pipe, it doesn't have hardware caching, the ISA is not exactly from another galaxy, IPC is pretty gimpy. As long as the emulator writers are actually aware about prefetching to cater to the DMA work (albeit modern prefetchers should do a reasonably good job anyways), you'd probably end up doing the bulk of work from L2.

Yap, Cell is an exercise in losing weight to gain extra speed. It didn't rely on just powerful and complicated tech because of power, heat, memory restrictions 5-7 years ago. Once everything is simplified and has "no extra weight", it runs fast naturally, as long as the programmer knows what he (or she) is doing.

Now with more room and newer tech, one should be able to hit similar performance envelope and then some. This is especially true if the new hardware ties the compute elements closer to the GPU (i.e., lesser communication overhead for Cell rendering work).

A reasonably modern core like NHLM or SB is considerably ahead of CELL in terms of execution prowess, and now it's working from a cache that's quite fast. You also have HT to help fill in some bubbles. More importantly, people have done work behind the scenes on this and it's reasonable, to say the least. So please, less rehashing of memes like: "we tweaked CELL so hard it's over 9000 now, if we had an updated version that ran at Graham's Number GHz and had Moser's Number SPEWs it would crack the very fabric of the universe".

Cell programs should be relatively well behave and predictable since the code and run-time is highly disciplined (Otherwise performance would suck). That's part of its DNA. We should be able to move the software around as long as the hooks to the outside world "looks" the same to the SPUs and PPU. The added complication may be the subtle dependencies between RSX and Cell since they work rather closely together. However if the new GPU is significantly faster, it may be okay.

The question is how will this compromise other design choices (e.g., a more GPGPU like setup). And PS2 emulation ? :runaway:

EDIT: The security subsystem is another consideration. We will need a way to segregate the hardware just like how the secure SPU work. Otherwise, even if the secure SPU code runs, the system is wide open.
 
The thing is: with heavily multithreaded code running on, say, PS3 you're going to hit all sorts of conditions that weren't there on CELL. Performance will be different and unpredictable, but if you have enough horsepower, you should be OK. This is pretty much what MS did with 360 back compat and that's why some games did not work (corner cases with horrible performance characteristics and code too performant and HW-dependent to run in emu on 360).

The current Cell software cannot go over the existing main and video memory bandwidth. The DMA and MFC insulate the SPUs' data and code from the rest of the world. So if they have some way to make DMA and MFC work within the "specs" in the new environment, it _should_ be ok.

In the worst case, when PS3 code is running, they can limit the use/influence of other programs.

EDIT:
Btw, that's one of the reasons I :love: Cell software. They should be "movable" after the fact because they rely on so little (hardware), and are more predictable ("Everything" specified in the code by the programmer). We should be able to run them in a new environment, or in more adventurous mode, spread them across a few nodes as long as we can guarantee the DMA/MFC performance somehow.

Even if the DMA/MFC performance doesn't match up, the ported code should run efficiently on other hardware just because data locality (for the entire software library/base) is observed.
 
The abstraction of the external system from the SPE POV does remove some difficulties.
The SPE doesn't need to know about the protection model or system-level things like how memory is structured since it relies on asynchronous units for handling the translation.

Another decent thing is that questions about memory model are less difficult with regards to consistency. Since the LS is non-coherent, there is none.
An SPE could be plopped down in a wide variety of surrounding architectures, so long as the interface was updated.

The downside is that the ISA is explicitly different, and that does pose a problem unless the future SPE keeps the same ISA or it is architected to have a superset of the original or have dual mode decoding.
If the core trying to emulate the SPE is a standard coherent processor, it would need to also support the SPE's interfaces and then also have the regular methods of communication. There may need to be an LS mode or advanced cache control to match SPE performance in tightly optimized games.

This may not be a bad thing, since the ability to efficiently and quickly message and coordinate other cores was very evolved and has an edge in some scenarios over the standard SMP model.

Sony could try binary translation on the fly, or a wholesale translation of the code if ISA compatibility is not present. The fastest would be to translate and store the results as code is encountered. This is a known performant solution, but it raised copyright concerns because it copied code. Since Sony owns the platform and sometimes the developer, this may not be a problem.

What would likely not do is a code-morphing solution or software emulation. Such measures require straight line performance that is a multiple of the SPE in order to hide the extra work needed to emulate it. Since it clocks at 3.2 GHz, it does not seem likely we will see 6-9 GHz replacements.
 
Now we have multicore processors, isn't it an option to have one (or more) cores dedicated to translating the SPU code into native code? If next-gen went for 16 simple cores, like ARMs, clocked the same as Cell, couldn't a load of them be tasked with translating the code on the fly, and be interleaved so that they can churn out instructions in timely fashion. Or would that be untennable with feeding I-cache piecemeal, or having to buffer a load of instructions meaning major code latency?
 
Dedicating cores to translate at runtime may not be good unless the results can be stored for later reuse. The slowness of the initial loads may not be acceptable.
Actually, the binary translation option may not work well either in many cases.
Even if another core translated everything ahead of time, the execution of that code needs to be fast enough.

It is possible that if the translation code is also an optimizing compiler, it could lead to improvement, but in areas where no optimization is possible, it falls back to raw clock speed.
If the emulating core is not able to execute a given SPE instruction in roughly the same number of clocks, clock speed needs to be higher to compensate.
It would help if the core handles a superset of SPE functionality, because then it's 1:1 in instruction execution. If it requires multiple instructions to arrive at the same result, it can be hidden if superscalar issue can issue them at the same time, but it comes back to clock if they are serial.


This is pretty much why modern x86 translates its ISA to an internal format via hardware means.
 
Presumably anything simulating SPUs would need to have the large register bank for loop unrolling and the very high speed LS access to have a chance, no?
 
Emulation is usually helped by having an even larger number of software visible registers. An equal number is probably the minimum.

The LS could be emulated with proper cache controls. Managing latency could be a problem, but that may vary based on load. There are examples of L2 caches the size of the LS that can get within several cycles of the SPE's latency. A small L1 could even reduce the average latency further unless the access patern is enough to thrash the L1.
Upping the clock would help here as well.
 
I haven't been tracking. Assuming the data magically appear at the right places, how does the SPU compare to the compute elements in high end PC GPGPUs ? Can the latter "emulate" the SPU math operations ?
 
Math operations on streamed data, probably. Fast and random access to around 100-200kb of data, quite definitely not.
 
Dominik D said:
This is pretty much what MS did with 360 back compat and that's why some games did not work (corner cases with horrible performance characteristics and code too performant and HW-dependent to run in emu on 360).
Afaik the XBox (as well as PS2) emulators are both rather complete in terms of hardware quirks. It's just that turning "some" of those features on can dramatically affect performance so you end up with "game-profiles" where you selectively enable certain shortcuts (that a specific title happens to not depend on 'enough') to maintain performance.
An obvious example of this would be say - accurate FP emulation (there's no standards here for consoles).
 
Math operations on streamed data, probably. Fast and random access to around 100-200kb of data, quite definitely not.

Yes, I am not expecting a fast random access mechanism similar to Cell in standard PC parts. But I'm curious about their relative architectural differences.

Do you have a more detailed comparison ? e.g., accuracy, features, speed, workflow etc.
 
Yes, I am not expecting a fast random access mechanism similar to Cell in standard PC parts. But I'm curious about their relative architectural differences.

Do you have a more detailed comparison ? e.g., accuracy, features, speed, workflow etc.

To put it in really simple terms, the approach Cell took to hiding memory latency was to have a small local pool that has deterministic fast access, while GPUs use multithreading on a very massive scale to provide the units with more work while other threads wait for the high-latency memory accesses.

While most of the workloads that Cell is good at can also be ran (faster) on modern GPUs, I don't think there's a snowballs chance in hell to automatically translate the existing code.
 
I haven't been tracking. Assuming the data magically appear at the right places, how does the SPU compare to the compute elements in high end PC GPGPUs ? Can the latter "emulate" the SPU math operations ?
Earlier posters have already said that the memory layout is one big difference of SPU vs GPU, but an even bigger difference would be the computational model. GPUs are executing instructions using SIMT (Single Instruction Multiple Threads) model. This paradigm relies on thousands of threads executing the same code at same time. The count of threads often is chosen to match the size of the computational problem, not the number hardware processing units (num threads >> num computational resources). On GPU you have to formulate your algorithms in a (perfectly) parallel way, and this often results in very different algorithms (and data structures) compared to those that are efficient for traditional serial processing architectures.

SPU on other hand can efficiently execute single threaded (narrow) vector operations, just like any other CPU with a vector coprocessor. A single core executes one thread at a time. SPU is more closer to traditional CPU than a GPU. The biggest difference is in the implementation of the fast work memory (local store vs cache). SPU instruction set doesn't differ that much from a modern CPU vector instruction set (like AVX). Majority of the SPU instructions could be 1:1 translated to CPU vector instructions. Of course there are some instructions that could be translated to 2 or even more instructions.

I don't personally think SPU emulation would be impossible, since it is basically working inside a sandbox, and the memory transfer operations in/out of the sandbox have long latencies (hundreds of cycles = could be emulated by memory block copies). Of course you would need a very fast CPU with very fast (and feature complete) vector units (with data store/load using vector registers as addresses) and a big enough L2 cache to hold all the SPU local stores + all other cached data.

Basically:
1. Allocate a static 256k memory area for each local store from system RAM (will be also automatically cached to L1/L2 when required).
2. Offline translate all SPU code to matching CPU vector code. The resulting code would likely require more instructions, but nothing major.
3. Offline translate all local store addresses in the SPU code to the static 256k system RAM area that was allocated to the local store.
4. Memory transfers to/from local store are translated to memory copies (RAM<->RAM). If areas are in the cache, CPU will (automatically) do copying in L1/L2 instead.

Of course there would likely be some low level specifics (that I am not aware of) that need special handling. But in general as long as the CPU executing the SPU code is fast enough and it has fast enough L2 bandwidth (and has at least 2MB L2) and more memory bandwidth than Cell it could work fine. Of course it would never be as energy efficient as running the native SPU code, and would require considerably more powerful CPU to execute things in timely manner. Maybe if PS4 had a 12 thread AVX2 powered Haswell :)
 
Earlier posters have already said that the memory layout is one big difference of SPU vs GPU, but an even bigger difference would be the computational model. GPUs are executing instructions using SIMT (Single Instruction Multiple Threads) model. This paradigm relies on thousands of threads executing the same code at same time. The count of threads often is chosen to match the size of the computational problem, not the number hardware processing units (num threads >> num computational resources). On GPU you have to formulate your algorithms in a (perfectly) parallel way, and this often results in very different algorithms (and data structures) compared to those that are efficient for traditional serial processing architectures.

SPU on other hand can efficiently execute single threaded (narrow) vector operations, just like any other CPU with a vector coprocessor. A single core executes one thread at a time. SPU is more closer to traditional CPU than a GPU. The biggest difference is in the implementation of the fast work memory (local store vs cache). SPU instruction set doesn't differ that much from a modern CPU vector instruction set (like AVX). Majority of the SPU instructions could be 1:1 translated to CPU vector instructions. Of course there are some instructions that could be translated to 2 or even more instructions.

I don't personally think SPU emulation would be impossible, since it is basically working inside a sandbox, and the memory transfer operations in/out of the sandbox have long latencies (hundreds of cycles = could be emulated by memory block copies). Of course you would need a very fast CPU with very fast (and feature complete) vector units (with data store/load using vector registers as addresses) and a big enough L2 cache to hold all the SPU local stores + all other cached data.

Basically:
1. Allocate a static 256k memory area for each local store from system RAM (will be also automatically cached to L1/L2 when required).
2. Offline translate all SPU code to matching CPU vector code. The resulting code would likely require more instructions, but nothing major.
3. Offline translate all local store addresses in the SPU code to the static 256k system RAM area that was allocated to the local store.
4. Memory transfers to/from local store are translated to memory copies (RAM<->RAM). If areas are in the cache, CPU will (automatically) do copying in L1/L2 instead.

Of course there would likely be some low level specifics (that I am not aware of) that need special handling. But in general as long as the CPU executing the SPU code is fast enough and it has fast enough L2 bandwidth (and has at least 2MB L2) and more memory bandwidth than Cell it could work fine. Of course it would never be as energy efficient as running the native SPU code, and would require considerably more powerful CPU to execute things in timely manner. Maybe if PS4 had a 12 thread AVX2 powered Haswell :)

This is the crux though isn't it, an array of 8 SPU's is only about 18mm^2 @28nm so at some point (pretty early probably) you have to decide is it just easier to include them. Personally, for a game console, I can't think of a reason why having an array of highspeed vector processors is a bad thing.
 
This is the crux though isn't it, an array of 8 SPU's is only about 18mm^2 @28nm so at some point (pretty early probably) you have to decide is it just easier to include them. Personally, for a game console, I can't think of a reason why having an array of highspeed vector processors is a bad thing.

That's what I would think Sony will want to do.
 
To put it in really simple terms, the approach Cell took to hiding memory latency was to have a small local pool that has deterministic fast access, while GPUs use multithreading on a very massive scale to provide the units with more work while other threads wait for the high-latency memory accesses.

While most of the workloads that Cell is good at can also be ran (faster) on modern GPUs, I don't think there's a snowballs chance in hell to automatically translate the existing code.

Earlier posters have already said that the memory layout is one big difference of SPU vs GPU, but an even bigger difference would be the computational model. GPUs are executing instructions using SIMT (Single Instruction Multiple Threads) model. This paradigm relies on thousands of threads executing the same code at same time. The count of threads often is chosen to match the size of the computational problem, not the number hardware processing units (num threads >> num computational resources). On GPU you have to formulate your algorithms in a (perfectly) parallel way, and this often results in very different algorithms (and data structures) compared to those that are efficient for traditional serial processing architectures.

Yes, I am familiar with massive SIMD architectures at a high level. Programmed on a CM-* machine briefly. A year or two ago, I have also seen a talented programmer mapped a regular (but very old) program to run on a GPGPU (s-l-o-w-l-y ^_^) in his spare time. Was wondering how far GPGPU has evolved.


EDIT:
This is the crux though isn't it, an array of 8 SPU's is only about 18mm^2 @28nm so at some point (pretty early probably) you have to decide is it just easier to include them. Personally, for a game console, I can't think of a reason why having an array of highspeed vector processors is a bad thing.

Yes yes, if you want, you can throw in more SPUs and/or clock it higher. I don't really track GPU progress, but am casually curious about any cross-over effort between SPU-like and GPU-like architectures.

As more developers optimize their software for GPGPU, the expertise and efficiency for that approach will also mature faster. So relying on GPGPU progress is not a bad idea too.
 
Now we have multicore processors, isn't it an option to have one (or more) cores dedicated to translating the SPU code into native code? If next-gen went for 16 simple cores, like ARMs, clocked the same as Cell, couldn't a load of them be tasked with translating the code on the fly, and be interleaved so that they can churn out instructions in timely fashion. Or would that be untennable with feeding I-cache piecemeal, or having to buffer a load of instructions meaning major code latency?
Distributing a single-thread workload to multiple emulation threads is not generally feasible, because data dependencies do happen more often than not in an instruction sequence, and "sending" those dependencies back to a different processor after the fact is never efficient. Bigger, beefier individual cores are the only surefire way to emulate less beefy ones.

The thing with SPEs is that they can do 8 flops per clock each, and they are no slouch at integer SIMD either, they have tons of register space, L1-cache-worthy memory access latencies and they are pretty much general purpose. Finding a core that is as fast or faster than a 3.2GHz SPE in any and all cases is already a challenge. Finding one that has enough performance headroom to not be bogged down by the additional housekeeping required for translation/emulation seems impossible to me right now. Maybe in another five years.

If Sony wants decent BC, they'll have to incorporate a (revamped/extended) Cell design in the PS4. I don't see a way around it.
 
I don't think that you could emulate it right now without SERIOUS cost restrictions.

Why bother to emulate it in any case? Even if they don't want to use it as their main CPU in whatever design they have for PS4, why not use it as some sort of satellite processor, not only providing BC for PS3 games but also giving whatever other CPU they might use a big fat multimedia/maths co-processor hung off the side? Not harking back to the days of 68k and 68882 here, but isn't this how Toshiba have basically used the SPURS engine?

How much would it cost them to stick say a 28nm Cell into a PS4 even if it had a different main CPU?

Also, could the point that they have not bothered to integrate Cell with RSX on the same die/package as Microsoft have done suggest that Sony has a whole have plans to use the chip in other CE devices once it's on a small enough node? What I mean by this is that the only reason to integrate it with RSX would be if they were never going to use it for anything else in the future other than PS3. Why have they not done this?

sticking a Cell for BC in the PS4 would cost a lot?, if you're going to put the 256MB XDR along it.
it doesn't feel easy, you need a lot of bandwith from Cell to GPU and a lot of it from Cell to memory, using maybe a fast coherent link to main CPU if you want to rid of the XDR.

it's not like the old time where you would just put a Z80 along your 68k, use it either for sound or for BC with the sega master system.

I like the idea of a four-or-something core modern PPU with 8 same old SPE.
 
Back
Top