Feasibility of Cell emulation

Discussion in 'Console Technology' started by Shifty Geezer, Nov 14, 2011.

  1. Weaste

    Newcomer

    Joined:
    Nov 13, 2007
    Messages:
    175
    Likes Received:
    0
    Location:
    Castellon de la Plana
    I don't think that you could emulate it right now without SERIOUS cost restrictions.

    Why bother to emulate it in any case? Even if they don't want to use it as their main CPU in whatever design they have for PS4, why not use it as some sort of satellite processor, not only providing BC for PS3 games but also giving whatever other CPU they might use a big fat multimedia/maths co-processor hung off the side? Not harking back to the days of 68k and 68882 here, but isn't this how Toshiba have basically used the SPURS engine?

    How much would it cost them to stick say a 28nm Cell into a PS4 even if it had a different main CPU?

    Also, could the point that they have not bothered to integrate Cell with RSX on the same die/package as Microsoft have done suggest that Sony has a whole have plans to use the chip in other CE devices once it's on a small enough node? What I mean by this is that the only reason to integrate it with RSX would be if they were never going to use it for anything else in the future other than PS3. Why have they not done this?
     
  2. Weaste

    Newcomer

    Joined:
    Nov 13, 2007
    Messages:
    175
    Likes Received:
    0
    Location:
    Castellon de la Plana
    Yet SPU and PPU don't share the same ISA do they?
     
  3. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    Yap, Cell is an exercise in losing weight to gain extra speed. It didn't rely on just powerful and complicated tech because of power, heat, memory restrictions 5-7 years ago. Once everything is simplified and has "no extra weight", it runs fast naturally, as long as the programmer knows what he (or she) is doing.

    Now with more room and newer tech, one should be able to hit similar performance envelope and then some. This is especially true if the new hardware ties the compute elements closer to the GPU (i.e., lesser communication overhead for Cell rendering work).

    Cell programs should be relatively well behave and predictable since the code and run-time is highly disciplined (Otherwise performance would suck). That's part of its DNA. We should be able to move the software around as long as the hooks to the outside world "looks" the same to the SPUs and PPU. The added complication may be the subtle dependencies between RSX and Cell since they work rather closely together. However if the new GPU is significantly faster, it may be okay.

    The question is how will this compromise other design choices (e.g., a more GPGPU like setup). And PS2 emulation ? :runaway:

    EDIT: The security subsystem is another consideration. We will need a way to segregate the hardware just like how the secure SPU work. Otherwise, even if the secure SPU code runs, the system is wide open.
     
  4. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    The current Cell software cannot go over the existing main and video memory bandwidth. The DMA and MFC insulate the SPUs' data and code from the rest of the world. So if they have some way to make DMA and MFC work within the "specs" in the new environment, it _should_ be ok.

    In the worst case, when PS3 code is running, they can limit the use/influence of other programs.

    EDIT:
    Btw, that's one of the reasons I <3 Cell software. They should be "movable" after the fact because they rely on so little (hardware), and are more predictable ("Everything" specified in the code by the programmer). We should be able to run them in a new environment, or in more adventurous mode, spread them across a few nodes as long as we can guarantee the DMA/MFC performance somehow.

    Even if the DMA/MFC performance doesn't match up, the ported code should run efficiently on other hardware just because data locality (for the entire software library/base) is observed.
     
  5. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The abstraction of the external system from the SPE POV does remove some difficulties.
    The SPE doesn't need to know about the protection model or system-level things like how memory is structured since it relies on asynchronous units for handling the translation.

    Another decent thing is that questions about memory model are less difficult with regards to consistency. Since the LS is non-coherent, there is none.
    An SPE could be plopped down in a wide variety of surrounding architectures, so long as the interface was updated.

    The downside is that the ISA is explicitly different, and that does pose a problem unless the future SPE keeps the same ISA or it is architected to have a superset of the original or have dual mode decoding.
    If the core trying to emulate the SPE is a standard coherent processor, it would need to also support the SPE's interfaces and then also have the regular methods of communication. There may need to be an LS mode or advanced cache control to match SPE performance in tightly optimized games.

    This may not be a bad thing, since the ability to efficiently and quickly message and coordinate other cores was very evolved and has an edge in some scenarios over the standard SMP model.

    Sony could try binary translation on the fly, or a wholesale translation of the code if ISA compatibility is not present. The fastest would be to translate and store the results as code is encountered. This is a known performant solution, but it raised copyright concerns because it copied code. Since Sony owns the platform and sometimes the developer, this may not be a problem.

    What would likely not do is a code-morphing solution or software emulation. Such measures require straight line performance that is a multiple of the SPE in order to hide the extra work needed to emulate it. Since it clocks at 3.2 GHz, it does not seem likely we will see 6-9 GHz replacements.
     
  6. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    44,106
    Likes Received:
    16,898
    Location:
    Under my bridge
    Now we have multicore processors, isn't it an option to have one (or more) cores dedicated to translating the SPU code into native code? If next-gen went for 16 simple cores, like ARMs, clocked the same as Cell, couldn't a load of them be tasked with translating the code on the fly, and be interleaved so that they can churn out instructions in timely fashion. Or would that be untennable with feeding I-cache piecemeal, or having to buffer a load of instructions meaning major code latency?
     
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Dedicating cores to translate at runtime may not be good unless the results can be stored for later reuse. The slowness of the initial loads may not be acceptable.
    Actually, the binary translation option may not work well either in many cases.
    Even if another core translated everything ahead of time, the execution of that code needs to be fast enough.

    It is possible that if the translation code is also an optimizing compiler, it could lead to improvement, but in areas where no optimization is possible, it falls back to raw clock speed.
    If the emulating core is not able to execute a given SPE instruction in roughly the same number of clocks, clock speed needs to be higher to compensate.
    It would help if the core handles a superset of SPE functionality, because then it's 1:1 in instruction execution. If it requires multiple instructions to arrive at the same result, it can be hidden if superscalar issue can issue them at the same time, but it comes back to clock if they are serial.


    This is pretty much why modern x86 translates its ISA to an internal format via hardware means.
     
  8. jonabbey

    Regular

    Joined:
    Oct 12, 2006
    Messages:
    809
    Likes Received:
    1
    Location:
    Austin, TX
    Presumably anything simulating SPUs would need to have the large register bank for loop unrolling and the very high speed LS access to have a chance, no?
     
  9. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Emulation is usually helped by having an even larger number of software visible registers. An equal number is probably the minimum.

    The LS could be emulated with proper cache controls. Managing latency could be a problem, but that may vary based on load. There are examples of L2 caches the size of the LS that can get within several cycles of the SPE's latency. A small L1 could even reduce the average latency further unless the access patern is enough to thrash the L1.
    Upping the clock would help here as well.
     
  10. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    I haven't been tracking. Assuming the data magically appear at the right places, how does the SPU compare to the compute elements in high end PC GPGPUs ? Can the latter "emulate" the SPU math operations ?
     
  11. hoho

    Veteran

    Joined:
    Aug 21, 2007
    Messages:
    1,218
    Likes Received:
    0
    Location:
    Estonia
    Math operations on streamed data, probably. Fast and random access to around 100-200kb of data, quite definitely not.
     
  12. Fafalada

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    2,773
    Likes Received:
    49
    Afaik the XBox (as well as PS2) emulators are both rather complete in terms of hardware quirks. It's just that turning "some" of those features on can dramatically affect performance so you end up with "game-profiles" where you selectively enable certain shortcuts (that a specific title happens to not depend on 'enough') to maintain performance.
    An obvious example of this would be say - accurate FP emulation (there's no standards here for consoles).
     
  13. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    Yes, I am not expecting a fast random access mechanism similar to Cell in standard PC parts. But I'm curious about their relative architectural differences.

    Do you have a more detailed comparison ? e.g., accuracy, features, speed, workflow etc.
     
  14. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    627
    Likes Received:
    414
    To put it in really simple terms, the approach Cell took to hiding memory latency was to have a small local pool that has deterministic fast access, while GPUs use multithreading on a very massive scale to provide the units with more work while other threads wait for the high-latency memory accesses.

    While most of the workloads that Cell is good at can also be ran (faster) on modern GPUs, I don't think there's a snowballs chance in hell to automatically translate the existing code.
     
  15. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    Earlier posters have already said that the memory layout is one big difference of SPU vs GPU, but an even bigger difference would be the computational model. GPUs are executing instructions using SIMT (Single Instruction Multiple Threads) model. This paradigm relies on thousands of threads executing the same code at same time. The count of threads often is chosen to match the size of the computational problem, not the number hardware processing units (num threads >> num computational resources). On GPU you have to formulate your algorithms in a (perfectly) parallel way, and this often results in very different algorithms (and data structures) compared to those that are efficient for traditional serial processing architectures.

    SPU on other hand can efficiently execute single threaded (narrow) vector operations, just like any other CPU with a vector coprocessor. A single core executes one thread at a time. SPU is more closer to traditional CPU than a GPU. The biggest difference is in the implementation of the fast work memory (local store vs cache). SPU instruction set doesn't differ that much from a modern CPU vector instruction set (like AVX). Majority of the SPU instructions could be 1:1 translated to CPU vector instructions. Of course there are some instructions that could be translated to 2 or even more instructions.

    I don't personally think SPU emulation would be impossible, since it is basically working inside a sandbox, and the memory transfer operations in/out of the sandbox have long latencies (hundreds of cycles = could be emulated by memory block copies). Of course you would need a very fast CPU with very fast (and feature complete) vector units (with data store/load using vector registers as addresses) and a big enough L2 cache to hold all the SPU local stores + all other cached data.

    Basically:
    1. Allocate a static 256k memory area for each local store from system RAM (will be also automatically cached to L1/L2 when required).
    2. Offline translate all SPU code to matching CPU vector code. The resulting code would likely require more instructions, but nothing major.
    3. Offline translate all local store addresses in the SPU code to the static 256k system RAM area that was allocated to the local store.
    4. Memory transfers to/from local store are translated to memory copies (RAM<->RAM). If areas are in the cache, CPU will (automatically) do copying in L1/L2 instead.

    Of course there would likely be some low level specifics (that I am not aware of) that need special handling. But in general as long as the CPU executing the SPU code is fast enough and it has fast enough L2 bandwidth (and has at least 2MB L2) and more memory bandwidth than Cell it could work fine. Of course it would never be as energy efficient as running the native SPU code, and would require considerably more powerful CPU to execute things in timely manner. Maybe if PS4 had a 12 thread AVX2 powered Haswell :)
     
  16. upnorthsox

    Veteran

    Joined:
    May 7, 2008
    Messages:
    2,106
    Likes Received:
    380
    This is the crux though isn't it, an array of 8 SPU's is only about 18mm^2 @28nm so at some point (pretty early probably) you have to decide is it just easier to include them. Personally, for a game console, I can't think of a reason why having an array of highspeed vector processors is a bad thing.
     
  17. TheChefO

    Banned

    Joined:
    Jul 29, 2005
    Messages:
    4,656
    Likes Received:
    32
    Location:
    Tampa, FL
    That's what I would think Sony will want to do.
     
  18. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    Yes, I am familiar with massive SIMD architectures at a high level. Programmed on a CM-* machine briefly. A year or two ago, I have also seen a talented programmer mapped a regular (but very old) program to run on a GPGPU (s-l-o-w-l-y ^_^) in his spare time. Was wondering how far GPGPU has evolved.


    EDIT:
    Yes yes, if you want, you can throw in more SPUs and/or clock it higher. I don't really track GPU progress, but am casually curious about any cross-over effort between SPU-like and GPU-like architectures.

    As more developers optimize their software for GPGPU, the expertise and efficiency for that approach will also mature faster. So relying on GPGPU progress is not a bad idea too.
     
  19. Rolf N

    Rolf N Recurring Membmare
    Veteran

    Joined:
    Aug 18, 2003
    Messages:
    2,494
    Likes Received:
    55
    Location:
    yes
    Distributing a single-thread workload to multiple emulation threads is not generally feasible, because data dependencies do happen more often than not in an instruction sequence, and "sending" those dependencies back to a different processor after the fact is never efficient. Bigger, beefier individual cores are the only surefire way to emulate less beefy ones.

    The thing with SPEs is that they can do 8 flops per clock each, and they are no slouch at integer SIMD either, they have tons of register space, L1-cache-worthy memory access latencies and they are pretty much general purpose. Finding a core that is as fast or faster than a 3.2GHz SPE in any and all cases is already a challenge. Finding one that has enough performance headroom to not be bogged down by the additional housekeeping required for translation/emulation seems impossible to me right now. Maybe in another five years.

    If Sony wants decent BC, they'll have to incorporate a (revamped/extended) Cell design in the PS4. I don't see a way around it.
     
  20. Blazkowicz

    Legend

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    sticking a Cell for BC in the PS4 would cost a lot?, if you're going to put the 256MB XDR along it.
    it doesn't feel easy, you need a lot of bandwith from Cell to GPU and a lot of it from Cell to memory, using maybe a fast coherent link to main CPU if you want to rid of the XDR.

    it's not like the old time where you would just put a Z80 along your 68k, use it either for sound or for BC with the sega master system.

    I like the idea of a four-or-something core modern PPU with 8 same old SPE.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...