A glimpse inside the CELL processor

Discussion in 'Console Technology' started by mckmas8808, Jul 14, 2006.

  1. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    43,577
    Likes Received:
    16,028
    Location:
    Under my bridge
    No.
     
  2. scificube

    Regular

    Joined:
    Feb 9, 2005
    Messages:
    836
    Likes Received:
    9
    Lost. Saw you're previous post. 3 cores 333K for 1 thread per core. 2 thread per core is 166K if the cache is locked....if not you one can't guarantee what amount of cache has data a thread may find useful.

    Logical threads on SPEs...I was thinking something totally different. Partioning the cache between logical threads would have the affect you describe. Not sure if logical threads are the way to go though.

    Anyway my bad. It's too early in the morning.
     
    #102 scificube, Jul 26, 2006
    Last edited by a moderator: Jul 26, 2006
  3. Jesus2006

    Regular

    Joined:
    Jul 14, 2006
    Messages:
    506
    Likes Received:
    10
    Location:
    Bavaria
    This is also what MS suggests, according to GDC slides: One big "worker" thread per Core and one less active and more uncritical "background" thread which jumps in when theres not so much to do (IO, Sound etc.).
     
  4. Titanio

    Legend

    Joined:
    Dec 1, 2004
    Messages:
    5,670
    Likes Received:
    51
    When you switch threads though, surely the cache used by the thread your switching from does not necessarily all suddenly become available? I mean if you're switching from a thread because it's memory bound, there'll be data coming into cache for it while your second is executing. Moreover, if you switch back and data that was in the cache for that thread previously is now gone, it'll possibly block on memory again very quickly.
     
    #104 Titanio, Jul 26, 2006
    Last edited by a moderator: Jul 26, 2006
  5. scificube

    Regular

    Joined:
    Feb 9, 2005
    Messages:
    836
    Likes Received:
    9
    Not sure I understand completely what you asking or trying to say so bear with me.

    If you switch contexts on the core how would you guantee that data wasn't garbage or useful to the new thread? You could try to have threads access the same system memory and then image data would still reside in cache (most likely)...but then you have to have both threads need that data in the first place for that to work. For some tasks this may be the case...for other tasks it will not.

    What I was referring to in what you quoted was not related to cache but exectution pipelines. The exectution pipelines are not duplicated for each context on a Xenon core and thus contexts must share the pipelines. This is why both HW threads can't be exectuted at the same time all at all times....sometimes instructions from both HW threads will need to flow down the same pipeline and thus one of them must wait until the other is done with the resource. I don't see it being easy to guarantee one thread uses the even pipe and the other thread uses the odd pipe epecially if the threads are working toward the same goal. It is easier to just use one primary thread of exectution and let it go and have an unrelated task that you can either have take longer to complete run along it or the other thread be complimentary to the primary thread altoghether. A sort of worker/ helper thread arrangement as I think MS suggest is best. For instance a task that goes out to memory a lot but still requires a lot of computation could be split into two where you have the helper fill the cache by going out to memory and blocking until the request is serviced allowing the other thread to churn away at getting the job done.

    I just find it difficult to maintain 6 six threads exectuting all at the same time and there not being contention for exec units in one pipeline or the other. I'm betting the compiler won't save you and you'd certainly have a hard time of it getting it done your self "by design" in code.

    edit: What I said was tangetial anyway, and a response out of misuderstanding Shifty. I don't mean to change the direction of the discussion.
     
    #105 scificube, Jul 26, 2006
    Last edited by a moderator: Jul 27, 2006
  6. Titanio

    Legend

    Joined:
    Dec 1, 2004
    Messages:
    5,670
    Likes Received:
    51
    I agree that you're not going to have 6 threads executing simultaneously the majority of the time (even with instruction interleaving if it's there), but I was just wondering about the "333k of cache on average per thread" statement. If you have just 3 threads, yes, but if you have more than that that you're switching between, there likely won't be an average of 333k for each of the executing threads. At least I don't think? You'll still need to keep data that the other blocked threads are using in cache, there still might be data coming into cache for them while they're blocked (in fact they may be blocked waiting for that data) etc. That's really all I was wondering about, sorry if I'm not being totally clear..
     
  7. scificube

    Regular

    Joined:
    Feb 9, 2005
    Messages:
    836
    Likes Received:
    9
    Don't apologize to me. It's not as if I know what I'm talking about :)

    I only went with the 333 split because that was the simplest between three cores and then I assumed threads on a core would co-operate vs. being given their own piece of the cache pie. If cache lines aren't locked then it's sort of a matter of...luck whether the data in cache is useable by threads as they switch if one hasn't optimized the code to ensure that to be the case. There woud be nothing stopping one thread from wiping all the data another thread "was" using from the cache all together. Given the nature of console programming I'm sure not much will be left up to luck if it can be helped though.

    (When IO finishes for a thread in wait it would be placed in the readly list and exectuted before that data is allowed to be touched or moved....otherwise IO could block a thread indifinitely if data were allowed to be purged before it could be used....because the thread would continually ask for data from system memory.)
     
    #107 scificube, Jul 26, 2006
    Last edited by a moderator: Jul 26, 2006
  8. Jesus2006

    Regular

    Joined:
    Jul 14, 2006
    Messages:
    506
    Likes Received:
    10
    Location:
    Bavaria

    Btw. is it now officially confirmed that CELL PPU has double the execution logic since DD2.1, compared to Xenon cores? Or is this still considered to be a rumor? So at least this problem should not occur so bad on the CELL PPU, making it better suitable for multithreading?
     
  9. scificube

    Regular

    Joined:
    Feb 9, 2005
    Messages:
    836
    Likes Received:
    9
    I've seen no such confirmation or even that suggestion before. Personally, I'd doubt it until I saw solid proof otherwise.
     
  10. Jesus2006

    Regular

    Joined:
    Jul 14, 2006
    Messages:
    506
    Likes Received:
    10
    Location:
    Bavaria
    It's been discussed here alot in the past. Theres also an article on realworldtech.com about it. At least the size of the CELL PPU grew considerably from DD1.0 to DD2.1+ and it's in the region of the VMX units... :)
     
  11. Jesus2006

    Regular

    Joined:
    Jul 14, 2006
    Messages:
    506
    Likes Received:
    10
    Location:
    Bavaria
  12. scificube

    Regular

    Joined:
    Feb 9, 2005
    Messages:
    836
    Likes Received:
    9
    Yes we've seen those images for a while now and been aware the the PPE has grown in size since DD1. However, there's been no public announcement as to true SMP on the PPU in Cell. I also would point out the 7 SPE Cells are reject 8 SPE Cells used in the PC space. Cells in IBM's blade servers do not feature true SMP on the PPU and thus there is no duplication of pipelines. What has happened exactly from DD1 to DD2 is not public knowledge but I'm fairly sure it was was not a duplication of pipelines.
     
  13. Jesus2006

    Regular

    Joined:
    Jul 14, 2006
    Messages:
    506
    Likes Received:
    10
    Location:
    Bavaria
    Yes, on the other hand, there is an interview with CryTek, about the SMT abilities of CELL and Xenon where it's said that Xenon's mulithreading is comparable to a P4's hyperthreading, while CELL's is a little better than that. Might be an indication of more execution hardware...

    This is the interview (my translation from www.gamestar.de) :)

     
  14. scificube

    Regular

    Joined:
    Feb 9, 2005
    Messages:
    836
    Likes Received:
    9
    Refer to the BE Handbook you can download from IBM's website.

    10. PPE Multithreading

    "To software, the PPE implementation of multithreading looks similar to a multiprocessor implementation, but there are several important differences....."

    "Since most of the PPE hardware is shared by the two threads of execution, the hardware cost of the PPE multithreading implementation is dramatically lower than the cost of replicating the entire processor core. The PPE’s dual-threading typically yields about one-fifth the performance increase of a dual-core implementation, but the PPE achieves this 10%-to-30% performance boost at only one-twentieth of the cost of a dual-core implementation."


    10.1 Multithreading Guidelines

    "In the PPE multithreading implementation, the two hardware threads share execution resources.
    Consequently, concurrent threads that tend to use different subsets of the shared resources will
    best exploit the PPE’s multithreading implementation."


    ;)
     
    #114 scificube, Jul 27, 2006
    Last edited by a moderator: Jul 28, 2006
  15. Jesus2006

    Regular

    Joined:
    Jul 14, 2006
    Messages:
    506
    Likes Received:
    10
    Location:
    Bavaria
    I know these docs, but maybe they (although it would be strange if DeanoC worked on one) refer to DD1 Cells? Or the ones that are used in IBM's blade, and not the PS3 ones? Because as you said (i think) these Cells do not have that kind of SMT support compared to the PS3 Cell... but i might be wrong :)

    Still we have not heard anything regarding this, guess it's because of NDAs but if Deano or someone might jump in here we'd really appreciate that ;)
     
  16. Jesus2006

    Regular

    Joined:
    Jul 14, 2006
    Messages:
    506
    Likes Received:
    10
    Location:
    Bavaria
    I've read that too :) But then again, as i said above, we do not clearly known which CELL revision this refers to.
     
  17. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    43,577
    Likes Received:
    16,028
    Location:
    Under my bridge
    DD1 never made it int the wild by accounts. The move to DD2 came very swiftly, and DD1 likely won't be found in all but the earliest of prototype Cell kits at IBM. I think DeanA was the source of this info, but someone definitely said this on this forum.
     
  18. Jesus2006

    Regular

    Joined:
    Jul 14, 2006
    Messages:
    506
    Likes Received:
    10
    Location:
    Bavaria
    Hm i see. Anyways, i remember some Xenon die shots which someone compared to the CELL PPU some time ago, any idea where that was? Would make comparisons easier (although we are running completly OT here :p )
     
  19. chachi

    Newcomer

    Joined:
    Sep 15, 2004
    Messages:
    120
    Likes Received:
    3
    There's probably a better image out there but google is your friend, etc.

    http://www.gigascale.org/mescal/maw/index.html

    As you can see the publicly released photos, the revamped PPE looks remarkably similar to the Xenon core when you compare the layout of the processor elements (with some obvious differences that were beaten to death in old posts) and it'd be surprising if IBM didn't use an updated shared common core ancestor (which may or may not find it's way to market as a standalone for embedded use someday) for both.

    I have no idea what CryTek was talking about but the two systems use completely different compilers so how code is scheduled and optimized is likely to be very different. It also seems more important on the 360 to maximize performance for single-threaded tasks (or at least for primary tasks, say the main game loop or physics) on a core while on the PS3 you have the SPEs you can also rely on so the Cell compiler might be generating code that takes better advantage of the ability to run two threads while sacrificing a bit of performance for any single thread.
     
  20. patsu

    Legend

    Joined:
    Jun 25, 2005
    Messages:
    27,709
    Likes Received:
    145
    Yes, the general idea is to have the software designed specifically for the Cell architecture. I am not sure whether this is always possible due to external factors. I think it's probably "in-betweens" in real life.

    Sorry for reviving the confusion part of this thread. I'll keep quiet now. :)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...