A glimpse inside the CELL processor

Shifty Geezer said:

Lost. Saw you're previous post. 3 cores 333K for 1 thread per core. 2 thread per core is 166K if the cache is locked....if not you one can't guarantee what amount of cache has data a thread may find useful.

Logical threads on SPEs...I was thinking something totally different. Partioning the cache between logical threads would have the affect you describe. Not sure if logical threads are the way to go though.

Anyway my bad. It's too early in the morning.
 
Last edited by a moderator:
scificube said:
I myself feel only 3 threads will dominate execution time because each core shares pipes not duplicates them so on the average 3 of the six contexts will be blocked while the others are using the pipelines in each core. Staging is eliminated because contexts != instructions and so instructions belonging to different contexts can't be staged AFAIK. With this is mind I think it may be better to say 3 threads would have 333K available to them on the average.)

This is also what MS suggests, according to GDC slides: One big "worker" thread per Core and one less active and more uncritical "background" thread which jumps in when theres not so much to do (IO, Sound etc.).
 
scificube said:
I myself feel only 3 threads will dominate execution time because each core shares pipes not duplicates them so on the average 3 of the six contexts will be blocked while the others are using the pipelines in each core. Staging is eliminated because contexts != instructions and so instructions belonging to different contexts can't be staged AFAIK. With this is mind I think it may be better to say 3 threads would have 333K available to them on the average.

When you switch threads though, surely the cache used by the thread your switching from does not necessarily all suddenly become available? I mean if you're switching from a thread because it's memory bound, there'll be data coming into cache for it while your second is executing. Moreover, if you switch back and data that was in the cache for that thread previously is now gone, it'll possibly block on memory again very quickly.
 
Last edited by a moderator:
Titanio said:
When you switch threads though, surely the cache used by the thread your switching from does not necessarily all suddenly become available? I mean if you're switching from a thread because it's memory bound, there'll be data coming into cache for it while you're second is executing. Moreover, if you switch back and data that was in the cache for that thread previously is now gone, it'll possibly block on memory again very quickly.

Not sure I understand completely what you asking or trying to say so bear with me.

If you switch contexts on the core how would you guantee that data wasn't garbage or useful to the new thread? You could try to have threads access the same system memory and then image data would still reside in cache (most likely)...but then you have to have both threads need that data in the first place for that to work. For some tasks this may be the case...for other tasks it will not.

What I was referring to in what you quoted was not related to cache but exectution pipelines. The exectution pipelines are not duplicated for each context on a Xenon core and thus contexts must share the pipelines. This is why both HW threads can't be exectuted at the same time all at all times....sometimes instructions from both HW threads will need to flow down the same pipeline and thus one of them must wait until the other is done with the resource. I don't see it being easy to guarantee one thread uses the even pipe and the other thread uses the odd pipe epecially if the threads are working toward the same goal. It is easier to just use one primary thread of exectution and let it go and have an unrelated task that you can either have take longer to complete run along it or the other thread be complimentary to the primary thread altoghether. A sort of worker/ helper thread arrangement as I think MS suggest is best. For instance a task that goes out to memory a lot but still requires a lot of computation could be split into two where you have the helper fill the cache by going out to memory and blocking until the request is serviced allowing the other thread to churn away at getting the job done.

I just find it difficult to maintain 6 six threads exectuting all at the same time and there not being contention for exec units in one pipeline or the other. I'm betting the compiler won't save you and you'd certainly have a hard time of it getting it done your self "by design" in code.

edit: What I said was tangetial anyway, and a response out of misuderstanding Shifty. I don't mean to change the direction of the discussion.
 
Last edited by a moderator:
I agree that you're not going to have 6 threads executing simultaneously the majority of the time (even with instruction interleaving if it's there), but I was just wondering about the "333k of cache on average per thread" statement. If you have just 3 threads, yes, but if you have more than that that you're switching between, there likely won't be an average of 333k for each of the executing threads. At least I don't think? You'll still need to keep data that the other blocked threads are using in cache, there still might be data coming into cache for them while they're blocked (in fact they may be blocked waiting for that data) etc. That's really all I was wondering about, sorry if I'm not being totally clear..
 
Titanio said:
I agree that you're not going to have 6 threads executing simultaneously the majority of the time (even with instruction interleaving if it's there), but I was just wondering about the "333k of cache on average per thread" statement. If you have just 3 threads, yes, but if you have more than that that you're switching between, there likely won't be an average of 333k for each of the executing threads. At least I don't think? You'll still need to keep data that the other blocked threads are using in cache, there still might be data coming into cache for them while they're blocked (in fact they may be blocked waiting for that data) etc. That's really all I was wondering about, sorry if I'm not being totally clear..

Don't apologize to me. It's not as if I know what I'm talking about :)

I only went with the 333 split because that was the simplest between three cores and then I assumed threads on a core would co-operate vs. being given their own piece of the cache pie. If cache lines aren't locked then it's sort of a matter of...luck whether the data in cache is useable by threads as they switch if one hasn't optimized the code to ensure that to be the case. There woud be nothing stopping one thread from wiping all the data another thread "was" using from the cache all together. Given the nature of console programming I'm sure not much will be left up to luck if it can be helped though.

(When IO finishes for a thread in wait it would be placed in the readly list and exectuted before that data is allowed to be touched or moved....otherwise IO could block a thread indifinitely if data were allowed to be purged before it could be used....because the thread would continually ask for data from system memory.)
 
Last edited by a moderator:
scificube said:
What I was referring to in what you quoted was not related to cache but exectution pipelines. The exectution pipelines are not duplicated for each context on a Xenon core and thus contexts must share the pipelines. This is why both HW threads can't be exectuted at the same time all at all times....sometimes instructions from both HW threads will need to flow down the same pipeline and thus one of them must wait until the other is done with the resource.


Btw. is it now officially confirmed that CELL PPU has double the execution logic since DD2.1, compared to Xenon cores? Or is this still considered to be a rumor? So at least this problem should not occur so bad on the CELL PPU, making it better suitable for multithreading?
 
Jesus2006 said:
Btw. is it now officially confirmed that CELL PPU has double the execution logic since DD2.1, compared to Xenon cores? Or is this still considered to be a rumor? So at least this problem should not occur so bad on the CELL PPU, making it better suitable for multithreading?

I've seen no such confirmation or even that suggestion before. Personally, I'd doubt it until I saw solid proof otherwise.
 
scificube said:
I've seen no such confirmation or even that suggestion before. Personally, I'd doubt it until I saw solid proof otherwise.

It's been discussed here alot in the past. Theres also an article on realworldtech.com about it. At least the size of the CELL PPU grew considerably from DD1.0 to DD2.1+ and it's in the region of the VMX units... :)
 
heres the link:

http://www.realworldtech.com/page.cfm?ArticleID=RWT072405191325&p=1

cell3-1.gif


cell3-2.gif


cell3-4.jpg
 
Yes we've seen those images for a while now and been aware the the PPE has grown in size since DD1. However, there's been no public announcement as to true SMP on the PPU in Cell. I also would point out the 7 SPE Cells are reject 8 SPE Cells used in the PC space. Cells in IBM's blade servers do not feature true SMP on the PPU and thus there is no duplication of pipelines. What has happened exactly from DD1 to DD2 is not public knowledge but I'm fairly sure it was was not a duplication of pipelines.
 
scificube said:
Yes we've seen those images for a while now and been aware the the PPE has grown in size since DD1. However, there's been no public announcement as to true SMP on the PPU in Cell. I also would point out the 7 SPE Cells are reject 8 SPE Cells used in the PC space. Cells in IBM's blade servers do not feature true SMP on the PPU and thus there is no duplication of pipelines. What has happened exactly from DD1 to DD2 is not public knowledge but I'm fairly sure it was was not a duplication of pipelines.

Yes, on the other hand, there is an interview with CryTek, about the SMT abilities of CELL and Xenon where it's said that Xenon's mulithreading is comparable to a P4's hyperthreading, while CELL's is a little better than that. Might be an indication of more execution hardware...

This is the interview (my translation from www.gamestar.de) :)

The 360's solution is similar to hypertherading. In prinicple, it's 3 CPUs with 2 hyperthreads. If you ask the hardware vendors, they would deny this. But if you analyse it as a software developer it's nothing else than hyperthreading. That means, you got 6 threads but actually it's only like 3 times 1,5 threads. On PS3 it's different with CELL: The main-CPU [PPU] got 2 threads (a little better than hyperthreading) and in addition the seven synergistic processors. The eighth SPU, which is still visible in the design has been removed.
 
Refer to the BE Handbook you can download from IBM's website.

10. PPE Multithreading

"To software, the PPE implementation of multithreading looks similar to a multiprocessor implementation, but there are several important differences....."

"Since most of the PPE hardware is shared by the two threads of execution, the hardware cost of the PPE multithreading implementation is dramatically lower than the cost of replicating the entire processor core. The PPE’s dual-threading typically yields about one-fifth the performance increase of a dual-core implementation, but the PPE achieves this 10%-to-30% performance boost at only one-twentieth of the cost of a dual-core implementation."


10.1 Multithreading Guidelines

"In the PPE multithreading implementation, the two hardware threads share execution resources.
Consequently, concurrent threads that tend to use different subsets of the shared resources will
best exploit the PPE’s multithreading implementation."


;)
 
Last edited by a moderator:
scificube said:
While it's not public knowledge what exactly happened from DD1 for DD2 to whatever I'm fairly sure it was not a move to full SMP on the PPU. There may or may not be more exec units in the PPU on Cell than on Xenon allowing form more instances where each thread can successfully issue an instr but whatever Crytek is referring to it does not strike me as being full on SMP as that is a little more the slightly better than hyperthreading. Look not further than AMD's X2 vs. hyper-threaded P4's and you can see that...as basically SMP is like...adding another complete core altogether as what AMD did and Intel did not.

I know these docs, but maybe they (although it would be strange if DeanoC worked on one) refer to DD1 Cells? Or the ones that are used in IBM's blade, and not the PS3 ones? Because as you said (i think) these Cells do not have that kind of SMT support compared to the PS3 Cell... but i might be wrong :)

Still we have not heard anything regarding this, guess it's because of NDAs but if Deano or someone might jump in here we'd really appreciate that ;)
 
scificube said:
10.1 Multithreading Guidelines

"In the PPE multithreading implementation, the two hardware threads share execution resources.
Consequently, concurrent threads that tend to use different subsets of the shared resources will
best exploit the PPE’s multithreading implementation."


;)

I've read that too :) But then again, as i said above, we do not clearly known which CELL revision this refers to.
 
DD1 never made it int the wild by accounts. The move to DD2 came very swiftly, and DD1 likely won't be found in all but the earliest of prototype Cell kits at IBM. I think DeanA was the source of this info, but someone definitely said this on this forum.
 
Shifty Geezer said:
DD1 never made it int the wild by accounts. The move to DD2 came very swiftly, and DD1 likely won't be found in all but the earliest of prototype Cell kits at IBM. I think DeanA was the source of this info, but someone definitely said this on this forum.

Hm i see. Anyways, i remember some Xenon die shots which someone compared to the CELL PPU some time ago, any idea where that was? Would make comparisons easier (although we are running completly OT here :p )
 
Jesus2006 said:
Hm i see. Anyways, i remember some Xenon die shots which someone compared to the CELL PPU some time ago, any idea where that was? Would make comparisons easier (although we are running completly OT here :p )
There's probably a better image out there but google is your friend, etc.

http://www.gigascale.org/mescal/maw/index.html

As you can see the publicly released photos, the revamped PPE looks remarkably similar to the Xenon core when you compare the layout of the processor elements (with some obvious differences that were beaten to death in old posts) and it'd be surprising if IBM didn't use an updated shared common core ancestor (which may or may not find it's way to market as a standalone for embedded use someday) for both.

I have no idea what CryTek was talking about but the two systems use completely different compilers so how code is scheduled and optimized is likely to be very different. It also seems more important on the 360 to maximize performance for single-threaded tasks (or at least for primary tasks, say the main game loop or physics) on a core while on the PS3 you have the SPEs you can also rely on so the Cell compiler might be generating code that takes better advantage of the ability to run two threads while sacrificing a bit of performance for any single thread.
 
Jesus2006 said:
I don't think it's a good idea to restructurize your data at runtime to make them fit to the SPEs. That's something you should think of before you start programming for CELL :)

Yes, the general idea is to have the software designed specifically for the Cell architecture. I am not sure whether this is always possible due to external factors. I think it's probably "in-betweens" in real life.

Sorry for reviving the confusion part of this thread. I'll keep quiet now. :)
 
Back
Top