No.scificube said:Shifty did you get the some of your math wrong before?
No.scificube said:Shifty did you get the some of your math wrong before?
Shifty Geezer said:
scificube said:I myself feel only 3 threads will dominate execution time because each core shares pipes not duplicates them so on the average 3 of the six contexts will be blocked while the others are using the pipelines in each core. Staging is eliminated because contexts != instructions and so instructions belonging to different contexts can't be staged AFAIK. With this is mind I think it may be better to say 3 threads would have 333K available to them on the average.)
scificube said:I myself feel only 3 threads will dominate execution time because each core shares pipes not duplicates them so on the average 3 of the six contexts will be blocked while the others are using the pipelines in each core. Staging is eliminated because contexts != instructions and so instructions belonging to different contexts can't be staged AFAIK. With this is mind I think it may be better to say 3 threads would have 333K available to them on the average.
Titanio said:When you switch threads though, surely the cache used by the thread your switching from does not necessarily all suddenly become available? I mean if you're switching from a thread because it's memory bound, there'll be data coming into cache for it while you're second is executing. Moreover, if you switch back and data that was in the cache for that thread previously is now gone, it'll possibly block on memory again very quickly.
Titanio said:I agree that you're not going to have 6 threads executing simultaneously the majority of the time (even with instruction interleaving if it's there), but I was just wondering about the "333k of cache on average per thread" statement. If you have just 3 threads, yes, but if you have more than that that you're switching between, there likely won't be an average of 333k for each of the executing threads. At least I don't think? You'll still need to keep data that the other blocked threads are using in cache, there still might be data coming into cache for them while they're blocked (in fact they may be blocked waiting for that data) etc. That's really all I was wondering about, sorry if I'm not being totally clear..
scificube said:What I was referring to in what you quoted was not related to cache but exectution pipelines. The exectution pipelines are not duplicated for each context on a Xenon core and thus contexts must share the pipelines. This is why both HW threads can't be exectuted at the same time all at all times....sometimes instructions from both HW threads will need to flow down the same pipeline and thus one of them must wait until the other is done with the resource.
Jesus2006 said:Btw. is it now officially confirmed that CELL PPU has double the execution logic since DD2.1, compared to Xenon cores? Or is this still considered to be a rumor? So at least this problem should not occur so bad on the CELL PPU, making it better suitable for multithreading?
scificube said:I've seen no such confirmation or even that suggestion before. Personally, I'd doubt it until I saw solid proof otherwise.
scificube said:Yes we've seen those images for a while now and been aware the the PPE has grown in size since DD1. However, there's been no public announcement as to true SMP on the PPU in Cell. I also would point out the 7 SPE Cells are reject 8 SPE Cells used in the PC space. Cells in IBM's blade servers do not feature true SMP on the PPU and thus there is no duplication of pipelines. What has happened exactly from DD1 to DD2 is not public knowledge but I'm fairly sure it was was not a duplication of pipelines.
The 360's solution is similar to hypertherading. In prinicple, it's 3 CPUs with 2 hyperthreads. If you ask the hardware vendors, they would deny this. But if you analyse it as a software developer it's nothing else than hyperthreading. That means, you got 6 threads but actually it's only like 3 times 1,5 threads. On PS3 it's different with CELL: The main-CPU [PPU] got 2 threads (a little better than hyperthreading) and in addition the seven synergistic processors. The eighth SPU, which is still visible in the design has been removed.
scificube said:While it's not public knowledge what exactly happened from DD1 for DD2 to whatever I'm fairly sure it was not a move to full SMP on the PPU. There may or may not be more exec units in the PPU on Cell than on Xenon allowing form more instances where each thread can successfully issue an instr but whatever Crytek is referring to it does not strike me as being full on SMP as that is a little more the slightly better than hyperthreading. Look not further than AMD's X2 vs. hyper-threaded P4's and you can see that...as basically SMP is like...adding another complete core altogether as what AMD did and Intel did not.
scificube said:10.1 Multithreading Guidelines
"In the PPE multithreading implementation, the two hardware threads share execution resources.
Consequently, concurrent threads that tend to use different subsets of the shared resources will
best exploit the PPE’s multithreading implementation."
Shifty Geezer said:DD1 never made it int the wild by accounts. The move to DD2 came very swiftly, and DD1 likely won't be found in all but the earliest of prototype Cell kits at IBM. I think DeanA was the source of this info, but someone definitely said this on this forum.
There's probably a better image out there but google is your friend, etc.Jesus2006 said:Hm i see. Anyways, i remember some Xenon die shots which someone compared to the CELL PPU some time ago, any idea where that was? Would make comparisons easier (although we are running completly OT here )
Jesus2006 said:I don't think it's a good idea to restructurize your data at runtime to make them fit to the SPEs. That's something you should think of before you start programming for CELL