How many developers are using all 3 cores for X360?

DeanoC said:
Most modern in-order processors (i.e. MIPS 5900, Pentium) are "dual issue", which meant that every cycle they can pick up two instruction out of a single stream (i.e. its ILP not TLP) and attempt to issue them to two execution units at once. Whether they can issue them at the same time is based on lots of things (for example register dependencies) but most importantly whether you have two execution units free. If you only have one execution unit of a particular type (say VMX) you can never dual issue two instructions that both use that unit.

*ILP = Instruciton Level Parallism
*TLP = Thread Level Parallism

Suddenly that makes everything make a whole lot more sense.

There is certainly allot to talk about with these consoles.
 
Wait...where was it confirmed that the PPE had two VMX units? And did you post that ppt before you censored it, Deano? If so, does anyone care to PM me the details of what he's referring to? ;) PEACE.
 
scificube said:
2. It's more related to having finite exectution resources. There really aren't enough there for two threads to fire off at 3.2GHz. There is no restriction on what resources a thread can use beyond whether they are available or not...so in essence it may be impossible to ever have enough execution resources if one makes their code greedy...is this correct? Makes sense to me because if you did elect to use one thread...why shouldn't you have access to all available resources?
It doesn't quite work that way. In hardware, the "resources" that you're worried about are arranged to form a pipeline. With a superscalar architecture (of which dual-issue is a subset), the stages after instruction decoding forks into 2 or more execution pipelines. These pipes don't have to be identical (do the same function) or symmetrical (have the same length). For eg., one may exclusively handle integer arithmetic while the other handles floats. Having same-length pipelines does make life easier though, and you usually see some delay slots injected purposefuly on shorter pipelines. It may help you to visualize this by looking at the diagram depicting the PPE pipeline, in the MPR document linked a couple of posts above this one. Hannibal's piece on Xenon, at arstechica, describes the same diagram.

At any single instant in time an instruction from one thread is occupying one stage of one pipeline. If only one thread is running, then the counterpart stage on the alternate pipeline is idle. In an OOOE processor, that would-be-idle stage could have likely been populated by an instruction, taken out-of-sequence from the same thread. In a multithreaded processor, the idle "resource" could/would be taken up from an instruction from the second thread.

So, the problem is not that two threads are fighting for control over resources (although that might also happen), rather when one thread is currently using a "resource" there is another perfectly functioning resource sitting on its hands at that same point in time.

3. Even if there aren't enough resources to go around there is still benefit in having enough registers in the HW to save both threads state/PCB so that if an opportunity arises a very fast context switch between threads can take place reducing overhead and adding to how much useful work can be done in a finite amount of time.

So...logical threads boil down to HW support for faster context switching and/or if threads aren't greedy they can run full blast?
Going along with my earlier paragraph, there are (almost) always enough resources to go around, and we want to make sure that they are always being utilized. There is no context switch involved, both contexts are "alive". That's why it appears to the OS that there are two logical processors. What multi-threading boils down to is this: when an instruction from one thread is going down one execution pipeline the chances are good that another instruction from the other thread may want to use the other pipeline. When that happens you are fully utilizing the processing resources made available to you, otherwise the idle execution units are just dead weights.
 
For information on hyperthreading, superthreading, multithreading and how an instruction pipeline works please look at this ArsTechnica article.

A lot of good information there where I first cut my teeth on this topic.
 
This comes from a supposed MS employee that is posting on TXB. Please don't hold me responsible if this guy turns out to be fake, but what he has to say about GOW is relevant to this topic and seems very interesting to say at the least ...

A couple folk have been curious as to whether or not the use of only one core really effects graphics, or if it's more an interesting fact due to the possibilities it presents with physics and AI usage.

The short answer is, MS was right - general purpose processing is the way to go, because no one has a strong idea of how best to utilize the multi-threading.

The longer answer has to do with what Cliffy doesn't mention. UE3 code isn't only using one core, it's only using one thread of one core. It's still single threaded code. There's SOOOOooo much more they can do with this CPU.

Currently, none of the X360 games I've investigated are doing any real multi-threading... and what happens is the one thread on the one core being used gets maxed out, and then the GPU has to wait on the CPU... so when games start really using the CPU, the GPU will have less idle cycles, and games can start looking much, much better.

Point is - graphics will indeed go up, potentially very significantly, once the CPU is better utilized.

Additionally, the GPU works so closely with the CPU in X360, that it is very possible some of the graphics work could be unloaded on the CPU, but honestly - I don't know exactly how that could be done, and won't pretend to know how... but there are people smarter than me working on it all the time, and I wouldn't doubt that it will happen.

What interests me most in all of this, is the concept of multithreaded code. I can't wait to use some of the diagnostic tools we've got, and see a developers milking all 6 threads, and seeing the GPU idle cycles diminish to 0-5% average... it's just ridiculous to think about what this machine can do.

...but again, multithreaded code is much more difficult than it sounds. "Doing physics on one core" seems like such a relatively easy thing to impliment to many, but in practice - it's really, really challenging. Far more so on something like the PS3, but still - it's really challenging to have different code running on different processessors simultaneously, keeping it all in sync, and then debugging/optimizing on top of it all... looking at a 6-part call stack (one per thread) is NOT a trivial, non-intimidating reality to debugging for any developer that's going to actively use all 6 threads.

Anyway, I'll stop my ramblings now, but suffice to say, yeah - when all 6 threads get used effeciently, we'll see some phenomenally amazing titles.

http://forum.teamxbox.com/showpost.php?p=5939801&postcount=93
 
Last edited by a moderator:
A single effecient thread can make good use of a core. Adding a second thread alongside could net negligable performance improvements. Also the poster's statement
The short answer is, MS was right - general purpose processing is the way to go, because no one has a strong idea of how best to utilize the multi-threading.
is meaningless gibberish. General Purpose code has sod all to do with multi-threading. You can have multi-threaded GP programs and single-threaded vector crunching. I also don't see why having a PS3 core doing physics would be any harder to integrate than having an XeCPU core doing physics. Both will be processing the same data and need to integrate the results with the rest of the program. Both offer similar solutions to achieve this.

Colour me skeptical.
 
Shifty Geezer said:
A single effecient thread can make good use of a core. Adding a second thread alongside could net negligable performance improvements. Also the poster's statement

is meaningless gibberish. General Purpose code has sod all to do with multi-threading. You can have multi-threaded GP programs and single-threaded vector crunching. I also don't see why having a PS3 core doing physics would be any harder to integrate than having an XeCPU core doing physics. Both will be processing the same data and need to integrate the results with the rest of the program. Both offer similar solutions to achieve this.

Colour me skeptical.

What he means is that at least as far as his game goes, a General Purpose core is better than an SPE.
 
It seems obvious to me that 9 threads would be harder to manage simultaneously than 6.

Also, the PS3's SPU's are specialized, while the PPE is general purpose, so you have two different types of Processors that require different types of optimizations. XeCPU has 3 identical cores.

Seems like simple common sense to me. I think its a no brainer that the XeCPU is simpler and easier to program for than the CELL will be.

And that's not even considering the fact the MS probably has better software for the developers.
 
scooby_dooby said:
It seems obvious to me that 9 threads would be harder to manage simultaneously than 6.

Also, the PS3's SPU's are specialized, while the PPE is general purpose, so you have two different types of Processors that require different types of optimizations. XeCPU has 3 identical cores.

Seems like simple common sense to me. I think its a no brainer that the XeCPU is simpler and easier to program for than the CELL will be.

And that's not even considering the fact the MS probably has better software for the developers.

To me the question is, if it is easier then how much easier is it than the PS3. I don't think we can easily assume the x360 is WAAAY easier than the PS3. It maybe easier I would like to know by how much and why.
 
You know if you really want to get into discussions on X is easier than Y you really need to be able to define easy in some measurable way.

Agree on a metric then you have a possible discussion.

If not it's just going to turn into a pissing contest as demonstrated by the many other threads that have done it.
 
Shifty Geezer said:
A single effecient thread can make good use of a core. Adding a second thread alongside could net negligable performance improvements.
I'm sorry, but you're quite wrong. Why would Deano suggest thinking of the Cell's PPE as 2 1.6 GHz processors otherwise? Why would anyone have SMT if you are right?

Also the poster's statement

is meaningless gibberish. General Purpose code has sod all to do with multi-threading. You can have multi-threaded GP programs and single-threaded vector crunching. I also don't see why having a PS3 core doing physics would be any harder to integrate than having an XeCPU core doing physics. Both will be processing the same data and need to integrate the results with the rest of the program. Both offer similar solutions to achieve this.
The point, which I thought was clear, is that trying to tap XeCPU is easier than tapping Cell when you have no real plan for multithreading. And that is true, from certain points of view. Holding one point of view does not make others "meaningless gibberish."
 
SPYRTEK and CLIMAX both are using the 3 full cores of x360..they were largely disappointed that the xenos has nothing new in it and each core cannot even run 2 full threads(1.5 according to them)....they were repenting for the fact that they were not even close to utilization of just about 50% power of cell...but i believe on what CLIMAX said...that they would eventually come up with something that is jaw dropping on ps3.

skeptimists dont just argue about my claims...i have already posted the interviews of CLIMAX and SPYRTEK...along with the article of IBM proving cell can has 128*128 bit registers wheras x360 has just 32 bit counterparts in the thread.....CELL vs XENON
 
the PPE is the only dual core processor that can run 2 full threads at optimum speed and has slighly better multithreading capabilities...each xenos can run only 1.5 according do them and has the same multithreading capabilities as normal pc processors....about the spes they also said that those are 100% efficient and cannot be matched with either xenos or pc based processors...
x360's cpu has 160 million transistors cell has 234 million transistors.....although we know that x360 has allocated 100 million transistors for eDRAM...sony has not till now any stats for the superior yDRAM used in ps3...we wait till TGS for the official info
 
scooby_dooby said:
It seems obvious to me that 9 threads would be harder to manage simultaneously than 6.
The guy was talking about the difficulties in working with multithreading, saying "Doing physics on one core" seems like such a relatively easy thing to impliment to many, but in practice - it's really, really challenging.

He wasn't talking about harnessing all cores but giving an example that even something 'relatively straightforward' as moving the physics onto a seperate core isn't easy. And he said it was less easy on PS3. He wasn't talking about the difficulties in writing for SPEs, but the difficulties in multithreading, and so presumably his observations on PS3 are 'it's harder to multithread on Cell'. It'd be a pointless comment if what he meant was 'multithreading is hard. Even moving one thread like physics onto a separate core isn't easy. And PS3's SPE's are a pain to write for.' In that case the comment on PS3 is totally out of context. Ergo he's commenting on the increased difficulty in parallising workloads on PS3, and I don't know why that would be harder. Unless perhaps his concerns were the memory management? But it's expected XeCPU needs pretty low level management to so I still can't see any difference in inherant difficulties to getting the two systems to multithread.
 
Inane_Dork said:
I'm sorry, but you're quite wrong. Why would Deano suggest thinking of the Cell's PPE as 2 1.6 GHz processors otherwise? Why would anyone have SMT if you are right?
2 1.6 GHz cores would be a slower system than 1 3.2 GHz core. SMT offers efficiency improvements in using a CPU's resources. The idea is when one thread stalls, the other can jump in and be worked on until the first thread is picked up again. IIRC the average improvement for SMT on a P4 is something like 15%. Now with an in-order CPU in closed box system where already the devs are using careful cache management, there's scope in theory to keep the CPU running pretty much full whack on the one thread. I'm not sure how the second thread can use unused resources concurrently though. If a second thread can be executed on the VMX unit and executed in parallel with the generic code of the first thread, that'll have an obvious advantage. TTBOMK that's not possible, but I'm hazy on the subject.

But when the guy says 'we're only using ONE thread, just wait 'til we use TWO!!' he's talking about maybe a 10% improvement in performance? My point is that adding a second thread doesn't do something magical like double the power. A single thread doesn't use half the capacity of a core. A single thread on PPE or XeCPU is NOT equivalent to a 1.6 GHz core but a 3.2 GHz core with some holes in execution. Dual threading on that core is equivalent to a 3.2 GHz core with far less holes.

The point, which I thought was clear, is that trying to tap XeCPU is easier than tapping Cell when you have no real plan for multithreading. And that is true, from certain points of view. Holding one point of view does not make others "meaningless gibberish."
Other course it doesn't. But saying 'we should use general purpose cores because we don't how to multithread' is a nonsense. It's like saying 'all cars should be Diesels because we don't know how to drive'. Whether a CPU is GP or not (and for GP read 'optimized for unstructured memory accessing') isn't at all related to multithreading, just as one's ability to drive isn't related to the type of engine in the car. Adding GP doesn't mitigate any of the demands of multithreading, doesn't make it easier, and doesn't provide an alternative so that multithreading isn't necessary.

PS : If anyone sees Deano please give him a smack! That slide is causing no end of confusion! He must post an explanation before the whole internet grows up to think wierd things about multithreaded cores. :p
 
Shifty Geezer said:
2 1.6 GHz cores would be a slower system than 1 3.2 GHz core. SMT offers efficiency improvements in using a CPU's resources. The idea is when one thread stalls, the other can jump in and be worked on until the first thread is picked up again. IIRC the average improvement for SMT on a P4 is something like 15%. Now with an in-order CPU in closed box system where already the devs are using careful cache management, there's scope in theory to keep the CPU running pretty much full whack on the one thread. I'm not sure how the second thread can use unused resources concurrently though. If a second thread can be executed on the VMX unit and executed in parallel with the generic code of the first thread, that'll have an obvious advantage. TTBOMK that's not possible, but I'm hazy on the subject.

But when the guy says 'we're only using ONE thread, just wait 'til we use TWO!!' he's talking about maybe a 10% improvement in performance? My point is that adding a second thread doesn't do something magical like double the power. A single thread doesn't use half the capacity of a core. A single thread on PPE or XeCPU is NOT equivalent to a 1.6 GHz core but a 3.2 GHz core with some holes in execution. Dual threading on that core is equivalent to a 3.2 GHz core with far less holes.

Other course it doesn't. But saying 'we should use general purpose cores because we don't how to multithread' is a nonsense. It's like saying 'all cars should be Diesels because we don't know how to drive'. Whether a CPU is GP or not (and for GP read 'optimized for unstructured memory accessing') isn't at all related to multithreading, just as one's ability to drive isn't related to the type of engine in the car. Adding GP doesn't mitigate any of the demands of multithreading, doesn't make it easier, and doesn't provide an alternative so that multithreading isn't necessary.

PS : If anyone sees Deano please give him a smack! That slide is causing no end of confusion! He must post an explanation before the whole internet grows up to think wierd things about multithreaded cores. :p

Imo, the PS3 dev. model will be better understandable, or more efficient (doesn't include easier to program for, but who knows? :) ).

Basically what you have on CELL is one dedicated controller unit (e.g. one thread of the PPU) which's task is to control and feed the SPEs. It's much like a master-slave(s) arrangement, quite different from "dual-coring" as it is done on Xenos (or even worse "tripple-coring", in this case).

On Xenos you have 3 almost (because 1&2 also doing driver stuff) identical cores, which have to fight for resources (primarily cache) and the programmer has to build up a good multi threading game engine that doesnt interfere too much with each other, otherwise you get unprecedented slow-downs because of bad synchronization. That's a serious bottleneck imo that does not exist in this way on a CELL processor.
 
Both CPU's are going to be trying to manage the same synchronisation challenges. If you're updating physics, AI, model animations, sounds and game code between 5 threads, you're going to need to make sure they're communicating well as needed. In this case the player shoots a cannonball the physics determines the physical effects with the cannonball hitting objects, the AI needs to react to the player's actions, the sound engine needs to play audio to match what is happening and where it's happening with 3D audio. I don't think either platform has an inherant advantage in this syncronisation. As far as I'm aware that's all down to software design. Perhaps Cell becomes more complicated when more threads are used, but other than that architectural i don't see there's anything to choose from between Cell and XeCPU when it comes to multithreading.

And that said, as I mentioned elsewhere you may not want to parallize the execution of your tasks but instead would be better off executing each task serially across all available units, if the task can be so parallized. It'll be interesting to see which techniques dev use and develop for XB360 and PS3.
 
Back
Top