How many developers are using all 3 cores for X360?

onetimeposter said:
.....so im wondering how many games are there which are using all 3 cores for Xbox 360. Unlike PS3 where you have to use some if not all the SPEs instead of only the Core PPC, Xbox 360 games can be made on one core only , so which ones are using all 3 of them.

IMHO I believe the answer to your question is no one knows.
 
onetimeposter said:
.....so im wondering how many games are there which are using all 3 cores for Xbox 360. Unlike PS3 where you have to use some if not all the SPEs instead of only the Core PPC, Xbox 360 games can be made on one core only , so which ones are using all 3 of them.

IMHO I believe the answer to your question is no one knows. Good question though.
 
Here's what I gather so far:

looking at the PPE as two 1.6GHz CPU could mean:

1. This is more of way to look at them than an acutal representation of performance. Given dependecies latencies etc it will be hard to get both threads doing the same amount of work as two 3.2GHz CPUs. It's more of a don't worry if you can't get the PPE to do magic kind of thing...

2. It's more related to having finite exectution resources. There really aren't enough there for two threads to fire off at 3.2GHz. There is no restriction on what resources a thread can use beyond whether they are available or not...so in essence it may be impossible to ever have enough execution resources if one makes their code greedy...is this correct? Makes sense to me because if you did elect to use one thread...why shouldn't you have access to all available resources?

3. Even if there aren't enough resources to go around there is still benefit in having enough registers in the HW to save both threads state/PCB so that if an opportunity arises a very fast context switch between threads can take place reducing overhead and adding to how much useful work can be done in a finite amount of time.

So...logical threads boil down to HW support for faster context switching and/or if threads aren't greedy they can run full blast?

If it's just faster context switching I still see the value in the optimization but I feel I've been duped into thinking allot more work could be done than really could be with these CPUs.

This could explain why the X2 can dust a P4 with HT. The X2 utilizes to HW threads while the P4 logical threads are at a disadvantage.

Question:

Can a thread in Cell use both VMX units at the same time but this would make a switch more expensive because the VMX unit only have one set of registers? This in contrast to MS's VMX unit with extra registers it could use to save state for faster switching?

...I'm not sure I'm getting any closer to understanding anything...I'm not stupid am I?
 
Last edited by a moderator:
scificube said:
Question:

Can a thread in Cell use both VMX units at the same time but this would make a switch more expensive because the VMX unit only have one set of registers? This in contrast to MS's VMX unit with extra registers it could use to save state for faster switching?

...I'm not sure I'm getting any closer to understanding anything...I'm not stupid am I?

Why do context switchin when you have two independent hardware VMX units, each capable of executing one thread independently of the other (in contrary to Xenos)?

The problem is the CELL PPE has 2 hardware VMX units on the DiE whereas Xenos only has one per core, with the difference that the Xenos single VMX unit has 128 registers, and each of the CELL PPE's VMX unit has their own 32 registers (=2x32). So there is no context switching. The only drawback of the CELL PPE is the shared L2 cachce, but even that is much worse on 360 Xenos.
 
i might add, the original CELL PPE design and also the first revisions (maybe that's what Deano Calver speaks about?) only had one VMX unit, so they could only "hyperthread" 2 threads, just like each Xenos core can do. But sony changed this and increase the DiE by adding a second hardware VMX unit so that now two real threads can be run independently. Guess why they've done that ? ;)
 
Nemo80 said:
Why do context switchin when you have two independent hardware VMX units, each capable of executing one thread independently of the other (in contrary to Xenos)?

The problem is the CELL PPE has 2 hardware VMX units on the DiE whereas Xenos only has one per core, with the difference that the Xenos single VMX unit has 128 registers, and each of the CELL PPE's VMX unit has their own 32 registers (=2x32). So there is no context switching. The only drawback of the CELL PPE is the shared L2 cachce, but even that is much worse on 360 Xenos.

Was it ever confirmed that the Cell PPE had 2 VMX units? Maybe two sets of registers, but 2 VMX units would imply nearly double the FP performance for the PPE than has been quoted by STI to date. If they could quote more FP performance I'm sure they would :p

(It's performance would be 8+8+4 * 3.2Ghz, vs the presumed 8+4*3.2Ghz, no?)

scificube said:
H
2. It's more related to having finite exectution resources. There really aren't enough there for two threads to fire off at 3.2GHz. There is no restriction on what resources a thread can use beyond whether they are available or not...so in essence it may be impossible to ever have enough execution resources if one makes their code greedy...is this correct? Makes sense to me because if you did elect to use one thread...why shouldn't you have access to all available resources?

...

If it's just faster context switching I still see the value in the optimization but I feel I've been duped into thinging allot more work could be done than really could be with these CPUs.

I've similar thoughts, although I don't quite feel "duped" because I just never bothered to look into it myself ;) My lingering question is if one thread isn't using an execution unit, can the other use it simultaneously? Of course, having one thread need it when the other doesn't could be tricky..

This is all quite elementary, and probably learned this in a class at one point. I feel quite embarrased to be unsure about it :p

As far as I can tell, and this is wholly academic of course, but if you switched between 2 threads evenly on a 3.2Ghz CPU, they'd both get 1.6 billion cycles to work with. They are sharing resources (sans any clarification on the usage of an unused execution unit by a second thread), Of course, that's not necessarily just like a 1.6Ghz CPU - if each thread spent the same proportion of time blocked on the 1.6Ghz CPU as on the 3.2Ghz CPU, that'd be half the cycles again. Of course, threads may not spend half their time being blocked, a thread could be waiting for a while on a SMT processor even if it's ready to go. In which case, you could indeed be better off running your code on two seperate 1.6Ghz cpus.
 
Last edited by a moderator:
As far as I see, this two-threading mechanism on each core is a way to hide the inefficiencies because of the loss of the logic for out-of-order execution (OOE). When there is OOE logic in the CPU, the CPU can re-order the instructions (even re-name the registers to enhance performance) so that it can keep all the functional units as busy as possible to maximize the instruction per cycle (IPC) throughput. But, as this logic is absent in both X360 CPU and CELL, it uses multi-threading to hide this. Evenif one thread requries execution of an instruction that locks the whole pipeline for a couple of cycles (i.e. which makes only one functional unit busy but makes the rest idle), other thread can continue to execute and use other functional units without any slowdown.

I believe multi-threading works best when one thread is int+fpu and other thread is a VMX heavy thread (btw in this case, you can think both threads working at 3.2 ghz, instead of 1.6 ghz).
 
Titanio said:
Was it ever confirmed that the Cell PPE had 2 VMX units? Maybe two sets of registers, but 2 VMX units would imply nearly double the FP performance for the PPE than has been quoted by STI to date. If they could quote more FP performance I'm sure they would :p

i don't know, but there is a DiE Comparison shot between PPE and Xenos, showing DD2 CELL revision PPE (it'S not from me:) ):

cx.JPG
 
I am... *somewhat* shocked that those two pics look very similar... maybe I missed something about the PPE being almost the same as one core of the XeCPU...
 
The interview nemo posted with Crytek is VERY interesting in terms of differences they see between threading etc. on the two systems. Very relevant to a lot of discussion that's been going on here too (actually directly answers some questions raised here). I have a translation, I'd post the thread but I think Nemo should since he found it.
 
Last edited by a moderator:
silhouette said:
As far as I see, this two-threading mechanism on each core is a way to hide the inefficiencies because of the loss of the logic for out-of-order execution (OOE). When there is OOE logic in the CPU, the CPU can re-order the instructions (even re-name the registers to enhance performance) so that it can keep all the functional units as busy as possible to maximize the instruction per cycle (IPC) throughput. But, as this logic is absent in both X360 CPU and CELL, it uses multi-threading to hide this. Evenif one thread requries execution of an instruction that locks the whole pipeline for a couple of cycles (i.e. which makes only one functional unit busy but makes the rest idle), other thread can continue to execute and use other functional units without any slowdown.

I believe multi-threading works best when one thread is int+fpu and other thread is a VMX heavy thread (btw in this case, you can think both threads working at 3.2 ghz, instead of 1.6 ghz).

I'm not sure I agree with the idea that dual-issue cores was a solution to removing OoOe...don't think I can go with that.

However the last sentence in that post is would seem a good argument for using your resources wisely with these CPUs.
 
Nemo80 said:
i might add, the original CELL PPE design and also the first revisions (maybe that's what Deano Calver speaks about?) only had one VMX unit, so they could only "hyperthread" 2 threads, just like each Xenos core can do. But sony changed this and increase the DiE by adding a second hardware VMX unit so that now two real threads can be run independently. Guess why they've done that ? ;)

Your really confused about the difference between dual issue and dual threading. Threading and issue are not connected, and more to the point, your guess is wrong.

The original version of my GDCE article (before I self censored myself) explained exactly how the PPE issues instructions and thread switches but as its not been publically released I pulled the section. The only bit left IS the section on thinking of PPE as 2 1.6Ghz dual issue pipelined processors.
 
scificube said:
I guess since it's not public knowledge yet...you can clarify this for us?

Nope, until STI release some details that I can work with, I can't talk about the subject. Thats why I pulled it from the presentation...
 
DeanoC said:
Nope, until STI release some details that I can work with, I can't talk about the subject. Thats why I pulled it from the presentation...

But could you maybe explain then the advantage of having two hardware VMX units, in comparison to only having one as it was before?
 
Nemo80 said:
But could you maybe explain then the advantage of having two hardware VMX units, in comparison to only having one as it was before?

Most modern in-order processors (i.e. MIPS 5900, Pentium) are "dual issue", which meant that every cycle they can pick up two instruction out of a single stream (i.e. its ILP not TLP) and attempt to issue them to two execution units at once. Whether they can issue them at the same time is based on lots of things (for example register dependencies) but most importantly whether you have two execution units free. If you only have one execution unit of a particular type (say VMX) you can never dual issue two instructions that both use that unit.

*ILP = Instruciton Level Parallism
*TLP = Thread Level Parallism
 
DeanoC said:
The original version of my GDCE article (before I self censored myself) explained exactly how the PPE issues instructions and thread switches but as its not been publically released I pulled the section. The only bit left IS the section on thinking of PPE as 2 1.6Ghz dual issue pipelined processors.

It sound to me like you are implying that the PPE is unable to issue instructions from both threads in the same clock cyckle.

This is also in line with the Microprocessor Report article published in Februari this year. (can be found here)
The article suggests that the PPE has fine grained multithreading, and not symethric multithreading as in the P4 processors. (last part of page 3 and the beginning of page 4).
 
Back
Top