Xenon question...

scificube

Regular
It is my understanding that the execution of threads is interleaved on a core in Xenon.

It's been said the VMX unit can handle two threads simultaneously. This makes little sense if thread execution is rigidly interleaved unless I'm missing something. I first took this to mean the VMX unit had register space reserved for two threads so that when a switch occurred it would be that much faster...but this seems goofy when there are already HW facilities for fast switching elsewhere.

I have misunderstood something somewhere (I suppose I should have read that MPR floating around...if I could understand it that is). Either thread execution is not rigidly interleaved on a core in Xenon or I don't understand how the VMX unit works if threads execution is in fact interleaved between the two contexts a core can handle.

If the VMX can handle two threads would it not make sense for the FP unit to also be able to or for there to be two FP units so that two threads could fire along simultaneously if both needed an FPU? I mean shouldn't other execution elements work in the same manner or be duplicated or is there good reason the VMX unit alone would have such a capability?
 
Why would threads be rigidly interleaved?
It would make for a very inflexible system, part of the reason for multiple hardware threads is to hide latency, rigidly interleaving them would really not do a good job of that.
 
scificube said:
It's been said the VMX unit can handle two threads simultaneously. This makes little sense if thread execution is rigidly interleaved unless I'm missing something. I first took this to mean the VMX unit had register space reserved for two threads so that when a switch occurred it would be that much faster...but this seems goofy when there are already HW facilities for fast switching elsewhere.
Those hardware facilities still need the VMX registers for two threads. Otherwise that data will need to be saved/loaded each context switch which kinda defeats the point of hardware multithreading, especially given the large number of registers.
 
Sorry I got tied up for a little while.

Shifty...that's my point so I know something is wrong somewhere.

ERP...I only mean rigidly interleaved in the sense that ONE thread is executing on a core at a time not TWO threads simultaneously.

It made sense because Xenon is an IO proc. If one thread needed to go out to memory or there was some dependency in one thread then the other thread could be switch to to hide the latency and so forth. As a secondary measure and a natural I think the scheduler would intelligently interleave the execution of the threads on the core(s) so as to maintain fairness throughput etc.

This is what I meant by rigidly interleaved. I didn't mean at some tick a switch is bound to occur but rather this is decided by the programmer or scheduler but the result is still that only one thread is executing per unit time on a core in Xenon.

If it two threads then I have to rethink some things about the chip and look elsewhere to understand Deano C's comments to the affect that the PPE is like unto two 1.6Ghz procs. I gather the PPE is very similar to Xenon's cores and from that I assumed the cycles on the PPE and a Xenon core where split (not evenly but intelligently) between the two threads on the core. If this is not the case and two threads can go full bore simultaneously on a Xenon core then I don't know why the PPE should be viewed as two 1.6 procs. Perhaps this could lead to a very significant difference between Cell's PPE and a core in Xenon. Much more significant that anything else I would think.

I apologize to the above poster. I just wanted to explain my thinking before I read the article you linked to. If it clears up things for me I'll let you know :)
-------------------------------------------------------------------------------------

In the article romiced linked to this is stated:

"At 3.2 GHz this is the highest frequency Power PC architecture core IBM is shipping.
The cpu core is a dual issue in order execution micro-architecture with simultaneous multi-threading and support facilities for 2 threads. Because dynamic power consumption is key we implemented extensive clock gating to shutdown pipelines until instructions are active"

When it says the core has simultaneous multi-threading support I have in the past taken this to mean the Xenon can handle 3 threads simultaneously across it's three cores. The support for 2 threads I have taken to be facilities to store a HW context for the next thread to execute on a core which allows for extremely fast switching the could be used to hide latency.

What the article doesn't say is that the VMX unit can handle 2 threads simultaneously. Nor does it mention any other execution pipe can. This is in line with what makes sense to me, but as I know didly about anything that's far too little for me to confident I understand things. What I didn't know what that the FPU and VMX pipes had buffers that allowed for OoOe in them. That's a good thing IMO.

---------------------------------------------------------------------------

I'm just asking for someone to show me the way a little bit...am I right, wrong, or close is all I'm asking for. (and private lessons but we'll work that out on the side :))
 
Last edited by a moderator:
Per core, it's most likely almost like P4 hyperthreading: Two execution contexts running simultaneously, grabbing whatever execution units they can get their hands on (causing it to be slower than 2x3.2ghz when both cores are using the same execution units, but if they execute sufficiently different code it should be possible to get close to the theoretical ceiling of 2x3.2ghz).

Of course, when one core is waiting for memory, the other has access to all other execution units. This is a more efficient utilization of the execution units than just having a single core that will do no work every time it waits for memory.

(here I use execution unit as "instruction pipeline", such as the VMX pipe, one of the (multiple?) integer pipes, etc).

And a disclaimer: This may contain any number of errors as it's just how I've understood things.
 
ector said:
Per core, it's most likely almost like P4 hyperthreading: Two execution contexts running simultaneously, grabbing whatever execution units they can get their hands on (causing it to be slower than 2x3.2ghz when both cores are using the same execution units, but if they execute sufficiently different code it should be possible to get close to the theoretical ceiling of 2x3.2ghz).

Of course, when one core is waiting for memory, the other has access to all other execution units. This is a more efficient utilization of the execution units than just having a single core that will do no work every time it waits for memory.

(here I use execution unit as "instruction pipeline", such as the VMX pipe, one of the (multiple?) integer pipes, etc).

And a disclaimer: This may contain any number of errors as it's just how I've understood things.

Thanks.

So two threads can execute simultaneously but there is contention for execution units with the exception of the FPU and VMX units?

I was under the impression the situation was not like a P4's hyper threading but if the FPU and VMX units can handle request from both threads at the same time then the speed increase for games should be better than the typical 15-20 boost Intel's HT provides.

I still have to wrestle with looking at a core as two 1.6 procs. It would seem this description is rather conservative about average performance no?
 
scificube said:
I still have to wrestle with looking at a core as two 1.6 procs. It would seem this description is rather conservative about average performance no?

Yep, I think it should be considerably better than that.
 
ector said:
Per core, it's most likely almost like P4 hyperthreading: Two execution contexts running simultaneously, grabbing whatever execution units they can get their hands on (causing it to be slower than 2x3.2ghz when both cores are using the same execution units, but if they execute sufficiently different code it should be possible to get close to the theoretical ceiling of 2x3.2ghz).

Of course, when one core is waiting for memory, the other has access to all other execution units. This is a more efficient utilization of the execution units than just having a single core that will do no work every time it waits for memory.

(here I use execution unit as "instruction pipeline", such as the VMX pipe, one of the (multiple?) integer pipes, etc).

And a disclaimer: This may contain any number of errors as it's just how I've understood things.

I think there is a fundamental difference from Pentium IV's HT.

In Pentium IV the dynamic scheduler looks at the instructions in the ROB and issues as many of them as possible (if one thread has a limited parallelism and can only issue a small number of uops, but another thread has some instructions ready to go the scheduler will "simply" notice there are instructions ready to be issued and proceed to issue them). IIRC that scheduler is thread unaware.

The XeCPU, like the PPE in CELL (and the two descend from the same basic core), has threads alternating in the pipeline: each thread, for example, fetches two instructions every other cycle and they alternate the fetch and decode phase while pushing instructions in the Issue Queue (which is only our Top Level Queue as for the FP units we have a separate VMX/FP Issue Queue underneath the regular and shared Top Level Issue Queue).

Instructions from different threads can surely end up sharing execution units (although both processors only have 1 iALU/FXU, 1 LSU, 1 BEU and a separate VMX/FP pipe), but it still works a little bit differently compared to the Pentium IV HT. Having a separate FP Issue Queue sitting under the other Issue Queue should see how this "sharing" of the execution units array can easily happen between a Integer Heavvy thread and an FP heavvy one. The main Issue Queue can issue up to two instructions per cycle from a thread and the same throughput can be seen in the VMX/FP Issue Queue (so you could have cases in which the core is actually a 4-way issue machine, kind of ;)). This is how SMT is talked about in the latest CELL docs released by IBM and they should be referring to the famous DD2 revision (DD1 is not the one going in the production CELL BroadBand Engine processor, DD2 or a further revision should be the one as affirmed by one of their designers Dr. Peter H. Hofstee).

If we really wanted to go in the "no-one talks about it" territory we should see what changes between the PPE and each of the XeCPU cores since, with CELL's DD2 revision, the PPE is almost 2x in size even though each of the XeCPU cores has 4x more VMX registers (128x128 bits versus 32x128 bits), a more complex VMX unit (support for Dot Product and horizontal math operations as well as custom Direct3D oriented instructions) and the SAME specs regarding all the other areas of the processor core. Multi-Threading is mentioned with the same figures as I mentioned in the above paragraph for both, but it seems clear to me that those ADDED transistors are not just sitting idle.

My guess is that the PPE has more replicated logic and enhancements (to reduce contention and starvation of shared resources) and perhaps a few twist to the FP pipe... it is possible that the PPE core is more efficient (has less restrictions) than each of the XeCPU cores...
 
Back
Top