How many developers are using all 3 cores for X360?

Lysander · Sep 5, 2005

But, then X2cpu is IBM`s "only PPE" idea chip.

Nemo80 · Sep 5, 2005

Lysander said:
But, then X2cpu is IBM`s "only PPE" idea chip.

in deed. Basically, "whats left from the CELL deal"-garbage with alot of weak points, especially a shared cache for three cores which already seems to be a serious limitation according to the AGEIA people (cache issues)

It's far from executin 3 Threads independently as stated by Microsoft (not to speak about 6, that's total bs).

blakjedi · Sep 5, 2005

scificube said:
I'm thinking of Deano Cleaver's comments as to not thinking of the PPE as a single 3.2GHz core but two 1.6GHz cores. If you would only use thread on the PPE you would be effectively using a 1.6GHz CPU.

I always wondered about that quote.

EDITED: while Titanio was writing

the XeCPU would be 3 (core)X (2 threads @1.6)?

wouldnt the Cell then be 1 core (2 threads@1.6) + 7 cores@3.2?

Makes no sense to me...

Titanio · Sep 5, 2005

blakjedi said:
I always wondered about that quote.

wouldnt the XeCPU be 3/3.2 = 1.06 per core then?

wouldnt the Cell then be 8/3.2 = 400Mhz per core?

I wonder about that quote too, but if you were applying that logic, it'd be

3.2Ghz/2 per core on X360 (each core has two hardware threads)

and

3.2Ghz/2 per PPE on Cell and 3.2Ghz/1 (i.e. 3.2Ghz) per SPU (SPUs are single threaded)

randycat99 · Sep 5, 2005

The whole 2x1.6 Ghz idea is predicated on 2 threads getting "equal-time" on the actual execution process. A goes, then B, then A, and so on... Conceptually, it could operate anywhere between 2x1.6 Ghz to 1x3.2 Ghz at a given moment, depending on just how "hungry" a particular thread ends up being. You can design hardware for 2 threads or 2 "gazillion" threads, but it's always got to shoehorn into the same one core. There is no extraction of magical clock cycles of execution from thin air. You are simply maximizing useage of the finite handful of execution cycles that the hardware can facilitate.

Titanio · Sep 5, 2005

randycat99 said:
The whole 2x1.6 Ghz idea is predicated on 2 threads getting "equal-time" on the actual execution process. A goes, then B, then A, and so on... Conceptually, it could operate anywhere between 2x1.6 Ghz to 1x3.2 Ghz at a given moment, depending on just how "hungry" a particular thread ends up being. You can design hardware for 2 threads or 2 "gazillion" threads, but it's always got to shoehorn into the same one core. There is no extraction of magical clock cycles of execution from thin air. You are simply maximizing useage of the finite handful of execution cycles that the hardware can facilitate.

I'm slightly confused by this, but it's an issue I've wondered about for a while so..

..what's the distinction between a core/cpu with support for 2 "hardware" threads vs a core/cpu that supports only 1 "hardware" thread but just switches between software threads? Does it just provide for faster switching between two threads or..?

Lysander · Sep 5, 2005

"whats left from the CELL deal"-garbage

Uh, PPE is build from Power5 chip, that is no garbage.
Cache could be limited for stupid software makers; but chip has an ability that its data circumvent cache (L1,L2,both) and go straight to gpu and system memory.
And yes threads should be autonomous, according to X2 patent. (but how, that is the question)

randycat99 · Sep 5, 2005

Titanio said:
I'm slightly confused by this, but it's an issue I've wondered about for a while so..

..what's the distinction between a core/cpu with support for 2 "hardware" threads vs a core/cpu that supports only 1 "hardware" thread but just switches between software threads? Does it just provide for faster switching between two threads or..?

Functionally, they accomplish a very similar effect- it's just the thread granularity is taken 1 step closer to the hardware when it comes to "hardware threads" (I guess "in the hardware" would be a better description). We are digging deeper into the hardware itself to recover "unused" execution cycles.

Titanio · Sep 5, 2005

randycat99 said:
Functionally, they accomplish a very similar effect- it's just the thread granularity is taken 1 step closer to the hardware when it comes to "hardware threads" (I guess "in the hardware" would be a better description). We are digging deeper into the hardware itself to recover "unused" execution cycles.

More registers for maintaining both threads' state in memory or..?

I really should look into this. Thanks for your help

I have up till now considered threads on X360 vs Cell to be equivalent but I'm not sure if I should be doing so now.

Shifty Geezer · Sep 5, 2005

I posted a lengthy explanation but the browser hung.

The functional unit (like a VMX unit) work on registers. With one set of registers when you context switch to another thread, the contents of those registers need to be saved out to record the current progress of the thread. When the thread is switched back in thread 2's contents need to be saved out and thread 1's loaded back in. This uses up cycles in preparing to continue where the thread left off. Hardware support has extras registers (and maybe other stuff) so each of the two threads has it's current state permenantly to hand so the moment a cycle is free is use the thread can jump right in.

ERP · Sep 5, 2005

randycat99 said:
The whole 2x1.6 Ghz idea is predicated on 2 threads getting "equal-time" on the actual execution process. A goes, then B, then A, and so on... Conceptually, it could operate anywhere between 2x1.6 Ghz to 1x3.2 Ghz at a given moment, depending on just how "hungry" a particular thread ends up being. You can design hardware for 2 threads or 2 "gazillion" threads, but it's always got to shoehorn into the same one core. There is no extraction of magical clock cycles of execution from thin air. You are simply maximizing useage of the finite handful of execution cycles that the hardware can facilitate.

Not really the 2x1.6 thing is a conceptual idea to try to make it easier to get good performance out of the cores.

Basically the principal is that the instruction and memory latencies on the cores are such that it's hard to hide them, but if you consider it as two processors with half the core speed you only have to hide half of the latency.

It's probably not a bad idea to treat Xenon cores the same way. They have more registers than the PPE's, so can potentially hide more latency, but IME optimisations in the running code become less and less significant as the thread count goes up everything ends up dominated by communications and even crappy code runs just as fast as the hand optimised stuff.

randycat99 · Sep 5, 2005

Naturally, there will be perks to exploit all over the place. I think it still comes down to the same thing, in the end- maximizing the utilization of a finite number of execution cycles. How you do it...that is where you have all those different methods and perks to wield.

aaaaa00 · Sep 5, 2005

Titanio said:
I'm slightly confused by this, but it's an issue I've wondered about for a while so..

..what's the distinction between a core/cpu with support for 2 "hardware" threads vs a core/cpu that supports only 1 "hardware" thread but just switches between software threads? Does it just provide for faster switching between two threads or..?

A thread context switch is really expensive in software, but cheap in hardware.

You could (on a PC) burn thousands and thousands of cycles on a software managed context switch between two threads in pretty much any of the modern OSes. You could do somewhat better by switching to a special purpose RTOS or tossing the OS entirely and running on the bare metal.

But in hardware like a hyperthreaded P4, you can hardware context switch on the next cycle between two threads, because all the register state is duplicated on chip between them.

Basically, software context switching is a way to make the CPU appear to run more than one thing from the user's point of view. Humans operate on the "tens of milliseconds" time scale at best, so blowing thousands of cycles on a context switch doesn't really mean a whole lot.

But hardware context switching is a technique to recover wasted execution cycles from the CPU's point of view by having something else immediately ready to run and use the idle execution units.

Titanio · Sep 5, 2005

aaaaa00 said:
A thread context switch is really expensive in software, but cheap in hardware.

You could (on a PC) burn thousands of cycles pretty easily on a software managed context switch between two threads in any of the modern OSes, but in hardware like a hyperthreaded P4, you can hardware context switch on the next cycle between two threads, because all the register state is duplicated on chip between them.

Basically, software context switching is a way to make the CPU appear to run more than one thing from the user's point of view, since humans operate on the 100+ millisecond time scale, and blowing thousands of cycles doesn't mean a whole lot on that time scale.

But hardware context switching is a technique to recover wasted execution cycles from the CPU's point of view by having something else immediately ready to run and use the idle execution units, because in the CPU's timescale it does care about wasting a few hundred cycles waiting for memory.

Cheers very much!

So there's no duplication of execution hardware? It's just about the switching? 3 threads executing at any one time, or can more execute if they use mutually exclusive sets of execution units?

edit - Also, if you have a pool of software threads you're switching between, with 6 of them in hardware, is it as expensive to switch one of the "software threads" into hardware as it is for a non-MT core to switch between software threads?

Which leads to the question, would you be ill-advised to use more threads in your application than you can keep "in hardware"?

aaaaa00 · Sep 5, 2005

Titanio said:
Which leads to the question, would you be ill-advised to use more threads in your application than you can keep "in hardware"?

You generally need to experiment with this and tune things.

If you have a bunch of tasks that very rarely wake up and are roughly independent of other things in your engine, then it may be cleaner to set aside one hardware thread, and put all of these tasks onto seperate software threads which will be scheduled onto that hardware thread when they need to run.

I can think of audio as a reasonably good candidate for something like this. Audio tends to operate on the human "tens of milliseconds" scale, and people just won't be able to tell the difference in latency when buffering ahead a frame. So architecturally, it may be cleaner in your engine to just have a software scheduled thread that fires and does the sound processing. Other things may work too, like network, disk I/O, user input scanning, etc.

But typically you want to stick to the number of hardware threads you have in your system, be it 2, 6, 100 or whatever.

Lysander · Sep 5, 2005

If I understand this, same i.e. FPU or intg unit works at the part of time on data stream from thread 1 and in another part of time on data stream from thread 2 and then back again.
But, how is now optimisation of idle unites achieved; what is the benefit of all this?

Fafalada · Sep 5, 2005

But, how is now optimisation of idle unites achieved; what is the benefit of all this?

When one thread is waiting because of data/memory dependancy, execution can switch to the other thread, thus keeping the execution units busy more of the time.

therealskywolf · Sep 6, 2005

Nemo80 said:
Almost right, though the latest CELL revision contains two full hardware VMX units, which can be seen from the DiE shots that go around in the internet. This means the CELL PPE can run 2 independent threads at full speed. The only limitation which counts in is the shared L2 cache.

But situation is even worse on Xenon. These cores contain only one VMX unit per core, enabling the Xbox 360 to do 3 "real" threads at once, only one more than the CELL PPE can handle. (of course the Xenon VMX unit can do hyper threading which gives a little performance boost but not much). But what is much worse is that all these threads on Xenon are blocked by each other because they all share one quite small (for so many threads) 1 meg L2 cache.

Yes but its VMX 128. The cores in Xenon are considerably more robust than the single core in the PPE.

Bohdy · Sep 6, 2005

therealskywolf said:
Yes but its VMX 128. The cores in Xenon are considerably more robust than the single core in the PPE.

How so?

Btw, Shifty: The information so far has been pretty clear about the fact that SPE's can only run in "user mode", so you couldn't actually run an OS on it afaik.

Lysander · Sep 6, 2005

Each core of X2cpu has 1 vmx unit but 2 128-bit registers, one per thread. Perhaps IBM will change that.

How many developers are using all 3 cores for X360?

Lysander

Nemo80

blakjedi

Titanio

randycat99

Titanio

Lysander

randycat99

Titanio

Shifty Geezer

uber-Troll!

ERP

randycat99

aaaaa00

Titanio

aaaaa00

Lysander

Fafalada

therealskywolf

Bohdy

Lysander

Similar threads