VMX Units on the X360 CPU

twotonfld

Newcomer
Quickie question:

I haven't seen any block diagrams so I'm a little curious about the implementation about the VMX units on the X360's CPU. Since each PPC core implements one unit are these units dual thread capable? I guess what I'm curious about is that if they're not and the X360 CPU is using these to generate verticies to feed the GPU, does the vertex generating thread block execution of the other simul-thread on the core to hit up the VMX unit? Or, does the architecture split the dual threads earlier only giving 1/2 the core access to the VMX unit? Is there a potential for bottlenecking other services as a result of vertex generation for such jobs as procedural rendering?

Thx

(also feel free to tell me I'm completely off) ;)
 
Alpha_Spartan said:
Xenon's PPC cores can execute on two threads each for a total of six hardware threads.

i realize this - the question is that since each PPC core implements only 1 VMX unit -

Does the VMX unit process 2 simultaneous threads?
or
Does the path split leaving 1 thread to pass non-VMX while the other hits the VMX?
or
Does the thread block simultaneous execution when the VMX unit of a processor is utilized?

Also - are the VMX units 128 bit? In that case, does it require both thread streams to be used in order to pass down that much data (since if I remember correctly each core is 64 bit)?

(edit - I just found out they are 128 bit).
 
twotonfld said:
Also - are the VMX units 128 bit? In that case, does it require both thread streams to be used in order to pass down that much data (since if I remember correctly each core is 64 bit)?

I think you're getting confused with datapath width and SIMD width. A Pentium III is a 32-bit machine that can work on 128-bit SIMD. Each instruction is still 32-bit, but it's juat that the SSE path supports 128-bit registers and calculations. Same case with VMX and PPC. So one thread of instructions can execute 128-bit VMX despite the instructions being only 64-bit wide.
 
Doh! (on me) - that makes sense, JF.

That said - do you know the answer to the other question? Are there blocking requirements or simultaneous usage limitations?
 
twotonfld said:
Doh! (on me) - that makes sense, JF.

That said - do you know the answer to the other question? Are there blocking requirements or simultaneous usage limitations?

Sorry, I'm not sure. But just from Microsoft's FLOPS figures, it's clear that no matter how you spin it with multi-threading, you can have only 3 effective VMX units at the same cycle. (ie. No 6 VMX ops per cycle)
 
JF_Aidan_Pryde said:
twotonfld said:
Doh! (on me) - that makes sense, JF.

That said - do you know the answer to the other question? Are there blocking requirements or simultaneous usage limitations?

Sorry, I'm not sure. But just from Microsoft's FLOPS figures, it's clear that no matter how you spin it with multi-threading, you can have only 3 effective VMX units at the same cycle. (ie. No 6 VMX ops per cycle)

Well, that knocks off one option at least - hopefully it doesn't block and you can do 1 VMX and one GP thread per tick.
 
Same case with VMX and PPC. So one thread of instructions can execute 128-bit VMX despite the instructions being only 64-bit wide.

Except that PowerPC instructions are 32-bits long, not 64...

Does the thread block simultaneous execution when the VMX unit of a processor is utilized?

Correct. However this really isn't a big deal since the whole point of SMT is to reduce the amount of pipeline bubbles and mask execution latencies by filling them with the execution stream of another thread...
 
archie4oz said:
Correct. However this really isn't a big deal since the whole point of SMT is to reduce the amount of pipeline bubbles and mask execution latencies by filling them with the execution stream of another thread...

I'm sorry - I'm a little confused.

From what's in the above quote, your response indicates to me that the execution of a thread utilizing the VMX unit blocks the entire core. But, then you move on talking about SMT. Your reference to SMT, are you focusing on the entire system or that single core? If you're focusing on the entire CPU, I follow you. But, if you're focusing on the individual core, that would seem to contradict your initial statement that the VMX request thread fully blocks the core (both thread streams and not just one).

Would you mind clarifying?

Thanks!
 
twotonfld said:
Does the VMX unit process 2 simultaneous threads?
It can't 'run threads', as it's an execution unit, not a processor in of itself. It handles specific opcodes coming from the instruction stream, that's it.

Does the path split leaving 1 thread to pass non-VMX while the other hits the VMX?
Maybe you are slightly confused what multithreading really means? A multithreaded processor can't execute more operations/cycle than a single-threaded processor with the same number of execution units no matter how many hardware threads are implemented in it; one, two, even more.

It ONLY runs instructions from thread two when thread one hits a pipeline stall (or the other way around, depending on how you look at it), assuming there is instructions and data available to be run for thread two that is. Hence, the max peak throughput is the same for a multithreaded CPU as for a singlethreaded.

Does the thread block simultaneous execution when the VMX unit of a processor is utilized?
If the VMX unit is independent of the floating-point processor, it should at least in theory be able to run in parallel. Of course there could be other bottlenecks in the chip, such as instruction decode limits, cache read/write bandwidth limitations etc that prevents every execution unit from issuing instructions all at the same clock cycle...

Also - are the VMX units 128 bit? In that case, does it require both thread streams to be used in order to pass down that much data
There is no "both thread streams". There's VERY LITTLE hardware duplication in a multithreaded CPU; in the P4, it was no more than a few percent according to Intel's PR. In essence, just register file and a few other bits and bobs are duplicated. Everything else, including execution units, data paths etc exist in just ONE copy. In-flight instructions are tagged so the processor knows wether they belong to one thread or the other; they don't exist sitting on a parallel instruction pipeline in the chip, they both share the same.
 
Guden Oden said:
twotonfld said:
Does the VMX unit process 2 simultaneous threads?
It can't 'run threads', as it's an execution unit, not a processor in of itself. It handles specific opcodes coming from the instruction stream, that's it.

That was a bit of a misstatement on my part - and I realize it's validity is off.


Guden Oden said:
twotonfld said:
Does the path split leaving 1 thread to pass non-VMX while the other hits the VMX?
Maybe you are slightly confused what multithreading really means? A multithreaded processor can't execute more operations/cycle than a single-threaded processor with the same number of execution units no matter how many hardware threads are implemented in it; one, two, even more.

It ONLY runs instructions from thread two when thread one hits a pipeline stall (or the other way around, depending on how you look at it), assuming there is instructions and data available to be run for thread two that is. Hence, the max peak throughput is the same for a multithreaded CPU as for a singlethreaded.

I admit - most of my threading knowledge resides at the application level, but I thought that the each core of the 360's CPU processes 2 simultaneous HW threads per cycle which would indicate that it does increase the maximum throughput sans pipeline bubbles which would stall one the other or both depending on the conditions. Anyway - I think you hit the meaning of this question more in the next quote.

Guden Oden said:
twotonfld said:
Does the thread block simultaneous execution when the VMX unit of a processor is utilized?
If the VMX unit is independent of the floating-point processor, it should at least in theory be able to run in parallel. Of course there could be other bottlenecks in the chip, such as instruction decode limits, cache read/write bandwidth limitations etc that prevents every execution unit from issuing instructions all at the same clock cycle...

This is really what I was getting at in the prior question - Is the VMX unit separate from the other execution units or does it block the execution of parallel threads.
 
twotonfld said:
This is really what I was getting at in the prior question - Is the VMX unit separate from the other execution units or does it block the execution of parallel threads.

There's no reason at all why they would design it so it would block the execution of a thread that isn't using the VMX unit. That would be a very silly thing to do.
 
1 ?

What would be the really use of VMX units, I thought it would/could boost (every) math ops (vertex/physics/processoral work...).But in the other thread sayed that physics/processoral work/AI... are not a big thingh in VMX units.

Can someone explain me :?:

Thanks.
 
aaaaa00 said:
twotonfld said:
This is really what I was getting at in the prior question - Is the VMX unit separate from the other execution units or does it block the execution of parallel threads.

There's no reason at all why they would design it so it would block the execution of a thread that isn't using the VMX unit. That would be a very silly thing to do.

Which is why I'm asking - it never hurts to make sure, and it's not like there haven't been odd HW gaffs in the past.
 
Is the VMX unit separate from the other execution units or does it block the execution of parallel threads.
The two threads alternate - they aren't executed in paralel. So no they don't increase the max throughput either.

Anyway, VMX is supposed to have two register sets iirc?, so obviously it's intended to run independantly on each thread.
 
In that case, if they alternate, are you not really hitting 6 threads per cycle as MS and many other people are suggesting?

Anyway, VMX is supposed to have two register sets iirc?, so obviously it's intended to run independantly on each thread.

That's right - MS did say that there were 128 registers per HW thread (oops).[/quote]
 
twotonfld said:
In that case, if they alternate, are you not really hitting 6 threads per cycle as MS and many other people are suggesting?
Who suggests such a thing? That's completely wrong, and it isn't supported in MS' published specs either (would over-inflate peak performance figures twofold). Like I said, peak throughput is the same because all hardware threads share the same SINGLE set of execution units. If the execution units were doubled, it wouldn't be a three-core chip, but a six-core chip. ;)
 
According to MS's specs there are 2 HW threads per core which the majority of the net is taking to mean 6 threads processed per cycle (I guess assuming that they're using different EUs on the core). I think it's even been mentioned at B3D.
 
twotonfld said:
According to MS's specs there are 2 HW threads per core which the majority of the net is taking to mean 6 threads processed per cycle. I think it's even been mentioned at B3D.

That is where the public has misinterpreted what 6 HW threads is.

XeCPU does have 6 HW threads, but they are not executed on concurrent cycles. Basically think of it as 3 P4's with hyperthreading. 3 real cores, each core capable of 2 threads in hardware. As Guden said, if it could execute both threads at the same time on the same cycle it would be a 6 core processor! Basically threads != cores. A lot of confusion right now, so don't worry. After fall it wont matter anymore ;)
 
Back
Top