Xenon VMX units - what have we learned?

BadTB25

Veteran
OK,

I didn't realize that I was posting in an old (and dead) thread (Is the 360 cpu limited?)

The Xbox360 has been out for a while with the developers having a year+ experience on final dev kits.

How much of the VMX is actually being used and in comparison to Cell was the extra registry size the right choice?

What are its advantages, if any?

I've been a lurker of these forums for a long time and although Cell is discussed in detail, I haven't seen too much info on the Xenon. (I have read the B3D article)

Is it because it more common architechture relative to Cell?
 
Graham said:
A while ago, I posted a thread on XNA performance.

Now take this with a quarry of salt, because there are millions of factors at play (primary one being I don't have access to a 360 devkit... yet). - However given the 360's compact framework has very limited floating point optimisation, has a non-generational garbage collector, very limited (if any) inlining, no ability do automatic pass by ref instead of pass by value (sortof like inlining), and only gives you access to 4/6 threads... I still managed to get over half the performance of my dual core X2. And I was hardly touching the XNA maths libs, which are apparently optimised to buggery boo.

So assuming the 360's CF is 75% efficient compared to the windows framework (which is crazy generous), and the 2/3 threads, you are looking at something that is *faster* than my X2.
Which at the time I got it was more expensive than the 360 :p

So no, I don't think it's cpu limited. I think it's a very good console cpu. Though it does mean that for single threaded stuff, XNA performance is pants :p

Repost
 
Yes I understand, the thread was closed due to being old (resurrected).

It still doesn't answer the topic questions?
Is the info under NDA or is there nothing really interesting to be gleaned?

Also he states he doesn't have access to a dev kit, so from the devs or others in the know, what has been learned?

I apologize if I'm not query isn't clear...:oops:
 
Everything is under NDA. Don't expect anyone to come out in a public forum and say "well, we used VMX128 in such and such way, and got such and such improvement over plain VMX".
 
Having read the other thread I thought I'd reply with a point which wasn't really raised.
There was a lot of x86 Vs XCPU discussion but they're not directly comparable.

x86 CPUs are designed for the code they run - typically old single threaded, branchy integer stuff.
XCPU was designed for multi-threaded vector stuff. OOO is a big benefit to x86 cores but it would be a disadvantage to the XCPU as it would lower the clock speed.

x86 cores will be good at what they are designed for, XCPU will be good at what it's designed for.

--

As for VMX128 Vs standard VMX (aka AltiVec), the biggest difference will be the extra registers - they mean you can unroll the loop further and this can give a hefty performance boost. Also XCPU has a high memory latency as it's working from shared RAM, the additional registers mean they can hide this better than 32 registers would.
 
OOO is a big benefit to x86 cores but it would be a disadvantage to the XCPU as it would lower the clock speed.
This is not categorically true.

Both pentium 4 core 2 and power 6 runs at 3.2GHz or more quite comfortably with very strong OOO abilities..

Perhaps it would be better to say havintg OOO would either mak ethe chip larger or reduce the amount of execution hardware if chip size is kept the same.

Peace.
 
Having read the other thread I thought I'd reply with a point which wasn't really raised.
There was a lot of x86 Vs XCPU discussion but they're not directly comparable.

x86 CPUs are designed for the code they run - typically old single threaded, branchy integer stuff.
XCPU was designed for multi-threaded vector stuff. OOO is a big benefit to x86 cores but it would be a disadvantage to the XCPU as it would lower the clock speed.

x86 cores will be good at what they are designed for, XCPU will be good at what it's designed for.

According to that book on designing the 360, MS, wanted an OOOE CPU and only didn't get one because IBM failed to deliver. Fair enough the failiure to deliver may have been because such a CPU was impossible given the budget/time/heat/power contraints that the 360 existed under but neverthelesss, it seems to me that from a raw performance point of view, all other constraints put aside, an half decent x86 dual core CPU would have been a better performer.

Not to mention a hell of a lot easier to use.
 
This is not categorically true.

Both pentium 4 core 2 and power 6 runs at 3.2GHz or more quite comfortably with very strong OOO abilities..

The quad core 2 chips are considerably larger than the XCPU and are built in 65nm tech, they run at 2.66GHz. They run 35W hotter than the XCPU.

POWER6 is a (mostly) in-order processor. It runs at a very high clock rate and has VMX but it too is built on 65nm, it's the fastest processor money can buy - but at around 180W it's probably also the hottest...

Perhaps it would be better to say havintg OOO would either mak ethe chip larger or reduce the amount of execution hardware if chip size is kept the same.

OOO involves highly complex logic and this burns a lot of power as you up the clock.
In the XCPU case they wanted fast VMX execution, it doesn't really benefit from OOO but does from clock speed - so they went for a higher clocked in-order 3 core design instead of a OOO design which would have to of either been clocked a lot lower or use 1 less core. They did this at 90nm, not 65nm.

Sun did something similar with Niagara. It's designed for servers, OOO is useful there as well but having lots of hardware threads is a lot more useful. They built an in-order processor with 8 cores & 32 threads, they also shoved on 4 memory controllers and a load of cache, they managed to keep the power down to 79 Watts but it rather severely restricted the clock speed.

OOO is only one tool CPU designers can use, but like all of them it has pros and cons. It also seems to effect some ISAs more than others, Intel's and IBMs figures for the benefit it gives are wildly different.
 
Would having the different VMX128 (units?) in addition to having 3 compared to 1 VMX for the Cell make it more difficult to develop games from X360 to PS3?

Since each VMX128 is tied to each core of Xenon, are they able to work independently of their cores ala SPU's?

I haven't read that this has really been an issue or that devs are really taking advantage of it. Would this have been something that MS could've just gotten away with 1 instead of 3?

My understanding is the advantage of the extra registers allows better performance per cycle.
 
The quad core 2 chips are considerably larger than the XCPU and are built in 65nm tech, they run at 2.66GHz.
2.93GHz is the fastest curremtn model I believe but they can run much faster than that quite easily.

They run 35W hotter than the XCPU.
So? It's a far more complex and powerful CPU. Of course it'll draw more power. This is a red herring.

but at around 180W it's probably also the hottest...
Hottest current chip most likely as plenty of supercomputer CPUs of the past burned amazing levels of power.. But anyway - it's not the transistor count that limits clock rate in of itself.

When firmer details of the R300 GPU started to leak out and it was said to feature 100+ million transistors some people refused to believe it could run at something as fast as 325MHz. Well - it did! :cool:

In the XCPU case they wanted fast VMX execution, it doesn't really benefit from OOO but does from clock speed
I doubt their actual goal was as single-minded as that. Not knocking VMX or anything but there's a lot more to game code than just VMX.

so they went for a higher clocked in-order 3 core design instead of a OOO design which would have to of either been clocked a lot lower
Again you merely parrot an unconditional statement.

Barring certain restrictions it may or may not be true!

Sun did something similar with Niagara.
The niagara cores were never intended to be high-performance processors on their own. Also the chip is intended for a totally different niche of computing. So I'd say it's not comparable to our situation.


Would having the different VMX128 (units?) in addition to having 3 compared to 1 VMX for the Cell make it more difficult to develop games from X360 to PS3?
The forum generally frowns of versus/comparison threads as they easily degenerate into "yes it is!"/"no it isn't!"/"your console sucks!"/"no your console sucks!" style bickering.

That said.. Since the 360 and PS3 are in of themselves very different beasts this is but one aspect where they differ. Making any specific claims about how this particular difference affects porting code is beyond my abilities but since nearly everything else is almost polar opposites on the two machines it's but one stumbling block on a very long road.

Since each VMX128 is tied to each core of Xenon, are they able to work independently of their cores ala SPU's?
No they're not separate processors. They're part of the processor itself. If the processor is an assembly line they're one of several robots sticking components onto circuit boards as they zip past. :cool: They can't act independently nor would you perhaps really want them to in a design like the 360.

Peace.
 
POWER6 is a (mostly) in-order processor. It runs at a very high clock rate and has VMX but it too is built on 65nm, it's the fastest processor money can buy - but at around 180W it's probably also the hottest...

Bit off topic but I have major doubts that Power6 would be faster as a desktop (or console) CPU than Core 2.
 
2.93GHz is the fastest curremtn model I believe but they can run much faster than that quite easily.

Ok, for a console they all have to run at a fixed rate so over-clocking is irrelevant.

So? It's a far more complex and powerful CPU. Of course it'll draw more power. This is a red herring.

Power is the most important factor in CPU designs these days, it's even more important in consumer electronics, if a CPU is too hot it can't be used.

Hottest current chip most likely as plenty of supercomputer CPUs of the past burned amazing levels of power..

You have a point, some of those were above 100K Watts...

But anyway - it's not the transistor count that limits clock rate in of itself.

When firmer details of the R300 GPU started to leak out and it was said to feature 100+ million transistors some people refused to believe it could run at something as fast as 325MHz. Well - it did! :cool:

It's not usually the number of transistors, it's the type. Cache uses low power transistors so you can throw on heaps of them.

I doubt their actual goal was as single-minded as that. Not knocking VMX or anything but there's a lot more to game code than just VMX.

Read some of Mike Acton's comments, IIRC he said the majority of the code actually running is just transformations, that's all VMX stuff.
However just look at the design of the 2 most powerful processors designed for games (XCPU and Cell) - both are heavily orientated towards high speed vector code.

Again you merely parrot an unconditional statement.

Barring certain restrictions it may or may not be true!

Ok, to explain:

OOO hardware is highly complex and gets more complex the more aggressive it gets. It's critical to the performance of the entire chip so it also needs to run fast, that means lots of high power transistors.

The clock speed is limited by the power consumption, you might be able to clock it higher but power goes up rapidly to the point where it becomes difficult or even impossible to reasonably cool.

An in-order design doesn't have all these extra transistors so isn't consuming as much power. Thus at the same power level an in-order design can have a higher clock.

The results are obvious - An XCPU core uses around 25W at 3.2GHz in 90nm tech, OOO designs at the same clock use at least twice that.

This may or not be an advantage depending on the type of code being run. For clock sensitive software it's obviously a gain.

MS could have just used G5 cores if they wanted more integer power - but they didn't.

The niagara cores were never intended to be high-performance processors on their own.

Also the chip is intended for a totally different niche of computing. So I'd say it's not comparable to our situation.

I was using it as an example, to show the trade-offs involved in CPU design. They went for low power cores because they wanted lots of them. The design paid off because as benchmarks show - in a server Niagara is a *very* powerful processor.
 
Ok, for a console they all have to run at a fixed rate so over-clocking is irrelevant.
NO, because the argument was that OOOE because of transistor count limits clock speed; obviously it does not!

Power is the most important factor in CPU designs these days, it's even more important in consumer electronics, if a CPU is too hot it can't be used.
It's STILL a red herring because what you just wroyte there has nothing to do with OOOE/transistor count limiting clock speed.

The clock speed is limited by the power consumption
I believe clock speed is much more limited to the 'critical path' of a chip but whatever. You're still following what is essentially a red herring argument. If you're arguing that power consumption and heat output is the limit of clock speed - then DO SO. You have to pick one or the other.

OOOE is not a limit to clock speed any more than any other feature of a microchip - properly implemented anyway. This is an entirely different issue compared to power draw.

Peaxe.
 
Didn't the Cell have 7 VMX units? One per Core, including the SPEs? And didn't the SPEs have 128x128bit registers?
 
The forum generally frowns of versus/comparison threads as they easily degenerate into "yes it is!"/"no it isn't!"/"your console sucks!"/"no your console sucks!" style bickering.

That said.. Since the 360 and PS3 are in of themselves very different beasts this is but one aspect where they differ. Making any specific claims about how this particular difference affects porting code is beyond my abilities but since nearly everything else is almost polar opposites on the two machines it's but one stumbling block on a very long road.



Peace.


My queries aren't meant to be flamebait.

I have resisted from registering and posting on B3D mainly because the discussions are mostly over my head. I am a fan of these forums as the discussions tend to be more mature. I would not like for this thread to digress to the usual fanboi drivel.

I am honestly trying to find out more about the VMX units in Xenon. Yes I know that am I asking for comparisons to Cell, but the intention is to ask what could have been differently? Would 1 have been enough compared to 3?

I guess because MS wanted symmetrical cores, separting them from 2 of the 3 would've been out of the question.

Comparisons can be useful in this case, because multiplatform games of comparable quality will be represented on the PS3 and X360. Just as some games will look and play differently due to the SPUs in cell, would this not also be the case because of the extra VMXs in X360?

That being said, are they being underutilized the same way that memexport seems to be. I just do not see much discussion relating to Xenon.

Is it because of NDAs or is the chip not that interesting?

BTW, I will pick up a PS3 when it drops to around $350 mainly for Sony exclusives and PS2 awesome backlog of games and I am equally interested in Cell as well. There just seems to be so much more information in these forums relative to Cell.
 
Didn't the Cell have 7 VMX units? One per Core, including the SPEs? And didn't the SPEs have 128x128bit registers?
SPUs don't have VMX units, even if SPU ISA borrows from VMX ISA here and there
 
To clarify, VMX units, from an ISA perspective, are much more complete as a multimedia vector processor.

Just, they are no processors :) It's plain DIE logic which is totally controlled by the CPU (and blocks it with every call it makes).
 
Back
Top