Xenon VMX units - what have we learned?

AlNom · Aug 13, 2007

Tap In said:
I think somewhere on here Fran and Joker have both mentioned VMX units and their distinct usefulness in 360 games but they are buried deep in threads. :smile:

It's a good thing we can search their posts and that they have relatively few.

Btw, BadTB25, this is probably the only official public IBM document surrounding Waternoose:

http://www-128.ibm.com/developerworks/power/library/pa-fpfxbox/

I'm sure you've read it or it may have been in the other thread, but it's a nice reminder (for myself at least).

bkilian · Aug 13, 2007

BadTB25 said:
Curiosity really and not just in the VMX specifically. You are right though, due to my limited knowledge, I was percieving them to be something more than they are.

I am also very interested on other parts of the X360 architechture such as Memexport, the EDRAM implementation and Xenos that has had comparatively less discussion on these boards.

You're forgetting the mysterious tesselation unit too. Has anyone used this? does it actually do anything useful for a game?

Crazyace · Aug 13, 2007

This is probally usefull as well

http://pc.watch.impress.co.jp/docs/2005/1028/kaigai06.pdf

Talks about 4 cycle permute, and 14 cycle dot product latency

http://pc.watch.impress.co.jp/docs/2005/1028/kaigai02.pdf

Has a pipeline diagram ( you need japanese fonts in Adobe reader )

Asher · Aug 14, 2007

flec04 said:
As for output the theoretical peak performance of an Intel 3GHZ P4 using SSE instructions is 6GFLOPS. This provides a ball park figure for the Xenons VMX units considering I was unable to uncover exact figures.

Sorry, this is nowhere close to the performance of VMX128. There are some fundamental differences between the two, for instance in the P4's SSE units, all instructions take at least two cycles vs one cycle in VMX128. That's a huge difference right there -- and there are many, many more.

flec04 said:
Each SPE on CELL is a dedicated high speed vector processor, they are not add-on units & they share no resources. They each have 256K of LS available bringing their combined total to 1.792MB (7X 256k).

When you add marketing terms like "high speed", it essentially nullifies what you say afterwards. This board is not about regurgitating marketing, it's about understanding the true nature of the hardware.

flec04 said:
Each SPE achieves around 25GFLOPS, consider the fact that there are 6x SPEs & its no surprise MS ignore the SPEs when discussing the vector processing abilities of their 3x addon VMX units. For vector based computations the PS3 outdoes the 360 by an order of magnitude

Order of magnitude? Oh, please.

Yes, the SPUs are faster vector processors than Xenon. But not THAT much faster. A Xenon core can also be much, much faster than the SPUs, another point conveniently left out in your summary. Welcome to the world of different processors with different target applications. One is not inherently better than the other.

liolio · Aug 14, 2007

Those who have real interest can find a description of the xcpu here :
http://www-128.ibm.com/developerworks/power/library/pa-fpfxbox

Tap In · Aug 14, 2007

liolio said:
Those who have real interest can find a description of the xcpu here :
http://www-128.ibm.com/developerworks/power/library/pa-fpfxbox

thanks

While the term VMX is familiar to PowerPC users, the implementation on the Xbox 360 processor is a new design called VMX128 which was specially enhanced to accelerate 3D graphics and game physics.
The number of vector registers was increased from 32 to 128. All 128 registers are directly addressable, and the original 32 registers are mapped to the first 32 entries of 128-entry vector register file, and so are compatible with the original PowerPC ISA. We also added a number of new instructions. Instructions were added to calculate the dot-product of two vectors made from three or four floating point values. Data formatting instructions were added to help improve the processing of data that has been packed into memory to reduce the program size. These include: instructions for rotate and insert operations, pack/unpack instructions for handling Direct3DÂ® data types, and loads and stores for misaligned data. The VMX128 ISA is binary-compatible with a subset of VMX. A few vector floating point and vector integer instructions are no longer supported, and attempting to execute them will result in the system illegal instruction handler being invoked.

joker454 · Aug 14, 2007

Fafalada said:
10x more evil when they come with a latency that makes their advantage moot in cases where it's supposed to be most important (non loopy code).
There's nothing more evil then hw features that have more PR then practical value.

Hmm hold on a sec, from what I'm seeing in my vmx code, it's all scheduled quite cleverly to the point that alot of the latency seems to be getting absorbed. I'm sure some will bring up a worse case scenario, but from where I'm standing it's looking pretty good. There's 128 registers that can be used (per core) so just batch up your loops, choose your algorithm carefully and vmx can be quite a hoot. I'm currently doing all manner of stuff via vmx on 360, stuff like calculating predicated tiles for each of 40000+ crowd, visibility tests for same 40000+ crowd, particle stuff, etc, and I'm barely scratching the surface of the power of one core. And yes, I am using the dot product instruction to a significant extent

On paper it may not seem wise (14 cycles) but that can be absorbed. Still got tons of cpu power left.

Cross platformness doesn't concern me in these cases because the heavy hitting code is written custom for each platform anyways. My PS3 implementation of this same code is completely different.

Hardknock · Aug 14, 2007

joker454 said:
Hmm hold on a sec, from what I'm seeing in my vmx code, it's all scheduled quite cleverly to the point that alot of the latency seems to be getting absorbed. I'm sure some will bring up a worse case scenario, but from where I'm standing it's looking pretty good. There's 128 registers that can be used (per core) so just batch up your loops, choose your algorithm carefully and vmx can be quite a hoot. I'm currently doing all manner of stuff via vmx on 360, stuff like calculating predicated tiles for each of 40000+ crowd, visibility tests for same 40000+ crowd, particle stuff, etc, and I'm barely scratching the surface of the power of one core. And yes, I am using the dot product instruction to a significant extent On paper it may not seem wise (14 cycles) but that can be absorbed. Still got tons of cpu power left.

Cross platformness doesn't concern me in these cases because the heavy hitting code is written custom for each platform anyways. My PS3 implementation of this same code is completely different.

You're working on Madden?

joker454 · Aug 14, 2007

Hardknock said:
You're working on Madden?

Oh goodness no

I'm on a baseball game.

patsu · Aug 14, 2007

Asher said:
Yes, the SPUs are faster vector processors than Xenon. But not THAT much faster. A Xenon core can also be much, much faster than the SPUs, another point conveniently left out in your summary. Welcome to the world of different processors with different target applications. One is not inherently better than the other.

Can you give some examples ?

If joker454 is doing visibility tests for 40000+ crowd, particle effects, ... on just 1 VMX, and barely scratching the surface of the power of one core; then I want to know what's possible in a Gen X game when all 3 VMXes and all 7 SPU vector engines are used (keeping the PPU VMX spare)

joker454 · Aug 14, 2007

patsu said:
Can you give some examples ?

I'm curious as well

If someone has a game relevant example, post it!

pjbliverpool · Aug 14, 2007

joker454 said:
Hmm hold on a sec, from what I'm seeing in my vmx code, it's all scheduled quite cleverly to the point that alot of the latency seems to be getting absorbed. I'm sure some will bring up a worse case scenario, but from where I'm standing it's looking pretty good. There's 128 registers that can be used (per core) so just batch up your loops, choose your algorithm carefully and vmx can be quite a hoot. I'm currently doing all manner of stuff via vmx on 360, stuff like calculating predicated tiles for each of 40000+ crowd, visibility tests for same 40000+ crowd, particle stuff, etc, and I'm barely scratching the surface of the power of one core. And yes, I am using the dot product instruction to a significant extent On paper it may not seem wise (14 cycles) but that can be absorbed. Still got tons of cpu power left.

Just out of interest, do you think you could achieve the same feats on a Core2's SSE4 (or otherwise)?

BadTB25 · Aug 14, 2007

Thanks All

I've seen some of the other docs, they're appreciated nonetheless, but the Crazyace kagai pdfs are new to me.

Shifty Geezer said: "The primary reason being there's nothing to discuss! There's no info out there. Devs aren't talking about the hardware, in contrast to PS3 devs who give us things to chew on."

That's is frustrating and disappointing. Hopefully with thread like this, a little bit of info can be gleaned. For example, I would not have thought that as many things that joker454 is doing would be possible on just 1 VMX128 (and barely scratching the surface at that). joker454 one of these days you're going to have to tell us which one it is. Hints are great and all, but I'm a little slow

I'm interested, since he brought it up, on how this is handled with the PS3.

inefficient · Aug 14, 2007

pjbliverpool said:
Just out of interest, do you think you could achieve the same feats on a Core2's SSE4 (or otherwise)?

The addition of the AoS dot product instruction to SSE4 should make that code easier to port over while maintaining similar performance characteristics.

It doesn't strike me as being a feat for either CPU to accomplish this. Like he said, a single core wasn't breaking a sweat even with a sub optimal technique. Unless the feat itself was that sub optimal code still ran well on VMX128.

Dr. Nick · Aug 14, 2007

bkilian said:
You're forgetting the mysterious tesselation unit too. Has anyone used this? does it actually do anything useful for a game?

Yeah I ask a question about this ever so often to find out if it's uses are just theory or has it actually been put to real world use. I would like to know if it is even worth talking about at all.

Asher · Aug 14, 2007

patsu said:
Can you give some examples ?

If joker454 is doing visibility tests for 40000+ crowd, particle effects, ... on just 1 VMX, and barely scratching the surface of the power of one core; then I want to know what's possible in a Gen X game when all 3 VMXes and all 7 SPU vector engines are used (keeping the PPU VMX spare)

(Disclaimer: The last time I used Cell or Xenon hardware or development tools was in ~August 2005.)

I think my comment may have been misinterpreted -- it's genuinely hard to find cases where the VMX units will outperform an SPE. Most of them would be pretty specific cases using D3D formats and specialized instructions, not general purpose vector code.

The cases where Xenon can outperform SPEs rely mostly on branching code of any kind. This isn't a secret, and yes, there are ways to implement such code on SPEs as well, but the performance doesn't come close to Xenon. And yes, this is why there's a PPE in Cell as well. But there's only one PPE.

Cell is stronger with vector processing*, Xenon is stronger with most logic and flow processing*. In general.

* except for exceptions!

Asher · Aug 14, 2007

inefficient said:
The addition of the AoS dot product instruction to SSE4 should make that code easier to port over while maintaining similar performance characteristics.

It doesn't strike me as being a feat for either CPU to accomplish this. Like he said, a single core wasn't breaking a sweat even with a sub optimal technique. Unless the feat itself was that sub optimal code still ran well on VMX128.

Are you guys referring to Core2's SSSE3 (what some mistakenly refer to as SSE4), or SSE4 that'll be in the Penryn cores?

pjbliverpool · Aug 14, 2007

Asher said:
Are you guys referring to Core2's SSSE3 (what some mistakenly refer to as SSE4), or SSE4 that'll be in the Penryn cores?

Sorry my bad. I guess the question is relevant to both implementations however I expect no-one has the exposure to answer it for Penryn yet.

Although educated guesses would be welcome!

Dr. Nick · Aug 14, 2007

Asher said:
Are you guys referring to Core2's SSSE3 (what some mistakenly refer to as SSE4), or SSE4 that'll be in the Penryn cores?

Probably whatever is in the Core2 chips right now. If you have any thoughts about how SSE4 perform we would like to know about them also.

patsu · Aug 14, 2007

Asher said:
(Disclaimer: The last time I used Cell or Xenon hardware or development tools was in ~August 2005.)

I think my comment may have been misinterpreted -- it's genuinely hard to find cases where the VMX units will outperform an SPE. Most of them would be pretty specific cases using D3D formats and specialized instructions, not general purpose vector code.

Yeah... I suspect so, but it's always good to hear alternate views.

I remember you worked for IBM. I have another question regarding IBM's claim of a run-time layer on top of Cell (software cache + automatic code management or something of that sort). But it's off topic here. Let me go gather sufficient material first before I start a new thread. I always wanted to know what kind of overhead we are looking at (in exchange for ease of development).

Xenon VMX units - what have we learned?

AlNom

Moderator

bkilian

Crazyace

Asher

liolio

Aquoiboniste

Tap In

joker454

Hardknock

joker454

patsu

joker454

pjbliverpool

B3D Scallywag

BadTB25

inefficient

Dr. Nick

Asher

Asher

pjbliverpool

B3D Scallywag

Dr. Nick

patsu

Similar threads