Xenon VMX units - what have we learned?

No, it is not a full-out processor. That is not necessarily a disadvantage, and it does not necessarily "block" the CPU either with the 'magic' of SMT.
 
There just seems to be so much more information in these forums relative to Cell.

Yes. That is because IBM put the whole documentation of the Cell in the public domain. :) That certainly helps!

Actually, I was looking for more information on the SPUs just now, and found that they have an additional 128x128bit register, called special purpose register. Here's IBM's full documentation of the SPUs specifically.

http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/76CA6C7304210F3987257060006F2C44/$file/SPU_ISA_v1.2_27Jan2007_pub.pdf
 
NO, because the argument was that OOOE because of transistor count limits clock speed; obviously it does not!

It's STILL a red herring because what you just wroyte there has nothing to do with OOOE/transistor count limiting clock speed.

You've lost me here I'm afraid, I never made that argument. What exactly are you arguing?

I believe clock speed is much more limited to the 'critical path' of a chip but whatever. You're still following what is essentially a red herring argument. If you're arguing that power consumption and heat output is the limit of clock speed - then DO SO. You have to pick one or the other.

No I don't. There are *several* factors which limit clock speed.

OOOE is not a limit to clock speed any more than any other feature of a microchip - properly implemented anyway. This is an entirely different issue compared to power draw.

It's very closely related to power draw.

You can do OOO in 2 broad ways - the Pentium 4 method or the G4 method.
The Pentium 4 method involved a OOO processor with a high clock rate, in order to do this each pipeline stage will do relatively little. This means the CPU will be inefficient - just as the Pentium 4 was. The P4 architecture ran into serious power issues and was eventually cancelled.

The other method is to do what Motorola did with the G4, it is an OOO processor but it does more per clock stage, this limits the clock speed but means the processor uses less power. Intel and AMD both use this method.

In the first case the power limited frequency, in the second case the longer stages limit the clock speed but the highest clock achievable is also ultimately limited by power.

The OOO hardware is complex and needs to be fast, it is thus one of if not the hottest part of the processor. In order designs don't have this so at the same power limit they'll be running faster. POWER5+ was a complex OOO machine which ran up to 2.2GHz, POWER6 is in-order has 3X more transistors and runs at twice that clock rate.

I don't see why you seem to object to this, every feature added to a processor has some impact or another, it depends on the workload as to whether they are worth adding.
 
SPUs don't have VMX units, even if SPU ISA borrows from VMX ISA here and there

VMX and the SPU ISA is very similar though (not surprising given the SPU ISA was based on VMX), I ported some code and it mostly consisted of adding the framework code to get it working and changing the memory I/O. The actual processing code was pretty almost identical.

The dot product will make a difference as will any other speciality instructions.
 
nAo said:
dot product instructions are evil
10x more evil when they come with a latency that makes their advantage moot in cases where it's supposed to be most important (non loopy code).
There's nothing more evil then hw features that have more PR then practical value.
 
10x more evil when they come with a latency that makes their advantage moot in cases where it's supposed to be most important (non loopy code).
There's nothing more evil then hw features that have more PR then practical value.

I thought char KenIsGreat[4*1024*1024];

inserted by the SDK tools just to reserve some RAM for the OS was far more evil :p.
 
Would having the different VMX128 (units?) in addition to having 3 compared to 1 VMX for the Cell make it more difficult to develop games from X360 to PS3?

Since each VMX128 is tied to each core of Xenon, are they able to work independently of their cores ala SPU's?

I haven't read that this has really been an issue or that devs are really taking advantage of it. Would this have been something that MS could've just gotten away with 1 instead of 3?

My understanding is the advantage of the extra registers allows better performance per cycle.

The 3X VMX128 units exist to provide some level of vector processing for XENON’s cores. They are add on units & must share existing resources with each core – 32KB L1 & 1MB L2 cache shared between the cores. Each core consists of 2x VMX-128 register sets to support both threads on each core. What isn’t widely advertised is that each core contains only 1 execution unit & both threads therefore have to share this 1 execution unit.

As for output the theoretical peak performance of an Intel 3GHZ P4 using SSE instructions is 6GFLOPS. This provides a ball park figure for the Xenons VMX units considering I was unable to uncover exact figures.

Now there is no question that the 3x VMX-128 units outdo the VMX unit on the CELLS PPU which is only a 32-128 register. But I think you’ll find that any serious SIMD processing will be carried out on the SPE’s. MS deliberately ignores the SPE’s & only compares their 3x VMX-128 units to the underdone VMX unit on the PPE.

Each SPE on CELL is a dedicated high speed vector processor, they are not add-on units & they share no resources. They each have 256K of LS available bringing their combined total to 1.792MB (7X 256k).

Each SPE achieves around 25GFLOPS, consider the fact that there are 6x SPEs & its no surprise MS ignore the SPEs when discussing the vector processing abilities of their 3x addon VMX units. For vector based computations the PS3 outdoes the 360 by an order of magnitude

Courtesy of Cell Architecture Explained & Ebony’s breakdown of PS3 architecture.
 
dot product instructions are evil, real men don't use them, they use SOA + madds ;)

I'd definitely agree with that for cross-platform code :)

DP can simplify optimizing regular PC type code though, so it does look a good design decision on microsofts part? - it does have latency advantages compared to madds. it can also be used to implement some non DP calcualtions with lower latency.
I suppose how evil it is depends on how much die space it takes (versus beefing up the soa approach..), of which i have no idea..
 
Last edited by a moderator:
Yes. That is because IBM put the whole documentation of the Cell in the public domain. :) That certainly helps!

Actually, I was looking for more information on the SPUs just now, and found that they have an additional 128x128bit register, called special purpose register. Here's IBM's full documentation of the SPUs specifically.

http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/76CA6C7304210F3987257060006F2C44/$file/SPU_ISA_v1.2_27Jan2007_pub.pdf

Thank you very much for that link; I am reading through it now. As I've said, most of it is over my head, but interesting nonetheless.

So from what flec04 said, the difference in the VMX's of the 2 consoles shouldn't be an issue in porting from X360 to PS3 because of the SPEs. I would assume the reverse would not be true considering the PS3 has the VMX in the PPU and the 6 SPEs to utilize vs the 3 VMX128 of the X360s.

" For vector based computations the PS3 outdoes the 360 by an order of magnitude"

What would be the primary benefit of this? Better physics, particle effects, etc?

Also, I did some googling and have learned a little more about the nature of the VMX128 of the X360. It seems that a few execution instructions were removed to make room for the additional registers. Any idea what was removed? Anything significant?

nAo, Fafalada and others still make my head spin, but I'm trying to keep up.
 
" For vector based computations the PS3 outdoes the 360 by an order of magnitude"

What would be the primary benefit of this? Better physics, particle effects, etc?
Better everything, in theory! All computing is shifting data around in the form of numbers and doing sums on it. The more numbers you can crunch, the more stuff you can work out. There's a complication in the movement of data too though, and if your algorithm uses lots of data that can't be crunched efficiently, the ability to do lots of maths is no good. However, as understanding improves, more and more functions are being mapped onto fast vector processors, such that eventually pretty much all areas should benefit.

Thus every facet of games could see a benefit. Though as a caveat, it's possible that the improvements aren't very detectable do to the principle of diminishing returns.
 
The 3X VMX128 units exist to provide some level of vector processing for XENON’s cores. They are add on units & must share existing resources with each core – 32KB L1 & 1MB L2 cache shared between the cores. Each core consists of 2x VMX-128 register sets to support both threads on each core. What isn’t widely advertised is that each core contains only 1 execution unit & both threads therefore have to share this 1 execution unit.

The L1 sharing can be deactivated by using cache locking, a thread can then stream directly into it while another uses the L2. Similar idea to what the SPEs do but not quite the same.

As for output the theoretical peak performance of an Intel 3GHZ P4 using SSE instructions is 6GFLOPS.

I would have thought it'd be somewhat higher, at least twice that.

This provides a ball park figure for the Xenons VMX units considering I was unable to uncover exact figures.

I don't think it's terribly safe to take numbers from a completely different processor from a completely different manufacturer!

The peak number for the XCPU per core is 25.6 GFLOPS - the same as that for the Cell's PPE or even the SPEs. The peak for the entire processor is 76.8 GFLOPS.

[/QUOTE]Now there is no question that the 3x VMX-128 units outdo the VMX unit on the CELLS PPU which is only a 32-128 register.[/QUOTE]

The PPE's VMX unit is actually pretty good. But yes anyone using Cell is more likely to use the SPEs for real work.

Each SPE on CELL is a dedicated high speed vector processor, they are not add-on units & they share no resources. They each have 256K of LS available bringing their combined total to 1.792MB (7X 256k).

They do share some resources - e.g. the PPE's MMU is in sole control of the memory pages and thus has to be used when a page change is needed which a SPU hasn't cached. There are ways around this though.

Each SPE achieves around 25GFLOPS, consider the fact that there are 6x SPEs & its no surprise MS ignore the SPEs when discussing the vector processing abilities of their 3x addon VMX units. For vector based computations the PS3 outdoes the 360 by an order of magnitude

The peak is nowhere near an order of magnitude higher, the actual figure (counting 7 SPEs and 1 PPE) is 2.6 times higher. Quite what either will do in practice is dependant on the developers using them and of course the 7th SPE isn't used directly.

Note: edited, last bit was mine.
 
Last edited by a moderator:
OK, dumb question then and thanks BTW for breaking it down to something I can easily process.

What is the possibility of something like one PPU and 6 VMX128s designed to function independently ala SPEs?

VMXs are good, although not as good as SPEs, and are according to some articles great for 3D graphics acceleration and physics. What is to prevent IBM (or others) from make a quad core Power chip set with additional VMXs, say 8, 10, etc?
 
The problem then is a data problem - the problem SPEs were designed to get around. VMX units can process sums quickly, if they have the data to work on. If they have to wait for the data, they sit there doing nothing twiddling their thumbs. Like pie eating. If you have a table with 1 pie eater and 2 pies an hour available, they could eat them all. If you have 12 pies an hour and 1 pie eater, there'll be a bottleneck with the pies piling up. If you have 6 pie eaters and 12 pies an hour, they all get eaten and you have an excellent pie consumption rate. But with 6 pie eaters and only 2 pies a minute, most of the time the pie eaters are sat there hungry.

It's this specific issue that SPEs were designed for; the problem that where you can cram more and more execution units into a CPU, providing them data to work on is hard. Advances in memory are way slower than CPU manufacturing. The inclusion of LS on SPEs means, with management, they can maintain a far faster supply of data than just a bunch of execution units squeezed onto a CPU.
 
Courtesy of Cell Architecture Explained & Ebony’s breakdown of PS3 architecture.

I can tell that this is more or less in the context of a refutation to the Major Nelson article; please please try to keep that thing out of these discussions - if and when it should come up naturally, it will be suitably straightened out. The problem is that the refutation in this case is really almost as simplistic as the original 'problem' article to begin with.

OK, dumb question then and thanks BTW for breaking it down to something I can easily process.

What is the possibility of something like one PPU and 6 VMX128s designed to function independently ala SPEs?

VMXs are good, although not as good as SPEs, and are according to some articles great for 3D graphics acceleration and physics. What is to prevent IBM (or others) from make a quad core Power chip set with additional VMXs, say 8, 10, etc?

I'm going to slow you down for a minute BadTB, and ask you rather, why the high interest in these VMX units? I think you might be perceiving them to be something more than they are.
 
I'm going to slow you down for a minute BadTB, and ask you rather, why the high interest in these VMX units? I think you might be perceiving them to be something more than they are.

Curiosity really and not just in the VMX specifically. You are right though, due to my limited knowledge, I was percieving them to be something more than they are.

I am also very interested on other parts of the X360 architechture such as Memexport, the EDRAM implementation and Xenos that has had comparatively less discussion on these boards. I hope I am contributing in some way by stirring the pot and asking questions (although simplistic) to get the community talking. The VMX128 just happens to be something that I can see that MS and IBM put extra work into. Besides the extra registers, what else did they design into it.

As I think I've said before, I been a long time fan of this forum, and so have been a frequent visitor. I just haven't seen that much discussion on the architechture of the X360. Dave's article, while exceptional, leaves me desiring more. One of the things I admire about B3D is that there tends to be less PR and more frank debate minus most of the fanboy agendas. I myself tend to prefer the X360 (for now), but have had most of the consoles dating back to the Odyssey (didn't have an Atari, TG16 or PS2).

That said, I am also interested in the PS3 architecture, as I will be picking this up in the future (when the price is right for me) and read the discussion on it as much if not more.

My tech knowledge is limited to working with XBMC for my modified Xbox, but I am always trying to pick up more as I go along.

As someone else pointed out, there is more info on Cell due to implementation on other devices.
 
I'm going to slow you down for a minute BadTB, and ask you rather, why the high interest in these VMX units? I think you might be perceiving them to be something more than they are.

Why not? There have been many interesting topics about how Cell's SPUs are being used for many things that people initially thought weren't practical. It would be nice to hear if the XCPU's VMX units are being (or can be) used for similar types of things, and to what degree they can use similar code/algorithms as the SPUs. At least, this is how I interpreted BadTB25's question.

There really hasn't been a lot of useful talk about the XCPU, so I commend BadTB25 for trying to initiate some. Especially since MS/IBM obviously felt that a butt load of floating point power was necessary for these consoles. How is it (or can it be) used?
 
I am also very interested on other parts of the X360 architecture such as Memexport, the EDRAM implementation and Xenos that has had comparatively less discussion on these boards.
The primary reason being there's nothing to discuss! There's no info out there. Devs aren't talking about the hardware, in contrast to PS3 devs who give us things to chew on.

Also all we really have on PS3 is Cell talk. There's nothing about RSX or the direct communication bus AFAIK. A lot of that hardware is in the dark too, on specifics. And Cell is well known because it's open hardware being used all over the place, so we don't just have PS3 devs to feed us tidbits.

It's just a sorry state of affairs that limits info that would feed people's desire for knowledge. NDAs make discussing console hardware a 'black art'! :D
 
I think somewhere on here Fran and Joker have both mentioned VMX units and their distinct usefulness in 360 games but they are buried deep in threads. :smile:
 
Why not? There have been many interesting topics about how Cell's SPUs are being used for many things that people initially thought weren't practical. It would be nice to hear if the XCPU's VMX units are being (or can be) used for similar types of things, and to what degree they can use similar code/algorithms as the SPUs. At least, this is how I interpreted BadTB25's question.

There really hasn't been a lot of useful talk about the XCPU, so I commend BadTB25 for trying to initiate some. Especially since MS/IBM obviously felt that a butt load of floating point power was necessary for these consoles. How is it (or can it be) used?

You're misinterpreting me, I think the VMX discussion has a lot of interesting fruit to bear - I always read these things myself. It just depends on the interpretation of the OP's question I guess.

@BadTB: I'd love for there to be more tiling/rendering/eDRAM and MemExport discussions as well.
 
Last edited by a moderator:
Back
Top