Xenon VMX units - what have we learned?

Discussion in 'Console Technology' started by BadTB25, Aug 11, 2007.

  1. Asher

    Regular

    Joined:
    Jul 1, 2005
    Messages:
    976
    Likes Received:
    10
    Location:
    Seattle, WA
    No, it is not a full-out processor. That is not necessarily a disadvantage, and it does not necessarily "block" the CPU either with the 'magic' of SMT.
     
  2. Arwin

    Arwin Now Officially a Top 10 Poster
    Moderator Legend

    Joined:
    May 17, 2006
    Messages:
    18,762
    Likes Received:
    2,639
    Location:
    Maastricht, The Netherlands
    Yes. That is because IBM put the whole documentation of the Cell in the public domain. :) That certainly helps!

    Actually, I was looking for more information on the SPUs just now, and found that they have an additional 128x128bit register, called special purpose register. Here's IBM's full documentation of the SPUs specifically.

    http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/76CA6C7304210F3987257060006F2C44/$file/SPU_ISA_v1.2_27Jan2007_pub.pdf
     
  3. ADEX

    Newcomer

    Joined:
    Sep 11, 2005
    Messages:
    231
    Likes Received:
    10
    Location:
    Here
    You've lost me here I'm afraid, I never made that argument. What exactly are you arguing?

    No I don't. There are *several* factors which limit clock speed.

    It's very closely related to power draw.

    You can do OOO in 2 broad ways - the Pentium 4 method or the G4 method.
    The Pentium 4 method involved a OOO processor with a high clock rate, in order to do this each pipeline stage will do relatively little. This means the CPU will be inefficient - just as the Pentium 4 was. The P4 architecture ran into serious power issues and was eventually cancelled.

    The other method is to do what Motorola did with the G4, it is an OOO processor but it does more per clock stage, this limits the clock speed but means the processor uses less power. Intel and AMD both use this method.

    In the first case the power limited frequency, in the second case the longer stages limit the clock speed but the highest clock achievable is also ultimately limited by power.

    The OOO hardware is complex and needs to be fast, it is thus one of if not the hottest part of the processor. In order designs don't have this so at the same power limit they'll be running faster. POWER5+ was a complex OOO machine which ran up to 2.2GHz, POWER6 is in-order has 3X more transistors and runs at twice that clock rate.

    I don't see why you seem to object to this, every feature added to a processor has some impact or another, it depends on the workload as to whether they are worth adding.
     
  4. ADEX

    Newcomer

    Joined:
    Sep 11, 2005
    Messages:
    231
    Likes Received:
    10
    Location:
    Here
    VMX and the SPU ISA is very similar though (not surprising given the SPU ISA was based on VMX), I ported some code and it mostly consisted of adding the framework code to get it working and changing the memory I/O. The actual processing code was pretty almost identical.

    The dot product will make a difference as will any other speciality instructions.
     
  5. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    dot product instructions are evil, real men don't use them, they use SOA + madds ;)
     
  6. Fafalada

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    2,773
    Likes Received:
    49
    10x more evil when they come with a latency that makes their advantage moot in cases where it's supposed to be most important (non loopy code).
    There's nothing more evil then hw features that have more PR then practical value.
     
  7. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    I thought char KenIsGreat[4*1024*1024];

    inserted by the SDK tools just to reserve some RAM for the OS was far more evil :p.
     
  8. flec04

    Newcomer

    Joined:
    Aug 26, 2006
    Messages:
    17
    Likes Received:
    0
    The 3X VMX128 units exist to provide some level of vector processing for XENON’s cores. They are add on units & must share existing resources with each core – 32KB L1 & 1MB L2 cache shared between the cores. Each core consists of 2x VMX-128 register sets to support both threads on each core. What isn’t widely advertised is that each core contains only 1 execution unit & both threads therefore have to share this 1 execution unit.

    As for output the theoretical peak performance of an Intel 3GHZ P4 using SSE instructions is 6GFLOPS. This provides a ball park figure for the Xenons VMX units considering I was unable to uncover exact figures.

    Now there is no question that the 3x VMX-128 units outdo the VMX unit on the CELLS PPU which is only a 32-128 register. But I think you’ll find that any serious SIMD processing will be carried out on the SPE’s. MS deliberately ignores the SPE’s & only compares their 3x VMX-128 units to the underdone VMX unit on the PPE.

    Each SPE on CELL is a dedicated high speed vector processor, they are not add-on units & they share no resources. They each have 256K of LS available bringing their combined total to 1.792MB (7X 256k).

    Each SPE achieves around 25GFLOPS, consider the fact that there are 6x SPEs & its no surprise MS ignore the SPEs when discussing the vector processing abilities of their 3x addon VMX units. For vector based computations the PS3 outdoes the 360 by an order of magnitude

    Courtesy of Cell Architecture Explained & Ebony’s breakdown of PS3 architecture.
     
  9. ebola

    Newcomer

    Joined:
    Dec 13, 2006
    Messages:
    99
    Likes Received:
    0
    I'd definitely agree with that for cross-platform code :)

    DP can simplify optimizing regular PC type code though, so it does look a good design decision on microsofts part? - it does have latency advantages compared to madds. it can also be used to implement some non DP calcualtions with lower latency.
    I suppose how evil it is depends on how much die space it takes (versus beefing up the soa approach..), of which i have no idea..
     
    #29 ebola, Aug 13, 2007
    Last edited by a moderator: Aug 13, 2007
  10. BadTB25

    Veteran

    Joined:
    Aug 11, 2007
    Messages:
    2,371
    Likes Received:
    645
    Location:
    Florida
    Thank you very much for that link; I am reading through it now. As I've said, most of it is over my head, but interesting nonetheless.

    So from what flec04 said, the difference in the VMX's of the 2 consoles shouldn't be an issue in porting from X360 to PS3 because of the SPEs. I would assume the reverse would not be true considering the PS3 has the VMX in the PPU and the 6 SPEs to utilize vs the 3 VMX128 of the X360s.

    " For vector based computations the PS3 outdoes the 360 by an order of magnitude"

    What would be the primary benefit of this? Better physics, particle effects, etc?

    Also, I did some googling and have learned a little more about the nature of the VMX128 of the X360. It seems that a few execution instructions were removed to make room for the additional registers. Any idea what was removed? Anything significant?

    nAo, Fafalada and others still make my head spin, but I'm trying to keep up.
     
  11. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    44,106
    Likes Received:
    16,898
    Location:
    Under my bridge
    Better everything, in theory! All computing is shifting data around in the form of numbers and doing sums on it. The more numbers you can crunch, the more stuff you can work out. There's a complication in the movement of data too though, and if your algorithm uses lots of data that can't be crunched efficiently, the ability to do lots of maths is no good. However, as understanding improves, more and more functions are being mapped onto fast vector processors, such that eventually pretty much all areas should benefit.

    Thus every facet of games could see a benefit. Though as a caveat, it's possible that the improvements aren't very detectable do to the principle of diminishing returns.
     
  12. ADEX

    Newcomer

    Joined:
    Sep 11, 2005
    Messages:
    231
    Likes Received:
    10
    Location:
    Here
    The L1 sharing can be deactivated by using cache locking, a thread can then stream directly into it while another uses the L2. Similar idea to what the SPEs do but not quite the same.

    I would have thought it'd be somewhat higher, at least twice that.

    I don't think it's terribly safe to take numbers from a completely different processor from a completely different manufacturer!

    The peak number for the XCPU per core is 25.6 GFLOPS - the same as that for the Cell's PPE or even the SPEs. The peak for the entire processor is 76.8 GFLOPS.

    [/QUOTE]Now there is no question that the 3x VMX-128 units outdo the VMX unit on the CELLS PPU which is only a 32-128 register.[/QUOTE]

    The PPE's VMX unit is actually pretty good. But yes anyone using Cell is more likely to use the SPEs for real work.

    They do share some resources - e.g. the PPE's MMU is in sole control of the memory pages and thus has to be used when a page change is needed which a SPU hasn't cached. There are ways around this though.

    The peak is nowhere near an order of magnitude higher, the actual figure (counting 7 SPEs and 1 PPE) is 2.6 times higher. Quite what either will do in practice is dependant on the developers using them and of course the 7th SPE isn't used directly.

    Note: edited, last bit was mine.
     
    #32 ADEX, Aug 13, 2007
    Last edited by a moderator: Aug 13, 2007
  13. BadTB25

    Veteran

    Joined:
    Aug 11, 2007
    Messages:
    2,371
    Likes Received:
    645
    Location:
    Florida
    OK, dumb question then and thanks BTW for breaking it down to something I can easily process.

    What is the possibility of something like one PPU and 6 VMX128s designed to function independently ala SPEs?

    VMXs are good, although not as good as SPEs, and are according to some articles great for 3D graphics acceleration and physics. What is to prevent IBM (or others) from make a quad core Power chip set with additional VMXs, say 8, 10, etc?
     
  14. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    44,106
    Likes Received:
    16,898
    Location:
    Under my bridge
    The problem then is a data problem - the problem SPEs were designed to get around. VMX units can process sums quickly, if they have the data to work on. If they have to wait for the data, they sit there doing nothing twiddling their thumbs. Like pie eating. If you have a table with 1 pie eater and 2 pies an hour available, they could eat them all. If you have 12 pies an hour and 1 pie eater, there'll be a bottleneck with the pies piling up. If you have 6 pie eaters and 12 pies an hour, they all get eaten and you have an excellent pie consumption rate. But with 6 pie eaters and only 2 pies a minute, most of the time the pie eaters are sat there hungry.

    It's this specific issue that SPEs were designed for; the problem that where you can cram more and more execution units into a CPU, providing them data to work on is hard. Advances in memory are way slower than CPU manufacturing. The inclusion of LS on SPEs means, with management, they can maintain a far faster supply of data than just a bunch of execution units squeezed onto a CPU.
     
  15. Carl B

    Carl B Friends call me xbd
    Legend

    Joined:
    Feb 20, 2005
    Messages:
    6,266
    Likes Received:
    63
    I can tell that this is more or less in the context of a refutation to the Major Nelson article; please please try to keep that thing out of these discussions - if and when it should come up naturally, it will be suitably straightened out. The problem is that the refutation in this case is really almost as simplistic as the original 'problem' article to begin with.

    I'm going to slow you down for a minute BadTB, and ask you rather, why the high interest in these VMX units? I think you might be perceiving them to be something more than they are.
     
  16. BadTB25

    Veteran

    Joined:
    Aug 11, 2007
    Messages:
    2,371
    Likes Received:
    645
    Location:
    Florida
    Curiosity really and not just in the VMX specifically. You are right though, due to my limited knowledge, I was percieving them to be something more than they are.

    I am also very interested on other parts of the X360 architechture such as Memexport, the EDRAM implementation and Xenos that has had comparatively less discussion on these boards. I hope I am contributing in some way by stirring the pot and asking questions (although simplistic) to get the community talking. The VMX128 just happens to be something that I can see that MS and IBM put extra work into. Besides the extra registers, what else did they design into it.

    As I think I've said before, I been a long time fan of this forum, and so have been a frequent visitor. I just haven't seen that much discussion on the architechture of the X360. Dave's article, while exceptional, leaves me desiring more. One of the things I admire about B3D is that there tends to be less PR and more frank debate minus most of the fanboy agendas. I myself tend to prefer the X360 (for now), but have had most of the consoles dating back to the Odyssey (didn't have an Atari, TG16 or PS2).

    That said, I am also interested in the PS3 architecture, as I will be picking this up in the future (when the price is right for me) and read the discussion on it as much if not more.

    My tech knowledge is limited to working with XBMC for my modified Xbox, but I am always trying to pick up more as I go along.

    As someone else pointed out, there is more info on Cell due to implementation on other devices.
     
  17. NRP

    NRP
    Veteran

    Joined:
    Aug 26, 2004
    Messages:
    2,712
    Likes Received:
    293
    Why not? There have been many interesting topics about how Cell's SPUs are being used for many things that people initially thought weren't practical. It would be nice to hear if the XCPU's VMX units are being (or can be) used for similar types of things, and to what degree they can use similar code/algorithms as the SPUs. At least, this is how I interpreted BadTB25's question.

    There really hasn't been a lot of useful talk about the XCPU, so I commend BadTB25 for trying to initiate some. Especially since MS/IBM obviously felt that a butt load of floating point power was necessary for these consoles. How is it (or can it be) used?
     
  18. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    44,106
    Likes Received:
    16,898
    Location:
    Under my bridge
    The primary reason being there's nothing to discuss! There's no info out there. Devs aren't talking about the hardware, in contrast to PS3 devs who give us things to chew on.

    Also all we really have on PS3 is Cell talk. There's nothing about RSX or the direct communication bus AFAIK. A lot of that hardware is in the dark too, on specifics. And Cell is well known because it's open hardware being used all over the place, so we don't just have PS3 devs to feed us tidbits.

    It's just a sorry state of affairs that limits info that would feed people's desire for knowledge. NDAs make discussing console hardware a 'black art'! :D
     
  19. Tap In

    Legend

    Joined:
    Jun 5, 2005
    Messages:
    6,382
    Likes Received:
    65
    Location:
    Gravity Always Wins
    I think somewhere on here Fran and Joker have both mentioned VMX units and their distinct usefulness in 360 games but they are buried deep in threads. :smile:
     
  20. Carl B

    Carl B Friends call me xbd
    Legend

    Joined:
    Feb 20, 2005
    Messages:
    6,266
    Likes Received:
    63
    You're misinterpreting me, I think the VMX discussion has a lot of interesting fruit to bear - I always read these things myself. It just depends on the interpretation of the OP's question I guess.

    @BadTB: I'd love for there to be more tiling/rendering/eDRAM and MemExport discussions as well.
     
    #40 Carl B, Aug 13, 2007
    Last edited by a moderator: Aug 13, 2007
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...