Xenon VMX units - what have we learned?

I guess the main question this little article brings up is, "Was the updated VMX hardware the major advancement that IBM brought to the table and did IBM provide MS with almost all of it's significant advancements?" This gets more interesting when you add the fact that Toshiba and IBM were bickering over the direction of the Cell and Sony seemed to side more with Toshiba...
 
"Was the updated VMX hardware the major advancement that IBM brought to the table and did IBM provide MS with almost all of it's significant advancements?"

Brought to the table... for which team? Certainly for the XeCPU the VMX units are significant to the design, and they go above and beyond what's going on in the Cell's PPE. But the SPEs are a whole different ballgame, and although IBM was (as a corporate entity) pushing for a non- heterogeneous, all-POWER approach, the truth is a lot of the prominent engineers on the team were very much behind the idea presented by Sony/Toshiba of high-throughput 'simple' cores. And it having been settled on the course it was, IBM played a very significant role in developing even the ISA used by the SPEs, not to mention the work on the Element Interconnect Bus and the actual chip design.

Now I've pointed out in the past the extent to which I believe 'the histories' under represent Sony and Toshiba's actual hand in Cell development, but certainly IBM was hugely prominent, and to the extent that Cell *is* the SPEs, that's not anything that flowed to MS in anything other than maybe... "these are the performance gulfs to said architecture, and our recommended means of addressing (VMX expansion)."
 
Then the whole sabotage angle in the WSJ article is mostly sensationalism... Still might be interesting to know what kind of performance gulf there would've been without the VMX augmentation... or if the more general nature of the Xenos GPU would allow it (GPU) to perform some of the missing functions from the VMX... and perhaps tangentially how much physics or other data-parallel work the Xenos GPU could perform in order to keep pace with the SPE units in the Cell...
 
Well in theory, the performance gulf between the two is still large, but in practice the Cell's contextual difficulty of approach limits its own real-world utilization. Which is of course rather ironic given its actual efficiency and ability to approach ops limits; but a lot of programmers simply have a hard time coding for it, and game code has a lot of moving parts (created by different people) to manage. 360 definitely has a qualified GPU advantage over the RSX, not that it's black and white, but it's versatile, and then on top of that the unified RAM pool just makes things easier on the dev as well.
 
I guess either architechture is nearly as good as a developer could hope for, the Cell just has this mystique of challenge to it.
I'd love a Next-Gen console to have a dedicated Cell-like CPU (for physics), a tradional CPU with decent OOO, and a fairly flexible GPU (with nifty custom geometry hardware), and something exotic like a hardware raytracer or neural network accelleration (not a true neural chip but one that does a good job accellerating neural-network modelling code for things like speach or hardwriting recognition).
 
Well you're talking about a lot of peoples' arbitrary dream console there - though I do want to emphasize 'arbitrary' :) - but no, beyond sounding like too much silicon, I think there would be too much redundancy as well, and we'll leave out the raytracing aspect entirely.

The thing with a presumed PS4, is that a logical extension of Cell, coupled to an extrapolated NVidia GPU from the year 2011, would lead to a lot of functional overlap. So it'll be interesting to see how custom NV is willing to make things - and custom could just as well mean 'dumb' - and how Sony wants to divide/emphasize its transistors. For the [720], MS is less bound by the ties to a former mega-project (Cell) or the ego of NVidia, so they might be able to find a nice efficient middle ground relatively more easily.

http://forum.beyond3d.com/showthread.php?t=31379&page=57

This is the thread for this tangent of discussion though.
 
Right, just dreaming out-loud. Redundancy point well taken (as well as off-topic point), but the dream of specialized hardware for every possible software contingent never dies... Amiga forever!
 
found on N4G
The 6 parallel execution units is a myth. It has 3 cores, 2 hw threads per core, and 3 VMX units. If, and I say if, you can issue 2 parallel ops on one core, you could theoretically execute 6 threads. In real world this never applies, it simply not 6 independent execution units, however, people tell me you can get close to 4-5 threads running in parallel if well optimized code is used (that means the instruction pipeline can decode/execute 2 ops in parallel because they can be decoded independently). However, you got 3 VMX cores, and that's about it. No hw threading there. Means 3 vector units max. And you need to share these with the general purpose cores. This is significant less then 6 fully independent vector cores, with local mem and load/store bandwith of 25GB/s (plus local store with the speed of local registers, and local EIB significant faster then 25GB/s).

So, yeah, while general purpose code could possibly compensate for some of these things, nothing in the 360 is suited for the massive amount of deferred rendering KZ2 does on the PS3. All physics and AI in KZ2 runs on the PPU
Any comments?
 
Everything OK until the sharp non-sequitur turn towards deferred rendering and KZ2. wtf? Are they implying KZ2 does deferred on the SPUs?
 
Everything OK until the sharp non-sequitur turn towards deferred rendering and KZ2. wtf? Are they implying KZ2 does deferred on the SPUs?

But doesn't KZ2 use the SPU's extensively for deferred rendering, ie the lighting and particle effects etc.
 
It's also wrong when it mentions there's no multi-threading on the VMX units. While it's true there's only 3 physical VMX units, it still has the duplicated register sets (6 sets of 128 128-bit registers) which can make quite an impact when juggling over 3 VMX-happy contexts.
 
It's also wrong when it mentions there's no multi-threading on the VMX units. While it's true there's only 3 physical VMX units, it still has the duplicated register sets (6 sets of 128 128-bit registers) which can make quite an impact when juggling over 3 VMX-happy contexts.

Maybe they are talking about "real" multithreading.
What the 360 VMX does is more like hyperthreading because it (is rumored to) lack(s) the HW execution units to execute the commands in parallel (think Crytek mentioned that some time ago in an interview as a difference between CELL/Xenon)

EDIT: Sorry did not read the quote first :) But it's actually saying the same thing :)

EDIT2: I think there is no duplicate register file after all, just "more" registers. Do you have any source for this claim?
 
Do you have any source for this claim?
figure3.gif





"VRF 2x128x128b", VRF = Vector Register File

As Asher explained, this makes a lot sense where context switching is concerned...
 
Maybe they are talking about "real" multithreading.
What the 360 VMX does is more like hyperthreading because it (is rumored to) lack(s) the HW execution units to execute the commands in parallel (think Crytek mentioned that some time ago in an interview as a difference between CELL/Xenon)

EDIT: Sorry did not read the quote first :) But it's actually saying the same thing :)

EDIT2: I think there is no duplicate register file after all, just "more" registers. Do you have any source for this claim?

Another reference is "Xbox 360 System Architecture", by Jeff Andrews and Nick Baker of Microsoft, IEEE Micro, March-April 2006, pp. 25-37.

From that source:
The CPU core has two-per-cycle, in-order instruction issuance. A separate vector/scalar issue queue (VIQ) decouples instruction issuance between integer and vector instructions for nondependent work. There are two symmetric multithreading (SMT), finegrained hardware threads per core.

And:
Each core also includes the four-way SIMD VMX128 units: floating-point (FP), permute, and simple. As the name implies, the VMX128 includes 128 registers, of 128 bits each, per hardware thread to maximize throughput.

Each core of the CPU is "really" multithreaded, including the VMX128 units. It wouldn't make much sense to do it any other way.

I'm not sure what you think "real" multithreading is. If a single "core" had the execution units, and, more importantly, the instruction issue logic, to process two threads in parallel, then it wouldn't be one core, it would be two. :)
 
But doesn't KZ2 use the SPU's extensively for deferred rendering, ie the lighting and particle effects etc.

OT, but I'd guess not because of sampling shadow maps when lighting.

As for VMX units, what we have learned is that not many make good use of them, nor SSE for that matter on PCs... add in vector scatter/gather and things get much more interesting. GPU SIMD (vector) model is actually easy to program, CPU (VMX or SPU even) hasn't been. I'm going to have to agree with Carl that in terms of learning for next generation, with DX11 compute shaders more and more parallel work will be GPU side, so perhaps CPU vector operations with much more limited functionality (compared to GPU vector processing) becomes less important. Top end GPU from NVidia or ATI pared with a low end CPU (even with only a minor Cell update) might be an easy winner. Guess we arn't doing some things like audio GPU side yet (as per not latency friendly), but even that might change at some point.
 
Last edited by a moderator:
Back
Top