Interview with Corrinne Yu, Principal Engine Architect at Halo Team

She's partly right, although it was also often the case that we weren't allowed to push it to make sure the games ran acceptably on other platforms. That's 50% of the reason I had abandoned 360 development long time ago, not because I didn't want to push the 360 but because we were required to treat it as a PortBox 360 and not an Xbox 360 so our hands were tied. Games shipped with so much idle gpu time even in 2009, it was kinda disgusting really. Dunno if that's changed in 2010.

Vmx means just rewriting code to use the 360 cpu's vector units instead of it's craptacular ppu's. You can kinda think of the vmx units as mini cpu's that operate on 4 things at a time really well, as well as being able to avoid stall penalties like load-hit-store if you keep everything in the vmx registers, and hiding latency nicely if you hand write everything using vmx intrinsics. I imagine most devs are using vmx, hard to imagine living on the 3 ppu's alone at this point.




Yeah, you can break up cpu tasks on 360 the same way you do on spu's. On PS3 you had no choice but to really granulate your code into many small jobs sooner than later because it simply wouldn't run well on the single ppu. On the 360 the 3 ppu's allowed the old crutch of just splitting code into three main threads, like maybe main render, ai and physics. That's easy to get up and running and debug, but it ultimately doesn't max out using the cpu. That old model has no choice but to die a painful death, especially if next gen has many more cores. Whoever has code that works as she says, broken up into much smaller parts, will be much better prepared for next gen. In that respect the single ppu + 6 spu model of the PS3 really forced people to catch up and break away from the 2 or 3 thread model, which long term is a good thing.

when you say that the VMX units operate 4 things in a time - do you mean that the VMX units have a vector length of 4?
This leads to a potential factor 4 speed up, right?
But how are these VMX units clocked? It is sometimes the case that vector units are rather low clocked (for instance the NEC SX8 is only 800 MHz, iirc), thus you don't get the full speed up compared to a single standard CPU with high clock rate.
 
Last edited by a moderator:
They're clocked at the CPU speed, 3.2GHz, 4 wide (4 32bit values), providing for some very good floating-point performance if used effectively.
 
They're clocked at the CPU speed, 3.2GHz, 4 wide (4 32bit values), providing for some very good floating-point performance if used effectively.

Ah, 3.2GHz, this is good, so you get a theoretical speed up of 4 when using this units (4 operation at one time, sans the setup of the vector pipeline, which is typically neglectable)!
But i don't understand up to now, how you can use them or not:
you have 3PPUs, with 2 threads each. Do you have the possibility to setup some additional threads via the VMX units? Or are there some compiler directives where you can mark loops with "you shall get vectorized when I compile you"...thus using the VMX units with this part of the code, whereas the rest of the code gets executed via a PPU thread?
 
Ah, 3.2GHz, this is good, so you get a theoretical speed up of 4 when using this units (4 operation at one time, sans the setup of the vector pipeline, which is typically neglectable)!
But i don't understand up to now, how you can use them or not:
you have 3PPUs, with 2 threads each. Do you have the possibility to setup some additional threads via the VMX units? Or are there some compiler directives where you can mark loops with "you shall get vectorized when I compile you"...thus using the VMX units with this part of the code, whereas the rest of the code gets executed via a PPU thread?
They are not "mini CPUs" at all, VMX is just an SIMD extension to the PPC architecture. The instructions are part of the regular code running on that CPU core. There is no co-processing model here, no extra threads, no separate program counters.

Yes, as with all SIMD extensions, compilers still typically don't achieve much even if they do offer auto-vectorization. Loop vectorization hints can help, but good performance will require explicit assembly-style programming.
 
They are not "mini CPUs" at all, VMX is just an SIMD extension to the PPC architecture. The instructions are part of the regular code running on that CPU core. There is no co-processing model here, no extra threads, no separate program counters.

True, but it's an easy way to think about it. You have to target them otherwise they idle, and you have to hand write the code to make the best use of them. I tend to think of them as mini cpu's as they are fairly powerful when targeted, and the ppu is kinda hopeless on it's own.
 
Not the same thing at the technical level although they both address vector math.

The SPUs are more general (e.g., can solve tree traversal problems much better than pure VMX), run independently (i.e., good security measure as evident in the GeoHot episode), has Local Store and DMA to counter memory latency. Due to their simplicity, they run fast and efficient, and STI could put more cores in the same space.

Whatever vector instructions bolted onto an existing CPU will be subjected to the same run-time hit experienced by the CPU (e.g., memory accesses).
 
So, in the same vein, you'd say SSE is an extra mini-core too? I can't wrap my head around that.

I know it's not really a separate core. The difference is that Intel main cores are actually good, whereas the ppu in the 360 on it's own isn't. The performance difference from going vmx to not going vmx is extreme, hence why to me I treated them as cores that anything and everything needed to be shifted to.


Whatever vector instructions bolted onto an existing CPU will be subjected to the same run-time hit experienced by the CPU (e.g., memory accesses).

The vmx units have their own registers, so if you hand write everything you can hide memory and instruction latency. It's really key that when you write stuff for vmx, that you stay in vmx to make best use of those registers. That's part of the idea of treating them as mini cores. If you bounce back and forth between ppu and vmx then your performance will be terrible. Instead treat it's registers and instruction set as all you have and stay in those confines, then performance can actually be really good.
 
Last edited by a moderator:
A core has its own thread, so they can't be considered cores as such. However they aren't used as part of the processor by default, at least not very well, so they can't be considered an integral part of the CPU as executable units. You have to tell the CPU to give over one of its threads to driving the VMX unit (as I understand it, though I could be wrong). Hence it's like a mini-core but one you have to outsource to, with the CPU thread given up to enable the VMX units.
 
The vmx units have their own registers, so if you hand write everything you can hide memory and instruction latency. It's really key that when you write stuff for vmx, that you stay in vmx to make best use of those registers.

Yes, but you're still subjected to other run-time characteristics (caching, fetch; you need to hide larger latency). On the SPU, the program lies in the Local Store itself. It is rather 'extreme' which makes the SPUs interesting and incredibly quick and efficient when targeted.
 
You have to tell the CPU to give over one of its threads to driving the VMX unit (as I understand it, though I could be wrong). Hence it's like a mini-core but one you have to outsource to, with the CPU thread given up to enable the VMX units.
Nope, no handing over either. It is integrated just like SSE. It's a little harder to move data between VMX and Int registers though, as there is no instruction to do that (unlike SSE's MOVD). Such data passing needs to go through memory (or L1 cache in practice).
 
Do a open source counter part for silverlight exist? I run ubuntu 10.4 on my laptop and I can't watch the damned vid?
Otherwise I'll watch it tomorrow at work... :LOL: she is a bliss to listen to so enthusiastic, humble. She has to be a good teacher.
 
Do a open source counter part for silverlight exist? I run ubuntu 10.4 on my laptop and I can't watch the damned vid?
Otherwise I'll watch it tomorrow at work... :LOL: she is a bliss to listen to so enthusiastic, humble. She has to be a good teacher.

You can download it.

its right below the description.
 
Do a open source counter part for silverlight exist? I run ubuntu 10.4 on my laptop and I can't watch the damned vid?
Otherwise I'll watch it tomorrow at work... :LOL: she is a bliss to listen to so enthusiastic, humble. She has to be a good teacher.

Yes, Mono Light iirc... but it only goes to Silverlight 3.0 as of yet. But 4.0 is brad new, so I doubt it'll be used much anyways (I have to learn Silverlight at any at the moment... this is just bad... for GUI designing above all!!)
 
What does she mean by "data decomposition"?

I suspect it's the process of breaking datasets into smaller components that can be processed in parallel, without heavy synchronous requirements.

A quick google search reveals:
https://computing.llnl.gov/tutorials/parallel_comp/#Designing

"There are two basic ways to partition computational work among parallel tasks: domain decomposition and functional decomposition."
 
Last edited by a moderator:
Back
Top