Hard-Tuning the PS2 (great Euro GDC 2002 talk)

duffer

Newcomer
If you're at all interested in the technical details of the PS2, you've got to read this article: "Hard-Tuning the PS2" by a Sony programmer, which can be downloaded from this web page. (You may need to register, but it's free.)

http://www.gamasutra.com/features/index_gdc.htm

The paper gives tons of good info on the inner workings of the PS2, and which parts games have a hard time using.

Two interesting points that struck me:

+ Apparently the PS2 CPU is difficult to use effectively- he says it's typical for a game to use only about 20% of the total CPU power. (This is 20% of just the MIPS core, not counting VU0 or VU1)

+ Many game developers avoid using VU0, cutting them off from 30% of the total processing power of the machine.

I'm sure that thanks to the availability of the Performance Analyzer, and of papers like this, that we'll see a significant improvement in the power of future PS2 games.

... while you're on that page, check out the "Cross-Platform Console Development" talk, and the "What sells and Where" talks. The other talks are also good, but those were the three I found most interesting.
 
I would hardly call this a paper or an article, it's just a short power-point presentation, and I think most of the things mentioned are already well know among PS2 devs. I was a little bit shocked at seeing how incredibly crappy the compilers are at pairing instruction. Only getting 7% instruction pairing is really bad, hopefully the compilers will improve somewhat, they could hardly do worse than they're doing now.
 
bloody compilers...

I think pairing pales into insignificence compared to the cache missing penalties - L2 caches are a huge savior of general code.
 
Re: bloody compilers...

Crazyace said:
I think pairing pales into insignificence compared to the cache missing penalties - L2 caches are a huge savior of general code.

Yeah, that's true, but I was expecting that, I wasn't expecting the compilers to be so bad.

And cache misses you could do something about (by streaming the data you need into the scratchpad ram, and making sure all loops fit into the instruction cache), there isn't much you can do about crappy compilers, short of writing one yourself (if you actually know how, and have the time to do it), or writing all critical parts yourself in assembly code.
 
Oh compilers actually tend to be even worse then that. For instance, at least up until early 2.95 revision compiler appeared to never use muladd/mulsub etc. instructions even though that pretty much doubles throughput of the core FPU unit (not talking about VU's here).

But instruction pairing is pretty much irellevant speedwise when compared to what cache misses can cost you, as Ace pointed out, especially in hybrid-UMA system that PS2 is, where you always have heavy background bus activity - which amplifies the problem.
And no, it's not particularly easy to do something about it either in more highlevel parts of the code (which suffer from the said issue the most). I'd rather argue it's often next to impossible to do anything effective about it... not to mention it can quickly involve really messy rearchitecting of the code, combined with the fact you don't have any guarantee it will be a worthwhile speedup.

If we're comparing what is easier to work around, IMO it's actually more feasible to code critical parts in asm if compiler optimier has problems with them.
 
Interesting, thanks Duffer and others.

I also find the widespread lack of support for VU0 to be odd. Faf, Crazy, Archie, can you discuss personal experiences?

Does it just come down to developer lack of time and the difficulty - tying in with compiler problems?
 
Try the playstation2-linux pages...

If you are really interested in the guts of the PS2 you can try the playstation2-linux.com site and the IRC channel. It's slightly more open than the hacker pages... ( Part of the linux kit is a complete set of hardware manuals for the EE and GS, the same as supplied to pro developers... )
 
The VU0 isn't very useful. It doesn't have a way to get data to the GS like the VU1 does, and it's also crippled by the fact that it's only got 4k of data memory (as opposed to the VU1's 16k). To use the VU0 in micromode you gotta get the data to the VU0 and then off the VU0 when you're done. We use it for things like matrix multiplies, sin/cos approximations and some volume intersection code (our use barely registers on the PA).
 
Not just slides..

If you look at the notes rather than the slides there is a lot of extra text...


I'd have to disagree with you fresh - the VU0 is incredibly useful - especially as you start to use more complex animation and physics, but the problem is that it is difficult to use well.
 
I have the idea that while L2 would help quite a bit, a good OOOe implementation for issue and execute would help quite a lot too ( the author keeps pounding on the fact that we could have executed so many instructions here and there... ) wether one is easier to implement than the other might be the point of discussion...
 
OOOE isn't that important here

A good compiler would do a far better job...
On the MIPs there are enough registers available to schedule independant operations ( unlike the x86 where there is a lot of register reuse ) and loads are deferred ( so a compiler can insert code between a load and the use of it's register )
Memory organisation is the key to good PS2 performance ( It's also important in general computing, but it's difficult to optimise fully on a PC where you dont know the underlying memory architecture fully )
 
Well, VU0 isn't exactly intended to do graphics mainly, and other uses aren't nearly as straightforward. Physics and other crap has a lot of general purpose code in there that can't just all be stuffed to micro code like shaders on VU1 can be. So it's more likely you end up using VU0 in macro mode most of the time.

fresh,
DMA controller DOES have a direct path from VU0 memory to GIF, it just isn't controlled by VU itself like on VU1 which makes synchronization more complicated.
Still, if you really need to, you can do the calculations and send results directly to GS, without ever leaving VU0 local memory.

Anyway, as far as our uses go... the matrix and some vector optimization is mainly macro code... also a few other smaller things.
Other then that we have micro code running collision tests (box to box, triangle to triangle), some renderer matrix setup routines, and until recently vectorized matrix decomposition and linear solver (this one I am rather proud of, since it yielded roughly 10x speed increase over straight C++ code, and it's one of the parts of physics that don't have a real hard limit on how much time they can take up, so it can make a really big difference).
 
On the MIPs there are enough registers available to schedule independant operations ( unlike the x86 where there is a lot of register reuse ) and loads are deferred ( so a compiler can insert code between a load and the use of it's register )

Yes the compiler can do a good job... keyword 'can' and with three operand instructions registers can be reused quite a bit too...

I don't fully agree on the argument "we have enough registers" and while I agree that the compiler might be able to fill some of those voids with code... there are several situations ( a certain condition that cannot be solved until run-time for example... there was an Alpha vs IA-64 paper that you all probably read too that talked about this ) in which the compiler simply doesn't know... and an in-order CPU is DEPENDENT on the compiler, more than a good OOOe CPU is ( they both are dependent... think IA-64 with bad compilers and no FDO and think a K7 with a bad compiler... they won't do good, the both of them... one will be much distanced from the other... guess who... )...

There are other ways around OOOe, predication, aggressive SMT, etc... but none of these roads was easy applicable to the EE...

an R10K would not have been that bad ;) even with no L2, but with SPRAM and all the modifications the R5900i received...

the problem that brought MIPS, POWER, x86, Alpha to the warm embrace of OOOe is that a good OOOe implementation can do better than most compiler can... even architectures like IA-64 which were designed with reliance on compilers in mind and heavvy support in the ISA itself for helps/hints coming from the compiler to the "dumb" micro-processor.. even IA-64 needs for Run time profiling ( several runs of the program...Feedback Directed Optimization IIRC ) to be fed to the compiler to re-compile the code again, to improove itse efficiency...

Compilers do not know run-time information, OOOe engines on the CPU do and that is why they can do and often do BETTER... and, btw, FDO/Profiling can be used for OOOe architectures to improove their compilers' code generation as well...
 
Back
Top