PS2 performance analyzer statistics (from Sony)

marconelly! · Dec 9, 2003

145k sustained it was, but not at 60fps. That makes it 70K polys at 60 on average, so 125K at 60 is still the fastest (and those are actual displayed polygons). But those are just numbers, in most games the quality of the picture does not depend so much on the number of polygons.

So, would that 125K be the absolute peak the game in question is achieveing, or the average number it puts out? If it's not the peak, could you tell us what IS the peak the game in question was rendering? Also, can you disclose what is the game in question? (come on you told us about LOTR

With games like GTA:VC & SSX3 using VU0 to decode 4.0 DTS in real time, how does this appear in the statistics?

Actually, those games are encoding DTS signal, your receiver is decoding it

Is there any way the PS2 could encode DTS 5.1 much in the way the xbox encodes DD and output the digital signal through the SPDIF port?

Yes, there is at least one game that I know that supports DTS 5.1 during gameplay (one of the NHL games from EA)

London Geezer · Dec 9, 2003

Legion said:
Nick Laslett said:

With games like GTA:VC & SSX3 using VU0 to decode 4.0 DTS in real time, how does this appear in the statistics? I assumed that this would use a lot of the capabilities of VU0, or else they would decode 5.1

Click to expand...

Is there any way the PS2 could encode DTS 5.1 much in the way the xbox encodes DD and output the digital signal through the SPDIF port?

I guess you could get DTS5.1 if you have the performance headroom. 4.0 is "good enough" and probably saves enough performance for other graphical features/framerate stablilisation.

I'm sure getting 5.1 wouldn't be a problem in terms of compatibility if that's what you're asking. DTS4.0 is a compromise, much like DPL2 compared to 5.1... At the end of the day it's 1/3 less channels to de/encode on the fly (from 6 to 4) and that overhead could be used for other things.

And remember that most if not all sound systems interpolate the signals anyway, for example, i still get sound out of the central speaker and subwoofer even though it's really 4.0 in SSX etc..

Yogi · Dec 9, 2003

Mmm....so which is the "best performing" game on pS2? GT3? Since it's created by Sony themselves.

Squeak · Dec 9, 2003

One thing Iâ€™m a curios about is why VU0 is so hard to use.
Is it due to the fact that its memory IIRC, were cut in half, shortly before finalisation of the project, in what looked like a last minute twitch move, and then the whole system became unbalanced by that?

VU0 has access to the 16Kb of SP-RAM, and also the R59k core must be able to spoon-feed it data and code pretty fast, so how could memory still be a problem?
Several research projects show that scratchpad memories actually give better cost to performance results, than a cache, especially for stuff like realtime 3d applications.
With 80Kb of low level memory (roughly equal to L1 cache) EE still has more of that, than most other CPUs on the marked today.
There is no L2 cache, but then EE has pretty good bandwidth to main mem. so that should help a lot to rectify that problem?

Dio · Dec 9, 2003

Squeak said:
Several research projects show that scratchpad memories actually give better cost to performance results, than a cache, especially for stuff like realtime 3d applications.

Well, cache has the disadvantage that it is relatively unpredictable - you don't know when you go to it whether it's going to take a short or a long time. If your ALU design assumes worst case, then it needs a lot of (expensive) latency compensation; if it assumes best case, then it will starve on misses. Its advantage is where there is semi-predictable reuse, often true for conventional programs but less likely to be so for 3D data where the either reuse is predictable or there isn't any reuse at all.

This leads into: a scratchpad has consistent, high performance until you run out of it; then either your algorithm becomes impossible to implement, or your performance falls off a cliff. A cache is likely to have more graceful degredation.

And then: developers will avoid the performance falling off a cliff, so they can / will program said hardware more efficiently. Alternatively, they can use it inefficiently if they aren't good enough to make good use of it - or not use it at all.

Of course, scratchpads have problems too - in particular, they are a pretty expensive block of memory if you have a lot of it, and many algorithms don't need more scratchpad - register space is often enough to manipulate data (which is then handed off to another unit for later processing).

DeanoC · Dec 9, 2003

Squeak said:
One thing Iâ€™m a curios about is why VU0 is so hard to use.
Is it due to the fact that its memory IIRC, where cut in half, shortly before finalisation of the project, in what looked like a last minute twitch move, and then the whole system became unbalanced by that?

VU0 has access to the 16Kb of SP-RAM, and also the R59k core must be able to spoon-feed it data and code pretty fast, so how could memory still be a problem?
Several research projects show that scratchpad memories actually give better cost to performance results, than a cache, especially for stuff like realtime 3d applications.
With 80Kb of low level memory (roughly equal to L1 cache) EE still has more of that, than most other CPUs on the marked today.
There is no L2 cache, but then EE has pretty good bandwidth to main mem. so that should help a lot to rectify that problem?

VU0 (or VU1) cannot access scratchpad directly. VU0 had got its own 4K of data ram (and another 4K program RAM) that can be filled via the DMA controller or spoon fed by R5900. The usual way VU0 is used is as a coprocessor to R5900, this isn't registered on these figures. So its not nessecarily not being used but its just not being used as a seperate processor.

Why not as a seperate processor?

Its a 4K RAM 16 bit integer processor with a massive SIMD FPU attached. Its kind of like programming for a circa 1980 CPU combined with a vertex shader. Unless you have a really good use for it, its not worth the time as you get alot of the benefit in coprocessor mode (which is alot easier to use).

On some of you other errors about EE,

Scratch pad require explicit control of the memory, therefore its not as easy to use as cache. 80Kb of RAM, not sure where you get that from unless your adding VU0 and VU1 RAM into the figure but they shouldn't count as there effectively seperate processors and need there own RAM.

There isn't much bandwidth between EE and main RAM as its shared by the entire system and VU1 uses alot to pump out the graphics.

[edit] Sillly typo with VU1 and VU0 corrected [/edit]

Fafalada · Dec 9, 2003

What DeanoC said, just wanted to add something of mine.

Squeak said:
and also the R59k core must be able to spoon-feed it data and code pretty fast, so how could memory still be a problem?

One of the major reasons to use micromode VU0 is paralel operation to the core. Because we have to spoon feed the data to VU0 by R59k, much of this paralelism gets lost. Moreover, VU0 can't output results by itself either, requiring more R59k assistance for reading out the results.

This interaction between core and VU0 is also anything but simple to implement efficiently, isn't easy to debug, and doesn't lend itself well to a generic approach that would work for multiple problems.
Macro code takes MUCH less time to write, and if you fail to implement the above interaction efficiently, stands a good chance to run comparably fast.

What VU0 lacks to making it truly usefull is what APUs promise to give - ability to feed and output data by themselves.

I'm not just speaking theory on this one btw, I've wrote a buttload of macro code as well as several micro coded solutions (mostly related to physics optimisations).

In regards to cache issue - the most painfull cache misses tend to come from higherlevel code with lots of random memory accesses. As Dio mentioned, this falls well out of domain of SPR assisted algorythms, and also lacking efficient cache, the CPU ends up wasting lots of cycles.

Squeak · Dec 9, 2003

DeanoC said:
VU0 (or VU1) cannot access scratchpad directly. VU1 had got its own 4K of data ram (and another 4K program RAM) that can be filled via the DMA controller or spoon fed by R5900. The usual way VU0 is used is as a coprocessor to R5900, this isn't registered on these figures. So its not nessecarily not being used but its just not being used as a seperate processor.

Why not as a seperate processor?

As I understand it, using the VU0 as a coprocessor isnâ€™t very good utilization of it, SCEE even went so far as to call it evil, in on of their presentations.

Its a 4K RAM 16 bit integer processor with a massive SIMD FPU attached. Its kind of like programming for a circa 1980 CPU combined with a vertex shader. Unless you have a really good use for it, its not worth the time as you get alot of the benefit in coprocessor mode (which is alot easier to use).

Doesnâ€™t vertex shaders, like in the gForce GPUs, have even less memory?

If it really is a third of the total floating point power of the system, then why not use it to its fullest?

Scratch pad require explicit control of the memory, therefore its not as easy to use as cache.

Maybe not as straight forward, but still more cost effective

80Kb of RAM, not sure where you get that from unless your adding VU0 and VU1 RAM into the figure but they shouldn't count as there effectively seperate processors and need there own RAM.

Well, isnâ€™t that almost a question of semantics? A question of how you define a separate processor? Then how about an FPU or Altivec units, are they also separate processors?
The VUs couldnâ€™t "live on their own" so to speak. For advanced applications they have to be hooked to some sort of CPU to work, or am I mistaken?

There isn't much bandwidth between EE and main RAM as its shared by the entire system and VU1 uses alot to pump out the graphics.

As far as I've been able to find out it still has more memory to die bandwidth, than other CPUs of it's time (for example xCPU has 1 Gb/s).

DeanoC · Dec 9, 2003

Squeak said:
As I understand it, using the VU0 as a coprocessor isnâ€™t very good utilization of it, SCEE even went so far as to call it evil, in on of their presentations.

Thats there job

The problem is that nobody (including them) can find a really good cost-effective use for it in seperate mode. We have all got/had ideas for some crazy things but schedules intrude.
In the constant arms race of games development it will get used more and more but it basically doesn't lend itself to being used as a seperate processor.

Squeak said:
Doesnâ€™t vertex shaders, like in the gForce GPUs, have even less memory?

But vertex shaders work on streams, so there RAM doesn't include storing the vertices your currently working on and the results of the calculations. In a VU, you have to upload the data as well into that space (as well as constants etc), you may also have to double buffer the output.

VU1 is doing an easier job, it takes a bunch of objects and renders them, except for some matrix stuff and bounding calculation is largely isolated from the game code.

Squeak said:
If it really is a third of the total floating point power of the system, then why not use it to its fullest?

Because you get a fair whack of the performance in coprocessor mode and by the time you sorted out all the issues to use it, you used alot of dev time and will gain alot less 30% extra FPU power.

Squeak said:
Scratch pad require explicit control of the memory, therefore its not as easy to use as cache.

Click to expand...

Maybe not as straight forward, but still more cost effective

Only in some situation with a fairly ordered access pattern, things like AI and general game code a good large cache helps ALOT.

Squeak said:
80Kb of RAM, not sure where you get that from unless your adding VU0 and VU1 RAM into the figure but they shouldn't count as there effectively seperate processors and need there own RAM.

Click to expand...

Well, isnâ€™t that almost a question of semantics? A question of how you define a separate processor? Then how about an FPU or Altivec units, are they also separate processors?
The VPUs couldnâ€™t "live on their own" so to speak. For advanced applications they have to be hooked to some sort of CPU to work, or am I mistaken?

I define (and I think most people define) a processor as a unit that executes a series of instructions independent of anything else. So FPU's or Altivec aren't seperate there coprocessors, as the 'just' add extra instructions for a CPU. Whereas the VU's are complete seperate processors (except VU0 in coprocessor mode) they run totally seperate from the CPU. Indeed VU1 is hardly ever touched by the R5900 itself, the only connection is R5900 builds a display list that the VU1 consumes, but thats only because we want interaction, its possible for VU1 to work seperately.

Physically there all on one chip (AFAIK actual physical hardware isn't something I've ever looked at) but logically they (can) all operate seperately. Indeed people (for fun) have run entire little games on VU1... with the only outside access being setting flags for controller input.

Squeak said:
As far as I've been able to find out it still has more memory to die bandwidth, than other CPUs of it's time (for example xGPU has 1 Gb/s).

The need to supply VU1 with data hogs the memory bus ALOT. Its fairly hard to get a bus cycle for the CPU which because of small cache is needed more often than most other CPU's.

Lots of devs use it as a powerful programmable coprocesser (you can upload mini programs to it and call them MACRO like from the main CPU). Its very good when used like that

BTW Your Xbox figure is way out, its got a total of 6.4Gb/s memory speed. But without understanding the difference in memory architectures is unfair to compare them (Xbox shares its bus with framebuffer ops but has more CPU cache etc)

ERP · Dec 9, 2003

And another VU0 opinion.

My current view is that for the most part I'd rather have better core (non SIMD) FPU performance or a bigger (more complex) cache on the die than VU0. It would probably have improved overall app performance more.

ShinHoshi · Dec 9, 2003

Thanks for all your explanations Deanoc and Fafalada.
However I have some doubts in one of Deanoc's comments.

VU0 (or VU1) cannot access scratchpad directly. VU1 had got its own 4K of data ram (and another 4K program RAM) that can be filled via the DMA controller or spoon fed by R5900. The usual way VU0 is used is as a coprocessor to R5900, this isn't registered on these figures. So its not nessecarily not being used but its just not being used as a seperate processor.

Just two questions:

1. Wouldn't you mean 16 kb ????
2. I have this picture from a PDF made by some Toshiba/Sony engineers that shows this scheme:

Doesn't this scheme show two paths in the relation between the main core and VU0 ?????

DeanoC · Dec 9, 2003

ShinHoshi said:
Just two questions:

1. Wouldn't you mean 16 kb ????
2. I have this picture from a PDF made by some Toshiba/Sony engineers that shows this scheme:

Doesn't this scheme show two paths in the relation between the main core and VU0 ???[/img]

VU0 has 4Kb, VU1 has 16Kb. Typo on my original post. Sorry (corrected now to stop further confusion).

[edit] can see picture now so can comment ;-) [/edit]

One path is direct from CPU to VU0, the other is via the system bus which uses DMAC to move stuff around. It least thats how I read it...

Squeak · Dec 9, 2003

Thank you again DeanoC and Fafalada, for taking time to answer the questions.

DeanoC said:
BTW Your Xbox figure is way out, its got a total of 6.4Gb/s memory speed.

Oops, of course I meant xCPU.

ERP said:
And another VU0 opinion.

My current view is that for the most part I'd rather have better core (non SIMD) FPU performance or a bigger (more complex) cache on the die than VU0. It would probably have improved overall app performance more.

I would have thought something like this would be better:
1. Ditch the FPU and instead give VU0 8Kb more, and upgrade the CPU core with a larger cache.
2. Instead of "only" 8Kb of texture cache make it 16Kb.
3. One more block of RDRAM, partly for more storage space, but mainly for more bandwidth. (In that order)

But still, about VU0, as DeanoC wrote:

DeanoC said:
Its kind of like programming for a circa 1980 CPU combined with a vertex shader.

A hyper charged VIC 20 then. But look what could be done with 4Kb back then.

As already mentioned DTS encoding can be done, and various culling schemes could probably be made to work (to save a lot of bandwidth from not rendering invisible geometry), data compression, calculation of normals for CLUT quantizised bump maps, and so on.
Of course macromode is there for a reason, and thats where the VU0 will spend most of it time, but for various small repetitive tasks VU0 would AFAICS fit the bill.

Fafalada · Dec 10, 2003

Shinhoshi said:
Doesn't this scheme show two paths in the relation between the main core and VU0 ?????

There is two paths of course - one for coprocessor(macro) mode where main cpu can access VU0 registers, and one from main mem to VU0.

[quote="Squeak]I would have thought something like this would be better:
1. Ditch the FPU and instead give VU0 8Kb more[/quote]
More memory wouldn't really help VU0. Like I said, add the ability for it to start DMA transfers in/out, and you got yourself a nice useable solution.

Like Deano said, a lot of the time the potential uses don't justify the large amounts of extra time spent implementing them. When they do get implemented they are still quite nice to have.

Crazyace · Dec 10, 2003

Personally I'd have loved the vu's to have been symmetrical ( even if it only meant 8k each ) with both coprocessor interfaces and graphics interfaces..
but there seemed to be good reasons for the split.

passerby · Dec 11, 2003

Crazyace said:
Personally I'd have loved the vu's to have been symmetrical ( even if it only meant 8k each ) with both coprocessor interfaces and graphics interfaces..
but there seemed to be good reasons for the split.

I've done... am doing parallel programming myself, and agree that will be a more elegant programming model. But some hardware requirements will have to be fulfilled for that to be practical, and that may be too challenging for a home console designed in '98/'99 priced at USD299.

Deadmeat · Dec 11, 2003

...

Or do away with both VUs all together and implement a single 4x4 matrix processor with 16 FMACs. This would have made everyone's life a lot easier...

London Geezer · Dec 11, 2003

Re: ...

Deadmeat said:
Or do away with both VUs all together and implement a single 4x4 matrix processor with 16 FMACs. This would have made everyone's life a lot easier...

Would it now? I'm sure some people still would have found something to complain about, hey Deadmeat...

Crazyace · Dec 11, 2003

4x4 may be a bit of overkill..

Hi DM,

Although a 4x4 matrix multiple with single cycle throughput would up the FP rating it would require a lot of extra memory BW ( seperate load / store ports to memory ), and also be wasted for many operations ( normal transforms are often 3x3, and colour ops 3x4 )
It would be similar to the original GTE on the ps1, which could perform the transformation almost more quickly than it would take to set up parameters and store results.

nobie · Dec 12, 2003

What's dissapointing about those figures is not the numbers themselves, but the fact that so much silicon is apparently sitting unused/wasted.

PS2 performance analyzer statistics (from Sony)

marconelly!

London Geezer

Yogi

Squeak

Dio

DeanoC

Trust me, I'm a renderer person!

Fafalada

Squeak

DeanoC

Trust me, I'm a renderer person!

ERP

ShinHoshi

DeanoC

Trust me, I'm a renderer person!

Squeak

Fafalada

Crazyace

passerby

Deadmeat

London Geezer

Crazyace

nobie

Similar threads