Hironobu Sakaguchi's opinion of cell and PS3

Status
Not open for further replies.
So, I have these four numbers that are sitting in a vmx register and I need to know the above value to pass to the next step. How could I do this faster on xenon without dot product?
In your specific case additions are enough, otherwise you use fmadds in the general case.
In 4 cycles you can do 1 to 4 dot4 (up to you how many things you want to process at the same time, I usually work with at least 4 entities (vertices, edges, whatever I need to process..) at the same time.
Even if you decide to do a dot product in 4 cycles you probably end up having less latency on the result of your dot product so that the CPU can schedule an instruction using your results sooner than later.
Another advantage of this way to work is that your dot products scales linearly with the width of your vectors, so that 4 dot3 take 3 cycles and 4 dot2 take 2 cycles.
Overall you have higher throughput and less latency.
 
In your specific case additions are enough, otherwise you use fmadds in the general case.
In 4 cycles you can do 1 to 4 dot4 (up to you how many things you want to process at the same time, I usually work with at least 4 entities (vertices, edges, whatever I need to process..) at the same time.
Even if you decide to do a dot product in 4 cycles you probably end up having less latency on the result of your dot product so that the CPU can schedule an instruction using your results sooner than later.
Another advantage of this way to work is that your dot products scales linearly with the width of your vectors, so that 4 dot3 take 3 cycles and 4 dot2 take 2 cycles.
Overall you have higher throughput and less latency.

Thanks guys! Good ideas, but I probably should have fleshed out what I was doing a bit more to show how I get those -1 to 0 values. Plus, maybe you guys will have a better way of doing what I need to do ;) Imagine an array of positive numbers, lets say with these numbers, that last of which is always infinity:

{ 1000, 5700, 9350, infinity }

I have another number, say 7675, and I want to know which slot it falls just under, and I need that slot number to use later on. It used to be figured out with a branch, which if course is bad. Instead I did this, all with vmx instructions to make sure everything stays in the vmx unit registers:

1) splat the target number across another vector so we have { 7675, 7675, 7675, 7675 }

2) subtract this new with the original one, giving: { -6675, -1975, 1675, huge +num }

3) do a max on that vector with -1, giving: { -1, -1, 1675, huge +num }

4) do a min on that vector with 0, giving: { -1, -1, 0, 0 }

5) dot that vector with itself, giving: 2

The value I need is 2, which I use elsewhere for other things. So, the above.....dumb or not dumb? Or maybe there is a better way to do this? Oh ya, I inline the above into a function and call it unrolled in groups of 4, so that the compiler can make great use of the 128 vmx registers and schedule things much better.
 
Last edited by a moderator:
Plus, maybe you guys will have a better way of doing what I need to do ;) Imagine an array of positive numbers, lets say with these numbers, that last of which is always infinity
...(snip)

How big is the array typically?

Once you have the array, how many queries of that sort will you perform on it before rebuilding it?
 
joker454 said:
Or maybe there is a better way to do this?
There is.
(at least on SPE - but IIRC you said you were multiplatform so this should be useful anyway).

1) "CompareGreaterThenWord" resultVector, { 7675, 7675, 7675, 7675 }, { 1000, 5700, 9350, infinity }
2) "GatherBitsFromWords" resultVector,resultVector (combines rightmost bits from each word into 4bit value, zeroes the rest)
3) "CountOnesInBytes" resultVector,resultVector (get the actual count - in this case, 2)

No messy long latency instructions or dodgy FP arithmetics, unrolls trivially, can be made shorter still, with some manual unrolling.
I'll let someone with VMX experience take a gander if equivalent syntax exists there.

Question, are you absolutely required to use 32bit values to compare with though?
 
Last edited by a moderator:
Question, are you absolutely required to use 32bit values to compare with though?

Ya, the sample numbers are small for simplicity, but the actual numbers are all over the map, they can get pretty huge. Thanks for that idea, I haven't done the spu version yet but I'll keep your idea in mind when I do this week! I guess I need to see if there are equivalent vmx instructions for what you describe as well.
 
Tim Schaff works on dev tools for Playstation?

The reports seemed to suggest he'd work on whipping up Sony's SW applications, like Connect, up to shape.
 
Tim Schaff works on dev tools for Playstation?

The reports seemed to suggest he'd work on whipping up Sony's SW applications, like Connect, up to shape.

That is correct, I was using him as an example, not as an explicit case...
 
.....
I inline the above into a function and call it unrolled in groups of 4, so that the compiler can make great use of the 128 vmx registers and schedule things much better.

Under standard VMX ( If you want it under PPE rather than Fafalada's SPE version ) probally vsumsws is your friend..

mask = vec_cmpgt( vectorvalue, vec_splat( testvalue );
field = vec_sums( mask,4 );

If you wanted to calculate 2 or 4 in parallel, just pack the bytes and use sum2s or sum4s
 
Shame that...

However, given the latency of the float commands it still may be worth permuting the compare results and using integer adds. I haven't given too much thought to the 360 VMX though
 
Shame that...

However, given the latency of the float commands it still may be worth permuting the compare results and using integer adds. I haven't given too much thought to the 360 VMX though

Indeed... Basically anything what would normally be VICU type instructions have been pulled and replaced with a dot product pipe. :(
 
Unfortunately, vsum* aren't supported on the 360 vmx ;(

Plus, there are some odd ommisions here and there. For example, the vmx units on the 360 can't do vsum* operations, which as far as I can see are available on the PS3's ppu or spu's (why?!?).

:p (are they available on PPE/SPE?)
 
So we started with SPUs missing stuff..and we ended with 360 VMX units missing stuff..hilarious
 
Didnt see any bashing (I actually saw possitive remarks regarding MS in general) but thats irrelevant

There is no denial Cell is complex. There have been many detailed articles on that. But saying Cell is low-powered without anything to back it up is more suspicious than anything. Low powered compared to what and why? Especially when taking into consideration benchmarks and other developer comments regarding its performance that contradict that statement.

He could be talking mainly about the Ps3's CPU minus the spes, which barely outperformed the Wii's CPU in benchmark tests.

The CPU performs terribly without the spes... well terrible considering it's high clock speed of 3.2ghz.

It uses a PPE core which is stripped down. There isn't a full featured PPC core in the 360 or the ps3.

The only next gen system this time with a fully featured PPC core this time around in the Wii, but it's clocked low... really low.
 
It uses a PPE core which is stripped down. There isn't a full featured PPC core in the 360 or the ps3.

The only next gen system this time with a fully featured PPC core this time around in the Wii, but it's clocked low... really low.
I am not arguing your point, but could you be more specific in your definition of what makes a fully featured PPC core, there are quite a few different cores around?

What instruction set do you consider to be fully featured, or do you have some other specific features in mind?
 
I am not arguing your point, but could you be more specific in your definition of what makes a fully featured PPC core, there are quite a few different cores around?

What instruction set do you consider to be fully featured, or do you have some other specific features in mind?

Out of Order?

Better performance per clock?

The reality is that the PowerPC unit for Nintendo is more complex than the PPE alone.
 
Out of Order?

Better performance per clock?

The reality is that the PowerPC unit for Nintendo is more complex than the PPE alone.

So any PowerPC core that is more complex than the PPE is a fully featured core? That was indeed very vague.
 
He could be talking mainly about the Ps3's CPU minus the spes, which barely outperformed the Wii's CPU in benchmark tests.

The CPU performs terribly without the spes... well terrible considering it's high clock speed of 3.2ghz.

It uses a PPE core which is stripped down. There isn't a full featured PPC core in the 360 or the ps3.

The only next gen system this time with a fully featured PPC core this time around in the Wii, but it's clocked low... really low.

If that is in fact what he is referring to then I fail to see how it validates his claim any more of the CELL being "low-powered"..

Saying the CELL is "low-powered" because the performance of the chip isn't great when you only use the PPE is just as stupid as saying a Ferarri is "not very fast" because you can't get past 20mph when you only drive it in 1st gear..

The CELL was never mean't to be used in a way where you isolate computation to the PPE only and ignore the SPEs.. So if you use a processor in a way it was never intended to be used then how can you pass judgement on the capabilities of the chip overall??

That's just silly IMO..
 
Status
Not open for further replies.
Back
Top