PS3 and Havok (Famitsu Article)

inefficient · Jan 26, 2007

It is interesring that the VMX128 has a dotproduct instruction.

I wonder what the fastest way to do a dot product on a SPU would be. The straightforward way would by 5 instructions by my estimation. But there must be a faster way.

mul
rotate quad 2 bytes
add
rotate quad 1 byte
add

Panajev2001a · Jan 26, 2007

inefficient said:
It is interesring that the VMX128 has a dotproduct instruction.

I wonder what the fastest way to do a dot product on a SPU would be. The straightforward way would by 5 instructions by my estimation. But there must be a faster way.

mul
rotate quad 2 bytes
add
rotate quad 1 byte
add

R1 =
x1x1x1x1
*
x2x3x4x5

R2 =
y1y1y1y1
*
y2y3y4y5

R4 =
z1z1z1z1
*
z2z3z4z5

R3 =
R2
+
R1

R6 =
x6x6x6x6
*
x7x8x9x10

R5=
R3
+
R4

R1 =
y6y6y6y6
*
y7y8y9y10

etc... etc...

This is surely not the fastest SoA dot-product you can imagine and although with an increasing number of vectors to be dotted the stalls would reduce (we could avoid to do the obvious "splatting" I have done in that code because the splatting does take some cycles although we can hide such cost with the loading vectors portion of the code perhaps or we could have a case in which this splatting is not needed as we are not dotting a single vector with multiple other ones or if we have, like on PlayStation 2's VU's a broadcast operator that allows us to multiply or add to a vector a single field from another vector) I suspect that the main benefit of SoA form can be seen right away: we are processing 4 dot products in parallel.

inefficient · Jan 26, 2007

Panajev2001a said:
...

Interesting... when the verts are arranged like that (SoA), not having the HW ability to do a sum across a vector doesn't look too bad at all.

I supposed the SoA form will also be used in some Xenos apps too since it's more cache freindly. If so, the dot product instruction usefulness is going to become limited.

Fafalada · Jan 26, 2007

Panajev2001a said:
the stalls would reduce

Having data to process in loops is kind of a prerequisite if you are talking about optimization on level of cycle counting.
Xenon's DP latency is 14cycles, it's not going to be terribly fast if you don't have lots of DPs or something else to schedule around it to fill out the latency gaps.

Panajev2001a · Jan 26, 2007

inefficient said:
Interesting... when the verts are arranged like that (SoA), not having the HW ability to do a sum across a vector doesn't look too bad at all.

I supposed the SoA form will also be used in some Xenos apps too since it's more cache freindly. If so, the dot product instruction usefulness is going to become limited.

The way we are trained in schools AoS form is more intuitive and being able to do horizontal operations across fields and all is easier for us to think about 3D math and vectors.

SoA might be a bit counter-intuitive at first and requires to re-organize your data from beginning to end to get most benefits out of it (although you could also waste some cycles to get data in and out of SoA form before and after some critical math loops).

Still, when it is the driver and the hardware that do a good chunk of the work for you, SoA form is not so bad... see G80's "scalar" processors

.

Panajev2001a · Jan 26, 2007

Fafalada said:
Having data to process in loops is kind of a prerequisite if you are talking about optimization on level of cycle counting.

True, but I like to put my hands forward before being hit with the clue-bat(tm) by one of the resident professional developers

.

AbbA · Jan 26, 2007

Is 3 Pc core a veiled term to rappresent X360Cpu?

StefanS · Jan 26, 2007

AbbA said:
Is 3 Pc core a veiled term to rappresent X360Cpu?

Read the thread ;-) It's supposed to be an QuadCore with 1 core idle for other tasks.

liolio · Jan 26, 2007

AbbA said:
Is 3 Pc core a veiled term to rappresent X360Cpu?

edit [strike]Why are you so happy to dismiss the xenon perf... [/strike] i'm under the impression I misunderstood your post

Anyway it doesn't make sense, ppu thread is 33ms, one xenon core with two thread running and with better altivec unit wouldn't be three time faster...

Thanks, Fafalada for your response

DieH@rd · Jan 26, 2007

hupfinsgack said:
Picture6:
This is a ragdoll demo called "Beating and Playing" {Comment: I don't know what the real title might be}. The blowing-off (blasting-off) of 200 soldiers’ bodies is simulated by physics operations. In the end up to 800 hundred bodies will be possible

This Heavenly Sword demo was played on lasy year's GDC...

-tkf- · Sep 17, 2007

Old?

Intel buys Havoc:

http://www.gamasutra.com/php-bin/news_index.php?story=15511

StefanS · Sep 17, 2007

-tkf- said:
Old?

Intel buys Havoc:

http://www.gamasutra.com/php-bin/news_index.php?story=15511

There's already several threads about it in the other sections of the B3D forums. Please use those instead.

Gunhead · Sep 20, 2007

There's already several threads about it on other sites too. Please use those instead.

Seriously now, this thread (the Famitsu article) had the interesting bit about Nvidia's coop with Havok. I wonder if the new situation means or leads to any increasing cozyness between Nvidia and Intel. (What with their mutual chipset and IGP war on the one hand, and DAAMIT on the other). Maybe it's just coincidence and has no consequences.

Okay, lockdown! ;-)

PS3 and Havok (Famitsu Article)

inefficient

Panajev2001a

inefficient

Fafalada

Panajev2001a

Panajev2001a

AbbA

StefanS

meandering Velosoph

liolio

Aquoiboniste

DieH@rd

-tkf-

StefanS

meandering Velosoph

Gunhead

Similar threads