PS3 and Havok (Famitsu Article)

"Havok 4.5 achieves a speed-up of as much as 2x with our optimized next-gen algorithms, and as much as 5-10x speed-up on PS3 thanks to efficient SPU implementation."

All good points Shifty, and it could be implying:

2-3x speed up over our existing multi core engine and 5-10x using the SPUs compared to using the one PPU core only?
 
So Havok 4.5 does improve performace on Xbox 360 as well? Can;t wait for phisics extravaganza in Halo 3 then.;) But I also can't wait to see in action all that extra power of Cell I've been reading for several years. I'm not sarcastic, I really want to be amazed by Heavenly Sword or Mercenaries 2.:smile:
 
All good points Shifty, and it could be implying:

2-3x speed up over our existing multi core engine and 5-10x using the SPUs compared to using the one PPU core only?

"Havok architecture now scales strongly across all SPUs and runs between 5 and 10 times faster than Havok 4.0 for a typical game scene on the PS3."

The truth is probably somewhere in between. Can't wait for GDC 2007. They are supposed to have a PS3/Cell track, aren't they ?
 
bbot said:
at how much utilization?
Given that they stated time scores for the simulation benchmark - (time of benchmark)/(frametime) if you want utilization in a realtime application (where frametime = 17ms for 60fps app).

Whether that PR can be applied literally to the said benchmark is a different question though, for that I guess we'll have to wait until GDC.
 
Last edited by a moderator:
2-3x speed up over our existing multi core engine and 5-10x using the SPUs compared to using the one PPU core only?
I don't feel comfortable with that idea. A 2x speedup is substantial, and worth crowing about. If it was anything over a moderate percentage increase (15-20% say) I'd have thought they'd have at least made sure to say that Havok 4.5 is faster on all platforms than Havok 4.0, which they haven't said. The only reference to speed-ups in on Cell. For other platforms, we just have the term 'optimized,' and we don't know how different those optimizations are to Havok 4.0's. v4.0 could already have been good on multicore vector units. For XB360, perhaps some manual cache locking could be added to improve data-flow? I'm having trouble envisaging where XB360 or other architectures could get a large improvement in performance from, unless, as I said, Havok 4.0 had some major inefficiencies.
 
The only possibility of further perf increase for Havok on Xbox 360 is scaling it out onto Xenos, but I doubt its use in an actual game at the cost of graphical appeal.
 
edit - although as I've had pointed out to me, they could have used a quad-core and just ran the code on 3 out of the 4 cores..which makes sense.

Which makes sense from a game development standpoint: If you have 4 CPU cores you don't want all being absorbed by your physics engine. You still need to do AI, sound, renderer, and so forth.

Which begs the question: What is the utilization like on those 2 PPE threads? The PPE, compared to SPEs, is a scarce asset which (at this point in time) is pretty important for a number of types of code. It would be interesting to know what the numbers were like with 1 PPE thread and 4 SPEs, although with the re-engineering of the product this is probably moot now.

The only possibility of further perf increase for Havok on Xbox 360 is scaling it out onto Xenos, but I doubt its use in an actual game at the cost of graphical appeal.

You act as if Xenon coding is completely trivial with no room to grow.

Besides VMX libraries, getting a better handle on threading in general as well as the unified cache and further tuning to the code to avoid cache misses and deal with the in order nature of the CPUs are obvious areas where developers will be seeking performance wins all generation.
 
Joshua Luna said:
Which begs the question: What is the utilization like on those 2 PPE threads?
I doubt anything in that benchmark idles for any significant time - they clearly hand picked examples to show great multicore scaling (on both platforms) / max utilization/min execution time.

At any rate, whatever the simulation was - the fast case would have maximum ~20% utilization of the USED cores at 60fps, so it's not like you're taking over majority of CPU or anything.
 
Can't wait for GDC 2007. They are supposed to have a PS3/Cell track, aren't they ?

Havok themselves don't have a PS3/Cell presentation, but Evolution Studios will present on physics in Motorstorm.

There's actually a few quite interesting looking presentations w.r.t. PS3 (including, in particular, one from Naughty Dog on RSX, with the promise of material on RSX and Cell/RSX that's never publically been presented before..but I digress).

edit - oh, and good point Faf. It would have been nice to see how the software scales with increasing numbers of objects also.
 
Am I wrong? No engine still let the spu deal theirs own stuff alone, the ppu acts like the chief, keeping thing synchronised, running some part of the main game code, even some part of the AI code (HS), if the two ppu hardware thread are used to keep everything under control while running a havoc demo, there is no room for the game code?
 
Am I wrong? No engine still let the spu deal theirs own stuff alone

Well, we don't know that in the general case..

the ppu acts like the chief, keeping thing synchronised, running some part of the main game code, even some part of the AI code (HS), if the two ppu hardware thread are used to keep everything under control while running a havoc demo, there is no room for the game code?

In this specific case, with the specific workload in the Havok demo, as Faf points out it would only take a quarter of the frame time in a 60fps game. If you wanted to have 2 PPU and 4 SPU threads just working on physics for that kind of period, it would leave a good majority of the rest of your time for other things. If you wanted another PPU thread running concurrently with that computation, you could do so, but the physics would take a little longer to finish.
 
The only possibility of further perf increase for Havok on Xbox 360 is scaling it out onto Xenos.
There's quite a bit of jiggery-pokery that could be used on XB360. Firstly are all the 128 registers being used on the VMX, or does v4.0 use only 32 registers for compatibility? Are they managing the cache with prefetching to keep everything busy? Is there the possibility of allowing MEMEXPORT to supply collision data or somesuch from the rendering process, saving on processing time?

There's probably some speed-ups possible. The only reason to think they're not major in this release, in my mind, is because nothing has been said of them. Some options also might be impossible, such as using MEMEXPORT which requires a non-tiling graphics engine, right?
 
There's quite a bit of jiggery-pokery that could be used on XB360. Firstly are all the 128 registers being used on the VMX, or does v4.0 use only 32 registers for compatibility?
You mean the compatibility with the Mac G5 alpha kit? I guess they tried it 1 year ago already and it was ready in 4.0.
Are they managing the cache with prefetching to keep everything busy? Is there the possibility of allowing MEMEXPORT to supply collision data or somesuch from the rendering process, saving on processing time?
Are they still to be exploited after 4.5 in a significant way? I think we can safely extrapolate future performance increase for Xbox 360 in terms of physics from the increase in 4.0 -> 4.5 if someone gives us a sample or two. I learned in this forum that gameplay physics is all about referencing pointers which favors shared memory and without this fundamental challenge it'd be not much different from what was tried in optimization for general game code produced for Xbox 360. I doubt Havok haven't tried them by now for their customers with the console which is out longer than the PS3.

BTW, apparently Havok FX only puts effect physics on GPU and it's still CPU that does things like collision detection and ragdoll.
 
You act as if Xenon coding is completely trivial with no room to grow.

Besides VMX libraries, getting a better handle on threading in general as well as the unified cache and further tuning to the code to avoid cache misses and deal with the in order nature of the CPUs are obvious areas where developers will be seeking performance wins all generation.

Marked those issues where the CELL PPU is exactly the same (except for less registers, and more shared cache per thread).

Would be intressting to see how a single (or two) SPEs perform in Havoc 4.5 compared to a (or all) 360 Core - VMX doesn't seem to add much to that performance, if you look at those screenshots (and compare with the "PC" CPU), at least on PS3 and I doubt it will be any different on 360.
 
There's quite a bit of jiggery-pokery that could be used on XB360. Firstly are all the 128 registers being used on the VMX, or does v4.0 use only 32 registers for compatibility?

The compiler will happily use all the 128 registers, and I doubt significant portions of Havok are written in assembly.

Some options also might be impossible, such as using MEMEXPORT which requires a non-tiling graphics engine, right?

MEMEXPORT is perfectly compatible with tiling, just not with AUTOMATIC tiling, and that was several releases of the SDK ago; now even that limitation might have been removed.
 
The compiler will happily use all the 128 registers, and I doubt significant portions of Havok are written in assembly.
Perhaps if they were, they'd get some good performance increases... Which is what I'm getting at here. I don't think there's much scope for large speed improvements, but unlike One I don't think that there's no chance of any speed improvements. I doubt Havok is the perfect implementation of speedy physics on XB360 in the second version of the library ever for that machine.
 
thanks Titanio,EDIt and One i've misunderstood..
it's clearer now ;)

I want to aask something else.
Ms state that vmw128 has some sort of instructions to accelerate graphics and physic works.
We know the dot product instruction, the ability to change data from aos to soa or the other way.
What about the others instructions.
 
Last edited by a moderator:
liolio said:
Ms state that vmw128 has some sort of instructions to accelerate graphics and physic works. We know the dot product instruction, the ability to change data from aos to soa or the other way.
There's no instructions to "change data" from AoS to SoA - and the only commercial CPU that can actually do that (and do it well) is used in a portable.

VMX128 adds dotproduct, and data quantization/packing instructions (the latter has ironically been present in just about every consumer space SIMD implementation to date, except for VMX).
It also removes a fair chunk of integer arithmetic support, so for those keeping notes, it's not backwards compatible with regular VMX.

Lossy data packing is great for graphics processing but will be largely useless in physics. Dotproduct is always nice, but IMO 128 registers will generally overshadow everything else when it comes to performance improvements.
 
Back
Top