Most games are 80% general purpose code and 20% FP...

Fox5 · Jul 1, 2005

seismologist said:
Gubbi said:

FLOPS are virtually free on both CELL and XeCPU. To the point where performance is going to be determined by other factors, like control logic complexity, memory latency etc.

It'll be interesting to see which is better, the traditional load/store architecture,caches and all, from M$ or the radically different Sony system that [h]has[/b] to be programmed with coarser memory transactions.

Cheers
Gubbi

Click to expand...

Is this really true? There must be a poiint where computation becomes a bottleneck. Sure for modern day games where 80% of the interactions are scripted (i.e. general purpose code).
But how about when you're running a real time physics simulation of say a plane crashing through a building. Things might start to get bogged down a bit.

Why not try it out yourself and see how it would go? I believe Ageia has some physics demonstration programs that will run on any PC, load one up, find the demo with the many towers each made up of thousands of blocks, and knock them over and see how it performs. Well, that wouldn't really tell you anything than how good current PCs could handle it.(I'd say multithreaded games are definetely needed for physics involving thousands of objects though, based on the performance I got)

Gubbi · Jul 1, 2005

DiGuru said:
Think about this: there are very many successful stream architectures in use (most at the API level) at the moment, while symmetric multiple thread architectures still suffer from a lot of serious problems.

Other than obvious stream oriented problems (embarassingly parallel) what are stream architectures succesful at ?

And what are the serious problems of SMP/MT architectures?

Cheers
Gubbi

seismologist · Jul 1, 2005

Fox5 said:
Why not try it out yourself and see how it would go? I believe Ageia has some physics demonstration programs that will run on any PC, load one up, find the demo with the many towers each made up of thousands of blocks, and knock them over and see how it performs. Well, that wouldn't really tell you anything than how good current PCs could handle it.(I'd say multithreaded games are definetely needed for physics involving thousands of objects though, based on the performance I got)

ye, that's the one I had in mind. And also based on the real time physics demos Sony was showing at E3, I hope this is the direction games will be heading.

I guess ray tracing is another potential app and I think it's being discussed in another thread.

Frank · Jul 1, 2005

Gubbi said:
Other than obvious stream oriented problems (embarassingly parallel) what are stream architectures succesful at ?

And what are the serious problems of SMP/MT architectures?

Cheers
Gubbi

Cm'on, Gubbi, we've discussed this multiple times so far.

Change it around: tell me what you would rather see, and why. Although we've done that a few times already as well.

psurge · Jul 1, 2005

archie - what kind of speed-ups do you see? (I am currently implementing min-max cost layout to minimize disk IO for a completely non-game related application, and it's honestly a PITA to produce such a layout...)

Anyways, for large tree structures, I was thinking that a min-max cost layout with a small page size (say 1kB) could make depth first tree traversal on an SPE quite feasible - many small external memory accesses would be replaced with a small number of page-sized DMA transfers. If you modify the depth-first descent to operate on a page sized sub-tree and queue any page-external children that must be traversed, one could even prefetch the first such child and hopefully hide DMA latency much of the time...

Squeak · Jul 2, 2005

Lessard said:
Johnny_Physics said:

So the document really says that the VU1 is used at 56% for T&L (almost) exclusively?

Click to expand...

No the document didn't say that effectively but the T&L is calculated by the VU1 (remember the GPU of the PS2 has not T&L unit).
The other tasks are done by the VU0 (physiks ... )
We can conclude the VU1 is used exclusively for T&L as the VU0 is not used at all

VU0 is used in macromode mostly, in which it is not wasted. It simply acts as powerful vector FPU for the MIPS core. But AFAIK a lot of the VU0s special abilities are wasted in macromode and the 8Kb of memory also.
The challenge is to split processing time between some macromode, and some MIPS and VU0 collaborating as free "individuals". It's so complicated concert that most devs. just default to macromode all the way.

Frank · Jul 2, 2005

The main difference between random memory access and loading blocks through DMA is, that you know exactly what the expected access time is for the latter, while it can vary a whole lot for the first, depending.

aaaaa00 · Jul 2, 2005

DiGuru said:
The main difference between random memory access and loading blocks through DMA is, that you know exactly what the expected access time is for the latter, while it can vary a whole lot for the first, depending.

DMAs have to deal with bus contention too -- they're not totally predictable either. If you queue a DMA, you have no idea really how long it's going to take depending on the state of all the other people using the bus and how busy the DMA controller is.

The advantage of DMA is that it's asynchronous and it batches things together. If you can find something else to do while the DMA completes, then you can can go do it.

On a cache architecture, you can kind of emulate this by putting prefetches in and locking down cache lines as necessary. Also SMT can help -- when the CPU blocks on one thread, it can try to run things on the second thread.

In both architectures you need to rearrange your data structures for optimal performance. On a cache, try to pack related struct fields on the same cache line for example.

Gubbi · Jul 2, 2005

Fox5 said:
Why not try it out yourself and see how it would go? I believe Ageia has some physics demonstration programs that will run on any PC, load one up, find the demo with the many towers each made up of thousands of blocks, and knock them over and see how it performs.

Excellent idea. I loaded up the Novodex Rocket Demo in Codeanalyst, started the "building explode" demo and got the following stats.

0xc1 events are retired micro-ops, 0xcb are retired FP ops

So roughly 40% are FP ops on a ~10GFLOPS A64 3500+. And mind you, 3DNow instructions, a four-wide singly cycle issue SIMD implementation should have a lower amount of FP ops.

EDIT: x87->3dnow

Cheers
Gubbi

aaaaa00 · Jul 2, 2005

Gubbi said:
Fox5 said:

Why not try it out yourself and see how it would go? I believe Ageia has some physics demonstration programs that will run on any PC, load one up, find the demo with the many towers each made up of thousands of blocks, and knock them over and see how it performs.

Click to expand...

Excellent idea. I loaded up the Novodex Rocket Demo in Codeanalyst, started the "building explode" demo and got the following stats.

It's nice to see some hard data.

aaronspink · Jul 2, 2005

jboldiga said:
Apple was pissed with IBM because they wanted faster clock speeds and IBM said they couldnt...then turned around and released 3.2 ghz ppc to sony and m$. IBM simply didnt make enough money from Apple to be concerned with them. It is also true that IBM could not fab enough.

Um, I think it was less about clock speed and more about performance. Realistically, a 2.5 Ghz G5 should run rings around the core in both cell and xbox for the majority of code except for linear algebra.

Aaron Spink
speaking for myself inc.

Gubbi · Jul 2, 2005

aaaaa00 said:
It's nice to see some hard data.

See the edit. It was 3DNow, not x87, and the ratio is 40%, so alot more FP work being done than I first stated. Still, both XeCPU and CELL with their much higher FP performance has "free" FP ops, IMO.

cheers
Gubbi

aaaaa00 · Jul 2, 2005

Gubbi said:
aaaaa00 said:

It's nice to see some hard data.

Click to expand...

See the edit. It was 3DNow, not x87, and the ratio is 40%, so alot more FP work being done than I first stated. Still, both XeCPU and CELL with their much higher FP performance has "free" FP ops, IMO.

But also this is just in the middle of a hardcore physics engine.

Average it out with the rest of the code in a real game, and I'm pretty sure it drops some more.

DemoCoder · Jul 2, 2005

Gubbi, how about describing algorithms in games *on which they are computationally bound* which are not parallelizable and streamable.

tree search, physics integration, collision detection, dijkstra or minimax pathfinding, state machines (per actor), are all parallelizable (have known well performing parallel counterparts). Any algorithm that is tree based is implicitly streamable/offline as well, which is why large databases are even possible in the first place (e.g. b-trees are amazingly effective at minimizing i/o traffic)

Yes, there are problems for which no good general purpose parallel versions exist (e.g. give near linear speedup), but many of the most divide-and-conquer algorithms are, and game programming is not nothing, but the art of hacking sub-optimal solutions and fooling you into thinking they are optimal.

In short, I think alot of this complaining is a strawman.

Fox5 · Jul 2, 2005

Gubbi said:
Fox5 said:

Why not try it out yourself and see how it would go? I believe Ageia has some physics demonstration programs that will run on any PC, load one up, find the demo with the many towers each made up of thousands of blocks, and knock them over and see how it performs.

Click to expand...

Excellent idea. I loaded up the Novodex Rocket Demo in Codeanalyst, started the "building explode" demo and got the following stats.

0xc1 events are retired micro-ops, 0xcb are retired FP ops

So roughly 40% are FP ops on a ~10GFLOPS A64 3500+. And mind you, 3DNow instructions, a four-wide singly cycle issue SIMD implementation should have a lower amount of FP ops.

EDIT: x87->3dnow

Cheers
Gubbi

Wow, where'd you get that program?
Hmm, interesting that a demo that is a physics demonstration is still only 40% FP ops, so that 20% number FP and 80% GP is probably about right then.
BTW, how do you know all the FP instructions were 3dnow and not SSE or generic x87?

Realistically, a 2.5 Ghz G5 should run rings around the core in both cell and xbox for the majority of code except for linear algebra.

I believe Anandtech had a G5 versus Athlon 64 versus P4 comparision, and the G5 was extremely weak in just about everything except SIMD performance, where it was extremely strong. I find that strange, is the Altivec even part of IBM's higher level Power processors?

Gubbi, how about describing algorithms in games *on which they are computationally bound* which are not parallelizable and streamable.
...

Good point, the X360 and Cell processors are probably good for more than just FLOPs calculations.

Gubbi · Jul 2, 2005

Fox5 said:
Wow, where'd you get that program?

here, and it's free.

It's AMD only, look for Vtune if you have an Intel processor.

Fox5 said:
BTW, how do you know all the FP instructions were 3dnow and not SSE or generic x87?

It's possible to mask retired FP instructions with x87, 3Dnow+MMX, SSE/SSE2 scalar and SSE/SSE2 vector. Ran it four times and it turns out that Novodex uses 3DNow (and not SSE/SSE2) on AMD processors.

Cheers
Gubbi

DemoCoder · Jul 2, 2005

The Novadex Rocket demo is a port from the ODF Rocket Demo, and it layers Novadex on top of ODF code, so my hunch is that there is a huge overhead in micro-ops from dynamic/indirect dispatch and excessive wrapping/marshalling. I'd like to see a profile of which functions of the Nx physics DLL are chewing up most of the micro-ops.

I bet indirect calls and data marshalling are chewing up a significantly fraction of those micro-ops, which won't be the case on statically linked platform versions which don't have requires to interface with open-source frameworks.

Since the Physics PPU isn't likely to be a better general purpose CPU than AMD/Intel, one has to ask the question, why does Novadex invest in making such a SIMD-heavy FLOPS chip if theoretically, it won't help applications very much if only 20% of their load, if that, is physics related SIMD streaming FLOPS? Do they know something you don't?

Gubbi · Jul 2, 2005

DemoCoder said:
The Novadex Rocket demo is a port from the ODF Rocket Demo, and it layers Novadex on top of ODF code, so my hunch is that there is a huge overhead in micro-ops from dynamic/indirect dispatch and excessive wrapping/marshalling. I'd like to see a profile of which functions of the Nx physics DLL are chewing up most of the micro-ops.

I bet indirect calls and data marshalling are chewing up a significantly fraction of those micro-ops, which won't be the case on statically linked platform versions which don't have requires to interface with open-source frameworks.

Here's a breakdown of the internals of the NxPhysics.dll. Overall lack of symbols, but the heavy lifting is done by NxBuildSmoothNormals(), which accounts for 60% of the time spent and has a FP ratio of 50%. Alot of the other functions has a significant amount of FP work, so it can't all be indirection and marshalling.

DemoCoder said:
Since the Physics PPU isn't likely to be a better general purpose CPU than AMD/Intel, one has to ask the question, why does Novadex invest in making such a SIMD-heavy FLOPS chip if theoretically, it won't help applications very much if only 20% of their load, if that, is physics related SIMD streaming FLOPS? Do they know something you don't?

Eh? Physics is probably fairly special purpose. Stringing ops to give free normalisation, swizzling etc., where appropriate, could lower the amount of instruction issued. Also only 3-way SIMD units are needed for physics, not the 2-way or 4-way found in Athlon/P4/XeCPU/CELL

Cheers
Gubbi

Fox5 · Jul 2, 2005

DemoCoder said:
The Novadex Rocket demo is a port from the ODF Rocket Demo, and it layers Novadex on top of ODF code, so my hunch is that there is a huge overhead in micro-ops from dynamic/indirect dispatch and excessive wrapping/marshalling. I'd like to see a profile of which functions of the Nx physics DLL are chewing up most of the micro-ops.

I bet indirect calls and data marshalling are chewing up a significantly fraction of those micro-ops, which won't be the case on statically linked platform versions which don't have requires to interface with open-source frameworks.

Since the Physics PPU isn't likely to be a better general purpose CPU than AMD/Intel, one has to ask the question, why does Novadex invest in making such a SIMD-heavy FLOPS chip if theoretically, it won't help applications very much if only 20% of their load, if that, is physics related SIMD streaming FLOPS? Do they know something you don't?

20% is still pretty significant, Creative still makes sound cards and sound has a far lower impact on the cpu than physics.
There's also the option to do far better physics and interactivity than we have now, people raved about Halflife 2 but most things in the game are still very static. Hopefully we'll have devs that actually implement better physics though, certainly for years systems have had the power to do better than they have, every once in a while you'll see a game that shines through there, but most games have pretty much ignored any significant physics calculations.

DemoCoder · Jul 2, 2005

What does SmoothNormals do? Any NovaDex docs around? Sounds like some kind of interpolation or filtering. And why does ReleasePmap or whatever eat up so much? Sounds like inefficient resource/heap management to me.

Most games are 80% general purpose code and 20% FP...

Fox5

Gubbi

seismologist

Frank

Certified not a majority

psurge

Squeak

Frank

Certified not a majority

aaaaa00

Gubbi

aaaaa00

aaronspink

Gubbi

aaaaa00

DemoCoder

Fox5

Gubbi

DemoCoder

Gubbi

Fox5

DemoCoder

Similar threads