Most games are 80% general purpose code and 20% FP...

The important point of Gubbi's statistic was, that 60% of the time was spent running a single function.
 
Well, if Smooth Normals is really where it is eating most of the CPU, that means it is vertex bound, since it looks like the purpose of that function is to calculate vertex normals.

It seems between SmoothNormals and ReleasePMap, most of the time is spend interpolating normals and creating and releasing voxels.
 
Without real symbols, we can't really tell what's going on inside the DLL, we can only really tell what's going on as a whole.

If the profiler is just using export symbols, the function breakdown is probably going to be pretty misleading.

I think the profiler will just assume everything between one exported entry point and the next belongs to one function. Who knows how many internal functions we're missing, and which may not even have been called from the entry point we think they were called from.

I think someone should try profiling a Doom 3 or HL2 timedemo or other modern game. Even without real symbols, getting some more hard data to look at is always a good thing. :)
 
After examining the DLL, it looks like the highest addressed exported entry in the table is NxBuildSmoothNormals().

So if the profiler is doing as I'm expecting and just using the export symbols, all the code above NxBuildSmoothNormals() in the DLL is being assumed to be part of NxBuildSmoothNormals() -- basically half of the DLL.

So without real symbols, we cannot take the screenshot Gubbi posted as an accurate breakdown of what functions are being called; we can only look at the overall percentage of ops processed during the profile run as accurate.

Code:
C:\Program Files\NovodeX SDK 2.1.2\Bin\win32>link.exe /dump /exports NxPhysics.dll
Microsoft (R) COFF/PE Dumper Version 7.10.3077
Copyright (C) Microsoft Corporation.  All rights reserved.


Dump of file NxPhysics.dll

File Type: DLL

  Section contains the following exports for NxPhysics.dll

    00000000 characteristics
    41B483E7 time date stamp Mon Dec 06 08:08:07 2004
        0.00 version
           1 ordinal base
          41 number of functions
          41 number of names

    ordinal hint RVA      name

          1    0 0004C5F0 NxBoxBoxIntersect
          2    1 00090EE0 NxBuildSmoothNormals <------------ See the RVA
          3    2 0004A5C0 NxComputeBoxDensity
          4    3 0004A6F0 NxComputeBoxInertiaTensor
          5    4 0004A590 NxComputeBoxMass
          6    5 0004A6D0 NxComputeConeDensity
          7    6 0004A6B0 NxComputeConeMass
          8    7 0004A690 NxComputeCylinderDensity
          9    8 0004A670 NxComputeCylinderMass
         10    9 0004A630 NxComputeEllipsoidDensity
         11    A 0004A5F0 NxComputeEllipsoidMass
         12    B 0004A570 NxComputeSphereDensity
         13    C 0004A740 NxComputeSphereInertiaTensor
         14    D 0004A550 NxComputeSphereMass
         15    E 0007AF20 NxCreatePMap
         16    F 0007A660 NxCreatePhysicsSDK
         17   10 0000B640 NxFluidAssert
         18   11 00039630 NxFluidDebugAABB
         19   12 00039570 NxFluidDebugArrow
         20   13 00039260 NxFluidDebugLine
         21   14 00039470 NxFluidDebugPoint
         22   15 000392D0 NxFluidDebugSphere
         23   16 000391B0 NxFluidDebugTriangle
         24   17 00039610 NxFluidFree
         25   18 000395F0 NxFluidPAlloc
         26   19 000588F0 NxJointDesc_SetGlobalAnchor
         27   1A 00058B20 NxJointDesc_SetGlobalAxis
         28   1B 0004DBE0 NxRayAABBIntersect
         29   1C 0004DDD0 NxRayAABBIntersect2
         30   1D 0004E120 NxRayCapsuleIntersect
         31   1E 0004D780 NxRayOBBIntersect
         32   1F 0004CB10 NxRayPlaneIntersect
         33   20 0004CDE0 NxRaySphereIntersect
         34   21 0004CEB0 NxRayTriIntersect
         35   22 0007AFF0 NxReleasePMap
         36   23 0004DA10 NxSegmentAABBIntersect
         37   24 0004D1C0 NxSegmentBoxIntersect
         38   25 0004D4B0 NxSegmentOBBIntersect
         39   26 0004CBC0 NxSegmentPlaneIntersect
         40   27 0004C040 NxSeparatingAxis
         41   28 0004E7E0 NxSweptSpheresIntersect

  Summary

        8000 .data
       1F000 .rdata
        E000 .reloc
        1000 .rsrc
      102000 .text
 
Try running it on the ODE toolkit demos. Its open-source, so there won't be any problems getting symbols. Of course, I wouldn't say ODE is the best most optimized toolkit to test this hypothesis.
 
The problem with this kind of analysis is, it doesn't necessarily reflect what we're going to see in next-gen games on different hardware.

I guess it's useful to get an idea of what current games might be like, but I wouldn't be comfortable drawing conclusions regarding chip design choices by MS or Sony based on that.

I guess to more specifically address one conclusion that has been raised here, while flops might be considered "free" running something like this on PS3 or X360, I don't think this represents the ultimate in physics ;) I'm sure there'll be things that people will try to do where flops aren't so free anymore.
 
Gubbi said:
DiGuru said:
Think about this: there are very many successful stream architectures in use (most at the API level) at the moment, while symmetric multiple thread architectures still suffer from a lot of serious problems.

Other than obvious stream oriented problems (embarassingly parallel) what are stream architectures succesful at ?

Cheers
Gubbi
Of course stream processors are only good at processing streams :D As I posted elsewhere regards GAMMA research, I think the key to using stream processors is finding ways to present problems in a stream-friendly way. Computer training to date has been to reduce number of operations from O(N) to O(log N) and the like, using conditions and branches. We'll probably see a change where these ideas are thrown out the window and the ideas become how to get O(N) whacked throw as quickly as possible.

It's too early to say what can and can't be done on stream processors in the long run. We need brainiacs to come up with new mathematical models to complement the steam architecture.
 
Shifty Geezer said:
Gubbi said:
DiGuru said:
Think about this: there are very many successful stream architectures in use (most at the API level) at the moment, while symmetric multiple thread architectures still suffer from a lot of serious problems.

Other than obvious stream oriented problems (embarassingly parallel) what are stream architectures succesful at ?

Cheers
Gubbi
Computer training to date has been to reduce number of operations from O(N) to O(log N) and the like, using conditions and branches. We'll probably see a change where these ideas are thrown out the window and the ideas become how to get O(N) whacked throw as quickly as possible.

Well, if your algorithm has significantly higher time complexity it will, eventually, lose.

Converting sequential algorithms into parallel/streaming ones usually consists of converting constant penalties from one sort to another.

Ie. the GPU sort that was mentioned elsewhere uses bitonic sort. Take three different sort algorithms:

1.) heap, O(n log(n)), sequential
2.) merge, O(o log(n)), sequential/stream
3.) bitonic, O(o(log(n^2)) (which is O(n 2log(n)) ), stream
4.) Quick, best case/average: O(n log(n)), worst case: O(n^2), sequential

Similar time complexity for 1), 2) and 3), but differing constant penalties that varies with problem size because of memory hierachies being what they are. In heap sort you have a latency penalty for dereferencing each node pointer, in merge sort you stream the data multiple times, busting cache. Bitonic search requires more work per node. But for large problem sizes, the performance difference will be a constant factor.

Shifty Geezer said:
It's too early to say what can and can't be done on stream processors in the long run. We need brainiacs to come up with new mathematical models to complement the steam architecture.

IMO, that's a risky business model, build something that requires a breakthough in thinking (which may never materialise) to work. It's not quite as bad as that; as DC has been eager to point out, a whole bunch of problems can be converted to a streaming model. I'm just questening the benefit of streaming architectures over multithread ones, in particular from an ease-of-programming point of view.

Cheers
Gubbi
 
Im certainly no expert in the field, but where do we have existing models of stream processing on multithreaded processing of the same applications for comparison? I get the impression the stream processing is an emergent field, though I'm sure there's masses of research history behind it. It'd be nice to see side-by-side comparisons and I really don't think anyone can say which is better or not. Indeed, there probably isn't a 'better', only a 'different'. I know I found the logic of the finite maths language (SML I think) very natural, whereas most in my class hated it and couldn't think differently to a C way of doing things. It might be Stream appeals to some, conventional appeals to others, and no one way is better overall in programming terms. Though of course one will be better in hardware terms as it requires simpler hardware.
 
jvd said:
No shader performance but stellar performance none the less

Sorry for being late to this discussion inside the discussion:


But 8 times multitexturing and EMBM, so more or less this Kyro+Elan System would be as good as the Flipper but with DOT3 on top.

Going with an 2x Kyro + 1x Elan System would have helped too.
 
mboeller said:
jvd said:
No shader performance but stellar performance none the less

Sorry for being late to this discussion inside the discussion:


But 8 times multitexturing and EMBM, so more or less this Kyro+Elan System would be as good as the Flipper but with DOT3 on top.

Going with an 2x Kyro + 1x Elan System would have helped too.

Hmm, I don't recall any bump mapped dreamcast games except for tomb raider, and even then the bump mapping wasn't that good. Shading wasn't very good either, I don't ever recall seeing shading beyond what was seen on the n64.
PowerVR did have some very impressive tech demos on their website though, but they also ran perfectly fine on my voodoo 3. In particular I'm thinking of the toon shading one.

DemoCoder said:
Well, if Smooth Normals is really where it is eating most of the CPU, that means it is vertex bound, since it looks like the purpose of that function is to calculate vertex normals.

It seems between SmoothNormals and ReleasePMap, most of the time is spend interpolating normals and creating and releasing voxels.

Also interesting to note that the performance is directly related to the number of objects on screen and not the number of objects moving. I wonder why, even if it's not taking advantage of most of the graphics card's functions, it certainly doesn't have graphics that should slow down a modern cpu.
 
It looked to me like the performance changed with the number of objects moving. I think the number of objects that are onscreen is constant.

Im not a software guy but I'd guess to program that simulation you'd need to calculate the force applied to each of the vertices.
Maybe that's what smooth normals is doing.
 
Or as the bricks pile up, the number of interactions increase, - at the same time the order of the space decomposition structure deteriorate-

Cheers
Gubbi
 
GPUs are both streaming and multithread architectures. It's not really either/or. The streaming creates opportunity for massive parallelism, and the multithreading allows hundreds of vertices/pixels to be inflight.


Neither XBGPU or CELL are very multithreaded compared to GPUs. A real multithreaded CPU would look something like Sun's Niagara processor. And a stream+multithread very would look like SPE + Niagara.

There are two challenges being mushed together: converting algorithms to be parallel and converting algorithms to be streamable. Usually, if you can convert something to be streamable, it is amenable to multithreading.
 
Fox5 said:
Hmm, I don't recall any bump mapped dreamcast games except for tomb raider, and even then the bump mapping wasn't that good. Shading wasn't very good either, I don't ever recall seeing shading beyond what was seen on the n64.


We don't talk about PVR2DC but about an Console with the Kyro2.
The PVR2DC had no real DOT3 but used an simpler form of bumpmapping AFAIR. I'm not sure about EMBM/dependent texture read either. At least I have found nothing in the PDF's I looked into.
 
DemoCoder said:
GPUs are both streaming and multithread architectures. It's not really either/or. The streaming creates opportunity for massive parallelism, and the multithreading allows hundreds of vertices/pixels to be inflight.


Neither XBGPU or CELL are very multithreaded compared to GPUs. A real multithreaded CPU would look something like Sun's Niagara processor. And a stream+multithread very would look like SPE + Niagara.

There are two challenges being mushed together: converting algorithms to be parallel and converting algorithms to be streamable. Usually, if you can convert something to be streamable, it is amenable to multithreading.

Just curious, what would Itanium be?

BTW, what defines what a stream processor is?
 
Fox5 said:
Just curious, what would Itanium be?

BTW, what defines what a stream processor is?

People have talked about the Imagine Stream Processor before, but I found these lecture notes from Stanford's course on stream processors pretty interesting, and perhaps informative if you haven't read through them before:

http://cva.stanford.edu/ee482s/notes.html

Touches on some of the issues brought up here and in other threads too, like the definition of a stream processor, justification for stream processing, what class of programs work well on stream processors and the possibility of general computing on stream processors etc.

There's a bunch of interesting papers linked off from those notes too (including one on polygon rendering here: http://graphics.stanford.edu/papers/prsa/ and a PHD dissertation on computer graphics here: http://graphics.stanford.edu/papers/jowens_thesis/).
 
I fund this

Do we have a real use for 50 billion dot products per second? No, not really. AI is not vector-based. Physics is not vector-based (only a small percentage of it is). Game logic, networking, memory chasing, I/O, none of this is vector-driven.

From here

http://www.psinext.com/index.php?categoryid=4&m_articles_articleid=29

This is the most interesting part IMO, the rest already as been discused here, anyway there is part 1 of the article.

http://www.psinext.com/index.php?categoryid=4&m_articles_articleid=28
 
I found some interesting things myself, specifically regarding physics performance on Cell/PS3. I wasn't sure if this was posted before since it's a little old - a quick search says no - so I'll just put it here and won't risk a new topic in case it's old.

Anyway, nvNews had an interview with AGEIA - the PhysX guys - during E3, and they asked about the consoles. It was a little hard to hear what was being said (video interview), but basically he was saying that for a while after they're released, X360 and PS3 will be more powerful than the PC, but that PhysX will help prop up the PC - but he goes on to say in particular that even a PC with Physx isn't necessarily going to be "at a PS3-level", but that it'll help to "semi-level" the playing field.

Also interesting is the followup discussion to Ryan Shrout's PCPerspective's article on Physx (http://www.pcper.com/article.php?aid=140). Some people asked him about Tim Sweeney's comments re. Cell v AGEIA etc. and he promised to look into it with developers. He got back with this:

Ryan Shrout said:
I dont know exactly how the new console will perform, but from what i have been hearing from the developers, the PS3's cell processor in particular has the workings inside to out perform a PC with a PPU.

On the PSINext article, re:

Do we have a real use for 50 billion dot products per second? No, not really. AI is not vector-based. Physics is not vector-based (only a small percentage of it is).

I can't say I'm an authority on this by any means, but I would have thought it would be fairly vector-orientated given that your input/data that you're working on pretty much is vectors. And when they talk about percentages, I'd further wonder if they mean in terms of code volume or execution time. On a more general note, there's also the issue of tasks and data which map naturally to vectors, versus tasks/data which can be cast to vectors - just because something doesn't inherently map to vectors doesn't necessarily mean it can't (and can't subsequently see a big performance boost).

Anyway, physics performance on these chips is something I'm very interested in. The persistant suggestion from a number of different sources now seems to be that Cell would/should excel at..it'll be very interesting to see what comes out once games start appearing a devs start talking about them.
 
Back
Top