View Full Version : The physics dilemma
Demirug
15-Jun-2006, 09:39
I know that this is the graphics forum and a thread about physic hardware seems to be off topic. But I have a good reason to post it here and if you read future you hopefully will see it. Additional this is not only a PhysX/PPU problem. Every GPU or multi core CPU based physic will face it too.
Maybe someone still remember the PhysX hype and what happens after the launch. It’s already some days ago and the web is a fast living place. In the case you do you should although remember that the PhysX “only” demo of Cell Factor shows some nice additional physics effects but at low frame rates. Most people could not believe that a new chip that is called to be much faster as a CPU in this special field is broke down so early. They are probably right because the massive use of new physical effects lead us to another problem.
It’s an old problem and should well know from graphics developer but the physics guys of an engine team maybe have never heard about it. With the low amount of CPU resources the other guys have left them they were be near this magical border line. I am talking about the draw call overhead that force the graphics developer to reduce the number of draw calls per frame. If you ask the developer support from nVidia and ATI they will tell you as general rule to stay below 500 calls per frame if you want high frame rates. If you go higher you will waste more and more CPU power and the GPU goes idle.
If we know go back to Cell Factor and run a draw call counter in the background we will see that the game stay below this limit. But only as long as the new physic effects don’t kicked in. With some action on the screen it goes easily beyond 1500 class per frame. You can’t get good frame rates with such a high call count in real world situations. I don’t know how many percent of the PhysX performance is used for this scenes but it doesn’t matter as the CPU have not enough power to push the additional objects to GPU. In the case the game adds even more physic effects the frame rate will go lower again even if the PPU is still idle the most time.
Maybe the whole situation remembers you on the first hardware T&L GPUs. The cards were faster as the CPUs at this time but you still need a strong CPU to push the objects to the cards. In the end HT&L allows us to make the objects on the screen more detailed but it doesn’t give us much more objects.
What should we and hopefully the game engine developers learn from this? At first that if you change one part of your engine (physic) it could have an impact on other parts (graphics). But this will not solve the problem. The physic guys need to talk with the graphics guys and find a way to draw objects that are currently fall apart in another way than draw every part alone.
If you think about instancing now you are on a good way but instancing will only help if all your objects are very equal. You may have already wondered why the massive physic demos always use a limited amount of different objects. Now you know. But in a real world we want not see a build that breaks in part like it was build with Lego bricks.
But there is another solution. If you have done some graphics programming by your own you maybe remember a technique called “indexed vertex blending” that was introduced together with HT&L on GPUs. It allows selecting 4 transformation matrixes from a set of up to 256 for every single vertex instead of one matrix for all vertices. You can easily rebuild this technique with a vertex shader program. Unfortunately the default constant buffer for a vertex shader can only contain up to 64 matrices. But even with limitation we could spilt a large object in up to 64 parts and still render it with only one draw call instead of 64 expensive one. A building that is broken in 500 parts will eat up all your calls with the classical technique. If you use an indexed vertex blending solution you will need only 8 or less.
If we lock a little bit in the future Direct3D 10 allows us to draw all this parts with only one call and the overhead for this call will be lower too.
To come to a final conclusion:
Adding a high performance physic solution to a game will make a game not better as long as the rest of the game engine could not handle that much physic.
I would like to thank the German PC Games Hardware magazine that runs they draw call count test with their PhysX hardware.
Interesting post - I think some of us already knew most of this, but the notion definitively hasn't hit the mainstream yet, and your solution for breaking up objects is definitively interesting.
For particles in general, I'd tend to believe they remain quads, though. So the solution is much easier, you just group them in a single drawcall. If they're not billboards but actual rotating quads, then obviously you'd want to do the matrix transformation on the CPU, considering how cheap that might be compared to sending
For breaking up objects, the modern equivalents of "indexed vertex blending" via the vertex shader definitively sounds like the best solution. This would indeed make this particuliar technique much more interesting. And as you mention, D3D10 should help too...
There are some other factors which I think should also be considered significant, however. You talk of CPU-GPU overhead here, but don't you think there's CPU-PPU overhead, too? Consider that in a highly optimized engine, you could nearly inline some of the things that require a function call for PhysX. Then consider what that function call is going to do, even if the API's is Ageia's and not Microsoft. It also has to pass through PCI drivers etc., which I doubt are the world's most optimized CPU-wise, too. Even less so, too, if it works in "small chunks", information-wise...
Finally, I'd like to highlight I'm tired of this kind of optimization mechanism where you trade performance for a frame of latency. SLI/Crossfire are that, and the PPU is too, compared to a single-core CPU. Compared to a dual-core CPU, it's the same issue, of course.
From a practical point of view, the only "advantage" you get from higher FPS is lower input latency and greater visual smoothness. The second can be obtained through motion blur much more easily. So, if you increase latency in order to improve FPS, the end result is counter-productive. I know it's not the IHV's or Ageia's fault per-se, but in my book, it makes the solution little more than a short-term fix, or perhaps a mid-term one at best.
The best solution then, of course, is to have Unified Shaders GPUs running the current frame's physics and then rendering everything not needing feedback immediatly. Also, the whole notion of distributing VS workloads with AFR begin making less sense in unified architectures (and even more so if the bandwidth/latency between the two boards aren't too high), so you can switch to a SFR-like mode "for free".
All this would put the latency levels back to the traditional level, and at no extra cost, if the hardware supported it. I fail to see how a separate coprocessor, ala Ageia's PPU, can accelerate physics without adding a frame of latency. Heck, I don't even see how a dual-core CPU can. Now, here's hoping things move in that direction...
Uttar
Finally, I'd like to highlight I'm tired of this kind of optimization mechanism where you trade performance for a frame of latency. SLI/Crossfire are that, and the PPU is too, compared to a single-core CPU. Compared to a dual-core CPU, it's the same issue, of course.
This is not quite right.
Considering a dual-core CPU: latency may be lowered, stay approximately the same, or raised by a frame depending on the algorithm and the exact performance; if the operations are naturally parallizable within a single frame (sim. the classic case described in Amdahl's law) then you get lower latency; if the operations parallelise on a per-frame basis but vsync does not interfere (e.g. vsync is off, or you go from 50fps to 70fps when refresh is 60fps) you will get about the same latency; if you raise your frame rate but do not raise it enough to cross a vsync boundary your latency rises.
As regards SLI/Crossfire, again, the same rule applies; if by doubling your rendering rate you also double your framerate, your latency has not changed. Only if you are overkilling and rendering more frames than you can see has your latency increased. If you had vsync off then latency would be identical to without Crossfire (minus any overheads).
(It's like the old triple buffer argument; does it increase latency, decrease latency or leave latency unchanged? It's possible to make a case for all three - depending on whether you take worst case or best case, and whether you analyse above, around or below vsync rate. I can't remember which I proved last time I looked at it, but I wasn't sure I was right anyway :) ).
Considering a dual-core CPU: latency may be lowered, stay approximately the same, or raised by a frame depending on the algorithm and the exact performance; if the operations are naturally parallizable within a single frame (sim. the classic case described in Amdahl's law) then you get lower latency; if the operations parallelise on a per-frame basis but vsync does not interfere (e.g. vsync is off, or you go from 50fps to 70fps when refresh is 60fps) you will get about the same latency; if you raise your frame rate but do not raise it enough to cross a vsync boundary your latency rises.I'm not going to disagree up to that for sure, and yes, that's a quite elegant formalization of the problem.
I should have highlighted the fact, however, that I was specifically speaking of Physics here. The problem is you can't send graphics to the GPU before you've computed the objects' positions. You cannot parallelize both operations without increasing latency, and that's basically "by definition".
You may argue, of course, that I'm assuming that there is no communication to the rendering engine during the actual computation process of physics. Such a scheme is of course possible, but you need to consider the fact that physics is mostly, but not fully, parallel. If you make it fully parallel, you're going to have two objects "entering" each other much too easily.
As for ways to prevent that, it doesn't matter much in our case, because any adds the same fundamental problem, which is that you really can't send much to be rendered before everything else is ready, too.
Now, if you look exclusively at non-gameplay-affecting physics (that does not collide with the player or other highly dynamic entities), things might be a bit different, because sometimes your particles might only need to interact with static environment, and as such, as soon as everything has been computed, you can (and should) send it to the GPU to render.
The problem there, though, is that you're only fixing the problem for things that don't matter. The gameplay-affecting entities would still have that frame of latency. Unless, of course, you render that after everything else, to buy you time. So you'd want particles to be sent to the GPU first. Great, only problem is that restricts you to additive blending, unless you're on a deferred render. Oopsy.
So fundamentally, to minimize latency, you'd want to have all the gameplay-affecting entities' physics be done ASAP, and with minimum latency. If that's not done on the GPU, you could always feed it with some static environment stuff in the meantime. And then you'd just use a maximal amounts of threads on the CPU to be done with it ASAP, hoping Amdahl's principle is by your side as much as you can.
But once again, if it was done on the same GPU doing the rendering using the current frame's data, all this would be greatly simplified, and most likely more efficient, too. I'm not saying there aren't limitations to that technique too, but I don't see many, personally.
As regards SLI/Crossfire, again, the same rule applies; if by doubling your rendering rate you also double your framerate, your latency has not changed. Only if you are overkilling and rendering more frames than you can see has your latency increased. If you had vsync off then latency would be identical to without Crossfire (minus any overheads).So fundamentally, you're not reducing your latency in any case whatsoever. Yet, your FPS is higher, so you'll increase the detail level. As such, you'd most likely and/or certainly see most people with a SLI/Crossfire system exhibiting higher latency than those without.
I'm not saying there's no advantage to SLI/Crossfire. Quite on the contrary; some extra smoothness at insane detail levels is always welcome. But I feel that for the hardcore FPS player for example, it's not an advantage, at all. Quite on the contrary, too, since you can't expect 2 cards/chips to have exactly 200% of the performance all of the time - or anytime, for that matter. It tends to be a fair bit below that.
If you stopped using AFR and instead went for a 3DFX/3DLabs-like multichip/multiboard configuration (think SAGE+Rampage, or 3DLabs' Wildcat Realizm 800), you could get near-equivalent efficiency at no extra cost. Or rather, no extra cost, IF your hardware was unified in the first place so you don't need multichip configurations.
Furthermore, we obviously don't know how efficient 3DFX's method was (2xSAGE having been cancelled might hint us towards "not so good", though). And 3DLabs' solution, well, wasn't efficient at all. That's irrevelant however, because it is easy to conceptualize how such a scheme would work quite easily in a proper unified design.
From my POV, having an unified design is thus an interesting advantage for SLI-like functionality. If anything, this is an opportunity for ATI to get a more interesting position in that market, if R600 is unified and G80 isn't - which remains to be seen, of course.
(It's like the old triple buffer argument; does it increase latency, decrease latency or leave latency unchanged? It's possible to make a case for all three - depending on whether you take worst case or best case, and whether you analyse above, around or below vsync rate. I can't remember which I proved last time I looked at it, but I wasn't sure I was right anyway :) ).What I'm arguing about here is that if the hardware and parallelization algorithms had only but small and "easy" paradigm changes, latency could easily be decreased further.
VSync is a case where latency is increased (compared to a world where the monitor's image changes instantly when the GPU has finished a frame, at least) because of how monitors operate. The actual "mechanism" behind VSync simply wants to prevent all tearing-like artifacts. Triple Buffering, on the other hand, is there to handle corner cases (and the "general" bad case) in a much more efficient manner, which as you say, can decrease latency in certain circumstances, compared to double buffering.
I would tend to believe however that VSync/Triple Buffering is nothing but an elegant hack. Once again, a very slight paradigm shift in how the monitor communicates with the GPU could most likely reduce latency here. I didn't research this personally, so I won't pretend to know for sure whether it'd work or not, but arjan de lumens suggested such a thing on B3D back in 2005:
http://www.beyond3d.com/forum/showthread.php?t=21731
And to conclude, please don't consider myself a latency freak - it does pain me, however, to see that basically nobody thinks about it anymore nowadays. I was amazed to see nobody going "WTF?!" at the "Full AFR" mode of Quad SLI, for example. While interesting for performance, it doesn't look like a good idea though, because of the obvious latency increase you'll suffer for a fixed framerate. And that's just an example among many.
Uttar
Has anyone tried Cell Factor on Vista Beta 2 to see what happens performance-wise?
Well, I'm one of those people who believes anything above 60fps is completely worthless anyway, so sure, I don't understand the mentality of the hardcore FPS brigade :) .
The idea of a floating refresh rate is certainly an interesting one and perhaps more practical with the shift to flatpanels.
For the dual-core doing physics, I think you're still not taking the throughput increase into account in your calculations - sure, relative latencies have risen but absolute latencies may change in any direction. Let me describe a possible dual-core game engine architecture to demonstrate this:
Consider an engine with a master thread and a slave thread. The master thread handles user input and world state including physics. The slave thread handles all rendering operations. The master thread is allowed to freewheel and updates as fast as it can, updating the world from a start state into an end state in a different block of memory. When a rendering operation occurs, the renderer thread locks the current start state and uses that to perform its rendering (the update thread can continue to run by triple buffering).
You can see that in this case the latency growth is on average 50% of the time taken to update the world state, with the worst case being just under 100%. *
Now, using both threads should allow you to raise your throughput - let's say to 140%, in the ballpark of Quake4's delta - and the two threads take equal time to run. In this case, the average absolute latency change is 1.5 / 1.4, so there is in fact only a very small latency gain.
Alternatively, consider that the throughput delta is about 140%, but that the update thread takes less time than the rendering thread - say 75%. In this case the latency change is 1.5 * .75 / 1.4 == 80%, so in fact there is a latency reduction in this case. Even the worst case latency is 2*.75/1.4 so only a 7% increase.
(Hopefully I haven't made any major mistakes in that lot... feel free to tell me if I've got it completely wrong!)
* If the game could make an accurate estimate of the update time remaining, this could be halved to 25% (by delaying the rendering thread start if the update thread is over 50% complete). In practice an accurate estimate is unlikely to be available but an approximation should allow some reduction - because I've no accurate quantification I haven't considered this here.
Doesn't D3 engine have a fixed 60 ticks per second for the gameworld?
Jawed
Well, I'm one of those people who believes anything above 60fps is completely worthless anyway, so sure, I don't understand the mentality of the hardcore FPS brigade :) .I quite fully agree with that. As I said, the only things a higher FPS will get you, unless the engine's design has serious limitations, is higher smoothness and lower latency. The first can be obtained through other means, and the second varies too much based on other factors.
Still, consider an online game. Extra latency can fundamentally be conceptualized as extra ping. It's not the exact same thing, but it's also much closer than you'd think. And you aren't going to pretend, I hope, that there isn't a very noticeable (at least in some games) difference between having a 60ms and a 100ms ping. Now, conceptualize 2 extra frames of latency at 50FPS, and that's exactly it.
For the dual-core doing physics, I think you're still not taking the throughput increase into account in your calculationsI read through your example, and after realizing the complaint I had wasn't valid, it does seem quite solid. I do have one major complaint with your assumptions, though...
Alternatively, consider that the throughput delta is about 140%, but that the update thread takes less time than the rendering thread - say 75%. In this case the latency change is 1.5 * .75 / 1.4 == 80%, so in fact there is a latency reduction in this case. Even the worst case latency is 2*.75/1.4 so only a 7% increase.I don't know about you, but if you're doing any interesting physics stuff, and you aren't rendering a zillion different particles, sorting them by hand... Well, I'd tend to believe it's the other way around. And obviously that affects the calculation quite a fair bit.
I obviously don't have anything against multicore physics. I personally believe it's going to increase latency in most (current) cases anyway, but I can live with that kind of increase considering the potential gains. For Ageia's PPU for example though, I'm not aware of any "proper usage" scheme that doesn't add a full frame of latency. It is, at the very least, the mechanism PhysX's docs suggest. Which, sadly, implies it'd work that way on dual-core CPUs too, using their API in such a way. I don't know how realistic it'd be to use PhysX in another way however, and I don't know how Havok works for such a thing either.
When it comes to Havok FX/GPU Physics, I can only hope for the best. But considering the latency heresy that is Quad AFR, I also fear for the worst...
Uttar
Sunrise
15-Jun-2006, 16:55
Doesn't D3 engine have a fixed 60 ticks per second for the gameworld?
JawedYes, the game tic simulation, including player movement etc, if i remember correctly. It solved some issues Carmack had to face developing his older engines, especially Q2, where there were some errors WRT game physics. Players with higher fps (ex. the famous "cl_maxfps "100" bug) had an advantage, which practically was a non-issue back then, because the speed of graphics hardware and the cpu was still quite limited. That 60Hz tic rate made sure that you couldn´t do moves which were possible with higher fps, but absolutely impossible with lower and less capable hw.
You don't need bilions of objects to make physics interesting...
You don't need bilions of objects to make physics interesting...
I think until you do, it's not much more than a gimmick. Much the way shaders (or EMBM for that matter) were earlier on, where the odd thing in every game level might use the feature, and everything else is static and flat. Especially when we get talking about any sort of fluid dynamics. Looking at PhysX's cottage cheese "fluid" really shows we have a long way to go.
I don't know about you, but if you're doing any interesting physics stuff, and you aren't rendering a zillion different particles, sorting them by hand... Well, I'd tend to believe it's the other way around. And obviously that affects the calculation quite a fair bit.
Sure. But that means you've got the wrong multithreading policy, and you should have parallelised your physics with itself instead of your render pipeline with your physics.
Amdahl's law again; don't shift something that only takes 20% of the time into a parallel thread; parallelise the job that takes 80% of the time.
Soon we'll be on quad cores in the PC space, and how many in the console space? Three threads to run the game state update in parallel and one to run the rendering sounds pretty good if the physics is the dominant load...
Or you can offload some portion of the game state elsewhere ;).
sonyps35
16-Jun-2006, 08:29
OT but, does Xenos tiled rendering, predicated tiling, increase latency?
Dio: Yup, definitively. My principal fear, however, is that more and more developers begin using 3rd Party APIs, and that simply to look better than the competition performance-wise, they'd rather increase the latency to improve the parallelization potential.
Heck, I wouldn't even trust most game developers to have a clue about how to properly apply Amdahl's law, or know why latency reduction is important to some. And with say, PhysX, suggesting to use a frame of latency... meh.
sonyps35: I don't see why it would. It keeps working on the same frame's data, after all. So, no, it doesn't increase latency whatsoever, unless there's an unrelated strange design choice in there, which I doubt.
Uttar
KingRoLo
17-Jun-2006, 14:50
Why dont developers just make use of the 2nd core of dual core CPUs?
karlotta
17-Jun-2006, 16:03
Why dont developers just make use of the 2nd core of dual core CPUs? because it takes more work, and there wasnt many dual cores around. Think Dell and intell... but now , in the next year many DCs. So soon they will.
Heck, I wouldn't even trust most game developers to have a clue about how to properly apply Amdahl's law, or know why latency reduction is important to some.
I think there'll be a lot of truth in this for the next couple of years. It was telling that at a conference a mate of mine was at a year or so ago, when a Sony rep asked the audience how many of them had multiprocessing experience about two people raised their hands.
In my experience, even in a CS degree there's not been enough tutoring on even the basics of multiprocessing let alone the stuff like MESI that is critical for optimising performance at the lowest level. (I've no clear idea where I've picked it up along the way, I assume it's my fascination with reading anything on assembly language and CPU architectures).
A lot of the key information is now all in the more familiar places like Intel and AMD's primary optimisation guides, which will help a lot, and everyone working on game engines is going to have to become familiar with it. So I suspect we'll see initial scalings of 25-50% ^ (cores) rising to around 75% ^ (cores) in the longer term, with a few rocket scientists way out ahead of the rest.
russo121
19-Jun-2006, 19:31
This is OT, but why there is no single word about ATI physics?
I thought ATI physics were near completion to market compared to Nvidia. All I can tell you is that this review really sucks!
http://www.tomshardware.com/2006/06/19/can_ageia/
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.