Dynamic Branching Benchmark

I guess what I said was wrong, sorry everyone. I recall reading a tutorial about it some years ago and I could swear this should work :(I also run a test just now, and sleep(0) doesn't reduce the CPU last... At least not with Vista and .NET 2.0 What it does instead, is trigger the execution of waiting threads, thus allowing better CPU utilization. Sleep(0) is something like "I am ready for now, you should take care of others". This is pretty useless however, if your application is the only one running :)

My tests were run on an empty loop, I am not sure if it will behave the same in a 3D demo.

I also tried Sleep(1), but that reduced my performance by factor of 3. Hoverer, this is an empty loop... Humus, if you have some spare time, could you probably try Sleep(1) in your demo and tell us the result? I am eager to know how it will reduce the performance.

Sleep(0) : A value of zero causes the thread to relinquish the remainder of its time slice to any other thread of equal priority that is ready to run. If there are no other threads of equal priority ready to run, the function returns immediately, and the thread continues execution. (MSDN)

Sleep(1) will wait for 1millisec each time it is being called etc

Since you're programming you cannot not know how this affects performace in any given app... ?! right ..?
 
For an empty loop, Sleep(1) us long. For a real rendering loop that takes much time, Sleep(1) can be nothing, while freeing the CPU time for other applications. Probably :)
 
On a related note, I've found that turning on vsync gives the driver a better chance at giving some time back to the OS, but even in that case on some previous (unnamed :)) cards, 100% CPU was still taken.

In any case, this isn't so much of an issue with multi-core CPUs becoming the norm. Vista/DX10 might handle it better as well.
 
For an empty loop, Sleep(1) us long. For a real rendering loop that takes much time, Sleep(1) can be nothing, while freeing the CPU time for other applications. Probably :)
Another option is Sleep(0) - that'll misbehave in FreeBSD and perhaps some other OSs though.


Uttar
 
For an empty loop, Sleep(1) us long. For a real rendering loop that takes much time, Sleep(1) can be nothing, while freeing the CPU time for other applications. Probably :)

In that case a single sleep(1) would have a really minor effect..
Remember it's the OS which is handing out chunks of CPU time to each running process based on its priority. So even if you're really trying hard to get 100% CPU in reality you cannot if there are other processes running.
(Then again DX9 code integrates in the Kernel itself thereby maybe complicating things a little further)
 
Interesting discussion ;)

Just for you guys, here are three variant of the benchmark exe:
- oZone3D_SoftShadows_Benchmark_VSYNC_OFF_Sleep(0).exe : 100 %CPU
- oZone3D_SoftShadows_Benchmark_VSYNC_OFF_Sleep(1).exe : 100 % CPU
- oZone3D_SoftShadows_Benchmark_VSYNC_ON_Sleep(0).exe : 5 % CPU

LINK: http://www.ozone3d.net/public/downloads/SoftShadows_Benchmark_New_Exe.zip
Just put these exe in the benchmark directory.

From msdn (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/sleepex.asp),
a value of Sleep(0) causes the thread to relinquish the remainder of its time slice
to any other thread of equal priority that is ready to run.

For GLQuake, I guess vsync is enabled, that might explain the little CPU work.

JeGX
 
X1900XTX at 672/859, CAT7.1, E6400@3280
DB OFF 1938, ON 3897
Both cores at ~50%
Shadows are indeed very blocky.
 
hi,

at the beginning of the thread, I wondered why dynamic branching performances on Geforce 7 were worse than ones on Geforce 6 or 8. I believe I've got the answer: Forceware drivers. Here are some new results (benchmark v1.5.4) where ratio = Branching_ON / Branching_OFF :

7600GS - Fw 84.21 - Branching OFF: 496 o3Marks - Branching ON: 773 o3Marks - Ratio = 1.5
7600GS - Fw 91.31 - Branching OFF: 509 o3Marks - Branching ON: 850 o3Marks - Ratio = 1.6
7600GS - Fw 91.36 - Branching OFF: 508 o3Marks - Branching ON: 850 o3Marks - Ratio = 1.6
7600GS - Fw 91.37 - Branching OFF: 509 o3Marks - Branching ON: 850 o3Marks - Ratio = 1.6

7600GS - Fw 91.45 - Branching OFF: 509 o3Marks - Branching ON: 472 o3Marks - Ratio = 0.9
7600GS - Fw 91.47 - Branching OFF: 509 o3Marks - Branching ON: 472 o3Marks - Ratio = 0.9
7600GS - Fw 93.71 - Branching OFF: 508 o3Marks - Branching ON: 474 o3Marks - Ratio = 0.9
7600GS - Fw 97.92 - Branching OFF: 505 o3Marks - Branching ON: 478 o3Marks - Ratio = 0.9
7600GS - Fw 100.95 - Branching OFF: 508 o3Marks - Branching ON: 480 o3Marks - Ratio = 0.9

my conclusion is: dynamic branching in OpenGL works fine (read the performance are better than without dynamic branching: ratio > 1) for forceware <= 91.37. For the drivers >= 91.45, the ratio drops under 1. It seems a "little" bug slipped into the driver code that manages dynamic branching from forceware 91.45. I've also done the test with the simple soft shadows demo provided with the NV SDK 9.5. The results are the same. So what do you think of this conclusion ? I've got it completely wrong or is there really a bug in the forceware drivers :?:
 
This is very interesting. The question is, how in the world could Nvidia not notice that, esp. with all the complaints about DB performance?
 
what if that high performance was due to a bug ie incorrect handling of DB in certain situations?

And after all - how many real-world DB examples we have?
maybe NV decided to lower results and keep it secret until/when a real app becomes available in order to claim "driver improvement"
 
what if that high performance was due to a bug ie incorrect handling of DB in certain situations?
Hmm, doesn't that resembles the sentence it's not a bug, it's a feature, then? :LOL:

So, now there are two odd occasions with DB, on presumably "inferior" NV hardware -- the slightly positive ratio on GF6 series, and now the GF7. Well, they are far cry from what the R500 chippery is showing, and G80 is by my mean fillrate/bandwidth limited in this test.
 
To improve DB performance you need smaller batches, to have smaller batches you need to use more registers..as long as you can hide texturing latency.
Wouldn't be suprised if the driver would artificially inflate the number of registers used by a shader to improve DB performance..
 
I've always wonder of how much the fixed batch size (1K fragments) on GF7 marchitecture is contributing over the previous 4K [NV40] one. Not that 1K is much (if any better) than 4K, but having more than one program counter in the fragment core is sure to be helpful. ;)
 
Back
Top