XNA performance (CPU).

Graham

Hello :-)
Veteran
Supporter
*OK warning first.
Keep in mind,
This is a very rough comparison of manage code. One system is running a speed optimised compiler with a generational GC, where the 360 is running what we can assume is a size optimised non-generational GC - aka the compact framework.
*anyway*




I've just done my first half-baked performance test of XNA on both the 360 and my PC.
The results are surprising, in both good and bad ways.

Firstly I'll state that I expected XNA to run like a dog on the 360.
These are *not* accurate numbers, but give a good rough idea.


The first surprise was that (as I should have realised) there is no thread scheduling across cores. Doh. You have to manually set the thread affinity if you want decent threaded performance. *ops*. Otherwise everything runs on the default core.

Second surprise, of the 6 'cores' you have access to only 4. core 0 and 2 are restricted (for the GC I'd presume). How much of an effect this has on overall performance is an open question.

The final surprise was a lack of WaitHandle.WaitAny() support, this method wraps the native WaitForMultipleObjects() function, which will wait on several events, and return once one of them has been signalled. I had to replace this with a rough equivalent that was effectively while{ loop(signals), sleep(0) }. So this *will* have had a performance effect, but not that significant.


The program is as follows:

My tree, from the other day, contains a bunch of objects. Either 5,000 or 25,000 depending on the test.
One lower priority thread collects objects from the tree, farms them off in batches of 64 to the worker threads. They do work :) Then the process repeats, however the worker threads are running out of caches, and the low priority thread is applying position changes to the tree. On top of that there is rendering too, which currently ain't too quick, and blocks the updating (I'm getting there...).
So each object gets updated twice per iteration, once to calculate, once to apply changes. For both of these, I 'simulate' complex code by a large bunch of trig' in a loop. For the 25k this trig is removed. (so it's really a more single threaded test)

Anyway. Results:

My PC:
athlon64 X2 4200+. 2gbram, etc.

(results as framerate)

5,000 item:

Code:
		ST		MT
360:		7		26
PC:		28		51

25k (more single threaded, no fake workload)


Code:
		ST		MT		MT*
360:		10		16		17
PC:		36		47		40

*only a single worker thread


Overall I expected the 360 to loose out. If anything I expected it to get slaughtered, but 50% of the potential performance of my A64, given it's running the compact framework (a lot less optimisation, the GC isn't generational, etc) and I'm actually quite happy.
When the 360 was new, this cpu was the same price as the hot white box, so all things considered I think it did very well.

Perhaps what surprised me the most was the significant jump when using all 4 'cores'. The performance jump *is* consistently higher than 3x with the heavy work load.

Graphics will be interesting.
 
Last edited by a moderator:
Perhaps what surprised me the most was the significant jump when using all 4 'cores'. The performance jump *is* consistently higher than 3x with the heavy work load.

Graphics will be interesting.

Err, 10 ST -> 16 MT isn't quite like 3x here, or did i get something wrong? :rolleyes:
 
Interesting test. Can you force the GC not to run on 360?

If you did that you'd be out of memory in a matter of seconds :)
You can't manually delete or allocate memory, the garbage collector handles all allocation/deallocation/compacting for you. It's pretty critical and generally very fast.
The problem is the compact framework GC isn't generational, which is a tradeoff they make for embedded devices.
With a generational GC, very short lived allocations are generation 0, which get cleaned up very frequently. Longer lived objects get promoted generations, and are cleaned up less often. This is very fast, but uses more memory. The CF GC is fairly brutal and makes such everything gets cleaned up as often as possible.
The more complex your object structure the harder the GC needs to work to decide if objects no longer need to be kept alive.
 
Last edited by a moderator:
If you did that you'd be out of memory in a matter of seconds :)
Well I meant reusing a custom object pool while disabling GC as in games there are only a limited number of objects. In Java/C# etc. it seems to be the typical way of making games because GC can stop a running game for a while at any given time.

EDIT: Found the (old) answer
http://blogs.msdn.com/stevenpr/archive/2004/07/26/197254.aspx
An Overview of the .Net Compact Framework Garbage Collector
...
Can I prevent a GC from occurring?

There are no APIs you can use to prevent a GC from occurring. As we’ve seen, the time spent in garbage collection is a function of the number of objects that have been allocated. As such, really the only way to “prevent” a collection is to keep the number of objects you allocate to a minimum. When you consider this, however, be sure to remember that there are subtle scenarios in which allocations may be happening on your behalf that you may not be aware of. The best example of this is boxing. Boxing a value type necessarily creates a reference type in the GC heap, so even though you aren’t explicitly calling new(), you’re still allocating objects. It might be tempting to think that a call to GC.Collect() can be used to “time” collections, but as described above, such calls are likely to negatively impact performance instead of helping consistently.
 
Last edited by a moderator:
Ok I see what you mean now :) Sorry to patronise :p

The big problem will be managed directx allocating, and as that answer alludes to, it is very very easy to have small short allocations stack up over time.
I'm doing very few runtime allocations (one, but it's temporary), however there are probably a few implicit ones, being done by methods in the runtime, etc. Once I start doing more complex things it will get more difficult to reduce them.

Even if you could stop the GC (and it does only run periodically or when memory is getting low), you wouldn't free up the locked core(s).

I'd guess that GC activity would be less than 5% anyway.
 
Yup. I've been keeping an eye on the GC through both windows tools such as CLR profiler and the remote perf mon. It's generally OK right at the moment, however I'll be keeping a close watch on it.

I have a version of my tree code that uses value type structs instead of classes. But the problem with that is every access to an item becomes an array index, instead of pointer based. It gets ugly quickly. It also makes for some very tricky situations where you need to expand the tree halfway through an iteration (if you need to expand your node/leaf caches... which means all the nodes/leaves on the stack suddenly are invalid). In the end I found it had worse performance on the PC, however I haven't done a side by side on the 360 yet.
If I find there are serious perf problems later on with the GC (eg skipping) I'll certainly be looking into it. I'm being very careful with how many objects get allocated for each item in the tree, etc. And I've also replaced the GameComponent class with my own, so that seems to have reduced at least allocation size.. The normal GameComponent class in XNA uses 160bytes I think, which isn't trivial if the numbers start to bump up into 5 digits. I'm also doing other little things like using Quat/pos instead of matrix, etc.

One thing I found utterly killed performance was iterating using yield enumerators. Everything I have read suggests the compiler treats them in a similar way to delegates (perf wise) yet I found the performance hit was terrible.

I've still got a long way to go, and of course I'll be optimising it moreso. I'll muck around with the 360 version too. Eventually I plan to make a small game with it, but also to release it as an alternative to GameComponent, etc.... With sortof easy multithreading :)

It can be very surprising where memory can go. I was surprised to see that one of my biggest points of allocation was WaitHandle.WaitAny() calls, which seem to allocate their own array internally. It's not a big amount of memory, but it added up slowly due to the sheer number of calls. Of course it won't effect the 360 version since it wasn't supported anyway and I had to write my own :)
 
What kind of "work" are the worker threads actually doing with your objects? You say trig, but is it on single floats or vectors?

Is there a way under XNA/C# to make your work explicitly SIMD? Or does it entirely depend on the library function?
 
What kind of "work" are the worker threads actually doing with your objects? You say trig, but is it on single floats or vectors?

Is there a way under XNA/C# to make your work explicitly SIMD? Or does it entirely depend on the library function?

It's just single maths functions. So it's nothing efficient. However I'm working on the assumption that the code will not be optimised all that well on either system.
I experimented with increasing/decreasing the complexity, and it did not seem to skew the results much. However I'll take a closer look tomorrow. It could well be an issue there.

The included maths types, eg Vector3, Matrix, Quaternion, etc all have platform specific optimisations as far as I know. The only dissapointment is a lack of a Vec * Quat op.

As I said initially, it's a very rough test. It is the first time I've put the two systems head to head. No doubt the numbers will change as the code becomes more real-world :p
I'm just happy because I had this horrible fear it would run at 2fps, or something horrible like that. * although initially seeing 7 wasn't too thrilling
 
Second surprise, of the 6 'cores' you have access to only 4. core 0 and 2 are restricted (for the GC I'd presume).
Hum. I'll see if I can word this properly or not (admittedly noyt being the sharpest knife in the toolbox hehe :cool:) but is this two "logical" cores spread across two "physical" cores - if I understoo dcorrectly how the xbox 360 processor works. Or is it two "logical" cores on the same "physical" core?

How much of an effect this has on overall performance is an open question.
If this "GC" thing occupies essentially 2/3 of the CPU hardware resources it could be quite a bit could it not?

Peace.
 
I'm seriously tempted to give XNA + IronPython a shot. But last I heard, you can't actually get it running on 360 hardware yet.
 
If this "GC" thing occupies essentially 2/3 of the CPU hardware resources it could be quite a bit could it not?

Peace.

Couldn't that also be the OS "reserved" cores? Iirc MS stated that 2 threads were reserved (5-10% each) for some OS operations going on in the background but im not sure if this is the case here as well
 
Couldn't that also be the OS "reserved" cores? Iirc MS stated that 2 threads were reserved (5-10% each) for some OS operations going on in the background but im not sure if this is the case here as well

Yes. When he says "cores" he means virtual cpu cores. Logical core 0 and logical core 2 would mean just part of 2 physical cores. The Garbage Collector is not going to be that CPU intensive. It just runs in the background freeing up memory that is no longer referenced by the program.

But this is separate from the 5-10% reserved for the OS. And it's only relevant for languages targeting the .NET CLR built with XNA. Commercial quality games are unlikely to empoly a GC enabled language because of the performance implications.
 
Ok I'll just stress a point.

The GC is very fast at what it does.
However, if you are not extremely careful of what you do, or allocate too much, you are going to increase it's workload significantly. At which point your game is going to run like rubbish. It's just a different way to optimise. In unmanaged code you need to be extraordinarily careful of memory leaks and buffer overflows... tradeoffs

It's a good tradeoff especially on the PC. With a full generational GC it's quite possible to only see less than ~0.1% cpu time spent collecting - which is usually on it's own thread to boot.

I personally believe managed languages are the future for games. Consoles are going to take their time transitioning of course, but I think it will happen. I wouldn't be surprised if .net is a first class citizen in the next xbox.


---------------

There are 6 logical cores on the 360, 2 for each physical core. XNA gives you access to 4. This probably means you still have access to all 3 physical cores.
 
But what is the actual benefit of having a separate program root through memory and potentially cause huge performance loss?

You might say only 0.1% but that's on top of all the other crap already running in a PC. Many little things have a way of adding up into large things..

Peace.
 
Performance vs Ease of use for the programmer. Memory management is really easy to get wrong and things like memory leaks and dangling pointers cause all sorts of havoc in C++ to even the best programmers. The idea is with less time spent dealing with memory problems, more time can be spent solving the problems that are specific to the project being worked on. It may not be as fast at run time (depending on the workload!), but in theory it will be more stable and more secure.
 
Back
Top