nFactor2 - an engine on X360

Titanio · Sep 6, 2005

It's about a week old now, but since it wasn't posted here (I don't think?) I figured it should be:

http://www.famitsu.com/game/news/2005/08/31/103,1125492984,42806,0,0.html

Inis is working on a new engine, for the moment on X360, called nFactor2. The presented it at CEDEC, including some demo footage. The only media available from it are these small pics:

They also gave a breakdown of how they are using threads/cores on X360, which I thought might be particularly interesting given the ongoing debate about CPU usage and how "extra" cores will be used next-gen (well here's one answer anyway):

Roughly translated (by Argyle @ GAF), it's:

Thread 0: main game loop
Thread 1/0: rendering thread
Thread 2: physics simulation
Thread 3: Hair simulation
Thread 4: Audio

Interesting that one hw thread isn't being used - possibly because of these threads is designed to rarely block, so they've given it a core of its own?

Guden Oden · Sep 6, 2005

Titanio said:
Interesting that one hw thread isn't being used - possibly because of these threads is designed to rarely block, so they've given it a core of its own?

I don't know if it's possible to design much of anything on x360 to rarely block unless you write most code to run more or less entirely out of L1 caches. L2's going to thrash if heavily used by more than one core (especially if part of it is locked off for use by xenos).

It might be that the coders simply didn't find anything useful to put on the last thread. It might be that cache thrashed worse by lopping something off onto the sixth thread and lowered overall performance rather than doing it in sequence after something else. Who can say without actually having been there inside the heads of the programmers?

All we can offer is speculation...

Audio ought to be very cache friendly though. The channel mixer can most certainly fit in 16k L1, not sure about the dolby digital encoder though, but regardless, the resulting audio data stream is so small (less than 200kB/sec) that it probably doesn't matter at all it may need a second processing pass to do DD encoding. Besides, even with 100+ sound channels, code and data access patterns ought to be very regular and predictable so this bit probably isn't a very large hit on CPU performance.

one · Sep 6, 2005

Titanio said:
Interesting that one hw thread isn't being used - possibly because of these threads is designed to rarely block, so they've given it a core of its own?

Maybe processing user input / OS / network?

BTW, before someone points out "hardware threads more than 5 are redundant!! Mwahahaha!!" with Cell in mind, the slide notes (in the sentence above the thread utilization breakup Titanio posted) that the more hardware threads the more room for performance gain you have.

deathkiller · Sep 6, 2005

one said:
Maybe processing user input / OS / network?

That sould be in the main game loop I think.

Gubbi · Sep 6, 2005

Guden Oden said:
Audio ought to be very cache friendly though. The channel mixer can most certainly fit in 16k L1, not sure about the dolby digital encoder though, but regardless, the resulting audio data stream is so small (less than 200kB/sec) that it probably doesn't matter at all it may need a second processing pass to do DD encoding.

Since audio is inherently streaming in nature you wouldn't want to pollute your caches with audio data. Using non-temporal loads and stores to move data in and out without using any cache for data. Then your only worry is I$ footprint which, as you state should be fairly modest (and reuse should be very good anyway with 50+ channels)

Cheers
Gubbi

one · Sep 6, 2005

deathkiller said:
That sould be in the main game loop I think.

User input can be in there, but how about OS / network? I mean including non-game related things (you can download things in background etc.)

Guden Oden · Sep 6, 2005

The 360 doesn't have an OS per se, and I don't think it supports any "downloads in the background" either (downloads of WHAT exactly?)...

Mefisutoferesu · Sep 6, 2005

Do you think the developers split the cash up to reduce thrashing? Say, they split the L2 into 480KB, 480KB, and 64KB... maybe even less, dunno how much L2 is needed for just audio streaming. That way they can work with the cores somewhat similar to your standard PC setup.

Might explain some older comments about the setup being similar to a PC from Monolith and Itagaki saying it's a lot like working on the Saturn (IIRC Saturn had 2 seperate cpus cache and all).

One thing I'm curious about, I thought you couldn't simply just move physics to another thread, since there's so many dependencies between that and game code. Any thoughts on why they're able to seperate them so distinctly? Perhaps I'm imagining there being more of a rift between the two than there really is?

Shifty Geezer · Sep 6, 2005

Physics can readily be seperated into another thread. Just look at Aegia's PPU!

Mefisutoferesu · Sep 6, 2005

The what was all that talk about dependencies and such? Granted my knowledge in this department is pretty limited, so I apologize for second guessing you, but surely things like collision detection can't just be whisked off to without a good amount of pain?

Jawed · Sep 6, 2005

I can't find the thread in which this PDF came up, recently:

http://www.ects.com/conference/presentations/leigh_davies.pdf

Note the variety of threading models:

fork join
pipeline
work crew

Jawed

Fafalada · Sep 6, 2005

deathkiller said:
That sould be in the main game loop I think.

The OS wouldn't be a part of the application, so no.

Guden Oden said:
The 360 doesn't have an OS per se

What makes you so sure about that?

Guden Oden · Sep 6, 2005

Coz it would be kind of overkill? While it's technically a computer, it's not a PC after all. I have a hard time seeing how it could use/need all the baggage of a full OS, with the memory and non-volatile storage requirements that entails...

Fafalada · Sep 6, 2005

Guden Oden said:
Coz it would be kind of overkill? While it's technically a computer, it's not a PC after all.

Tell that to MS and Sony, not me.

I'm not saying you'll get a full Windows XP in there. But if we're talking about something that will be always running the background, occupying a certain amount of hardware resources(memory and processing power) - then well...

pipo · Sep 6, 2005

Fafalada said:
... a certain amount of hardware resources(memory and processing power) - then well...

Wasn't that 2% processing power (running in one of the threads) or something?

Titanio · Sep 6, 2005

Guden Oden said:
I don't know if it's possible to design much of anything on x360 to rarely block unless you write most code to run more or less entirely out of L1 caches. L2's going to thrash if heavily used by more than one core (especially if part of it is locked off for use by xenos).

I wonder though if using some models, a streaming model for example, if even a small amount of "locked" cache for a thread would be enough to avoid blocking? I suppose it would depend on the amount of computation per data.

one said:
BTW, before someone points out "hardware threads more than 5 are redundant!! Mwahahaha!!" with Cell in mind, the slide notes (in the sentence above the thread utilization breakup Titanio posted) that the more hardware threads the more room for performance gain you have.

I think where you can parallelise, more hardware threads, more execution hardware certainly benefits you, especially if the stuff you're breaking up is computationally expensive. "With cell in mind" I think the usage outlined above actually vindicates/supports an approach like Cell's..a lot of the stuff beyond one core is floating point, streaming. It'd map very well to it indeed, and you could probably have multiple SPEs doing what one thread on a core is doing there (if you mapped the first two threads to the PPE, a whole SPE to audio - probably overkill - that'd leave 6 SPEs for physics and hair modelling, which could probably be parallelised to take advantage of more execution hardware (?))

BTW, was there anything else of interest on the slide?

Jawed · Sep 6, 2005

Titanio wrote: I wonder though if using some models, a streaming model for example, if even a small amount of "locked" cache for a thread would be enough to avoid blocking? I suppose it would depend on the amount of computation per data.

In Cell there's 256KB shared between data and code. If that's enough memory to support a streaming algorithm, then Xenon with a 1/3 share of 1MB is going to be fine running the same algorithm, is it not?

Jawed

Qroach · Sep 6, 2005

This is interesting. I wish we had bigger pictures to look at.

Xbox 360 has an OS running in the background. That's how you can jump out of a game, and back to the dashboard. So there will be some overhead required for that functionality.

I see lot's of people seemed concerned about the memory contention for L2 cache. Is this really that much of an issue of you can lock a portion of the cahce for a specific CPU to use?

Titanio · Sep 6, 2005

Jawed said:
Titanio wrote: I wonder though if using some models, a streaming model for example, if even a small amount of "locked" cache for a thread would be enough to avoid blocking? I suppose it would depend on the amount of computation per data.

In Cell there's 256KB shared between data and code. If that's enough memory to support a streaming algorithm, then Xenon with a 1/3 share of 1MB is going to be fine running the same algorithm, is it not?

A thread could take 1/6th if you were splitting things evenly. You could give it one third (or one half or two thirds..), but obviously other threads would be dealing with less then.

I do think it'd be enough for that purpose, although you are taking away from what can be used for cache elsewhere.

scooby_dooby · Sep 6, 2005

It was made explicitly clear in the leaked 360 development document that cache management, and LOCKING cache areas were integral to high performance.

People keep saying the 3 cores using the same cache will cause threads to be overwriting eachothers data etc, but wouldn't any Dev worth his salt lock certain portions of the cache for each core? It only makes sense.

So it seems like they can choose between 3 cores with 332kb each, or 2 cores with 512kb each, or even some combination of the above where maybe one core needs very little cache for whatever reason and it could be 400kb/400kb/200kb

Is this fairly accurate?

nFactor2 - an engine on X360

Titanio

Guden Oden

Senior Member

one

Unruly Member

deathkiller

Gubbi

one

Unruly Member

Guden Oden

Senior Member

Mefisutoferesu

Shifty Geezer

uber-Troll!

Mefisutoferesu

Jawed

Fafalada

Guden Oden

Senior Member

Fafalada

pipo

Titanio

Jawed

Qroach

Titanio

scooby_dooby

Similar threads