A question about tri-core game programming......

Alstrong said:
Is the amount of lockable cahce arbitrary?


http://arstechnica.com/articles/paedia/cpu/xbox360-1.ars/5

Xenon's L1 and L2 caches can function in the conventional manner described above, but they can also function quite differently. More specifically, Xenon invests programmers with an unprecedented level of control over how their applications use the caches. Insofar as they can fall under the explicit control of the programmer, the Xenon's caches, and its L2 cache in particular, can function remarkably like the "local storage" that's attached to each of the Synergistic Processing Elements (SPEs) in IBM's Cell processor. (For more on the Cell and its SPEs, see Part I and Part II of my Cell coverage.)....cont'd.....

In the Xenon's L2 cache, an arbitrary number of the sets can be locked and turned into FIFO queues for private, exclusive use by individual data generation threads. The sets that aren't locked look to the Xenon like normal L2 cache. This means that non-write-streaming threads can use the non-locked L2 cache space normally, as can threads that are write-streaming.

A write-streaming data generation thread that has its own private locked set can also access the pool of generally available, non-locked L2 cache just like any other thread, but with the exception that it can't use too much of it. A write-streaming thread is not allowed to get greedy and use too much L2, because the system will restrict its L2 usage so that it plays nicely with the other running threads.

In sum, write streaming allows the programmer to carve up the L2 cache into small chunks of private, local storage shared between each thread and the GPU. This local storage, in the form of a FIFO queue, acts kind of like a pipe that transports data directly through the L2 without allowing that data to spill over and pollute the rest of the L2.
 
ah, thank you :)


So what would be a "common configuration" for each hardware thread or core if not being able to really separate game/AI/physics/graphics work if they are linearly dependent :?:
 
scooby_dooby said:
i thought each core could partition some of the cache and use it exclusively.
For use as a FIFO queue for Xenos to read, yes. More than that I don't has been specified.


Shifty Geezer said:
Here's a question : will 6 threads be beneficial?

As I understand it, the double hardware threads is while one thread is sat idle, the CPU can work on another. But as the cores are in-order and devs are told to be careful how they manage cache, so we should see less cache misses and latencies (?), won't there be less thread idling? Will six threads working on that cache not cause more slowdown due to lack of available data than three threads? Or will the second thread per core be rigged to use data already available on the cache? Or will the second thread operate fairly transparently and effectively?
If you're going to have 6 threads, you're probably smart enough to make sure some of them don't hit L2 much. XeCPU has read- and write- streaming modes that are ideal for animation. Read vertecies directly into L1 and bypass L2. Some physics work might also benefit from those modes.

Anyway, even if you split up 1 MB 6 ways, you still have about 167 KB per thread. That's not a ton but it's not bad for a closed system like the 360.
 
What it means with this.
L2 cache, an arbitrary number of the sets can be locked and turned into FIFO queues for private, exclusive use by individual data generation threads
Let me make an *educated guess*. The TLB does not touch those regions that are marked 'lock'. So no, you don't want to lock down large regions of L2, because the TLB will have even less space to work with, and the L2 thrashes worse.

The cache locking gives us a very good FIFO implementation. Still does not address my original *guessed problem* of L2 contention. Actually my issue also concerns bandwidth and latency. Put it this way: A CPU with L2 of its own -> latebcy between L2 and CPU(or L1?) = ~10-20 cycles? Now think about a L2 that needs to serve 3 CPUs. Everytime a CPU issues a request L2 may well be still serving requests of other CPUs.

A reminder that this is all guesswork here. Certainly one needs to work and experiment with the final hardware to see if these are valid concerns.
 
Anyone know some good resources on multithreaded programming and load balancing etc? I've always wanted to know more about this stuff.
 
passby - That's how I see it. I think that the lock-down will see more use turning the L2 into per-core LS and leaving devs to manage data fetching on there own in a lot of instances. eg. The streaming, say 64 kb is locked for 32kb in use, 32 kb buffer, and the dev is sending data requests. As the results of cache contention can introduce unknown delays, and the cores aren't OOO so a delay will stall the process, it'll make more sense for devs to manage data fetches this way themselves and be sure of what data's available and how it'llimpact thread performance. If so, it's very similar to Cell which wouldn't be surprising as during design I'm sure IBM were aware of the potential pitfalls with in-order cores hitting the same cache.

Don't know what the cache management is like on XeCPU though to confirm this. And are there any cache systems in place that could negate this problem?
 
ShootMyMonkey said:
mistwalk said:
About tri-core game programming......
Could it be programmed to...
CPU0-Gameplay,AI,music
CPU1-Physical calculation
CPU2-Graphic calculation
?
This is kind of a problematic arrangement. Not impossible, but problematic in several ways. You can't separate Gameplay and AI from physics and graphics because they're linearly dependent. AI, for example, needs to know the state of the world (where everything is, what it's doing -- physics is kind of a small part of that, wouldn't you say?). You could theoretically do physics for the next frame while doing AI for the current one -- just need to watch out if you do that, though because the cache is so small for 3 CPUs to share and you have to be able to store backup world states.

Similarly, both gameplay and AI include things like putting characters (whether player or NPC) into certain states, deciding animations, conditions, etc. That includes that the graphical calculations depend on to get things like which animations to skin to

My thinking exactly.

And there's more:
Think about what happens if you let the gameplay thread freewheel independent from the graphics thread.
Unless you synchronize the threads (which will eventually come down to idling the cores waiting for the others), you introduce bigger latency.

The latency will be 1 to 2 frames. That's when the gamplay thread runs as fast as the graphics thread . On top of that you get refresh latency.
So a scene at 60fps in this console game will lag 1 to 4 frames behind.
 
Anyone know how much power the tri-core@ 3.2Ghz consumes?

mistwalk said:
"A low-power 970FX consumes between 13W and 16W at frequencies of 1.2GHz, 1.4GHz and 1.6GHz."


So if I read this right, a 970FX at 3.2Ghz will consume 32W and at 10Ghz will consume 100W. ;) :LOL:
 
Alstrong said:
So what would be a "common configuration" for each hardware thread or core if not being able to really separate game/AI/physics/graphics work if they are linearly dependent :?:


So yeah... what's the answer to this? :| Or at least one possible good way to use 6 threads? Everyone seems to be downplaying multicore in this thread.
 
It's just going to be tough at first to find uses for all those 6 threads at once.

Initially, devs will only use one core, and the second one for sound. It will take some time to see fully optimised applications using 6 threads at any one time.

Like it will take some time to see applications using all 7 SPEs on Cell...

New architectures need time to be exploited, saying one architecture is "useless" is a bit dumb in my opinion. Obviously it's not useless otherwise MS wouldn't have gone for it.
 
There's no definitive answer. It all depends on the program. The needs of a fighter game would be very different to the needs of a racer or a sports game fr example. One application might see separate threads for AI, physics and game code, another might use 1 thread for model synthesis, a second for texture synthesis, and a third for fluid dynamics.

I don't think any comparisons can be drawn with existing single thread consoles and PCs. It's a brave new world and until devs actually get to work on the hardware there's no way of knowing how they'll prefer to use it.
 
mistwalk said:
"A low-power 970FX consumes between 13W and 16W at frequencies of 1.2GHz, 1.4GHz and 1.6GHz."
Part of the reason that's possible is because at those low clock speeds, you can also operate at very low voltages. Xenon's cores, I believe can reach 3.2 GHz even at a core voltage of something like 1.1V (possibly even 1.0). 970FX can't do that at all.

Say it could handle 3.2 GHz at 1.5 V (which I know it can't, but hypothetically) -- That means that one 970FX core would guaranteeably eat up more than double what one X360 core would. Even under the false assumption that 970FX and X360 are otherwise identical cores, that 0.4V difference alone accounts for an 86% increase in power.

Shifty said:
I don't think any comparisons can be drawn with existing single thread consoles and PCs. It's a brave new world and until devs actually get to work on the hardware there's no way of knowing how they'll prefer to use it.
I guess the point we are seeing is that getting any real performance out of 360 or PS3 will never really happen without some sufficient TLP. What with some 500 cycles penalty for a cache miss on 360, it's as if single-threaded performance will only be seen if you never, ever, for any reason, cross your heart, hope to die, on your mother's grave... access memory.
 
passby said:
Let me make an *educated guess*. The TLB does not touch those regions that are marked 'lock'. So no, you don't want to lock down large regions of L2, because the TLB will have even less space to work with, and the L2 thrashes worse.

The cache locking gives us a very good FIFO implementation. Still does not address my original *guessed problem* of L2 contention. Actually my issue also concerns bandwidth and latency. Put it this way: A CPU with L2 of its own -> latebcy between L2 and CPU(or L1?) = ~10-20 cycles? Now think about a L2 that needs to serve 3 CPUs. Everytime a CPU issues a request L2 may well be still serving requests of other CPUs.

A reminder that this is all guesswork here. Certainly one needs to work and experiment with the final hardware to see if these are valid concerns.

let me second your guesswork. basically in my ears the XeCPU desing screams 'fine-grained', with backing vocals a brass section. speaking of which, btw, it'd be interesting to see if MS would supply any fine-grain parallelizing tools (i.e. compilers), as that level of parallelism is arguably more remote to the human mindset compared to the coarse-grain parallelism of independent tasks.
 
squeak, i guess you've read passby's posts earlier on, and you want to formalise things a bit.

coarse-grained parallelism is built around the notion that the work is split into realtively independent 'locales', and the separate work threads need only now and then randezvous (be that implicitly or explicitly).

so in XeCPU we have shared l2 with locking; whereas the locking provides for nice fifo mechanism for the data which are meant to cross the locale boundaries (i.e. producer/consumer model), those data not meant to cross locale boundaries only get penalized by the shared l2 due to both higher eviction rate and longer access latencies (compared to same l2 cache but not shared), and even further penalized by the l2 locking scheme as locked parts do not participate in the mem caching scheme.

so basically XeCPU's caching scheme benefits inter-locales data flow and penalizes intra-locales data flow. of course it's not a true-false separation, it is just that the strengths of the XeCPU's architecture are well into the fine-grained end of the spectrum.
 
With streams, you basically have a lot of tasks that are all sequentially performed on incoming data. So, a stream performs an amount of modifications sequentially on all data structures that travel through it. Like a factory.

Data structures can be added to the queue that is going to do the next operation. And they can skip parts of it as well. Like a car, you might want to skip adding an airco and would want to send it to the dryer after applying the paint.

So, you divide your programming model in many independent operations, give them all a queue and a means to define where they have to go next. Essentially, that is the same kind of model as the current one, only it uses data streams to define what actions are to be performed, instead of sequential logic or events (messages).

So, you have many small tasks (threads), that are only activated when there is something waiting in their queue. And one process that manages everything. Like a custom, stream-centered operating system. Because that's what an OS does: manage IO, collect input, manage the memory (and cache) use, and give all the processes (threads) their slices of processing time.

The amount of streams and the data in them are mostly limited by the memory bandwith in most next-gen computers.
 
london-boy said:
Initially, devs will only use one core, and the second one for sound. It will take some time to see fully optimised applications using 6 threads at any one time.

They may also lift out physics and drop it on a different core. NovodeX is already setup for this if you've read their SDK docs. They could also put the animation system on another core, maybe some pathfinding stuff, whatever.

There are a bunch of fairly easy ways you can at least take partial advantage of more than one core on xenon by lifting out fairly big chunks of engine code, and dropping them into a seperate thread.

Not saying you will get great utilization of all the cores without something more sophisticated, but for a quick and dirty performance improvement, especially for a tight schedule, I'd think it might be ok.
 
Back
Top