nFactor2 - an engine on X360

Titanio said:
I think where you can parallelise, more hardware threads, more execution hardware certainly benefits you, especially if the stuff you're breaking up is computationally expensive.
Well, you have to store the code somewhere too, so if the programs that process your streaming data are bulky, they're still going to jostle with each other in the cache... Best would hence be if code processes little or streaming data in tight loops. Not sure how much actual processing work can be implemented in such a manner however.

Jawed said:
In Cell there's 256KB shared between data and code. If that's enough memory to support a streaming algorithm, then Xenon with a 1/3 share of 1MB is going to be fine running the same algorithm, is it not?
Except as you know, SPU RAM isn't cache. Typically you can't be sure that a particular piece of code/data will actually be in the cache as it functions transparently and invisibly compared to the processor... Of course these days there's automatic and explicit prefetching available, and cache lines can be locked (in some CPUs anyway) so the contents doesn't get pushed out. It'll be interesting to see what coders will come up with to circumvent the limits of each platform.

Titanio said:
A thread could take 1/6th if you were splitting things evenly. You could give it one third (or one half or two thirds..), but obviously other threads would be dealing with less then.
One thing I've not been able to confirm tho, is xCPU actually capable of doing this? It can lock off a piece for use by xenos, and I assume it can lock certain lines like for example the R4200i in the N64 (or Gekko in GC). But can it really partition off (an arbitrary) size for one particular CPU? One particular hardware thread even?

Then again, the ability to partition off pieces of cache is by no means any guarantee of achieving better performance... Look at the celeron with its crippled cache size, these chips have an uncanny way of sucking, performance-wise. And a low-end celeron-sized cache coupled with a high-end clockrate does indeed spell trouble, as far as performace is concerned. But like I said, it'll be interesting to see how things will work out in the end. I'm optimistic! :D
 
What use would there be for a core without access to an L2 cache? If memory serves me, the first Celeron had no L2 and performance tanked...but that's a PC thing. Would anything run in a game well and not require L2?
 
Guden Oden said:
Well, you have to store the code somewhere too, so if the programs that process your streaming data are bulky, they're still going to jostle with each other in the cache...

I mentioned "seperate execution hardware" for that reason, with seperate memory too..so different threads wouldn't be competing for that memory. Just one thread on one unit/piece of memory.

Guden Oden said:
One thing I've not been able to confirm tho, is xCPU actually capable of doing this? It can lock off a piece for use by xenos, and I assume it can lock certain lines like for example the R4200i in the N64 (or Gekko in GC). But can it really partition off (an arbitrary) size for one particular CPU? One particular hardware thread even?

Well I've been assuming so, but I'm not entirely sure. I guessed if you could slice off a piece of the cache for the GPU you could do so arbitrarily also. Can anyone clarify?

Guden Oden said:
Then again, the ability to partition off pieces of cache is by no means any guarantee of achieving better performance... Look at the celeron with its crippled cache size, these chips have an uncanny way of sucking, performance-wise. And a low-end celeron-sized cache coupled with a high-end clockrate does indeed spell trouble, as far as performace is concerned. But like I said, it'll be interesting to see how things will work out in the end. I'm optimistic! :D

Well, locking off part of the cache, you'd no longer be using it as cache I don't think - locking the cache says to me that all automatic manipulation of that memory stops, and it's up to the coder then to explicitly determine what happens. And that can be very beneficial depending on what you are doing - you'd probably adopt a model then whereby data was coming in and going out consistently and predictably.

MoeStooge said:
Hair simulation? I laughed at that.

The dude looks like he has some funky hair :D
 
From a whitepaper leaked WAAAAY back in June 2004:

"Each of the three cores includes a 32-KB L1 instruction cache and a 32-KB L1 data cache. The three cores share a 1-MB L2 cache. The L2 cache can be locked down in segments to improve performance. The L2 cache also has the very unusual feature of being directly readable from the GPU, which allows the GPU to consume geometry and texture data from L2 and main memory simultaneously.
Xenon CPU instructions are exposed to games through compiler intrinsics, allowing developers to access the power of the chip using C language notation."
http://forums.xbox-scene.com/index.php?showtopic=231928

wikipedia:
"When running procedural synthesis algorithms, one of the Xenon CPU's cores may "lock" a portion of the 1 MB shared L2 cache. When locked, a segment of cache no longer contains any prefetched instructions or data for the CPU, but is instead used as output space for the procedural synthesis thread. The Xenos GPU can then read directly from this locked cache space and render the procedurally generated objects. The rationale behind this design is that procedurally synthesized game content can be streamed directly from CPU to GPU, without incurring additional latency by being stored in system RAM as an intermediary step. The downside to this approach is that when part of the L2 cache is locked, there is even less data immediately available to keep the 3 symmetric cores in the Xenon CPU running at full efficiency (1 MB of shared L2 is already a rather small amount of cache for 3 symmetric cores to share, especially considering that the Xenon CPU does not support out-of-order execution to more efficiently use available clock cycles)."
http://en.wikipedia.org/wiki/Xbox_360

Arstechnica's overview:
"Xenon's scheme has the advantage that it is more dynamically adaptable to the needs of the application, since it's a single store that can be partitioned dynamically. However, what the Xenon gains in adaptability it loses in flexibility, since unlike the SPE local storage, which is just a flat memory space that can be used in any way the programmer sees fit, the Xenon's write buffers can only be configured in one specific way and for one specific purpose (as described above).
Before finishing off the topic of locked sets in the L2 cache, it's important to note that while I may have gleaned this information from a Microsoft patent, a similar technique is already in use in an IBM chip: the Nintendo Gamecube's "Gekko" processor. The Gekko allows a programmer to lock half of the chip's 32KB, 8-way set associative L1 cache to prevent streaming writes from polluting the L1. Data from the locked half of the L1 bypasses the L2 entirely and goes directly to the Gekko's bus interface unit and out onto the memory bus.

The Gekko's L1 cache locking scheme doesn't provide the same degree of control and flexibility as the Xenon's L2 cache locking scheme described above, but it doesn't have to. Gekko is a single-issue, single-threaded design, whereas Xenon can have up to six simultaneous threads running. "
http://arstechnica.com/articles/paedia/cpu/xbox360-1.ars
 
I think that sums it up. It's far from flexibel or efficient. That's why MS clearly states in their papers that devs should try hard to avaid cache locks (which is not very easy when using 3 (or 6 virtual) threads at once.

On the other hand that's also not applicable for CELL. CELL's SPEs use 256kb LS RAM (no cache) which each of them can access independantly without being disturbed or pushed into a wait state as it happens on Xbox when the cache is accessed by any of the cores.
 
Thanks for the links scooby.

This bit seems relevant too:

Read streaming

The other way that the programmer can explicitly control the cache hierarchy is by placing a thread in read streaming mode. Like write streaming, read streaming provides a way of preventing streaming media data that won't be reused from polluting the caches. When a thread is in read streaming mode, it can read data directly from the system bus into either the L1 cache or the registers. In other words, the programmer has the option of bypassing the L2 and streaming data directly into the L1, or of bypassing the entire cache hierarchy and streaming data directly into the registers.

It's a little bit more complex than I thought, you can't just lock a part of the cache and use it arbitrarily as I had thought previously.

It seems that you can lock some unknown portion of the cache for writing to only. The same thread can use the rest of the cache as normal (although not "too much"), but you can't use that locked portion of the cache for reading in data, just for writing to (the article even refers to these as "write buffers"). The article also suggests you can only do this for the purposes of allowing the GPU to directly read the data in the locked portion.

For reading, you can't lock a portion of the L2 cache, but you can lock some portion of the L1 cache or you can read data directly into registers.

So it doesn't seem like you can mimic something like the SPU's local sram with the cache locking. It would have been quite useful to be able to lock part of the cache for arbitrary reading and writing.
 
Nemo80 said:
I think that sums it up. It's far from flexibel or efficient. That's why MS clearly states in their papers that devs should try hard to avaid cache locks (which is not very easy when using 3 (or 6 virtual) threads at once.

Actually they clearly state that only by making good use of the cache management tools provided, will you be able to extract the potential performance from this CPU.

Version....anyways you could send me that x360 DevKit document again? it had some details on Cache management, and outlined exactly how integral it was to good performance.
 
I think this direct cpu-gpu connectivity is very interesting for physic; perhaps it could operate same way as procedural synthesis should operate: cpu creates basic computational instruction for creating image and then leave to gpu to do the rest.
(by the way, they are not virtual threads, they are real)
 
Titanio said:
Thanks for the links scooby.

This bit seems relevant too:



It's a little bit more complex than I thought, you can't just lock a part of the cache and use it arbitrarily as I had thought previously.

It seems that you can lock some unknown portion of the cache for writing to only. The same thread can use the rest of the cache as normal (although not "too much"), but you can't use that locked portion of the cache for reading in data, just for writing to (the article even refers to these as "write buffers"). The article also suggests you can only do this for the purposes of allowing the GPU to directly read the data in the locked portion.

For reading, you can't lock a portion of the L2 cache, but you can lock some portion of the L1 cache or you can read data directly into registers.

So it doesn't seem like you can mimic something like the SPU's local sram with the cache locking. It would have been quite useful to be able to lock part of the cache for arbitrary reading and writing.

I think there're two different types of cache management in the x360.

First, there are the tools to allow them to manage cache between threads, although we have very few details on this it is obvious from the MS dev papers that cache management is essential to good performance, therefore we can assume they are providing the tools to implement it.

The other type of cache management is for procedural synthesis, in this case a portion of the cache is locked so the GPU can write directly to it.

I don't think the two should be confused.

Nemo80 - that's not the document i'm referring too. a couple months ago they released a doc for people moving from alpha to beta kits(G5's to Xenon), they outlined where performance was better or worse, and what to do to maximize performance, and made several references to cache management as being key, IIRC
 
Lysander said:
(by the way, they are not virtual threads, they are real)

No they are not since there is only one VMX unit per core. Read the above dev. paper. Even MS speaks about only 3 threads...

Additionally there is an interview with the creator of FarCry speaking about the threading capabilities of Xenos and CELL, and he says something about "1.5 times the performance" (by using 2 threads on on core -> hyberthreading) on Xenos whereas there is a real doubling of performance on CELL. So these are just virtual threads, just like Intels hyperthreading, in contrast to an AMD X2 (simply comparison) :)
 
scooby_dooby said:
I think there're two different types of cache management in the x360.

First, there are the tools to allow them to manage cache between threads, although we have very few details on this it is obvious from the MS dev papers that cache management is essential to good performance, therefore we can assume they are providing the tools to implement it.

The other type of cache management is for procedural synthesis, in this case a portion of the cache is locked so the GPU can write directly to it.

I don't think the two should be confused.

Indeed, I don't think they should be either. The "cache management" they refer to may have nothing to do what we're discussing here.

Nemo - I asked in another thread, and I'll ask again, but what makes you think there's a second VMX unit in the PPE? That would completely changed the floating point numbers already publicised by STI. You sure it's not two sets of registers, perhaps?
 
Last edited by a moderator:
No they are real; software threads are virtual, X2cpu has hardware threads like Power5 chip. Yes, each core has 1 vmx but 2 registers, one for each thread; vmx will operate same as other execution units do, switching between both threads.
 
scooby_dooby said:
The other type of cache management is for procedural synthesis, in this case a portion of the cache is locked so the GPU can write directly to it.

The GPU cannot write to the Cache, it can only read (afaik).
 
Lysander said:
No they are real; software threads are virtual, X2cpu has hardware threads like Power5 chip. Yes, each core has 1 vmx but 2 registers, one for each thread; vmx will operate same as other execution units do, switching between both threads.

Yes, and each VMX unit on CELL has 32 registers as well, so what is the difference then? btw this is what CryTek says (translated with google:) ):

If you ask the hardware manufacturers, is not it naturally like that. But if one analyzes it[Xenos] as a software developer, it is nothing different than Hyperthreading. That is, one has six Threads, actually however only three times 1.5 Threads. On the PlayStation 3 it looks with Cell differently: The head CPU has two Threads (somewhat better than Hyperthreading), and in addition comes seven synergetic processors
 
Nemo80 said:
Yes, and each VMX unit on CELL has 32 registers as well, so what is the difference then? btw this is what CryTek says (translated with google:) ):

Can you link to that article Nemo?

Also, why isn't the PPE rated at 64Gflops if there are indeed 2 VMX units?
 
Nemo80 said:
The GPU cannot write to the Cache, it can only read (afaik).

right, my bad.

Titanio, the cache management they speak of may very well be exactly what we are talking about in this thread, whether or not they can 'partition' the cache as they see fit is the question is it not??

To what extent can coders control and manage this cache, isn't that what we're trying to find out?

edit - mispelled your name again...,
 
Back
Top