nFactor2 - an engine on X360

Nemo80 said:
Maybe this still points to the first CELL revision? As i said IBM have changed this over time.

Yes, but I'd expect they'd have announced the change. Also, the second revision was known before Sony announced the PS3 spec.


Nemo80 said:
this is the original interview (in german though):

http://www.gamestar.de/dev/pdfs/crytek.pdf

Thanks, he does actually say that. That's quite interesting. I wonder what he means precisely. But the comment on the threading is timely and relevant to this discussion, certainly (and the other thread here).

Haven't read the whole interview yet, but there seems to be quite a lot of talk about the consoles. Good find - surprised it didn't show up here sooner!

scooby_dooby said:
right, my bad.

Titanio, the cache management they speak of may very well be exactly what we are talking about in this thread, whether or not they can 'partition' the cache as they see fit is the question is it not??

To what extent can coders control and manage this cache, isn't that what we're trying to find out?

edit - mispelled your name again...,

Of course, that's why I hope someone might "clarify" for us re. that other "cache management" ;)
 
What mystifies me is that things like this, about Xenon:

As is widely known by now, the Xenon's VMX unit features 128 registers, each 128 bits in length. This so-called "VMX-128" unit allows for each running thread to use 128 vector registers, which means that Xenon has a total of 256 physical vector registers on the die (128 registers x 2 threads).

http://arstechnica.com/articles/paedia/cpu/xbox360-2.ars/4

should be well known by now. Each Xenon core has a dual-issue VMX core, each "half" of the VMX core having a dedicated set of 128 registers.

Jawed
 
sorry my first quote was a little bit cut, heres the whole ´paragraph (again google):

The 360-solution resembles Hyperthreading. There is in principle three CPUs with two Hyperthreads each. If you ask the hardware manufacturers, is not it naturally like that. But if one analyzes it as a software developer, it is nothing different one than Hyperthreading. That is, one has six Threads, actually however only three times 1.5 Threads. On the PlayStation 3 it looks with Cell differently: The head CPU has two Threads (somewhat better than Hyperthreading), and in addition comes seven synergetic processors. The eighth SPU existing in the Design was omitted.
 
Jawed said:
What mystifies me is that things like this, about Xenon:

should be well known by now. Each Xenon core has a dual-issue VMX core, each "half" of the VMX core having a dedicated set of 128 registers.

Jawed

Dual issue is a bad word...

My guess, ars technica did not know it exactly at that time, and they also state that:

The exact number of VMX-128 execution units is not yet known, but it's probably at least two, and possibly three.

.. well actually it's just one. AT guessed two, taht's why they speak about 256 (2x128) registers.
 
Nemo80 said:
sorry my first quote was a little bit cut, heres the whole ´paragraph (again google):

Nemo - that entire interview is very interesting, and worthy of its own thread. There's quite a good bit of decent commentary in there on the next-gen systems.

You don't happen to know when it was published? What issue of GameStar?
 
Nemo, with your posting record I'm afraid you don't have any credibility in my eyes...

Jawed
 
Nemo - The amount of VMX units does not tell you how many h/w threads exist. VMX are execution units - they have very little to do with "how many" h/W threads exist in a core. There are two hardware threads per core period.


1.5X performance does not mean that there is only half of a real thread... but that the second thread has to share resources with the first giving you the benefit of half the performance of the second thread. Whats great is that Xenon has a much larger register being shared pre thread than Cell PPE. (64/thread versus 32/thread).

Remember they are execution units only.
 
blakjedi said:
Nemo - The amount of VMX units does not tell you how many h/w threads exist. VMX are execution units - they have very little to do with "how many" h/W threads exist in a core. There are two hardware threads per core period.


1.5X performance does not mean that there is only half of a real thread... but that the second thread has to share resources with the first giving you the benefit of half the performance of the second thread. Whats great is that Xenon has a much larger register being shared pre thread than Cell PPE. (64/thread versus 32/thread).

Remember they are execution units only.

Actually it'S 128 per thread on xenon, because MS extended that for more graphic intensive tasks much like the 128 registers of a CELL's SPE but not identitcal. The big difference is that the CELL VMX Threads dont'T neet to save that register file when switching their hardware threads which takes up some additinoal CPU cycles like Xenon or any other Hyperthreading CPU (P4) has to.

@Jawed: Im reading these forums long enough to know that i don't mind anything of what you say ;) You don't need to believe me, but maybe people at Crytek, or even Microsoft?
 
Should the 128 128bit registers for each SPE be considered when thinking about data storage as well? Would this not provide a bit more space for code in an SPE's LS? ...or is this not good thinking.
 
Nemo80 said:
Actually it'S 128 per thread on xenon, because MS extended that for more graphic intensive tasks much like the 128 registers of a CELL's SPE but not identitcal. The big difference is that the CELL VMX Threads dont'T neet to save that register file when switching their hardware threads which takes up some additinoal CPU cycles like Xenon or any other Hyperthreading CPU (P4) has to.

And neither does Xenon. Physically the VMX register file is most likely layed out as a 128 entry x 128 dual-register solution, providing 256 physical register files but only 128 physical registers available on any given cycle.

Neither CELL's PPE NOR X360's core will need to save context when switching between their built in hardware contexts. Both CELL and X360 will have to save contexts when switching between software contexts within a given hardware context.

AND FYI, Hyperthreading doesn't exist, it is a marketing term, the correct terms are SMT (Simultanious Multi-Threading), CMT (Concurent Multi-Threading), and SOEMT (Swith On Event Multi-Threading). It is unlikely that the core designed for either CELL OR X360 actually supports SMT and more likely that it supports CMT or SOEMT.

Aaron Spink
speaking for myself inc.
 
scificube said:
Should the 128 128bit registers for each SPE be considered when thinking about data storage as well? Would this not provide a bit more space for code in an SPE's LS? ...or is this not good thinking.
This has no bearing on multithreading on the PPE core. The SPE's assets are for it's own use.
 
Shifty Geezer said:
This has no bearing on multithreading on the PPE core. The SPE's assets are for it's own use.

I didn't meant to imply it had anything to do with multi-threading. I wanted to explore 256K LS for each SPE may be a little less limiting for data and code given there seems to be allot of wide and unspecialized registers also available to store data.

128*128bits = 16,384 bits / 8 = 2048 bytes / 1024 = 2KB

I was asking if this could be taken advantage of as a storage space along with the LS. If so...maybe the SPEs are in a bit better shape then first thought.
 
Last edited by a moderator:
aaronspink said:
AND FYI, Hyperthreading doesn't exist, it is a marketing term, the correct terms are SMT (Simultanious Multi-Threading), CMT (Concurent Multi-Threading), and SOEMT (Swith On Event Multi-Threading). It is unlikely that the core designed for either CELL OR X360 actually supports SMT and more likely that it supports CMT or SOEMT.

Aaron Spink
speaking for myself inc.
Hyperthreading is SMT right? Or am I getting my acronyms jumbled?
 
3dcgi said:
Hyperthreading is SMT right? Or am I getting my acronyms jumbled?

The Pentium 4 micro-architecture implements a form of SMT. Hyperthreading per se is a marketing term and may or may not refer to SMT, just like Hypertransport is a marketing term refering to a type of point to point interconnect, but could be used in the future to refer to something somewhat different.

I just dislike using marketing terms. They cause confusion and are without strong definition. Its like the crappy SmoothVision and CineFX marketing names Nvidia and ATI try to get away with instead of just saying FSAA and SM2/3.

Aaron Spink
speaking for myself inc.
 
scificube said:
I didn't meant to imply it had anything to do with multi-threading. I wanted to explore 256K LS for each SPE may be a little less limiting for data and code given there seems to be allot of wide and unspecialized registers also available to store data.

128*128bits = 16,384 bits / 8 = 2048 bytes / 1024 = 2KB

I was asking if this could be taken advantage of as a storage space along with the LS. If so...maybe the SPEs are in a bit better shape then first thought.
Ummm...what's the point in having those registers in the SPU if they're not going to be used? ;) So yes, they contribute to the total local storage available to the execution units on the SPE giving a 1% boost to that 256kb LS.

I can't see how LS can be thought of as limited or in bad/not very good shape. It removes a good few barriers to keeping execution units fed over a conventional I+D cache system. It has a down side in needing to manage LS, but not a limitations in usefulness due to size. Again PS2 programmers are used to working with greatly limited resources and they've still pulled it off.
 
Shifty Geezer said:
Ummm...what's the point in having those registers in the SPU if they're not going to be used? ;) So yes, they contribute to the total local storage available to the execution units on the SPE giving a 1% boost to that 256kb LS.

I can't see how LS can be thought of as limited or in bad/not very good shape. It removes a good few barriers to keeping execution units fed over a conventional I+D cache system. It has a down side in needing to manage LS, but not a limitations in usefulness due to size. Again PS2 programmers are used to working with greatly limited resources and they've still pulled it off.

Actually I'm sorry I brought this up. I was victim to not thinking with the correct orders of magnitude in mind...bits to bytes etc. Yes 2 more K is trivial compared to the 256K LS. I was trying too hard I think. I've heard quite a few times that the size of the LS would be limiting and in my own way was trying to explore if it was a problem and how it could be dealt with...without going the other way and thinking about software design.

Again...I was trying to hard and not thinking straight.
 
I think the fear of limits comes to 'heck, we've got to fit our entire program into 256kb including data! :oops: '. As an initial shock, in an age of as many megaytes of source code as you could wish for, the idea of fitting aprogram into 256kb + data seems crazy. But when you chop that umpteen megs of source code into individiual processes you see that to perform a step of the main program you might well only need a few kb of instructions, and when you consider the LS is doing the same thing as I and D caches you see it can be used the same way, keeping local what's immediately needed and DMA'ing more data/code as the process progresses.

Coming up with small bite-size processes to implement might be quite a hurdle to begin with though. To date, at least on PC, you just keep throwing instructions at the processor and the cache manage itself which instructions and data are kept locally.
 
I think you're right in that it comes down to how well you optimise for the LS's size and IBM may be correct in thinking 256K is fine if you DMA in the manner you speak...as would make sense even to me. The latency attached to DMA requests should be easy enough to hide behind some intense computation elsewhere as long you don't have to go out to main memory all the time.

I guess I should leave this one for people better qualified to talk about the relevant issues. Interesting times are ahead at least.
 
Presumably the "DMA batches of work" concept requires "triple buffering", e.g.:
  1. fetch buffer - e.g. 20KB
  2. working buffer - e.g. 150KB (including 20KB fetched, 80KB output)
  3. output buffer - e.g. 80KB
The working batch, e.g. C, needs to have enough space in its buffer to simultaneously hold the fetched data ("copied" from the fetch buffer - in reality just a simple old<->new switch) plus the "program scratchpad" plus the output data. At the end of working on batch C, the output data is "copied" into the output buffer (using a simple old<->new switch) and the whole thing starts all over again.

So when working on batch C, say, the memory flow controller is organising:
  • batch D data being DMA'd into the fetch buffer 1
  • batch B data being DMA'd out from the output buffer 3 into RAM (or the next SPE, if pipelined)
The tweaky bit, obviously, is tuning the size of the buffers to fit within 256KB of LS, whilst also tuning the program's per-batch memory usage to take account of the conflicting demands of the input dataset size and the output dataset size, whilst trying to ensure that the batch is compute-bound (rather than stalling on fetch or output lag).

Prolly deserves its own thread this discussion.

Jawed
 
Last edited by a moderator:
Back
Top