Crytek on PS3/X360 (+ more - great read)

Nemo80 said:
Exactly, that's what i think of. But still, whats the difference to CELL then, making it "better" in this respect, according to CryTek (and a presumable 2nd VMX unit ;) )?

I'm not entirely sure. I haven't been keeping up with all of the information about the PPE, but one of my pet theories (someone will probably prove this entirely false!) is that the PPE is perhaps capable of two hardware threads, and also has two VMX execution units. So each thread shares the int unit/fp unit/cache/etc, but each has it's own VMX execution unit (or perhaps each thread can issue instructions to both VMX units?).

Nite_Hawk
 
That SMT document is pretty helpful. Thanks for that. I've never actually read about SMT but as I suspected the idea is really nothing new. Conceptually it's the same thing as multitasking in operating systems.
 
seismologist said:
That SMT document is pretty helpful. Thanks for that. I've never actually read about SMT but as I suspected the idea is really nothing new. Conceptually it's the same thing as multitasking in operating systems.

It's similar, but it's more like multitasking limited to two threads which each have dedicated hardware for thread context information so you don't need a context switch. It is a pretty neat idea given the small amount of hardware needed and the improvement in efficiency you can net if the two threads aren't fighting for control of shared resources.

I think the biggest question (atleast for me) is what is IBM/MS doing to deal with the logical processors fighting with each other over shared resources? between the six hardware threads, you've got 3 sets of execution units, and a 1MB cache. If you really intend to have 6 hardware threads that cache is going to be quite the prime commodity.

Nite_Hawk
 
blakjedi said:
Xenon is both SMP and SMT. The Hardware supports six independent hardware threads and sees each thread as its own processor. Programmers can either see them as Six threads or six processors.

It is highly probable that neither Xenon NOR CELL are SMT. The more likely implementation is CMT or SOEMT.

Aaron Spink
speaking for myself inc.
 
blakjedi said:
Almost Exactly... are there 128 VMX registers though some say 256. *shrug*

Um, people without clues shouldn't speak. Or try to correct other people without clues.

Each Core in Xenon implements 2 hardware contexts.

Aaron Spink
speaking for myself inc.
 
blakjedi said:
Taken from IBM's site: Characterization of simultaneous multithreading (SMT) efficiency in POWER5

"In SMT mode, the processor resources—register sets, caches, queues, translation buffers, and the system memory nest—must be shared by both threads, and conditions can occur that degrade or even obviate SMT performance improvement."

If I misunderstand that quote please let me know because I am trying to grasp this whole conversation as best I can.

The Power Architecture has 31 architected integer registers. The Power5 design implements much greater than 31 physical registers. There are 2*31 registers that contain the Architecture state for each of the hardware contexts supported by the Power5 design. In addition, there are a number of extra registers that are used for the renaming operation of the CPU used to disambiguate between various intantaneous uses of the a given architects register identifier within the program flow. This disambiguation of the registers allows the OoOE. There are multiple ways to implement the register subsystem of an OOO processors, three different implementations of which are done in the Alpha EV6, the K7/8, and the P6 microarchitectures.

Neither Xenon NOR the PPE being an OOO processor, none of this applies to either of them.

If you would like to learn more, I suggest you pick up a copy of one or both of the Hennesey and Patterson Computer Architecture books.

Aaron Spink
speaking for myself.
 
Nite_Hawk said:
The big downside is that there will be contention for resources, but IBM/MS has to deal with this anyway between the 3 cores, so perhaps they have made some advances beyond the previous SMT implementations.

Quite likely that neither Xenon NOR CELL implement SMT but instead implement either CMT or SOEMT.

Aaron Spink
speaking for myself inc.
 
one said:
So does it always bypass L2? Then what's the use of L2? :???:

Not always, as programmer wishes. I.e. read stream from X2 patent: "There are different techniques that can be used for loading information into the cpu. In one tec. the L1 cache is implemented as an n-way set-associative cache. In this tec., the information is received directly into a locked set of the L1 cache, bypassing the L2 cache. The information can then be transferred from the L1 cache to the registers [of the cpu]. In another tec. the information is transferred into a locked set of the L2 cache, and thereafter transferred to the registers. In yet another technique, the information can be streamed into a 2 or more way L1 cache, but with no set locking. [for write stream] CPU can perfom write streaming directly to system memory by bypassing both the L1 cache and the L2 cache (non-temporal store operation)." Last operation is more red alert ops to prevent trashing caches, which are already full. Write (read) stream could also go through non-locked part of L2 cache.
 
aaronspink said:
Quite likely that neither Xenon NOR CELL implement SMT but instead implement either CMT or SOEMT.

Aaron Spink
speaking for myself inc.

Sounds like CMT would be a better fit for in-order processors, but can they really expect very dramatic performance gains when switching only happens on high latency events?

Edit: I.E. are the 50% speed increases really feasable? It seems like it performs best when you have cache misses?

Nite_Hawk
 
Last edited by a moderator:
Nemo80 said:
Exactly, that's what i think of. But still, whats the difference to CELL then, making it "better" in this respect, according to CryTek (and a presumable 2nd VMX unit ;) )?

I feel blakjedi's characterization comes closest to answering this question. The 2nd VMX unit also helps with respect to the PPE in contrast to the single but more robust VMX unit in a core on MS's chip. It makes sense that the VMX unit is more robust because it is mean to perform some heavy lifting with respect to procedural synthesis etc where as on Cell the PPE can help but the real heavy lifting is to be done by an SPE or set of SPEs.

blakjedi said:
No. The processors are logical not virtual. Virtual processors are software based threads. The main difference between the independent threads on XeCPU and the SPEs (to follow on with your MP comment) is that on chip hardware such as registers, FPU units, integer, etc have to be shared between the threads. SPE each have there own hardware and are single threaded.

The CEll has 9 logical processors with 8+ sets of support hardware.
XeCPU has 6 logical processors and 3(maybe more) sets of flow control, execution, etc hardware to support them.

I understand how you are phrasing it and before I learned more about the topic I would have phrased it that way too, but its not quite right. BTW the mutlithreading that present on both chips is MUCH more than a small benefit.

.
 
aaronspink said:
If you would like to learn more, I suggest you pick up a copy of one or both of the Hennesey and Patterson Computer Architecture books.

I dont even think this is covered in Hennesey and Patterson. He should go pick up an operating systems book and look at some of the scheduling algorithms. Which is where all of this originated from in the first place.
SMT, CMT, blah blah blah (basically make up any name for your implementation) are all derived from the same basic concept.
 
Nite_Hawk said:
Sounds like CMT would be a better fit for in-order processors, but can they really expect very dramatic performance gains when switching only happens on high latency events?

Edit: I.E. are the 50% speed increases really feasable? It seems like it performs best when you have cache misses?

Nite_Hawk

Yeah in some tasks i think it should do that, with the little info we have now it would seem that the PPE could be better core vs cores because of the bigger cache.
However if you only use the PPE and one XCpu core as likely many of first gen titles will then it would be the other way around in theory atleast(twice the L2 cache per thread).

But then again Cell dont need to compete with bandwith from RSX so its really hard to speculate but very interesting.

The more we have been digging in the systems the more different they become it seems.
 
In X2 patent I read that there is 11 registers in each core. Perhaps this registers could be somehow delegated in 2-sets far 2 threads.
 
Nite_Hawk said:
Sounds like CMT would be a better fit for in-order processors, but can they really expect very dramatic performance gains when switching only happens on high latency events?

You are thinking of SOEMT. CMT implies a design where more than 1 thread can be active within a pipe but only 1 given thread can be active within a given pipestage.

SOEMT implies a design where only 1 thread can be active within the pipeline (ie, there is a slight pipeline drain between thread switches).

SMT implies that at pretty much any given pipestage there can be more than 1 thread active.

Edit: I.E. are the 50% speed increases really feasable? It seems like it performs best when you have cache misses?

The N* line of processors from IBM achieve general speed ups in the range of 35%. N* was the first commercially release MT processor in ~97/98. N* implmented SOEMT.

Aaron Spink
speaking for myself inc.
 
seismologist said:
I dont even think this is covered in Hennesey and Patterson. He should go pick up an operating systems book and look at some of the scheduling algorithms. Which is where all of this originated from in the first place.
SMT, CMT, blah blah blah (basically make up any name for your implementation) are all derived from the same basic concept.

Um, pretty sure the latest version of H&P covers threading. At least that was the impression I got the last time I was talking to them a couple of years ago.

SMT, CMT, SOEMT are pretty defined terms and you don't just make up any name. The basic concept is multiple hardware contexts which is ancient and actually dates back to the 60's in hardware for some IO controllers.

Anyways, an OS book doesn't have really any direct application to the area of multiple hardware contexts other than talking about OS level software contexts which are an orthogonal issue.

Aaron Spink
speaking for myself inc.
 
yeah the idea behind it is ancient. I mentioned operating systems because I think it would take a more comprehensive look at task scheduling just by the nature of it being done in software. Rather than focusing on 1 or 2 specific hardware implementations.

Of course there's always the possibility that I have no clue what I'm talking about :p since my version of H&P doesn't cover hardware multi-threading.
 
aaronspink said:
The N* line of processors from IBM achieve general speed ups in the range of 35%. N* was the first commercially release MT processor in ~97/98. N* implmented SOEMT.

Aaron Spink
speaking for myself inc.

You don't happen to have a reference for this do you? IBM seemed to enjoy similar improvements with SMT in it's Power5 processors, though not in all situations. Shouldn't SMT generally provide greater speed ups?

...IBM claims substantial performance improvements from SMT: ~35% in database transaction processing and Websphere workloads, ~28% in SAP, and ~45% in Domino R6 Mail [12]. IBM also claims SMT increases throughput for SPECint_rate2k and SPECfp_rate2k by about 21% and 10% respectively [13]. This is significantly better than what CMT can do for Montecito because SMT allows for overlap of thread instruction execution time not just long thread stalls.

http://realworldtech.com/page.cfm?ArticleID=RWT100404214638&p=8

Nite_Hawk
 
aaronspink said:
Um, people without clues shouldn't speak. Or try to correct other people without clues.

Thanks for nothing there aaron.:rolleyes: Here is why I linked to the Power5 article: "The Xenon's SMT implementation is probably much, much simpler than that of the Pentium 4 with hyperthreading. SMT on the Xenon will probably look a lot like SMT on the POWER5, where the hardware support comes mainly in the form of duplicated registers."

How about this:

"As is widely known by now, the Xenon's VMX unit features 128 registers, each 128 bits in length. This so-called "VMX-128" unit allows for each running thread to use 128 vector registers, which means that Xenon has a total of 256 physical vector registers on the die (128 registers x 2 threads)."

OK so each hardware thread never has to "share" VMX registers. Cool. Now that you made me look it up, I have the answer to my earlier question.

aaronspink said:
Each Core in Xenon implements 2 hardware contexts.

Aaron Spink
speaking for myself inc.

In other words that arstechinca artcile I was reading is right. Thats all you had to say man.... youre smart and everything dealing with microprocessors and architectures but geez...some of us are just soaking this stuff up.

Now what Im trying to understand is if a single VMX-128 unit on Xenon can have two sets of registers (one for each context), then why does Cell PPE have two full VMX units?
 
Last edited by a moderator:
For twice the VMX performance! XeCPU can have one VMX instruction running at a time, switching between threads. PPE can have 2 instructions running, presumably either one per thread or two dual-issued on the one thread.
 
Shifty Geezer said:
For twice the VMX performance! XeCPU can have one VMX instruction running at a time, switching between threads. PPE can have 2 instructions running, presumably either one per thread or two dual-issued on the one thread.

How sure are you about this? It would be great if true.

Nite_Hawk
 
Back
Top