Crytek on PS3/X360 (+ more - great read)

1) PPE has larger cache per thread than Xbox 360 CPU core
Is that L1 or L2? Cell has 0.5mb of L2, X2cpu has 1mb of L2; so 0.25mb per thread in cell, 0.17mb per thread in X2cpu with ability to bypass L2 entirely. X2 has a bridge inside gpu.
I thought we are comparing threads in each cpu, not threads with spe.
Is "faster memory access" higher BW?
 
blakjedi said:
Xenon is both SMP and SMT. The Hardware supports six independent hardware threads and sees each thread as its own processor. Programmers can either see them as Six threads or six processors.

Even MP only systems dont necessarily get the full benefit of the extra processing power (dual athlons, dual celerons, SLI/Crossfire all prove this).

These are six virtual processors which you can use from a programming standpoint but performance-wise
it's behaves as 3 way mp with some small benefit from threading. In that case it's probably better to stick with sequential coding which is why the guy says that it's closer to PC than it is to Cell.

I'm not sure what you mean by the second part. I guess it's because those aps dont support multiple processors?
 
seismologist said:
yeah that's a much better way to schedule things. To maximize throughput so you sitting idle waiting for sequential jobs to complete.
Are you being sarcastic?
 
Lysander said:
Is that L1 or L2? Cell has 0.5mb of L2, X2cpu has 1mb of L2; so 0.25mb per thread in cell, 0.17mb per thread in X2cpu with ability to bypass L2 entirely. X2 has a bridge inside gpu.
I thought we are comparing threads in each cpu, not threads with spe.
So does it always bypass L2? Then what's the use of L2? :???:
Lysander said:
Is "faster memory access" higher BW?
I mean latency, but if the test is done on real-world conditions where contention occurs in UMA, bandwidth matters.
 
seismologist said:
These are six virtual processors which you can use from a programming standpoint but performance-wise
it's behaves as 3 way mp with some small benefit from threading.

No. The processors are logical not virtual. Virtual processors are software based threads. The main difference between the independent threads on XeCPU and the SPEs (to follow on with your MP comment) is that on chip hardware such as registers, FPU units, integer, etc have to be shared between the threads. SPE each have there own hardware and are single threaded.

The CEll has 9 logical processors with 8+ sets of support hardware.
XeCPU has 6 logical processors and 3(maybe more) sets of flow control, execution, etc hardware to support them.

I understand how you are phrasing it and before I learned more about the topic I would have phrased it that way too, but its not quite right. BTW the mutlithreading that present on both chips is MUCH more than a small benefit.
 
blakjedi said:
Xenos can read directly from L2 AND Main memory simultaneously. Coding with that in mind should reduce latency considerably.

For the CPU? I'm not sure how that would affect the distance between the CPU and memory so to speak.

one said:
I don't know in what condition they tested them, but if you allocate a part of the L2 cache for the write buffer for Xenos then you have even less cache. For the memory access, it's just my guess that the RAM is more far from CPU in Xbox 360 and a cache miss costs more in the overall latency.

I'd forgotten the CPU goes through Xenos's memory controller, that makes even more sense now. Other characteristics like contention as you pointed out earlier, and perhaps also any differences in XDR behaviour may contribute also. But I'd agree that main memory access is sufficiently expensive on either chip though that you'd want to try and avoid it ;)
 
blakjedi said:
No. The processors are logical not virtual. Virtual processors are software based threads. The main difference between the independent threads on XeCPU and the SPEs (to follow on with your MP comment) is that on chip hardware such as registers, FPU units, integer, etc have to be shared between the threads. SPE each have there own hardware and are single threaded.

The CEll has 9 logical processors with 8+ sets of support hardware.
XeCPU has 6 logical processors and 3(maybe more) sets of flow control, execution, etc hardware to support them.

Why do I feel like we're saying the exact same thing? What am I missing?

I understand how you are phrasing it and before I learned more about the topic I would have phrased it that way too, but its not quite right. BTW the mutlithreading that present on both chips is MUCH more than a small benefit.

I'm not sure how you quantify the benefit from multithreading on shared resources. Its performance will be application dependent and which may not be an ideal scenario for gaming if you need predictable results which is why I say sequential coding may be the right choice on Xenon.
I'm making some assumptions here, but basically I'm trying to interpret the comment by Crytek stating Xenon being closer to X86 than it is to Cell.
if you have some other explanation please do tell.
 
seismologist said:
I'm not sure how you quantify the benefit from multithreading on shared resources. Its performance will be application dependent and which may not be an ideal scenario for gaming if you need predictable results which is why I say sequential coding may be the right choice on Xenon.

That might be a reason why Microsoft only talks about 3 threads in their development paper.

I'm making some assumptions here, but basically I'm trying to interpret the comment by Crytek stating Xenon being closer to X86 than it is to Cell.
if you have some other explanation please do tell.

Yes, but still i would like to know why CryTek also says that Xenos "Hyperthreading" gives only "1.5"-times performance gain while CELL is better in this respect?
 
seismologist said:
I'm not sure how you quantify the benefit from multithreading on shared resources. Its performance will be application dependent and which may not be an ideal scenario for gaming if you need predictable results which is why I say sequential coding may be the right choice on Xenon.

I think for some things this may be true. I could easily see a core being reserved for just one thread. If it's a very busy thread it may not be even worth trying to put another one on there.

This also reminded me of:

"With 3 cpu's with 2 hardware threads each (dual core cpu's) it's possible that we are going to scale for 6 threads. Maybe we're not gonna do it though, depending how fast the individual cores or the cpu-threads are running respectively."

Not saying he was necessarily thinking of the same things we are here, but it seems relevant. The issues he raised as being important to multi-threading/multi-processing also seemed to bring to the fore issues of dedicated per-thread resources vs shared per-thread resources. It might be his expectation that some of the tasks he's looking to parallelise won't be wasting too many cycles.
 
blakjedi said:
The main difference between the independent threads on XeCPU and the SPEs (to follow on with your MP comment) is that on chip hardware such as registers, FPU units, integer, etc have to be shared between the threads. SPE each have there own hardware and are single threaded.

Share REGISTERS between hardware threads? Are you crazy? Where did you get your CS education?
 
ector said:
Share REGISTERS between hardware threads? Are you crazy? Where did you get your CS education?

Well i guess that means there are 2 threads running in software, but since there is only one register set and one execution unit, these threads have to share them. So there has to be some context swtiching between them (applies to Xenon).
 
Nemo80 said:
Well i guess that means there are 2 threads running [in] software, but since there is only one register set and one execution unit, these threads have to share them. So there has to be some context swtiching between them (applies to Xenon).

Almost Exactly... are there 128 VMX registers though some say 256. *shrug*
 
Last edited by a moderator:
I think other way: they are 2 sets of memory registers, one set of execution units (fpu, intg, vmx), and there is time slice switching of those executions units between data stream from one register to data stream from second register; therefore creating two hardware threads.
 
Lysander said:
I think other way: they are 2 sets of memory registers, one set of execution units (fpu, intg, vmx), and there is time slice switching of those executions units between data stream from one register to data stream from second register; therefore creating two hardware threads.

This is how I understand it to be, but I may be wrong. I think the idea is that context switching should be very fast so keeping the execution unit fed is much easier, but you are still limited to only perform as much work as a single execution unit can do.

I believe the 1.5x thread quote was describing that the xenon can be thought of as utilizing 100% of the execution unit, rather than say 67% of the execution unit that a single hardware thread would allow. So 100%/67% = 1.5 speed increase over a single thread, or "1.5 threads". It's not really correct, but perhaps a reasonable characterization of the performance increase that is gained by having 2 hardware threads per core.

In this way, the xenon might be thought of as having the same performance as a CPU with 4.5 cores with single hardware threads.

Nite_Hawk
 
Nite_Hawk said:
I believe the 1.5x thread quote was describing that the xenon can be thought of as utilizing 100% of the execution unit, rather than say 67% of the execution unit that a single hardware thread would allow. So 100%/67% = 1.5 speed increase over a single thread, or "1.5 threads". It's not really correct, but perhaps a reasonable characterization of the performance increase that is gained by having 2 hardware threads per core.

In this way, the xenon might be thought of as having the same performance as a CPU with 4.5 cores with single hardware threads.

Nite_Hawk

Exactly, that's what i think of. But still, whats the difference to CELL then, making it "better" in this respect, according to CryTek (and a presumable 2nd VMX unit ;) )?
 
ector said:
Share REGISTERS between hardware threads? Are you crazy? Where did you get your CS education?

Since I am new as of this year to understanding much about computer architecture I'll just quote IBM. If my understanding is wrong please enlighten! :D

Taken from IBM's site: Characterization of simultaneous multithreading (SMT) efficiency in POWER5

"In SMT mode, the processor resources—register sets, caches, queues, translation buffers, and the system memory nest—must be shared by both threads, and conditions can occur that degrade or even obviate SMT performance improvement."

If I misunderstand that quote please let me know because I am trying to grasp this whole conversation as best I can.
 
On the Power5 the register pool is larger than the programmers view, to cope with the renaming and OOOE. 2 threads share this common renamed pool.
 
blakjedi said:
Since I am new as of this year to understanding much about computer architecture I'll just quote IBM. If my understanding is wrong please enlighten! :D

Taken from IBM's site: Characterization of simultaneous multithreading (SMT) efficiency in POWER5

"In SMT mode, the processor resources—register sets, caches, queues, translation buffers, and the system memory nest—must be shared by both threads, and conditions can occur that degrade or even obviate SMT performance improvement."

If I misunderstand that quote please let me know because I am trying to grasp this whole conversation as best I can.

This is also brought up in the arstechnica article that was posted in the other thread:

http://arstechnica.com/articles/paedia/cpu/hyperthreading.ars

I'd also like to revise my earlier post a bit. It is not that context switching is fast per say on xenon, it's more that the scheduler gets operations from both logical processors that are running concurrently, so there really is no context switching (atleast for the 2 hardware threads). The obvious benefit of this is that you don't pay for a context switch and it is faster to access cache (though it is all shared anyway between the 3 cores). The big downside is that there will be contention for resources, but IBM/MS has to deal with this anyway between the 3 cores, so perhaps they have made some advances beyond the previous SMT implementations.

I think people are opposed to the "logical processor" definition given that the execution units are shared. "logical processors" just means that you have two front ends that can both feed the execution units so they don't sit idle as often and share cache (for better or worse).

Nite_Hawk
 
Last edited by a moderator:
Back
Top