Crytek on PS3/X360 (+ more - great read)

MechanizedDeath said:
I think the most interesting thing here is the potential difference between the PPE and the XeCPU cores

Definitely.

Deano alluded to the 2x1.6GHz cores thing in his GDC ppt. Is that clock locked?

I belive that Deano wasn't being literal, but rather illustrating the better way to "look at" the PPE if you want to code well for it.

So yeah, the only information we have right now is supposedly that there are two VMX units on the PPE, so it seems as if two hardware threads could simultaneously make use of the VMX for floating SIMD code only. That this been substanciated anywhere? And is there more to it. Does anyone have any other information, rather than just squabling?
 
http://translate.google.com/transla...&hl=en&ie=UTF-8&oe=UTF-8&prev=/language_tools

Full Summary:

1) Discussion was about multithreading, multicore, multi CCU, Streaming Architecture, Cross Flat form

2) They support SM 2.0 and upwards

3) They scale the modules into seperate into different entitities, they support multicore, multithreading and multiCPU systems

4) Xbox 360 CPU resembles hyperthreading, in principles 3 CPUs with 2 threads making it 6 threads but they call it 1.5 threads instead of 2 threads

5) Playstation 3 doesnt come close to hyperthreading, its completely different with the core having 2 proper long threads and SPEs having 7 shorter threads.

6) In terms of final game, the Playstation 3 SPU is not as flexible as a conventional processor, and therefore they scale it to the 7 SPEs

7) Because Playstation 3 architecture is so different, we need to make special adjustments for it , a normal programming policy does not work with playstation 3.

8) Sony might say they have improved thier development environment alot but thats thier wish and that is far from the truth, one must do alot of lower level work to make things work on playstation 3.

9) Cost and Development for converting from single core to multicore, multithreading systems is alot and severe.

10) When we take a single core game code and run it on multithreads, it loses its efficiency , but when you optimise it, the awards are much higher

11) The largest problem with moving advancing hardware is cache missing.

12) The difference between an Intel PC and a Xbox 360 is closer because of hyperthreading than is the difference between Xbox 360 and PS3 even though they essentially use the same PPC core.

13) The only common thing between PC, X360 and PS3 is multithreading

14) A single core Xbox 360 processor would be more efficient than the PPC processor of Playstation 3 but that seems to change when we take into account the 7 SPEs.

15) Being that Playstation 3 is much more different than PC and Xbox 360, we have to make a sub engine for the playstation 3.
 
Sounds like PS3 will be champ when it comes to multithreaded aps. PC's will excel at single threaded with X360 falling somewhere in between.

" Before we had the PS3 devkits we thought PS3 and Xbox 360 were closer in design than PC and console."

That's odd because I'm still under the impression that X360 was closer to Cell (strong multithreading). I wonder what they saw that made them change their minds.
 
seismologist said:
Sounds like PS3 will be champ when it comes to multithreaded aps. PC's will excel at single threaded with X360 falling somewhere in between.

" Before we had the PS3 devkits we thought PS3 and Xbox 360 were closer in design than PC and console."

That's odd because I'm still under the impression that X360 was closer to Cell (strong multithreading). I wonder what they saw that made them change their minds.

i dont get that at all from them saying that. i think you might be reading into it too much.
 
onetimeposter said:


We already have machine and human translations, and you've added a little embellishment to your "summary".

e.g.

onetimeposter said:
5) Playstation 3 doesnt come close to hyperthreading, its completely different with the core having 2 proper long threads and SPEs having 7 shorter threads.

They don't characterise the nature of the threads as "short" and "long" here. They just say the PPE has two threads, better than "hyperthreading", and then you have the 7 SPUs.

onetimeposter said:
6) In terms of final game, the Playstation 3 SPU is not as flexible as a conventional processor, and therefore they scale it to the 7 SPEs

It says they scale these differently.

onetimeposter said:
11) The largest problem with moving advancing hardware is cache missing.

Cache misses, not that cache is missing if some people confuse that.


onetimeposter said:
12) The difference between an Intel PC and a Xbox 360 is closer because of hyperthreading than is the difference between Xbox 360 and PS3 even though they essentially use the same PPC core.

You're adding things here again, trying to fill in "whys" and soforth. They said really the only point of similarity between X360 and PS3 is that they're both multithreaded. He didn't mention at all about "using the same PPC core" (be that actually the case or not). He said, before they had kits they thought they'd be very similar, closer than X360 and PC, but they've turned out to be very different.

onetimeposter said:
14) A single core Xbox 360 processor would be more efficient than the PPC processor of Playstation 3 but that seems to change when we take into account the 7 SPEs.

He said if you looked at them as generic processors, the X360 CPU (not just one core) would be more efficient/powerful than Cell. But when you factor in the SPUs, things change.

This is pointless, people are well able to read the available machine and human translations for themselves.
 
Last edited by a moderator:
MechanizedDeath said:
Anyway, can we get back on topic? I think the most interesting thing here is the potential difference between the PPE and the XeCPU cores. Is the expansion from DD1 to DD2 the reason behind this claimed ability of the PPE to be able to run two threads concurrently (albeit at effectively halved clockspeed)?
If you take away the overhead then there's not much difference between running two threads on one 3.2 GHz processor and one each on two 1.6 GHz processors. Well, assuming both threads are processor bound, if you'd otherwise be waiting for memory or something then you can get better efficiency because otherwise you'd be doing nothing. What DeanoC said seems to be a way to conceptualize that.

Months ago, there was discussion about what had changed, and it seemed evident from the die shots that a large portion of the DD1 die had be cloned. But not the full thing. There was speculation about whether or not the VMX had been cloned, and so on. Two run two threads concurrently would suggest twive the execution units correct? But Pana made mention at GAF of shared ALU/FPU/VMX units. How then could you run two threads?
There's still quite a bit about both architectures that hasn't been made public. A big part of the switch to DD2 seems to have been to adopt a more robust core (and one that looks a lot like the one from the X360). There was an additional functional unit in the VMX area of the die, but I don't think you should read that as it now has 2 VMX units. You have the answer to your question in the first part of your post - running two threads at the same time on a 3.2 GHz processor is a lot like running one each on 1.6 GHz ones.

My guess is that the original philosophy of the processor - most code running on the SPEs with the PPE orchestrating things and running the OS - wasn't turning out as ideal for the desired application as they'd hoped and it made sense to beef up the PPE since it'd be doing more heavy lifting (and that's certainly how it turned out).

Cell still performs best when used as envisioned (the ray tracing demo, Toshiba's HD video demo) but there aren't many opportunities to really push it like that.

It would appear that calling XeCPU a 3-PPE chip would be very flawed indeed.
I don't get why you'd say that, the main difference in design seems to be with the VMX unit, you could practically overlay them otherwise. Maybe I'm expecting too much from the SPEs but I don't see why you'd bother with VMX on Cell for SIMD use at all, so improvements to it are nice but... Maybe that was part of the plan to entice Apple (who has lots of legacy Altivec code and have loved a better running variant than what they got in the 970) away from the dark side. Not that it worked.
 
The P4 core compared to the X360 cpu is that it "relative" has many tricks too keep it running without to many stalls. Of course the prescott is not so good at this compared to the athlon64 and that was why AMD did not go that road according to F.W.

The point im getting to is that Intel has thrown cache to the Prescott as a way to get better HT although the succes of this can be discused elsewhere.

As ERP mentioned the X360 cpu would probably gain more with HT in relative sense compared to the latest pc cpu´s in that it with smart coding can get more out of the Core/Cores.
The trouble i see is with the cache and if it will be enough, or rather was the "cost/performance" the "best" from the budget they set.

The Cell cpu as i see it with the DD2 revision was that it has 2 VMX units with 32 registers versus 1 VMX unit per core with 128 registers on xCpu /correct me if im wrong.
 
My take on 1.5 threads / 2 threads is

1) PPE has larger cache per thread than Xbox 360 CPU core
2) PPE has faster memory access

and not dependent on multithreading styles. The both are fine-grained MT.
 
one said:
My take on 1.5 threads / 2 threads is

1) PPE has larger cache per thread than Xbox 360 CPU core
2) PPE has faster memory access

and not dependent on multithreading styles. The both are fine-grained MT.

Heh, yeah, seems so obvious but I hadn't considered it. That could easily just be it.

The threads having more cache to them is clear, but we know memory access is also faster? Or are we taking that from just a expected better XDR performance, or something else?
 
Hyperthreading isn't true multiprocessing. If you try to run some parallel computation like physics in multiple threads on a single core, they're going to be crashing into each other.
So the way I see it, Xenon is 3 way true multi-processing with some small benefit from hyperthreading while Cell is 8 way mp.

Seems like the way you would code for Xenon is each thread running a different process. Sort of multiprocessing at the job level. i.e. thread 1 = physics, thread 2 = AI, thread 3 = geometry or whatever.

Where as Cell would be parallel at the algorithm level. i.e. thread 1-4 = physics. thread 5-7 = AI with each thread running independently on the SPU.
 
If the Cell happens to have two VMX units it doesn't make it more "real" in terms of multitthreading. It just means that the core doesnt have to wait for one VMX to complete execution before begininng the next SIMD instruction set.

In other words, latency in terms of filling the pipelines with vector instuctions, and minimizing bubbles is reduced. There have been other PowerPC chips with more than one VMX unit and this is what they have done in the past.
 
seismologist said:
Hyperthreading isn't true multiprocessing. If you try to run some parallel computation like physics in multiple threads on a single core, they're going to be crashing into each other.
So the way I see it, Xenon is 3 way true multi-processing with some small benefit from hyperthreading while Cell is 8 way mp.

Xenon is both SMP and SMT. The Hardware supports six independent hardware threads and sees each thread as its own processor. Programmers can either see them as Six threads or six processors.

Even MP only systems dont necessarily get the full benefit of the extra processing power (dual athlons, dual celerons, SLI/Crossfire all prove this).
 
Titanio said:
Heh, yeah, seems so obvious but I hadn't considered it. That could easily just be it.

The threads having more cache to them is clear, but we know memory access is also faster? Or are we taking that from just a expected better XDR performance, or something else?
I don't know in what condition they tested them, but if you allocate a part of the L2 cache for the write buffer for Xenos then you have even less cache. For the memory access, it's just my guess that the RAM is more far from CPU in Xbox 360 and a cache miss costs more in the overall latency.
 
one said:
I don't know in what condition they tested them, but if you allocate a part of the L2 cache for the write buffer for Xenos then you have even less cache. For the memory access, it's just my guess that the RAM is more far from CPU in Xbox 360 and a cache miss costs more in the overall latency.

Xenos can read directly from L2 AND Main memory simultaneously. Coding with that in mind should reduce latency considerably.
 
seismologist said:
Where as Cell would be parallel at the algorithm level. i.e. thread 1-4 = physics. thread 5-7 = AI with each thread running independently on the SPU.
Or Cell could be geared over to
SPEs 1-7 = Physics for 4 ms
SPEs 1-7 = AI for 2 ms
SPEs 1-3 = Audio for 2 ms
SPEs 4-7 = Cloth dynamics for 3 ms
SPEs 1-7 = Rendering reflections and post processing for the rest fo the frame when they've finished their current task. SPEs 1-3 would start this before SPEs 4-7.

Just want to remind people that a SPE doesn't need to be statically assigned a task but can deal with other tasks, which might well be beneficial, where invariably examples for SPE function divided SPEs into operating on static tasks. eg. You may want to complete the physics simulation before tackling the AI to see if there's anything an AI entity needs to react to. Running the physics in parallel with the AI would be a bad idea as the simulation might produce a result that would affect an AI entity, only that entity has already had it's AI processed, such as a rock starts falling in the simulation AFTER the guy underneath has already had his AI algorithm see there's no threats. This concept would also fit in better with Cell's scalability.
 
one said:
I don't know in what condition they tested them, but if you allocate a part of the L2 cache for the write buffer for Xenos then you have even less cache. For the memory access, it's just my guess that the RAM is more far from CPU in Xbox 360 and a cache miss costs more in the overall latency.
I've heard often that XDR has lower latencies, plus would not the memory controller on Xenos add a delay? One of the advantages with AMD64 is the mem controller on die is it not? Though both machines really to be avoiding cache misses and I don't think any advantage PS3 might have in lower latencies would have noticeable benefit.
 
Shifty Geezer said:
Or Cell could be geared over to
SPEs 1-7 = Physics for 4 ms
SPEs 1-7 = AI for 2 ms
SPEs 1-3 = Audio for 2 ms
SPEs 4-7 = Cloth dynamics for 3 ms
SPEs 1-7 = Rendering reflections and post processing for the rest fo the frame when they've finished their current task. SPEs 1-3 would start this before SPEs 4-7.

Just want to remind people that a SPE doesn't need to be statically assigned a task but can deal with other tasks, which might well be beneficial, where invariably examples for SPE function divided SPEs into operating on static tasks. eg. You may want to complete the physics simulation before tackling the AI to see if there's anything an AI entity needs to react to. Running the physics in parallel with the AI would be a bad idea as the simulation might produce a result that would affect an AI entity, only that entity has already had it's AI processed, such as a rock starts falling in the simulation AFTER the guy underneath has already had his AI algorithm see there's no threats. This concept would also fit in better with Cell's scalability.

yeah that's a much better way to schedule things. To maximize throughput so your not sitting idle waiting for sequential jobs to complete.
 
Last edited by a moderator:
Back
Top