New ITAGAKI interview touches on 360 and PS3 comparison

Black Dragon37 · Sep 25, 2005

Gubbi said:
Of course they'll say that, the built the shite.

They also built Xenon. That's shite too?

Nemo80 · Sep 25, 2005

I think the SPE model is actually easier to get performance out as compare to a classic SMP/T as is the X360.

On the 360 you have to do a lot of synchronization work because 6 Threads can be running simultanously, each taking over some work for a game engine simulatanously. So everything has to be synchronized carefully not to let any thread run out of the context. (too slow or too fast). Additionally there is very little cache for 6 threads so that the dev. has to carefully plan what will be done at which time to not run into a cache bottleneck here, which also decreases performance.

On the CELL on the other hand you have the same problems, when looking only at the PPE (although cache is a little less a problem since there is more per thread than Xenos has). The big difference is however that the SPE Model is not a SMP/T one at all. It can be thought of something like a Master -Slave relationship where the Master (PPE) delivers tasks to the individual SPE, much like simply calling a subroutine. Only that the subroutine is running in an ultra fast SPE instead of a GP core. This way there is much less synchronziation work needed and since each SPE is independent from the other it's also highly unlikely that "cache" stalls can occur (Also since SPEs don't have any cache)...

aaaaa00 · Sep 25, 2005

Nemo80 said:
The big difference is however that the SPE Model is not a SMP/T one at all. It can be thought of something like a Master -Slave relationship where the Master (PPE) delivers tasks to the individual SPE, much like simply calling a subroutine.

It's trivial to construct any form of multithreaded relationship on an SMP machine, because it is the most general and flexible form of multithreading that exists.

If you want, you can easily build a "master - slave" design pattern on SMP by designating a main thread and constructing a job queue for each slave thread you allocate.

The synchronization required for doing this on an SMP is no more and no less than in the SPE Model -- you need some sort of lock to protect each slave thread's job queue from corruption by access from multiple concurrent threads.

inefficient · Sep 25, 2005

aaaaa00 said:
It's trivial to construct any form of multithreaded relationship on an SMP machine, because it is the most general and flexible form of multithreading that exists.

If you want, you can easily build a "master - slave" design pattern on SMP by designating a main thread and constructing a job queue for each slave thread you allocate.

The synchronization required for doing this on an SMP is no more and no less than in the SPE Model -- you need some sort of lock to protect each slave thread's job queue from corruption by access from multiple concurrent threads.

I believe you are wrong. Your thinking in classic classic SMP/T terms. And like Nemo80 hinited, the correct way to look at SPE progamming is not like this. The key advantages the Cell has here is the DMA memory access model and that each SPE has a local store. In the cell programming model you would set up a DMA on the SPE and then let it execute/read/write in it's own private area.

Gubbi · Sep 25, 2005

ihamoitc2005 said:
Gubbi said:

It's not that you explicitly have to set up a DMA that is the problem, it's that local stores aren't kept coherent. The lack of memory coherence *is* a bitch. The nuisance of the heterogenous ISA is minor compared to that.

Click to expand...

You didnt read the link.

It starts with ...

The Cell Broadband Engine is a single-chip multiprocessor with nine processors operating on a shared, coherent memory.

Click to expand...

and

While each SPE is an independent processor running its own application programs, a shared, coherent memory and a rich set of DMA commands provide for seamless and efficient communications between all Cell processing elements.

Click to expand...

DMA transfers are kept coherent, but stores to SPE's local store are not. This means that the entire local store is part of the SPE's context and have to be saved on a context switch.

So the line from the article (my emphasis):

The SPEs are more adept at compute-intensive tasks and slower at task switching.

Is a gross understatement. SPEs having more than 256KB context compared to the ~1KB for a regular PPE context means that task switching is all but impractical.

ihamoitc2005 said:
Also referring to heterogeneous ISA ...

http://domino.research.ibm.com/comm...?Open&printable

Memory access is performed via a DMA-based interface using copy-in/copy-out semantics, and data transfers can be initiated by either the IBM Powerâ„¢ processor or an SPU. The DMA-based interface uses the Power Architectureâ„¢ page protection model, giving a consistent interface to the system storage map for all processor structures despite its heterogeneous instruction set architecture structure.

Click to expand...

Sounds pretty straightforward no?

And that has nothing to with the Instruction Set Architecture. The PPE and SPEs have different ISAs. And, worse, different programming model. This means that if your PPE is dogged down with tasks, you can't just push one of them off to one of the SPEs, you actually have to do some real work to do that.

So no, it's not straight forward at all.

Cheers
Gubbi

Gubbi · Sep 25, 2005

Black Dragon37 said:
They also built Xenon. That's shite too?

IMO, yes.

OOO and more cache please.

Cheers
Gubbi

Gubbi · Sep 25, 2005

inefficient said:
I believe you are wrong. Your thinking in classic classic SMP/T terms. And like Nemo80 hinited, the correct way to look at SPE progamming is not like this. The key advantages the Cell has here is the DMA memory access model and that each SPE has a local store. In the cell programming model you would set up a DMA on the SPE and then let it execute/read/write in it's own private area.

The only way to look at CELL is to look at it as what it is: One host processor with multiple DSPs. What is novel about CELL is that the DSPs are optimized for float instead of integers, the level of integration and the bandwidth of the beast.

Microsoft chose to add DSP like functionality to their cores; SIMD instructions and lockable cache.

The fact that you have to explicitly move data around with DMAs is not an advantage. Repeat: NOT... NOT.... N.O.T. an advantage. It was done to remove the complexity of keeping 9 cores coherent.

Cheers
Gubbi

compres · Sep 25, 2005

expletive said:
Regardless of the actual hardware benefits, a lot of developers who prefer the 360 have commented on the overall dev environment and the tools they can use. Debuggers, performance tools, etc. Plus they are all tools that developers who develop for the PC are familiar with already and those who havent, claim they are easy to use. (and after seeing the pc version of the 360 controller, its obvious this is a HUGE part of MS' mid to long-term strategy: one development budget-two platforms)

That said, paralleism with 3 identical cores and 6 identical threads should be a bit easier than a PPE and SPE design where each has different needs and potentially different roles shouldnt it? (I have to credit that thought to Carmack though, as he stated in his Quakecon address.)

What we have not seen, however, is if the Cell will provide an advantage in the closed-box system known as the PS3 and i think thats what is really on trial in this thread.

J

Almost excactly my thoughts.

Panajev2001a · Sep 25, 2005

C/C++ and Intrinsics on EE's VU's... :lol

Seriously,

If you are not counting the "succesful" (ahem) VectorC/Codeplay VU Compiler, SCe has never really endorsed a compiler for VU's nor did they ever think about it when designing them with Toshiba. Even VCL was an afterthought, a good one, but still an afterthought IMHO. The ISA, the VU resources and functional units choice, etc... it was all chosen with low-level ASM programming with hand-scheduling of instructions by the programmer.

You cannot really compare the VU's with the SPE's this way unless you are quite bitter on the argument and you want to ignore the progresses made over the concept and implementations of Vector/SIMD processors going from the VU's to the SPE's (although some mistakes were made too, like the big mess about lack of misaligned load/stores in the SPE's and how it relates to scalar processing performance).

darkblu · Sep 25, 2005

aaaaa00 said:
It's generally easier to write fast code when it's easier to write correct code, since fast but incorrect code is not typically very useful.

why is that i get the feeling you equate 'correct' with 'easy'? i.e. why the fact that it's relatively easy to write smp multithreaded code (which, btw, many gamedevs i've worked with would outright disagree with) somehow translates to writing correct code? i can pick a rock off the ground and throw it right away at a target - that's toddler-easy. the chances of me hitting a target that way though and the chances of me hitting the same target through a scope rifle, which i took some time to learn handling, are quite different. see, the former approach is 'easy' whereas the latter is 'correct'.

Correct multithreaded code is much easier to write when you have N identical CPUs all sharing identical access to the same main memory, with a well-ordered memory model and cache coherency guaranteed by the hardware. (Which is pretty much x86 SMP in a nutshell in fact.)

first, you step on the premise that correct smp multithreading is the easiest, most natural and, i get the feeling magically efficient form of concurrency. whereas in fact it's only the _dominant_ one in the desktop space. for various reasons, the majoriy of them purely historical and others purely economical. and you're yet to prove that point about the easy _correct_ smp code.

Such an architecture is fairly well understood today, and any college concurrent programming textbook will teach you the basics of synchronization objects and have parallel algorithms that work correctly and reasonably well on an SMP.

good luck with being a good concurrent programmer after reading one college textbook on multithreading (that's not necessarily directed at you, that's a general statement).

Each step away you take from such an architecture introduces stuff that makes it more complicated just to insure code correctness, never mind performance.

that's basically saying that each step away from a place takes us farther from it. yes, what's so fundamentally problematic with that? why should we be so stuck with smp multithreading? because of the availability of college textbooks on the subject? sorry, i fail to see the reason.

The point Carmack is making is that xbox 360 is already pretty much the best case scenario for multithreaded architectures -- but even there, insuring code correctness is going to be hard to do before you even start to think about making the performance better.

the point Carmack is making is that the 360 is as close to pc smp as you can get, and _yet_ that gives you zilch in terms of guaranteed performance gains. and he has a gripe with that, regardess of how (un)comfortable he or anybody else may feel about smp mutithreading.

Panajev2001a · Sep 25, 2005

Gubbi said:
The fact that you have to explicitly move data around with DMAs is not an advantage.

Says you and others... but not everyone dislikes it

.

expletive · Sep 25, 2005

darkblu said:
regardless of how true his statement is in itself, the question is: what do _you_ read into his statement.

the first paragraph of Carmack's statement basically says: 'it is very easy to spawn a thread and get it running on the 360 - just as easy as it is on your grandma's smp pc'

to which eveybody can only nod in agreement, as there's nothing to misundersand here and that message gets clearly and correctly propagated. now, getting a thread up and running and actually getting efficient parallelism are two entirely different things, as anybody who has ever tackled a single parallelism problem could tell you. so let's see what Carmack says further in his second paragraph.. he says exactly this - 'regardless of how easy it's to tinker with threads (in your grandma's smp way) this still grants you nothing in terms of effective paralellism'.

ok, now that we cleared up the matter with Carmack's statement we can return to the original topic - how much easier it is to achieve _efficient_parallelism_ on the 360 over the cell. and now it's your turn to step in and actually build your argument.

Ok, couple of things here, and i do appreciate your thoughtful response.

1. My original post was phrased in the form of a question:

"That said, paralleism with 3 identical cores and 6 identical threads should be a bit easier than a PPE and SPE design where each has different needs and potentially different roles shouldnt it?"

So i'm not trying to argue any point, just trying to udnerstand the difference and benefits of each approach.

2. I guess i interpret what is being said by JC slightly differently. My interpretation is he's saying:
a. multithreaded prgramming is a pain in the ass
b. from an 'ease of use' standpoint, the design in the 360 is the best possible case for a developer to coax performance benefits out of multithreading
c. even on the best possible case, its very difficult to realize real world benefits

In my mind, that still doesnt change the fact that regardless of how much aboslute performance gain you can wring out of the 360 CPU, at the end of the day its still easier to get that meager efficency from the XeCPU than the Cell.

What does this mean? I dont really know, more relative efficiency on the 360 CPU? Shorter developement times? Bertter games sooner in each console's lifecycle? No idea and only time will tell.

So in summary, even if the dev-friendly design of the XeCPU gets you nothing, its easier to get nothing on the 360 then on the PS3.

J

Shifty Geezer · Sep 25, 2005

Panajev2001a said:
Originally Posted by GubbiThe fact that you have to explicitly move data around with DMAs is not an advantage.

Click to expand...

Says you and others... but not everyone dislikes it .

I've got to say I like the idea of SPE's forced memory management. I work on high level PC code and there's often occassions when I WANT to know what's passing through cache and where my data in distance from the processing logic. But then at Uni out of all the languages and programming models the one I liked most was assembler. I preferred to know exactly what the hardware is doing and to think like a CPU to make the most of it.

I guess I could liken it to explicit variable declarations or not. I'd much rather have the option to NEED to declare my variables up front then the freedom to make new variables on the fly mid-code, as the ease of the latter produces the risk of errors from using a wrong variable name. Likewise the NEED to keep an eye on managing memory accesses may be an inconvenience but it focusses the developer on optimisations and working WITH the processor. Kind of a Xen thing :mrgreen:

Shifty Geezer · Sep 25, 2005

Gubbi said:
DMA transfers are kept coherent, but stores to SPE's local store are not. This means that the entire local store is part of the SPE's context and have to be saved on a context switch.

I can't see who was talking about context switches on a SPE. Anyone wanting to run two+ concurrent threads on a SPE and switch between them needs their head examining! You set it a task, let it finish it, and then move onto another task. When would you not want to work that way on a SPE?

one · Sep 25, 2005

Gubbi said:
The fact that you have to explicitly move data around with DMAs is not an advantage. Repeat: NOT... NOT.... N.O.T. an advantage. It was done to remove the complexity of keeping 9 cores coherent.

How about the prospect of a software framework for Cell that can automatically manage/optimize dataflow in this deterministic environment instead of hand optimization by a programmer?

Sis · Sep 25, 2005

Shifty Geezer said:
I've got to say I like the idea of SPE's forced memory management. I work on high level PC code and there's often occassions when I WANT to know what's passing through cache and where my data in distance from the processing logic. But then at Uni out of all the languages and programming models the one I liked most was assembler. I preferred to know exactly what the hardware is doing and to think like a CPU to make the most of it.

For small, trivial things, sure assembler is great, and understanding exactly how the machine is operating at a low level is a good thing. But aren't we so far beyond the simple, trivial cases that this is just not feasible--except for the highly-focused, performance-profiled directed situations?

.Sis

Shifty Geezer · Sep 25, 2005

Dunno. Large programs are broken into smaller procedures or code segments that make up your engine and these get pieced together to make the whole program. 256kb LS for data and code means your program isn't going to be totally massive, and I would guess much smaller than 256kb. Heck, 200kb of assembler isn't a pretty thought! You can achieve a lot in 32kb (whole 8 bit games even. Imagine how fast the original Elite could run when written for a SPE

) and I'd expect a process could be broken into manageable and efficient chunks. Seems more a matter of good design is needed rather than mystical programming powers. And note SPE's don't need assembler so the points moot anyway. Unless you're still developing for PS2!

ihamoitc2005 · Sep 25, 2005

expletive said:
b. from an 'ease of use' standpoint, the design in the 360 is the best possible case for a developer to coax performance benefits out of multithreading
c. even on the best possible case, its very difficult to realize real world benefits

You misunderstood what was meant by "best possible case". He didnt say design of the XeCPU = best possible design, what he means is that in programming XeCPU, one could ideally use it that way (get six threads going) but that in reality not like that. You see then he's comparing ideal world with real world. Carmack expert in only in single-core x86 so maybe hes looking for excuses.

ihamoitc2005 · Sep 25, 2005

SPE use

Shifty Geezer said:
Dunno. Large programs are broken into smaller procedures or code segments that make up your engine and these get pieced together to make the whole program. 256kb LS for data and code means your program isn't going to be totally massive, and I would guess much smaller than 256kb. Heck, 200kb of assembler isn't a pretty thought! You can achieve a lot in 32kb (whole 8 bit games even. Imagine how fast the original Elite could run when written for a SPE ) and I'd expect a process could be broken into manageable and efficient chunks. Seems more a matter of good design is needed rather than mystical programming powers. And note SPE's don't need assembler so the points moot anyway. Unless you're still developing for PS2!

Finally someone understands CELL programming. Its not multicore in the way of XeCPU. Its PPE run OS with ability to run 7 additional threads using specialized hardware, much like CPU off-loading graphics to GPU, audio to sound-card, etc... but advantage is since all SPEs identical whatever SPE is free can do whatever next task required is, a little like unified shader idea where same unit can do different types of tasks.

The key is to write lots of small programs that fit on 256kb. So new programming model must be internalized but if small programs used result = very fast and efficient load balanced processing.

People wanting multi-tasking on one SPE dont understand how to use it. Its not even needed. It processes one program at time in order of que, like 7 grocery store cashiers but where customers can move from long line to short line as needed..

Also, this is easy to understand no? Memory on each SPE is coherent but also shared on very high bandwidth bus so should be pretty straightforward. No bandwidth issues no coherence issues, only thing is to get programming model correct and rest is taken care of.

from below IBM link...

each SPE is an independent processor running its own application programs, a shared, coherent memory

mckmas8808 · Sep 25, 2005

Questions, questions, questions

ihamoitc2005 said:
People wanting multi-tasking on one SPE dont understand how to use it. Its not even needed. It processes one program at time in order of que, like 7 grocery store cashiers but where customers can move from long line to short line as needed..

BS. No way that's true (smiling hoping that it really is). Are you telling me that you don't have to dedicate a particlur SPE to do physics? So why do people always say, "If EA is using 3 SPEs for graphics and the other 4 for physics sound and AI then the CPU is being wasted"? I don't get it. So you're saying using your perfect example that any SPE could be used for physics at anytime? Is it smart to do it that way? So the 3 SPEs that EA was talking about might not always be the exact same SPEs, yet just will take 3 SPEs worth of information at any given time?

And who are you and what do you do? I've never seen you here before? Are you a game developer?

New ITAGAKI interview touches on 360 and PS3 comparison

Black Dragon37

Nemo80

aaaaa00

inefficient

Gubbi

Gubbi

Gubbi

compres

Panajev2001a

darkblu

Panajev2001a

expletive

Shifty Geezer

uber-Troll!

Shifty Geezer

uber-Troll!

one

Unruly Member

Sis

mental_v-sync=off;

Shifty Geezer

uber-Troll!

ihamoitc2005

ihamoitc2005

mckmas8808

Similar threads