Does Cell Have Any Other Advantages Over XCPU Other Than FLOPS?

DeanoC said:
MS have done work on a C++ varient called Concur, that they are promoting as the basis of the C++ ISO200x standard. The basic idea is a ERP said pretty simple, add a few key words to C++ that indicate parallism... I have no idea how good it works but the fact they're at the point of introducing a new varient on C++ is IMHO good. I'm strongly in the 'its got to be C++ in new clothes' school of language design.
I think it'll take another 5 years or more to be incorporated in the standard. They've been discussing C++0x since 1999 for some new containers and syntax sugars but still haven't reached a conclusion while some vendors can't even support template export properly. Besides, it's unlikely that MS encourage developers to use non-standard language extensions such as Concur in production code as it can be a source of compatibility problems with future tools. As it's deeply integrated into the language unlike simple intrinsics or pragmas, IMHO it affects programming style and software architecture greatly to make them hard to convert or reuse when the real standard which is different from it appears.
 
10 yers ago my Comp Sci professors were talking about 5th generation languages in development that'd allow programming through near English level, natural language...
 
Shifty Geezer said:
10 yers ago my Comp Sci professors were talking about 5th generation languages in development that'd allow programming through near English level, natural language...

I'd suppose he got fired for substance abuse ?

Cheers
Gubbi
 
Shifty Geezer said:
10 yers ago my Comp Sci professors were talking about 5th generation languages in development that'd allow programming through near English level, natural language...
Too bad that we can't make compilers worthy of these super high level languages. Compilers so far have just been adequate.
 
10 yers ago my Comp Sci professors were talking about 5th generation languages in development that'd allow programming through near English level, natural language...

They were talking about 5GLs 10 years before that as well!
Then again COBOL can read almost like English and it's a very old language.

I once used a language called RPG, it's completely un-English (imagine indent sensitive assembly language, in German) but it's amazingly powerful as you can do amazing stuff in only a few lines of code. The best bit is it's getting on for 50 years old!

Something like that in English could be amazingly powerful. It'd be completely useless for games (it's really a database manipulation language) but I guess a games version could be done.
 
Fafalada said:
Maybe by short they mean the C/C++ name for 16bit integers.
SPE integer ISA doesn't having explicit instructions for multiply/multiply-add on 32bit operands.

Thanks for the answer Faf, makes sense.
 
Asher said:
You're slightly changing the argument. I have never said SPEs cannot perform integer and logic instructions. They can, but it's a bit like doing a "cone challenge" with a semi-truck. It works, you just gotta go real slow. ;)

"Real slow" as in SEVEN SPE's running at 3.2 GHz? I never claimed a SPE is equal in performance to the PPE in integer code, but claiming the SPEs' are slow, is nonsense. How do you define slow?
 
DeanoC said:
SPE have a large register context because they need lots of loop unrolling to get decent speed (same for XeCPU, hence VMX128). If the compiler halves the register file it will seriously limit the extent to which the compiler can hide the in-orderness of it and the FLOPs rating will be reduced.

If you have some time DeanoC maybe you could take a look at this.

I'm just trying to understand why you said this a little better.

I would think Xenon's VMX units would have enough register space to handle two threads and I believe that is documented to be the case. However, it's doubtful to me that the VMX units would have 128 128bit registers like an SPE. This would make the SPE a bit more capable of handling the XLC scheme in question no? I mean even if the register space in an SPE was cut in half I would imagine there would still be more registers per thread than what resides in the VMX units.

Also, I thought the VSUs provided for OoOe at their level. So why would the VMX128 in an Xenon core need a large register space to hide the chips 'Inorderness'? I thought the register space was largely due to the VMX units being made capable of serving two threads simultaneously as to avoid stalls there. No one has yet explained just what the PPE's VMX unit is like or how it operates. (Look Ma! I'm fishin'!) Unless the VSUs do not provide OoOe at their level I'm guessing you meant 'inorderness' was being hidden at a higher level.

I get lost though because I don't know why you mentioned Xenon's VMX unit and not it's general purpose registers in a core.

As far as what the XLC scheme is trying to pull off I was under the impression that it was not meant to be a means of maximizing throughput or rather speed but to deal with being memory bound due to a large number of perhaps unpredictable DMA requests. So if the flops should suffer for doing this I would imagine it would be in relation to a more ideal situation than this. What I mean to say is that in a situation where DMA requests are stalling you this scheme would seem to be a way of getting around those stalls and thus if performing flops in between DMA requests flops performance should improve (relatively). Flops performance would be less than a more ideal situation where DMA requests aren't hanging you out to dry.

I'm trying to understand what it is you really said. Do you mean that having to unrool loops would break this 'trick'? Or are you saying this 'trick' should not be a first option as it would adversely affect performance due to loop unrolling? The latter makes sense to me (well maybe not completely) because this 'trick' seems to be for a specific case. The former I don't understand on my own so I'm asking for help. I'm also curious if scheme has no value at all anywhere else.

I'm lost again (just point and laugh...everyone else does) on loop unrolling in itself....wouldn't this be done at compile time? So then wouldn't unrolled loops affect the size of your code and thus how much space is consumed in an LS or in cache instead of the space in the core's general purpose registers?

Lost I am. Saving I will need. Hides the truth the dark side does...clouds my judgment...or is that just my pills...nope....I probably don't know what I'm talking about.

---------------------------------------------------------------------------------

Seperate questions to anybody:

Why would an SPE's iop performance be less than it's flops performance? (Is version's Gints number wrong in post#70...or is this again a special case kind of thing like Cell being able to handle 64 threads)

Why can't a flop be exchanged for an iop? (3D games use flops not iops anyway more often than not no?)
 
Last edited by a moderator:
Asher said:
Again, you fundamentally misunderstand what multithreading is. It is not just duplicating the register set. The initial implementations many years ago were like that, modern SMT lets multiple threads compete (in real-time) for resources on the same chip.

Thanks, you do agree, that dual threading does not mean dual execution, cause if it did, it would mean any dual pipelined design offers dual threading.

You like to drag in those VMX units to make your argument, but then again you drag down your own argument by using that as that dual execution will occur in certain algorithms, and will get you no where near the 50 percent claim you are making.

You have a fundamental misunderstanding of dual threading.

If you want to claim 50 percent than PROVE IT!!! And it better be overall 50 percent, and not certain instances running a specific program.

All I can say, the amount of opitimization a 360 developer will have to go to, to get your 50 percent by using VMX code, must be a lot more effort than throwing an extra SPE into the mix to get even more work done. Hard to argue against SEVEN SPE's running at 3.2 GHz.

Optimizing your ALU's and VMX units to run at the same time, is not easy coding at all!

How do you argue against proof that AGEIA Physics API runs complete on CELL but stripped down for X360's CPU??? CELL is SUPERIOR to X360 CPU.
 
Last edited by a moderator:
It's not something that can be proven over a message board, and you probably know that... I do not currently have access to either Xenon or Cell, so I can speak only of past experience, which I cannot "prove" to you.

So forget it if you want, keep telling people SMT provides a 10-20% performance boost max...

I will provide some quotes for you from the IEEE will happens to say dual threads --> dual execution, though: http://csdl2.computer.org/persagen/.../01/0990toc.xml&DOI=10.1109/IPDPS.2001.924929
Simultaneous Multithreading (SMT) is a technique that permits multiple threads to execute in parallel within a single processor.
 
All I can say, the amount of opitimization a 360 developer will have to go to, to get your 50 percent by using VMX code, must be a lot more effort than throwing an extra SPE into the mix to get even more work done. Hard to argue against SEVEN SPE's running at 3.2 GHz.
The SPEs have, what, 7 million logic transistors? They're powerful when used correctly, but it's no easy task to use them correctly. They're quite different from what developers are used to. They aren't some magical chip with incredible FLOPS output and no tradeoffs for that.
 
Asher said:
The SPEs have, what, 7 million logic transistors? They're powerful when used correctly, but it's no easy task to use them correctly. They're quite different from what developers are used to. They aren't some magical chip with incredible FLOPS output and no tradeoffs for that.

Yes, 7 million for logic, out of a total of 21 million for the SPE's.

Sure the SPE's will require work to get the most out of them, but in the long run, which the PS3 can afford, with it's huge following, developers will exploit that power effectively. If that was not the case, than PS2 games would still look as bad as the first generation games.

Ageia has already proven, they can get superior performance out of CELL over Xbox 360's CPU even this early in the console race. How much more telling will it be as time goes on? If Ageia can dump their full API running on CELL, how much more does that free the PPE for general purpose stuff. If you have the resources to run your specialized stuff really fast, you have more resources for general purpose stuff.
 
Last edited by a moderator:
Edge said:
Ageia has already proven, they can get superior performance out of CELL over Xbox 360's CPU even this early in the console race.
They haven;t actually proven that as far as I'm aware. They've said Cell is good at physics, and there was a slide saying Cell could do things XeCPU couldn't, but that isn't official yet and we haven't any head-to-head comparisons. I don't think one comment Aegia have since disputed will count as conclusive proof in any court of law :p
 
Edge you're so far in over your head it's funny. This guy actually worked on the CELL processor, and you're sitting here arguing with him with numbers based on hyper-threaded P4-style multithreading(i'm assuming).

Or do you have some documentation/studies of the effect of SMT on an In-Order PowerPC core?
 
scooby_dooby said:
...This guy actually worked on the CELL processor...

Well he didn't work on Cell per se, he worked on the XLC project overall - so a substantial difference to take into account, though related in some senses. Not to say that he's not well versed in architectural differences of course. And Asher if I'm selling your 'proximity' to Cell short, do feel free to correct me.
 
How does a single SPE compare in general purpose performance with the MIPS core in the EE? Does the MIPS core have branch prediction?
 
scooby_dooby said:
Edge you're so far in over your head it's funny. This guy actually worked on the CELL processor, and you're sitting here arguing with him with numbers based on hyper-threaded P4-style multithreading(i'm assuming).

Or do you have some documentation/studies of the effect of SMT on an In-Order PowerPC core?

Too funny, the Xbox fan, is cheering for a guy making arguments for specialized coding on the X360 CPU, while downplaying the specialized coding on the SPE's. Sorry, but you should not be dragging in the VMX units, when bragging the PPE's are much better at general purpose code.

And Asher goes on and on how the SPE's are difficult to program, but you think SMT programming is easy???

Scooby_dooby if you have nothing to contribute to the discussion besides cheerleading your Xbox boy, then I suggest you stay out of it. You're clearly lack the understanding of our discussion, based on the fact you're not contributing to it.

Last time I checked this was a open discussion board? What do we have here, intellectual censorship based on your experience and knowledge??? I graduated in computer science, and have done years of programming. Is that enough credentials to have a discussion here?
 
Last edited by a moderator:
Back
Top