View Full Version : New ITAGAKI interview touches on 360 and PS3 comparison
ps2xboxcube
24-Sep-2005, 08:21
GI: You’ve said before you want to develop for the most powerful console out there. You must be pretty confident in the Xbox 360.
Itagaki: Yes. I hope so. I think Xbox 360 is the best game console on the earth. It’s better than PlayStation 3.
GI: Can I ask you why?
Itagaki: PS3 has too complicated of architecture.
http://www.gameinformer.com/News/Story/200509/N05.0923.1500.24632.htm
Now I am sure some sony fans will jump on TeamNinja and say thier development team is stupid and they should hire smarter programmers becasue cell is to complicated for them.
GI: We were very surprised with the online lobby setup for DOA4 with the Halloween theme. Are there multiple themes? What made you decide to do something like that for the online lobby?
Itagaki: You’ll start out with one that is free, but there will be lobbies that you can purchase from the online marketplace. Even if you don’t purchase any more from the online market place – if you’re a casual gamer – of course you’ll be able to play the game online without spending money. The free set of avatars will be ninja themed. You’ll be a lesser ninja. You’ll have to upgrade to get out of this sort of low class, lower set of ninjas.
GI: Is the Halloween one you showed at the event one going to be the one that is the default free lobby?
Itagaki: No, only one will be given away for free. If Ninja is the free one, the Halloween lobby you’ll be charged for.
GI: Will any of the lobbies or avatars be unlockable in the game?
Itagaki: No, we can’t do that. The story of DOA happens on the other side of that little TV screen. The side in front of the TV [ed: the lobby] isn’t part of the DOA world. I don’t know which side we consider is the virtual world. However, since we don’t allow character customization in the game, we are allowing it in the lobbies.
I think this is F'ING rediculous. Who the hell wants to pay for a stupid lobby and avatar. I would love to pay to customize all of the hot girls like Tina or Ayane. Hell every other non online fighting game, TEKKEN 5, VF4, and SC3, has Free charachter customization. I hate that they are passing up such a great opertunity to extend the life of the game by not having new costumes or accesories to download.
I find it very stupid they waste time on something that few people would want or pay for instead of what every DOA fan always wanted and would gladly pay for. He already gave us customization in DOAX. He should at least give us the same amount of customization and Swimsuits he gave us in Xtreame Beach VolleyBall for download. They would make a total shite load of money if they put those swimsuits and accesories to buy for download.
GI: You’ve always pushed the hardware you’re working on quite a bit. What have you been able to do on the Xbox 360 that you haven’t been able to do on the first Xbox?
Itagaki: At E3 we were one of the few developers to show their game demo on Xbox 360. Although it wasn’t playable. Most of the games were 15 fps. Our E3 trailer we tried getting it to 60 fps, but it ended up turning out to be around 45. That was E3. Now we’ve brought it up to 60. To be more specific, maybe it’s about 55 fps. From now until launch we’ll bring it up to 60. Other developers are now trying to bring their games up to 30fps. That’s a fact. Can you think of any other games that are running at 60 fps? Every Party? (laughs)
......So of course, for the last 8 years, our focus was to improve the graphics quality, and of course without having enough power, that became a challenge. From this point forward, obviously that won’t be the focus, and we’ll be working more on interactivity – things like this lobby – more about the actual game concept and gameplay. That’s where I think my focus will be on.
I think DOA Ultimate was the best looking game ever Until ConkerReloded came out to take the title for best graphics of all three systems for the current gen. But untill we get a direct feed DOA4 trailer of the newest build, I think even though there not at 60fps, GOW, PGR3 and MGS4 all look better than DOA4 does. The DOA4 charachter models even still look the same as DOAU and that really sucks.
I hate how he rushes the game because Microsoft is forcing Tecmo to make the 360 launch. He did the same thing to DOA3 and gave the US an incomplete version that he only fixed for Japan and Europe. And never released for the US. If he doesn't give us hurricane pack like downloads to upgrade DOA4 then I probably will not buy the game.
seismologist
24-Sep-2005, 08:25
notice that he didn't say the 360 is the most powerful ;)
Titanio
24-Sep-2005, 09:21
I remember a time when Itagaki wore his team's ability to harness "complicated" hardware like a badge of honour :P I guess he's just not as hardcore anymore ;)
I wonder if Itagaki is a real fa... console loyalist or did MS just throw insane amounts of money at him. He is a bit crazy that's for sure.
mckmas8808
24-Sep-2005, 09:30
I respect his answer, but we can't act like he's completly innocent here. If this guy is litterly blowing up X360 devs kits, then why is the PS3 so hard to program for? Last I heard it is easier to program the PS3 right now than the PS2 is 5 years after it's debut.
Acert93
24-Sep-2005, 10:17
Not every developer has the same budget, design goals, abilities, development window, background, or preference.
Devs of different walks are each going to find a console that matches their design goals best within the limitations they have. ALL games have limitations of some type. Limitations being a very vast word for both technical issues, real world issues, and game design issues (one game may be limited on System A and not B, but another game may be limited on System B and not A).
And of course there is politics and money.
All this stuff goes both ways. I think it is sufficient to say developers from all perspectives have spoken up on what they like. This is NOT an issue of lazy, one being better than another, etc...
In the context of the above developer, he gave an answer to his reason (which is much more than most devs will say, and further a more relevant point than 99% of the "my console is better than yours" in online forums). This reason is valid for his team and design objectives. This is not true of every developer, so the limitation is not universal. Obviously there are others who agree with him, and obviously there are some who would say, "Not for us!"
So no point being defensive about it. The game development world is HUGE. All kinds of companies and teams.
Apart from the above, the only thing I can get out of this of any significance is that some developers are attracted to one console over the other due to approachability. This is not an issue for most AAA dev teams, but since the majority of the market are the smaller/medium sized fish this is interesting. It fleshes out the dynamic we are seeing in the market to a degree (at least in the context of the games these teams are trying to make... again EVERY project and EVERY team is different).
Ps- I guess his quote is less offensive than some because it is more subjective within the construct of HIS TEAM. Its not a "X is better than Y absolutely"--although we SHOULD dismiss ANY notions of that--but his response to his feelings are in the context of his team and project.
Ditto if a PS3 dev said the PS3 was best for their project. The problem arises with absolutes or "console X is just too slow, console Y is just 4x as fast in every way" kind of garbage. When we hear Valve Software, id Software, Team Ninja, etc... speak we should understand what they say from their perspective. So far id and TN have done a good job of that, giving reasons and presenting it within the construct of their design philosophies.
Obviously a 400 person Square-Enix dev team working on a new FF has a different cap than a small 30 person team with 18 months to complete a game.
I guess I take ALL these comments with a grain of salt. TN's opinion is just that. Valid for them, but no more, no less.
Xbox fans: Be happy MS made a dev happy. Don't gloat as if this is universally true--because it of course is not.
Sony fans: meh, one dev--they all have preferences so nothing to worry about. But don't bash him either because it makes you look like a defensive system advocate who can only spam and slam and only look at things through the "my console is best at everything" angle.
This end Acert's public service announcement.
fireshot
24-Sep-2005, 10:20
There are 101 ways to interpret what the man means. Which side of the fence are you sitting on? ;)
Titanio
24-Sep-2005, 10:30
*snip*
Acert, you make lovely points, but given Itagaki's comments in the past about development difficulty on certain platforms, and insinuations he has made about other dev's capability on such platforms, and given his attitude to "difficulty" and "challenge" in general, his comments are quite amusing ;) The understanding you call for is not understanding Itagaki has shown in the past.
aaronspink
24-Sep-2005, 10:30
notice that he didn't say the 360 is the most powerful ;)
A Cat D10 is more powerful than my GTI, but I'd still rather drive my GTI.
Acert93
24-Sep-2005, 11:20
Acert, you make lovely points, but given Itagaki's comments in the past about development difficulty on certain platforms, and insinuations he has made about other dev's capability on such platforms, and given his attitude to "difficulty" and "challenge" in general, his comments are quite amusing ;) The understanding you call for is not understanding Itagaki has shown in the past.
Never played DoA series and am not familiar with the quotes. Do you have any direct quotes in context of an interview?
You may be right in his case (I have no clue as I have not seen them), but without seeing them I have seen devs appear to "flip flop" on issues when that is not their intent. Some developers like the task of difficult hardware; others like the task of difficult software designs. Those two can overlap or be unrelated... if that makes sense.
He is an Xbox dev after all, he may have liked the "challenge" of making difficult software (with the aid of a console with a balance of power/ease of use). But not knowing much about him I cannot say.
Money changes people too. Not only money from companies, but also budgets. So is this amusing? Yeah. But circumstances of all kinds can put the words we say at tension when the underlying motivation is the same or the situation changes so our stance does based on the new situation. Does not mean our opinion of the former circumstance has changed.
He could be totally bought, but he is not the first express views on either side of the fence. Easier to see things my way, even if there will be blatant examples of PR, being bought, etc. Let the games give a verdict :D
Welcome to the 60fps club!
Now it's blast from the past...
http://forum.teamxbox.com/showthread.php?t=328909
Titanio
24-Sep-2005, 12:48
Never played DoA series and am not familiar with the quotes. Do you have any direct quotes in context of an interview?
He's talked a lot in the past about power, the importance of power over all else, and about how complex architectures are "easy" for him and how that doesn't matter. Here's one example I was able to find that neatly contrasts with his comments above:
"XBN: Is PS2 harder to develop for than Xbox?
Itagaki: No, not really. PS2 development was easy for us. Do you know something? We converted DOA2 to PS2 in just three months. If your programmers are good enough, PS2 development shouldn’t be anything to complain about."
He has made a point in the past of boasting about his ability to harness machines that others think are difficult and complicated, and has implied less than kind things about other developers who are having difficulty - so when he makes comments like this about PS3, his past attitude should come back to haunt him.
AlStrong
24-Sep-2005, 13:02
Maybe he's just getting "old" like Carmack. :wink:
Mordecaii
24-Sep-2005, 13:17
I also seem to remember him stating that he would always develop for the most powerful console no matter what... Of course we don't know which one that is or even if it will be a clear cut difference, but if it's found out later that PS3 is for some reason more powerful he's going to have to go back on one of his previous stances and either develop for the PS3 or ignore the fact he's not developing for the most powerful platform. :) Of course this is a hypothetical what-if situation just to point out a previous statement from Itagaki.
ecliptic
24-Sep-2005, 13:35
Considering he has both consoles development kits. I am sure he knows enough about the two to form his opinion.
fireshot
24-Sep-2005, 13:43
Welcome to the 60fps club!
Now it's blast from the past...
http://forum.teamxbox.com/showthread.php?t=328909
What?
Titanio
24-Sep-2005, 13:44
Considering he has both consoles development kits. I am sure he knows enough about the two to form his opinion.
No one's questioning his opinion. I think everyone agrees, in fact, that PS3 is more difficult to develop for. Just pointing out that his complaint about that is at odds with his previous opinion on such matters (an opinion he was quite forward and "aggressive" with).
I think it is and was a foregone conclusion that Team Ninja will develop only for X360, and that has little to do with technical matters. It's just funny to see interviewers bring such issues up, and now to see his response, and how it contrasts with his stance on these issues before.
Shifty Geezer
24-Sep-2005, 13:47
He has made a point in the past of boasting about his ability to harness machines that others think are difficult and complicated, and has implied less than kind things about other developers who are having difficulty - so when he makes comments like this about PS3, his past attitude should come back to haunt him.OR...PS3 is sooo hard to developer for even the glorious Itagaki can't manage it!
Shifty Geezer
24-Sep-2005, 13:48
Maybe he's just getting "old" like Carmack. :wink:And Snake
Titanio
24-Sep-2005, 13:49
OR...PS3 is sooo hard to developer for even the glorious Itagaki can't manage it!
I can scarcely fathom that! Itagaki can't manage something!! No, never! :P
macabre
24-Sep-2005, 14:02
Considering he has both consoles development kits. I am sure he knows enough about the two to form his opinion.
Where does he say he has both ?
fireshot
24-Sep-2005, 14:05
"XBN: Is PS2 harder to develop for than Xbox?
Itagaki: No, not really. PS2 development was easy for us. Do you know something? We converted DOA2 to PS2 in just three months. If your programmers are good enough, PS2 development shouldn’t be anything to complain about."
He has made a point in the past of boasting about his ability to harness machines that others think are difficult and complicated, and has implied less than kind things about other developers who are having difficulty - so when he makes comments like this about PS3, his past attitude should come back to haunt him.
GI: You’ve said before you want to develop for the most powerful console out there. You must be pretty confident in the Xbox 360.
Itagaki: Yes. I hope so. I think Xbox 360 is the best game console on the earth. It’s better than PlayStation 3.
GI: Can I ask you why?
Itagaki: PS3 has too complicated of architecture.
Right? Right.
I hate jumping the fence, but we can also interpret as Itagaki thinks PS3 is needlessly complicated(costly) for a system in the same ballpack as X3.. he didn't exactly say he couldn't handle the silver beast..no?
Back up to the fence now.
darkblu
24-Sep-2005, 14:19
I hate jumping the fence, but we can also interpret as Itagaki thinks PS3 is needlessly complicated(costly) for a system in the same ballpack as X3.. he didn't exactly say he couldn't handle the silver beast..no?
ps2 was "needlessly" complicated compared to the xbox and that didn't seem to bother him. heck, the ps2 was ridiculously complicated compared to the dc and still i don't remember tecmo sticking to the dc either*.
i think it's about time people on this board stop implying technical merits to PR and marketing babble, even when that comes from devs. yes, i know not feeding the little f*boys in everyone of us takes some effort but nevertheless we can at lest try to pick more serious material to feed them on : )
* and in general, commercial game devs don't pick their platform of choice based on personal preference of architecture, power, ease of programming or color of the box. the criteria are completely different, and if they happen to match their personal preferences then that's very nice. but that's about all there's to it.
fireshot
24-Sep-2005, 14:41
ps2 was "needlessly" complicated compared to the xbox and that didn't seem to bother him. heck, the ps2 was ridiculously complicated compared to the dc and still i don't remember tecmo sticking to the dc either*.
i think it's about time people on this board stop implying technical merits to PR and marketing babble, even when that comes from devs. yes, i know not feeding the little f*boys in everyone of us takes some effort but nevertheless we can at lest try to pick more serious material to feed them on : )
* and in general, commercial game devs don't pick their platform of choice based on personal preference of architecture, power, ease of programming or color of the box. the criteria are completely different, and if they happen to match their personal preferences then that's very nice. but that's about all there's to it.
Please don't quote me.
I only know too well about f**boys feeding off what Itagaki said. see my first reply to the topic. :)
Johnny Awesome
24-Sep-2005, 15:46
Itagaki-san is full of shit. It's all PR-speak. It may or may not be true that PS3 is too much hassle to bother with, but it won't be because Itagaki-san said so. :)
blakjedi
24-Sep-2005, 15:57
Dag:roll: Ya'll will damn anyone who chooses the power, ease, or architecture of X360 over PS3... Sometimes this place sux. If you've never cared for Itagaki's games anyway why post?
Thegameman
24-Sep-2005, 16:13
He's talked a lot in the past about power, the importance of power over all else, and about how complex architectures are "easy" for him and how that doesn't matter. Here's one example I was able to find that neatly contrasts with his comments above:
"XBN: Is PS2 harder to develop for than Xbox?
Itagaki: No, not really. PS2 development was easy for us. Do you know something? We converted DOA2 to PS2 in just three months. If your programmers are good enough, PS2 development shouldn’t be anything to complain about."
He has made a point in the past of boasting about his ability to harness machines that others think are difficult and complicated, and has implied less than kind things about other developers who are having difficulty - so when he makes comments like this about PS3, his past attitude should come back to haunt him.
In fact they leave the PS2 after DOA2 hardcore cuz by their word they max out the PS2.
When there are games on PS2 that put DOA2 hardcore to shame,he hate to much Sony but they like powerful hardware and on their history you can see,that they have always suport the better hardware,they ported DOA1 to PS cuz it was the better hardware,they port DOA2 cuz the PS2 was the best hardware,and they din't port DOA 3 cuz the xbox was the best hardware.
I know that if the PS3 realy end on top Team Ninja will fly to the PS3 as well.
Titanio
24-Sep-2005, 16:22
In fact they leave the PS2 after DOA2 hardcore cuz by their word they max out the PS2.
When there are games on PS2 that put DOA2 hardcore to shame,he hate to much Sony but they like powerful hardware and on their history you can see,that they have always suport the better hardware,they ported DOA1 to PS cuz it was the better hardware,they port DOA2 cuz the PS2 was the best hardware,and they din't port DOA 3 cuz the xbox was the best hardware.
I know that if the PS3 realy end on top Team Ninja will fly to the PS3 as well.
I don't know..they have a very strong business relationship with MS now, something I don't think they had on quite the same level with Sony's competitors in the past. And MS possibly represents a more viable alternative for them than others in the past. So I'm not sure if they'll feel the need to port to Sony systems now. Their animosity with Sony stretches back before Xbox ever existed, IIRC, and MS represents a viable ticket away from Sony for them, I think.
Now I am sure some sony fans will jump on TeamNinja and say thier development team is stupid and they should hire smarter programmers becasue cell is to complicated for them.
Itagaki-san is full of shit.
:lol:
I remember a time when Itagaki wore his team's ability to harness "complicated" hardware like a badge of honour :P I guess he's just not as hardcore anymore ;)
Actually not the most complicated but the most powerfull (example xbox).. So basically what I think he is saying right now is they are about the same in power but the 360 is easier to program for.... But that has been known all along...
Titanio
24-Sep-2005, 17:01
Actually not the most complicated but the most powerfull (example xbox).. So basically what I think he is saying right now is they are about the same in power but the 360 is easier to program for.... But that has been known all along...
The reason he gives for preferring X360 is that PS3 is that it's "too complicated". He's previously questioned the capability of other devs who complained about other, difficult, systems and boasted that his team found them "easy". Taking power comparisons out of this totally, it's simply funny to hear Itagaki speak like this now.
But again, this really has nothing to do with technology.
Actually not the most complicated but the most powerfull (example xbox).. So basically what I think he is saying right now is they are about the same in power but the 360 is easier to program for.... But that has been known all along...
No he sounded like he's been sold out.With all the development tools that the PS3 has today there's no way it's impossible and as complex as the EE+GS.Moreover the Xbox 360 is also quite new as well.It's just too early to take sides.
I don't respect a person who uses sex elements to sell his games.
The reason he gives for preferring X360 is that PS3 is that it's "too complicated". He's previously questioned the capability of other devs who complained about other, difficult, systems and boasted that his team found them "easy". Taking power comparisons out of this totally, it's simply funny to hear Itagaki speak like this now.
But again, this really has nothing to do with technology.
True
Yes I do remember him saying that and he was talking about the developers that were complaining about the difficulty to program for the next gen game consoles.
But I was only pointing out he always went for the most power no matter what.. Dreamcast, xbox.
Just so you know I have no bias towards any system.. I do think interms of power both systems will be pretty close this time around.. But I think the specs have been debated enough so I wont comment there.....
Guden Oden
24-Sep-2005, 17:19
So basically what I think he is saying right now is they are about the same in power but the 360 is easier to program for....
Actually he isn't saying that at all. He's just being a good dog and pleasing his master in a PR interview. Team ninja/tecmo has long had its head up microsoft's butt, of course he's going to say stuff like this.
But that has been known all along...
It's long been speculated. Not the same thing.
I still dont understand all the "damage control" going on here. Two-three devs say 360 is more powerful, four-five say PS3 is more powerful. Doesnt mean you need to attack those two-three devs.
I still dont understand all the "damage control" going on here.
And how many of them are actual programmers with their hands into the hardware producing code?
And how many of them are actual programmers with their hands into the hardware producing code?
I dont know but I know one thing; I wont bash anyone just because he doesnt agree with me.
Titanio
24-Sep-2005, 17:34
Two-three devs say 360 is more powerful, four-five say PS3 is more powerful.
If you're keeping score, who are they?
Actually he isn't saying that at all. He's just being a good dog and pleasing his master in a PR interview. Team ninja/tecmo has long had its head up microsoft's butt, of course he's going to say stuff like this.
It's long been speculated. Not the same thing.
Well than what do you think of Kojima comment at the end of the mgs TGS video?.
MS ******s are saing the same thing about him being a paid minion of Sony..?
Personally I think Itagaki is to cocky and out spoken to have anyone tell him what to do or say....
If you're keeping score, who are they?
I am not, I'm sure you are. ;)
The point I was making was that I've read more devs making comments that the PS3 has an advantage in processing power, if one dev says that isnt true doesnt necessarily warrant an attack on them.
Titanio
24-Sep-2005, 18:12
I am not, I'm sure you are. ;)
I was just curious because I don't think we've had that many comment on the matter. And I can't remember a third party dev outright saying X360 was more powerful than PS3 for example.
The point I was making was that I've read more devs making comments that the PS3 has an advantage in processing power, if one dev says that isnt true doesnt necessarily warrant an attack on them.
I don't think anyone is ragging on Itagaki for that, because that is not what he has said. I'm pointing out a shift in his attitude that is rather humorous, and I think it's fair to do so given his arrogance in the past.
I don't think anyone is ragging on Itagaki for that, because that is not what he has said. I'm pointing out a shift in his attitude that is rather humorous, and I think it's fair to do so given his arrogance in the past.
I wasnt referring to you about the "attacking" part. ;)
deathstar121
24-Sep-2005, 19:10
The title of the thread is misleading, all he said was PS3 is a complicated system, he didn't say which was better, and I can bet once he get comfortable with PS3 he will prefer it over 360, alot of the xbox fan-nation don't want to see NG2 on PS3 but it will happen, unless MS pays for the exclusive.
pegisys
24-Sep-2005, 19:30
The title of the thread is misleading, all he said was PS3 is a complicated system, he didn't say which was better, and I can bet once he get comfortable with PS3 he will prefer it over 360, alot of the xbox fan-nation don't want to see NG2 on PS3 but it will happen, unless MS pays for the exclusive.
and you know this how, there hasn't been anything that I have seen that makes me think the PS3 is going to be leagues ahead of the xbox360
in the interview that came on G4 a while back, he said the reason he left the ps2 to develop on the xbox was ease of programming and predefined libraries
and you know this how, there hasn't been anything that I have seen that makes me think the PS3 is going to be leagues ahead of the xbox360
in the interview that came on G4 a while back, he said the reason he left the ps2 to develop on the xbox was ease of programming and predefined libraries
Still too early to say so.
The PS3 isn't PS2.Programming and middleware tools are now more widely available.
pegisys
24-Sep-2005, 19:58
Still too early to say so.
The PS3 isn't PS2.Programming and middleware tools are now more widely available.
but still not on the level with the xbox360 and asfar as development environments go I don't think sony will catch up
with the cost of making games going up and the fact he has no real competition for a 3d fighter or NG type games on the xbox I would think it would be the best way to go
Black Dragon37
24-Sep-2005, 19:59
I don't know..they have a very strong business relationship with MS now, something I don't think they had on quite the same level with Sony's competitors in the past. And MS possibly represents a more viable alternative for them than others in the past. So I'm not sure if they'll feel the need to port to Sony systems now. Their animosity with Sony stretches back before Xbox ever existed, IIRC, and MS represents a viable ticket away from Sony for them, I think.How far? And why?
in the interview that came on G4 a while back, he said the reason he left the ps2 to develop on the xbox was ease of programming and predefined libraries
BS, the reason he left PS2 and went on to develop for Xbox was because there were no competing fighting game franchises on Xbox. DOA was never a big deal before the move, but now it has found an audience on it, and I don´t expect him to move from there.
And come on, he deserves being laughed at. Afterall, he used to be such a "beast tamer", and now PS3 is far too complicated for him? LOL.
compres
24-Sep-2005, 20:29
Still too early to say so.
The PS3 isn't PS2.Programming and middleware tools are now more widely available.
You dont need to be a psychic or a genius to see that, even with improved tools compared to the ps2, the increased paralelism and they new spes in the ps3 makes it very complicated from a software engineer stand point.
I dont know how many of the ps3 fans here are software engineers, but trust me, coding on an unnecessarily complicated platform for no benefit at all in performance(compared to the other options) is not desirable nor worth cheering for.
Last time around, the ps2 already had a big market share, for many reasons, such as earlier release, included dvd(which I remember was a big part of it, becouse some times sony sold more consoles than software), sony good brand recognition, among others. For this generation, they are 6+ months behing in the release date of their competitor, and M$ is not sega when we talk about cash or global image, so we will have to wait and see. Market share and sony's brand recognition ensured a lot of developers on the ps2, good hardware design did not(In MNSHO).
All of this posts is more opinions than facts, but saying that the ps3 is a fine developer platform just becouse its less bad than ps2 is wrong. Less bad is not good, specially when the competition has arguably the best developer tools on any 3D platform to date at launch.
Inane_Dork
24-Sep-2005, 20:31
If you're keeping score, who are they?I think what you meant to say was: if you're keeping score, get a life.
Titanio
24-Sep-2005, 20:56
How far? And why?
I don't know why, but I do remember reading an interview from early PS2/DC days at least that mentioned less-than-happy relations between Tecmo and Sony.
I think what you meant to say was: if you're keeping score, get a life.
Haha, quite.
You dont need to be a psychic or a genius to see that, even with improved tools compared to the ps2, the increased paralelism and they new spes in the ps3 makes it very complicated from a software engineer stand point.
In absolute terms the system is probably more complicated than even PS2, at least in terms of the CPU, but for very different reasons. PS2 was complicated because of very nuts-and-bolts, low level reasons. PS3 has a some of that still on the CPU side, but the much bigger challenge is splitting your program up in ways that make sense to take advantage of the parallelism.
That's not a challenge unique to PS3 however - X360, PCs they also present that fundamental challenge. PS3 is just unique in terms of scale and the model used. Relatively speaking I doubt PS3 is as hard to deal with vs X360 as PS2 was vs Xbox. But I suppose it would be a matter of opinion for those dealing with the hardware.
In fact they leave the PS2 after DOA2 hardcore cuz by their word they max out the PS2.
When there are games on PS2 that put DOA2 hardcore to shame,he hate to much Sony but they like powerful hardware and on their history you can see,that they have always suport the better hardware,they ported DOA1 to PS cuz it was the better hardware,they port DOA2 cuz the PS2 was the best hardware,and they din't port DOA 3 cuz the xbox was the best hardware.
I know that if the PS3 realy end on top Team Ninja will fly to the PS3 as well.
the best hardware excuse my ass
it's all about the $$$$$$
You dont need to be a psychic or a genius to see that, even with improved tools compared to the ps2, the increased paralelism and they new spes in the ps3 makes it very complicated from a software engineer stand point..
So does the Xbox 360.Isn't parallelism evident in the Xenon's 3core/6 threads CPU as well?Don't forget that it comes with the new unified sharer architecture which is flexible but efficiency is questionable.
All of this posts is more opinions than facts, but saying that the ps3 is a fine developer platform just becouse its less bad than ps2 is wrong. Less bad is not good, specially when the competition has arguably the best developer tools on any 3D platform to date at launch.
I never said that PS3 is a fine developer platform but the PS3>PS2 anytime.
Black Dragon37
24-Sep-2005, 21:31
Look, Itagaki doesn't like Sony. Plain and simple. He said he's gonna make a game on the DS just because his daughter likes the handheld console.
blakjedi
24-Sep-2005, 21:52
the best hardware excuse my ass
it's all about the $$$$$$
This quote very neatly sums up PS3 fans views vis-a-vis anything pro-X360.
Quote:
Originally Posted by Titanio
If you're keeping score, who are they?
I think what you meant to say was: if you're keeping score, get a life.
I will definitely say that its more Sony fans that are keeping "score" as it were as to "who" says the PS3 is the most powerful platform. In PS2-Xbox days, it was CLEAR. This generation its not... and anyone willing to say that gets ridiculed.
Just because Cell has double the FP power of XeCPU, doesn't mean PS3 the system is more powerful than x360.
In PS2-Xbox days, it was CLEAR. This generation its not... and anyone willing to say that gets ridiculed.
Just because Cell has double the FP power of XeCPU, doesn't mean PS3 the system is more powerful than x360.
hmm..
Shifty Geezer
24-Sep-2005, 22:35
You dont need to be a psychic or a genius to see that, even with improved tools compared to the ps2, the increased paralelism and they new spes in the ps3 makes it very complicated from a software engineer stand point. As Titanio says, parallelism isn tthe issue. XB360 has that same issue. Plus the SPE's can be written for C/C++, unlike PS2's VU's that were assembler only. PS3's in order, as is XB360, as is PS2, so in that instance PS3 is no worse off then PS2 or XB360 either.
The only fundamental 'difficulty' of PS3 over XB360 is working in 256kb LS's and managing data structures to fit that.
I dont know how many of the ps3 fans here are software engineers, but trust me, coding on an unnecessarily complicated platform for no benefit at all in performance(compared to the other options) is not desirable nor worth cheering for.You're another of these 'STI wasted buckets of money on a useless and overly difficult design' subscribers I see! We've no real world gaming examples, but we HAVE seen the advantages of Cell demo's over other processors, and we HAVE seen where the SPE's architecture has benefits.
As for PS3 being more complex vs. XB360, the only real difference is SPE's and managing local storage. And by accounts that's an issue XeCPU shares anyway. The recommendations from MS in it's developer paper was saying a degree of hand-tuned cache management would be benefical to some code. Having to manage 1MB between six threads offers it's own difficulties.
To me, both hardwares are pretty complex to master. Both need parallelism and restructuring of data in float jobs where possible, as both CPU's were designed to be strong on floating point maths. Both provide well known API's for graphics. Both provide direct CPU>GPU communications. Oh, PS3 has a NUMA which is another thing to 'worry' about. No more then a typical console though.
I'd like to know what you think is so very difficult about PS3 versus XB360, 'coz I'm not seeing it, other than the SPE coding model which is something any capable coder can learn to work with without too much bother I'd have thought. Especially an (allegedly) self-proclaimed coding God like Itagaki ;)
pegisys
24-Sep-2005, 23:55
You're another of these 'STI wasted buckets of money on a useless and overly difficult design' subscribers I see! We've no real world gaming examples, but we HAVE seen the advantages of Cell demo's over other processors, and we HAVE seen where the SPE's architecture has benefits.
what exactly have we seen that puts it sofar ahead of anything else at doing games, it's barely been anything running in realtime and the stuff that we have seen is can be done on the gpu, I'm not saying the cell doesn't have it's strong points but I still haven't seen anything that makes me think it was worth the money to put it in a game machine.
msia2k75
24-Sep-2005, 23:57
what exactly have we seen that puts it sofar ahead of anything else at doing games, it's barely been anything running in realtime and the stuff that we have seen is can be done on the gpu, I'm not saying the cell doesn't have it's strong points but I still haven't seen anything that makes me think it was worth the money to put it in a game machine.
And what about waiting just...
... a little huh?
Guden Oden
25-Sep-2005, 00:20
Look, Itagaki doesn't like Sony. Plain and simple.
It's not that simple at all. Sony's the biggest market in the console business, particulary overwhelmingly so in japan. You don't say no to all of that without figuring you'll do more money overall some other way - such as being paid by the bucketload by MS to go exclusive.
What someone LIKES doesn't count when the bottom line is how much cash you can pull in by making games for platforms X, Y, Z. It's called "capitalism", and fanb0yism certainly doesn't play a factor there, much to the chagrin of some tecmo defenders I'm sure.
Personally the only game team ninja's ever made that I'd be interested in playing on either x360 or PS3 is ninja gaiden, all their other stuff has been sexist misogynist CRAP that the world might just as well be without.
expletive
25-Sep-2005, 00:30
As Titanio says, parallelism isn tthe issue. XB360 has that same issue. Plus the SPE's can be written for C/C++, unlike PS2's VU's that were assembler only. PS3's in order, as is XB360, as is PS2, so in that instance PS3 is no worse off then PS2 or XB360 either.
The only fundamental 'difficulty' of PS3 over XB360 is working in 256kb LS's and managing data structures to fit that.
You're another of these 'STI wasted buckets of money on a useless and overly difficult design' subscribers I see! We've no real world gaming examples, but we HAVE seen the advantages of Cell demo's over other processors, and we HAVE seen where the SPE's architecture has benefits.
Regardless of the actual hardware benefits, a lot of developers who prefer the 360 have commented on the overall dev environment and the tools they can use. Debuggers, performance tools, etc. Plus they are all tools that developers who develop for the PC are familiar with already and those who havent, claim they are easy to use. (and after seeing the pc version of the 360 controller, its obvious this is a HUGE part of MS' mid to long-term strategy: one development budget-two platforms)
That said, paralleism with 3 identical cores and 6 identical threads should be a bit easier than a PPE and SPE design where each has different needs and potentially different roles shouldnt it? (I have to credit that thought to Carmack though, as he stated in his Quakecon address.)
What we have not seen, however, is if the Cell will provide an advantage in the closed-box system known as the PS3 and i think thats what is really on trial in this thread.
J
darkblu
25-Sep-2005, 01:00
That said, paralleism with 3 identical cores and 6 identical threads should be a bit easier than a PPE and SPE design where each has different needs and potentially different roles shouldnt it? (I have to credit that thought to Carmack though, as he stated in his Quakecon address.)
i'm not aware of the exact words of Carmack but no, 'parallelism with 3 identical cores and 6 identical threads' would not be necesserily easier than the ps3 model due to a number or different reasons. but of course that's yet to be determined in practice.
Josh378
25-Sep-2005, 01:11
I'm not a fan of his fighting games...but ITAGAKI and his team can create a great action/fighter game....(although you'll be seeing alot of "patches" in the future).
I really want him to step away from Just eye-candy only and really put a focus on a complex fighting engine...I would love to see the speed of DoA4 with Virtual Fighter techniques (dreams). But seriously, look at Xbox's fighting games on the system....only 2 really stand out (DOA and Soul Calibur)
On PS2 it's about 5-15 great fighting games on the system. He made the right choice on system solutions with xbox vs PS2...xbox will just give him better sales and reconition. On the other hand...the Action/Adventure Genre is wide open for him to take...but it does have Hard competition like DMC and GoW...but not asbad as the fighter genre. Ninja Gaiden will stand out more in it's role on PS3 than DoA would be on PS3...thats why I say...let his fighters be Xbox 360 exclusive and then make NG Black on Xbox 360 and port it to PS3 with Downloadable patches/updates......
Now, i'm off to finish Metroid Prime 2...
MechanizedDeath
25-Sep-2005, 01:42
Itagaki speaks his mind, period. That quote said nothing about power for or against either machine. This thread is unnecessarily long. PEACE.
zRifle1z
25-Sep-2005, 02:13
Itagaki speaks his mind, period. That quote said nothing about power for or against either machine. This thread is unnecessarily long. PEACE.
This thread unfortunately overlooks this statement as well:
"GI: Visually, what do you feel that you could accomplish with the Xbox 360 that you couldn’t with the Xbox? I’ve played Ultimate and DOA 3 on an HDTV and it already looked fantastic.
Itagaki: What you see is what you get. I never felt that I had enough machine power. The more power I get the more I want to do. Even with Xbox 360 there’s never enough. I must tell you an interesting story since you came all the way to Tokyo for me. Of course, as a developer we welcome more power. From a consumer stand point, they must feel that there’s already enough power. Back in 1997 – 8 years ago – I made games. I look back and I surprise myself. I’m impressed with what I see and what I did 8 years ago. Eight years from today I don’t think I’m going to look back at today and feel as impressed as I feel about product from 8 years ago. So of course, for the last 8 years, our focus was to improve the graphics quality, and of course without having enough power, that became a challenge. From this point forward, obviously that won’t be the focus, and we’ll be working more on interactivity – things like this lobby – more about the actual game concept and gameplay. That’s where I think my focus will be on.
expletive
25-Sep-2005, 03:11
i'm not aware of the exact words of Carmack but no, 'parallelism with 3 identical cores and 6 identical threads' would not be necesserily easier than the ps3 model due to a number or different reasons. but of course that's yet to be determined in practice.
"So the returns on multi-core are going to be initially disappointing, for developers or for what people get out of it. There are decisions that the hardware makers can choose on here that make it easier or harder. And this is a useful comparison between the xbox 360 and what we’ll have on the PC spaces and what we’ve got on the PS3.
The xbox 360 has an architecture where you’ve essentially got three processors and they’re all running from the same memory pool and they’re synchronized and cache coherent and you can just spawn off another thread right in your program and have it go do some work.
Now that’s kind of the best case and it’s still really difficult to actually get this to turn into faster performance or even getting more stuff done in a game title. "
There as another interview where he specifically mentions identical cores but i cant seem to track it down.
I dont know of another developer at his level that has said otherwise. I'd be interested to see any who disagree with this statement becuase it seems to make perfect sense.
J
dantruon
25-Sep-2005, 04:42
Just because Cell has double the FP power of XeCPU, doesn't mean PS3 the system is more powerful than x360. ahhahah, you being sarcastic right?
I think the reason that Techmo doesnt want to develop DOA to playstation platform in general is because they will have to compete with Tekken and Virtual Fighter and I dont think they can beat those two games in term of sale, but if they develop for MS, they are like monopoly on that platform, so in my opinion that is one of the main factor we will never see the DOA series on the PS3, but Ninja Gaider might have a chance though.
darkblu
25-Sep-2005, 05:11
"The xbox 360 has an architecture where you’ve essentially got three processors and they’re all running from the same memory pool and they’re synchronized and cache coherent and you can just spawn off another thread right in your program and have it go do some work.
Now that’s kind of the best case and it’s still really difficult to actually get this to turn into faster performance or even getting more stuff done in a game title. "
I dont know of another developer at his level that has said otherwise. I'd be interested to see any who disagree with this statement becuase it seems to make perfect sense.
regardless of how true his statement is in itself, the question is: what do _you_ read into his statement.
the first paragraph of Carmack's statement basically says: 'it is very easy to spawn a thread and get it running on the 360 - just as easy as it is on your grandma's smp pc'
to which eveybody can only nod in agreement, as there's nothing to misundersand here and that message gets clearly and correctly propagated. now, getting a thread up and running and actually getting efficient parallelism are two entirely different things, as anybody who has ever tackled a single parallelism problem could tell you. so let's see what Carmack says further in his second paragraph.. he says exactly this - 'regardless of how easy it's to tinker with threads (in your grandma's smp way) this still grants you nothing in terms of effective paralellism'.
ok, now that we cleared up the matter with Carmack's statement we can return to the original topic - how much easier it is to achieve _efficient_parallelism_ on the 360 over the cell. and now it's your turn to step in and actually build your argument.
lol funny quote I found on GA.
"Does anybody else find it slightly ironic that Itagaki is complaining about developing PS3 games because its too hard and yet he rags on people who find Ninja Gaiden too hard and calls them Ninja Dogs? Who's the ninja dog now Itagaki?"
ecliptic
25-Sep-2005, 06:11
lol funny quote I found on GA.
"Does anybody else find it slightly ironic that Itagaki is complaining about developing PS3 games because its too hard and yet he rags on people who find Ninja Gaiden too hard and calls them Ninja Dogs? Who's the ninja dog now Itagaki?"
That is suppose to be funny?
The idiot is comparing a developing a game to playing a game. Playing a game doesn't costs an extra $10 million because the game is harder than another game.
liverkick
25-Sep-2005, 06:39
That is suppose to be funny?
The idiot is comparing a developing a game to playing a game. Playing a game doesn't costs an extra $10 million because the game is harder than another game.
Yes its funny. Because its a joke. Jokes are meant to be funny and taken in jest. Not everything is a veiled indictment against your favorite brand. Calm down.
Accert I agree with your first post made alot of good points.
I just hope that TN really polishes up their character detail. I agree with a previous poster about the characters (at this point so far) are not a huge departure then what we got in DOAU. Granted its a first generation launch title etc. but honestly I did expect a little more from TN especially given what they produced with the Xbox 1 launch. A little rough yeh but impressive for a first gen title on a new system.
Well it's at least near 60fps in 720p (though its animation and gameplay look almost the same as in the previous generation) so if it's that hard to get it on Xbox 360 then Itagaki shouldn't be blamed more than other devs shooting for 30fps...
http://www.gametrailers.com/player.php?id=7796&type=wmv
http://www.gametrailers.com/player.php?id=7797&type=wmv (HD)
ihamoitc2005
25-Sep-2005, 08:46
There as another interview where he specifically mentions identical cores but i cant seem to track it down.
I dont know what he said but 2 identical cores like SLI GPUs: 2x processing power, but real word much less, but with joint L2 cache can be more efficient than GPUs no? I think he says its not easy to be efficienct with multiple threads because of small cache.
SPEs very good at dynamic load balancing if data model is appropriate so full performance possible from all SPEs, do not have to write to any one SPE, only write SPE appropriate code.
Sony very smart in processor design but stupid in assuming developer adoption of recommended programming model. Sony said PS2 performance attainable if followed recommended model and in the end it was true. Same with CELL. Maybe we must wait 4 years before a lot of good CELL using games. Actually maybe much less this time since Sony is making better SDK.
Itagaki is making fun of other x360 developers no? Funny since DOA2 looks bad next to Virtua Fighter 4 & Tekken 5. Maybe PS2 too "complicated" for Itagaki. Not too complicated for Namco no?
aaaaa00
25-Sep-2005, 09:08
ok, now that we cleared up the matter with Carmack's statement we can return to the original topic - how much easier it is to achieve _efficient_parallelism_ on the 360 over the cell. and now it's your turn to step in and actually build your argument.
It's generally easier to write fast code when it's easier to write correct code, since fast but incorrect code is not typically very useful. :twisted:
Correct multithreaded code is much easier to write when you have N identical CPUs all sharing identical access to the same main memory, with a well-ordered memory model and cache coherency guaranteed by the hardware. (Which is pretty much x86 SMP in a nutshell in fact.)
Such an architecture is fairly well understood today, and any college concurrent programming textbook will teach you the basics of synchronization objects and have parallel algorithms that work correctly and reasonably well on an SMP.
Each step away you take from such an architecture introduces stuff that makes it more complicated just to insure code correctness, never mind performance.
The point Carmack is making is that xbox 360 is already pretty much the best case scenario for multithreaded architectures -- but even there, insuring code correctness is going to be hard to do before you even start to think about making the performance better.
And if you look at his whole argument, including his example, you can see that this was the point that he was making.
The xbox 360 has an architecture where you’ve essentially got three processors and they’re all running from the same memory pool and they’re synchronized and cache coherent and you can just spawn off another thread right in your program and have it go do some work.
Now that’s kind of the best case and it’s still really difficult to actually get this to turn into faster performance or even getting more stuff done in a game title.
The obvious architecture that you wind up doing is you try to split off the renderer into another thread. Quake 3 supported dual processor acceleration like this off and on throughout the various versions.
It’s actually a pretty good case in point there, where when we released it, certainly on my test system, you could run and get maybe a 40% speed up in some cases, running in dual processor mode, but through no changing of the code on our part, just in differences as video card drivers revved and systems changed and people moved to different OS revs, that dual processor acceleration came and went, came and went multiple times.
At one point we went to go back and try to get it to work, and we could only make it work on one system. We had no idea what was even the difference between these two systems. It worked on one and not on the other. A lot of that is operating system and driver related issues which will be better on the console, but it does still highlight the point that parallel programming, when you do it like this, is more difficult.
Anything that makes the game development process more difficult is not a terribly good thing.
aaronspink
25-Sep-2005, 09:23
As Titanio says, parallelism isn tthe issue. XB360 has that same issue. Plus the SPE's can be written for C/C++, unlike PS2's VU's that were assembler only. PS3's in order, as is XB360, as is PS2, so in that instance PS3 is no worse off then PS2 or XB360 either.
C/C++ could be used to program the VU's as well, its called intrinics, and anyone using C/C++ for the SPUs and getting any performance will be writting in a version of C just above macro assembler
The only fundamental 'difficulty' of PS3 over XB360 is working in 256kb LS's and managing data structures to fit that.
and the lack of direct memory access, etc.
I'd like to know what you think is so very difficult about PS3 versus XB360, 'coz I'm not seeing it, other than the SPE coding model which is something any capable coder can learn to work with without too much bother I'd have thought. Especially an (allegedly) self-proclaimed coding God like Itagaki ;)
Um, I think you are underestimating the complexity of the cell programming model. It is significantly different enough from most architectures that it will remain a specialized field, some of those programmers will do a good job, but that will only be after a steep learning curve. The programming model for Xenon will be much more mainstream and already well researched with a significant amount of development funds spent. You also can't underestimate the amount of skill that will be spent at tackling any remaining issues with the x360 model because of the simularities with the more mainstream PC model.
Aaron Spink
speaking for myself inc.
ihamoitc2005
25-Sep-2005, 09:37
and the lack of direct memory access, etc.
This is from IBM. Seems DMA no problem no?
http://www-128.ibm.com/developerworks/power/library/pa-cbea.html
SPEs access main storage with DMA commands that go between main storage and a private local memory used to store both instructions and data. SPE instruction- fetches and load and store instructions access this private local store rather than shared main storage.
This three-level organization of storage (register file, local store, main storage) -- with asynchronous DMA transfers between local store and main storage -- is a radical break with conventional architecture and programming models because it explicitly parallelizes computation and the transfers of data and instructions.
The reason for this radical change is that memory latency, measured in processor cycles, has gone up several hundredfold in the last 20 years. The result is that application performance is often limited by memory latency rather than peak compute capability or peak bandwidth. When a sequential program on a conventional architecture performs a load instruction that misses in the caches, program execution now comes to a halt for several hundred cycles. Compared with this penalty, the few cycles it takes to set up a DMA transfer for an SPE is quite small. Even with deep and costly speculation, conventional processors manage to get at best a handful of independent memory accesses in flight. The result can be compared to a bucket brigade in which a hundred people are required to cover the distance to the water needed to put the fire out, but only a few buckets are available.
In contrast, the explicit DMA model allows each SPE to have many concurrent memory accesses in flight without the need for speculation.
The most productive SPE memory-access model appears to be the one in which a list (such as a scatter-gather list) of DMA transfers is constructed in an SPE's local store so that the SPE's DMA controller can process the list asynchronously while the SPE operates on previously transferred data. In several cases, this new approach to accessing memory has led to application performance exceeding that of conventional processors by almost two orders of magnitude, significantly more than anyone would expect from the peak performance ratio (about 10x) between the Cell Broadband Engine and conventional PC processors.
The programming model for Xenon will be much more mainstream and already well researched with a significant amount of development funds spent. You also can't underestimate the amount of skill that will be spent at tackling any remaining issues with the x360 model because of the simularities with the more mainstream PC model.
Xenon isn't x86.I've never underestimated MS's software development skills but with Sony' recent move of going with software houses such as Epic and using Nvidia's CG tools,it shows that they indeed making efforts to make their console easier to develop for this coming gen.
This is from IBM. Seems DMA no problem no?
http://www-128.ibm.com/developerworks/power/library/pa-cbea.html
Of course they'll say that, the built the shite.
It's not that you explicitly have to set up a DMA that is the problem, it's that local stores aren't kept coherent. The lack of memory coherence *is* a bitch. The nuisance of the heterogenous ISA is minor compared to that.
Cheers
Gubbi
ihamoitc2005
25-Sep-2005, 11:30
Of course they'll say that, the built the shite.
It's not that you explicitly have to set up a DMA that is the problem, it's that local stores aren't kept coherent. The lack of memory coherence *is* a bitch. The nuisance of the heterogenous ISA is minor compared to that.
Cheers
Gubbi
You didnt read the link.
It starts with ...
The Cell Broadband Engine is a single-chip multiprocessor with nine processors operating on a shared, coherent memory.
and
While each SPE is an independent processor running its own application programs, a shared, coherent memory and a rich set of DMA commands provide for seamless and efficient communications between all Cell processing elements.
Also referring to heterogeneous ISA ...
http://domino.research.ibm.com/comm/research.nsf/pages/r.arch.innovation.html?Open&printable
Memory access is performed via a DMA-based interface using copy-in/copy-out semantics, and data transfers can be initiated by either the IBM Power™ processor or an SPU. The DMA-based interface uses the Power Architecture™ page protection model, giving a consistent interface to the system storage map for all processor structures despite its heterogeneous instruction set architecture structure.
Sounds pretty straightforward no?
Black Dragon37
25-Sep-2005, 12:30
Of course they'll say that, the built the shite.They also built Xenon. That's shite too? ;)
I think the SPE model is actually easier to get performance out as compare to a classic SMP/T as is the X360.
On the 360 you have to do a lot of synchronization work because 6 Threads can be running simultanously, each taking over some work for a game engine simulatanously. So everything has to be synchronized carefully not to let any thread run out of the context. (too slow or too fast). Additionally there is very little cache for 6 threads so that the dev. has to carefully plan what will be done at which time to not run into a cache bottleneck here, which also decreases performance.
On the CELL on the other hand you have the same problems, when looking only at the PPE (although cache is a little less a problem since there is more per thread than Xenos has). The big difference is however that the SPE Model is not a SMP/T one at all. It can be thought of something like a Master -Slave relationship where the Master (PPE) delivers tasks to the individual SPE, much like simply calling a subroutine. Only that the subroutine is running in an ultra fast SPE instead of a GP core. This way there is much less synchronziation work needed and since each SPE is independent from the other it's also highly unlikely that "cache" stalls can occur (Also since SPEs don't have any cache)...
aaaaa00
25-Sep-2005, 13:00
The big difference is however that the SPE Model is not a SMP/T one at all. It can be thought of something like a Master -Slave relationship where the Master (PPE) delivers tasks to the individual SPE, much like simply calling a subroutine.
It's trivial to construct any form of multithreaded relationship on an SMP machine, because it is the most general and flexible form of multithreading that exists.
If you want, you can easily build a "master - slave" design pattern on SMP by designating a main thread and constructing a job queue for each slave thread you allocate.
The synchronization required for doing this on an SMP is no more and no less than in the SPE Model -- you need some sort of lock to protect each slave thread's job queue from corruption by access from multiple concurrent threads.
inefficient
25-Sep-2005, 14:14
It's trivial to construct any form of multithreaded relationship on an SMP machine, because it is the most general and flexible form of multithreading that exists.
If you want, you can easily build a "master - slave" design pattern on SMP by designating a main thread and constructing a job queue for each slave thread you allocate.
The synchronization required for doing this on an SMP is no more and no less than in the SPE Model -- you need some sort of lock to protect each slave thread's job queue from corruption by access from multiple concurrent threads.
I believe you are wrong. Your thinking in classic classic SMP/T terms. And like Nemo80 hinited, the correct way to look at SPE progamming is not like this. The key advantages the Cell has here is the DMA memory access model and that each SPE has a local store. In the cell programming model you would set up a DMA on the SPE and then let it execute/read/write in it's own private area.
It's not that you explicitly have to set up a DMA that is the problem, it's that local stores aren't kept coherent. The lack of memory coherence *is* a bitch. The nuisance of the heterogenous ISA is minor compared to that.
You didnt read the link.
It starts with ...
The Cell Broadband Engine is a single-chip multiprocessor with nine processors operating on a shared, coherent memory.
and
While each SPE is an independent processor running its own application programs, a shared, coherent memory and a rich set of DMA commands provide for seamless and efficient communications between all Cell processing elements.
DMA transfers are kept coherent, but stores to SPE's local store are not. This means that the entire local store is part of the SPE's context and have to be saved on a context switch.
So the line from the article (my emphasis):
The SPEs are more adept at compute-intensive tasks and slower at task switching.
Is a gross understatement. SPEs having more than 256KB context compared to the ~1KB for a regular PPE context means that task switching is all but impractical.
Also referring to heterogeneous ISA ...
http://domino.research.ibm.com/comm...?Open&printable
Memory access is performed via a DMA-based interface using copy-in/copy-out semantics, and data transfers can be initiated by either the IBM Power™ processor or an SPU. The DMA-based interface uses the Power Architecture™ page protection model, giving a consistent interface to the system storage map for all processor structures despite its heterogeneous instruction set architecture structure.
Sounds pretty straightforward no?
And that has nothing to with the Instruction Set Architecture. The PPE and SPEs have different ISAs. And, worse, different programming model. This means that if your PPE is dogged down with tasks, you can't just push one of them off to one of the SPEs, you actually have to do some real work to do that.
So no, it's not straight forward at all.
Cheers
Gubbi
They also built Xenon. That's shite too? ;)
IMO, yes. :)
OOO and more cache please.
Cheers
Gubbi
I believe you are wrong. Your thinking in classic classic SMP/T terms. And like Nemo80 hinited, the correct way to look at SPE progamming is not like this. The key advantages the Cell has here is the DMA memory access model and that each SPE has a local store. In the cell programming model you would set up a DMA on the SPE and then let it execute/read/write in it's own private area.
The only way to look at CELL is to look at it as what it is: One host processor with multiple DSPs. What is novel about CELL is that the DSPs are optimized for float instead of integers, the level of integration and the bandwidth of the beast.
Microsoft chose to add DSP like functionality to their cores; SIMD instructions and lockable cache.
The fact that you have to explicitly move data around with DMAs is not an advantage. Repeat: NOT... NOT.... N.O.T. an advantage. It was done to remove the complexity of keeping 9 cores coherent.
Cheers
Gubbi
compres
25-Sep-2005, 15:58
Regardless of the actual hardware benefits, a lot of developers who prefer the 360 have commented on the overall dev environment and the tools they can use. Debuggers, performance tools, etc. Plus they are all tools that developers who develop for the PC are familiar with already and those who havent, claim they are easy to use. (and after seeing the pc version of the 360 controller, its obvious this is a HUGE part of MS' mid to long-term strategy: one development budget-two platforms)
That said, paralleism with 3 identical cores and 6 identical threads should be a bit easier than a PPE and SPE design where each has different needs and potentially different roles shouldnt it? (I have to credit that thought to Carmack though, as he stated in his Quakecon address.)
What we have not seen, however, is if the Cell will provide an advantage in the closed-box system known as the PS3 and i think thats what is really on trial in this thread.
J
Almost excactly my thoughts.
Panajev2001a
25-Sep-2005, 16:14
C/C++ and Intrinsics on EE's VU's... :lol
Seriously,
If you are not counting the "succesful" (ahem) VectorC/Codeplay VU Compiler, SCe has never really endorsed a compiler for VU's nor did they ever think about it when designing them with Toshiba. Even VCL was an afterthought, a good one, but still an afterthought IMHO. The ISA, the VU resources and functional units choice, etc... it was all chosen with low-level ASM programming with hand-scheduling of instructions by the programmer.
You cannot really compare the VU's with the SPE's this way unless you are quite bitter on the argument and you want to ignore the progresses made over the concept and implementations of Vector/SIMD processors going from the VU's to the SPE's (although some mistakes were made too, like the big mess about lack of misaligned load/stores in the SPE's and how it relates to scalar processing performance).
darkblu
25-Sep-2005, 16:18
It's generally easier to write fast code when it's easier to write correct code, since fast but incorrect code is not typically very useful. :twisted:
why is that i get the feeling you equate 'correct' with 'easy'? i.e. why the fact that it's relatively easy to write smp multithreaded code (which, btw, many gamedevs i've worked with would outright disagree with) somehow translates to writing correct code? i can pick a rock off the ground and throw it right away at a target - that's toddler-easy. the chances of me hitting a target that way though and the chances of me hitting the same target through a scope rifle, which i took some time to learn handling, are quite different. see, the former approach is 'easy' whereas the latter is 'correct'.
Correct multithreaded code is much easier to write when you have N identical CPUs all sharing identical access to the same main memory, with a well-ordered memory model and cache coherency guaranteed by the hardware. (Which is pretty much x86 SMP in a nutshell in fact.)
first, you step on the premise that correct smp multithreading is the easiest, most natural and, i get the feeling magically efficient form of concurrency. whereas in fact it's only the _dominant_ one in the desktop space. for various reasons, the majoriy of them purely historical and others purely economical. and you're yet to prove that point about the easy _correct_ smp code.
Such an architecture is fairly well understood today, and any college concurrent programming textbook will teach you the basics of synchronization objects and have parallel algorithms that work correctly and reasonably well on an SMP.
good luck with being a good concurrent programmer after reading one college textbook on multithreading (that's not necessarily directed at you, that's a general statement).
Each step away you take from such an architecture introduces stuff that makes it more complicated just to insure code correctness, never mind performance.
that's basically saying that each step away from a place takes us farther from it. yes, what's so fundamentally problematic with that? why should we be so stuck with smp multithreading? because of the availability of college textbooks on the subject? sorry, i fail to see the reason.
The point Carmack is making is that xbox 360 is already pretty much the best case scenario for multithreaded architectures -- but even there, insuring code correctness is going to be hard to do before you even start to think about making the performance better.
the point Carmack is making is that the 360 is as close to pc smp as you can get, and _yet_ that gives you zilch in terms of guaranteed performance gains. and he has a gripe with that, regardess of how (un)comfortable he or anybody else may feel about smp mutithreading.
Panajev2001a
25-Sep-2005, 16:19
The fact that you have to explicitly move data around with DMAs is not an advantage.
Says you and others... but not everyone dislikes it :).
expletive
25-Sep-2005, 16:47
regardless of how true his statement is in itself, the question is: what do _you_ read into his statement.
the first paragraph of Carmack's statement basically says: 'it is very easy to spawn a thread and get it running on the 360 - just as easy as it is on your grandma's smp pc'
to which eveybody can only nod in agreement, as there's nothing to misundersand here and that message gets clearly and correctly propagated. now, getting a thread up and running and actually getting efficient parallelism are two entirely different things, as anybody who has ever tackled a single parallelism problem could tell you. so let's see what Carmack says further in his second paragraph.. he says exactly this - 'regardless of how easy it's to tinker with threads (in your grandma's smp way) this still grants you nothing in terms of effective paralellism'.
ok, now that we cleared up the matter with Carmack's statement we can return to the original topic - how much easier it is to achieve _efficient_parallelism_ on the 360 over the cell. and now it's your turn to step in and actually build your argument.
Ok, couple of things here, and i do appreciate your thoughtful response.
1. My original post was phrased in the form of a question:
"That said, paralleism with 3 identical cores and 6 identical threads should be a bit easier than a PPE and SPE design where each has different needs and potentially different roles shouldnt it?"
So i'm not trying to argue any point, just trying to udnerstand the difference and benefits of each approach.
2. I guess i interpret what is being said by JC slightly differently. My interpretation is he's saying:
a. multithreaded prgramming is a pain in the ass
b. from an 'ease of use' standpoint, the design in the 360 is the best possible case for a developer to coax performance benefits out of multithreading
c. even on the best possible case, its very difficult to realize real world benefits
In my mind, that still doesnt change the fact that regardless of how much aboslute performance gain you can wring out of the 360 CPU, at the end of the day its still easier to get that meager efficency from the XeCPU than the Cell.
What does this mean? I dont really know, more relative efficiency on the 360 CPU? Shorter developement times? Bertter games sooner in each console's lifecycle? No idea and only time will tell.
So in summary, even if the dev-friendly design of the XeCPU gets you nothing, its easier to get nothing on the 360 then on the PS3. :D
J
Shifty Geezer
25-Sep-2005, 17:24
Originally Posted by GubbiThe fact that you have to explicitly move data around with DMAs is not an advantage.
Says you and others... but not everyone dislikes it :).I've got to say I like the idea of SPE's forced memory management. I work on high level PC code and there's often occassions when I WANT to know what's passing through cache and where my data in distance from the processing logic. But then at Uni out of all the languages and programming models the one I liked most was assembler. I preferred to know exactly what the hardware is doing and to think like a CPU to make the most of it.
I guess I could liken it to explicit variable declarations or not. I'd much rather have the option to NEED to declare my variables up front then the freedom to make new variables on the fly mid-code, as the ease of the latter produces the risk of errors from using a wrong variable name. Likewise the NEED to keep an eye on managing memory accesses may be an inconvenience but it focusses the developer on optimisations and working WITH the processor. Kind of a Xen thing :mrgreen:
Shifty Geezer
25-Sep-2005, 17:29
DMA transfers are kept coherent, but stores to SPE's local store are not. This means that the entire local store is part of the SPE's context and have to be saved on a context switch.I can't see who was talking about context switches on a SPE. Anyone wanting to run two+ concurrent threads on a SPE and switch between them needs their head examining! You set it a task, let it finish it, and then move onto another task. When would you not want to work that way on a SPE?
The fact that you have to explicitly move data around with DMAs is not an advantage. Repeat: NOT... NOT.... N.O.T. an advantage. It was done to remove the complexity of keeping 9 cores coherent.How about the prospect of a software framework for Cell that can automatically manage/optimize dataflow in this deterministic environment instead of hand optimization by a programmer?
I've got to say I like the idea of SPE's forced memory management. I work on high level PC code and there's often occassions when I WANT to know what's passing through cache and where my data in distance from the processing logic. But then at Uni out of all the languages and programming models the one I liked most was assembler. I preferred to know exactly what the hardware is doing and to think like a CPU to make the most of it. For small, trivial things, sure assembler is great, and understanding exactly how the machine is operating at a low level is a good thing. But aren't we so far beyond the simple, trivial cases that this is just not feasible--except for the highly-focused, performance-profiled directed situations?
.Sis
Shifty Geezer
25-Sep-2005, 18:54
Dunno. Large programs are broken into smaller procedures or code segments that make up your engine and these get pieced together to make the whole program. 256kb LS for data and code means your program isn't going to be totally massive, and I would guess much smaller than 256kb. Heck, 200kb of assembler isn't a pretty thought! You can achieve a lot in 32kb (whole 8 bit games even. Imagine how fast the original Elite could run when written for a SPE :shock: ) and I'd expect a process could be broken into manageable and efficient chunks. Seems more a matter of good design is needed rather than mystical programming powers. And note SPE's don't need assembler so the points moot anyway. Unless you're still developing for PS2!
ihamoitc2005
25-Sep-2005, 19:15
b. from an 'ease of use' standpoint, the design in the 360 is the best possible case for a developer to coax performance benefits out of multithreading
c. even on the best possible case, its very difficult to realize real world benefits
You misunderstood what was meant by "best possible case". He didnt say design of the XeCPU = best possible design, what he means is that in programming XeCPU, one could ideally use it that way (get six threads going) but that in reality not like that. You see then he's comparing ideal world with real world. Carmack expert in only in single-core x86 so maybe hes looking for excuses.
ihamoitc2005
25-Sep-2005, 19:43
Dunno. Large programs are broken into smaller procedures or code segments that make up your engine and these get pieced together to make the whole program. 256kb LS for data and code means your program isn't going to be totally massive, and I would guess much smaller than 256kb. Heck, 200kb of assembler isn't a pretty thought! You can achieve a lot in 32kb (whole 8 bit games even. Imagine how fast the original Elite could run when written for a SPE :shock: ) and I'd expect a process could be broken into manageable and efficient chunks. Seems more a matter of good design is needed rather than mystical programming powers. And note SPE's don't need assembler so the points moot anyway. Unless you're still developing for PS2!
Finally someone understands CELL programming. Its not multicore in the way of XeCPU. Its PPE run OS with ability to run 7 additional threads using specialized hardware, much like CPU off-loading graphics to GPU, audio to sound-card, etc... but advantage is since all SPEs identical whatever SPE is free can do whatever next task required is, a little like unified shader idea where same unit can do different types of tasks.
The key is to write lots of small programs that fit on 256kb. So new programming model must be internalized but if small programs used result = very fast and efficient load balanced processing.
People wanting multi-tasking on one SPE dont understand how to use it. Its not even needed. It processes one program at time in order of que, like 7 grocery store cashiers but where customers can move from long line to short line as needed..
Also, this is easy to understand no? Memory on each SPE is coherent but also shared on very high bandwidth bus so should be pretty straightforward. No bandwidth issues no coherence issues, only thing is to get programming model correct and rest is taken care of.
from below IBM link...
each SPE is an independent processor running its own application programs, a shared, coherent memory
mckmas8808
25-Sep-2005, 21:06
People wanting multi-tasking on one SPE dont understand how to use it. Its not even needed. It processes one program at time in order of que, like 7 grocery store cashiers but where customers can move from long line to short line as needed..
BS. No way that's true (smiling hoping that it really is). Are you telling me that you don't have to dedicate a particlur SPE to do physics? So why do people always say, "If EA is using 3 SPEs for graphics and the other 4 for physics sound and AI then the CPU is being wasted"? I don't get it. So you're saying using your perfect example that any SPE could be used for physics at anytime? Is it smart to do it that way? So the 3 SPEs that EA was talking about might not always be the exact same SPEs, yet just will take 3 SPEs worth of information at any given time?
And who are you and what do you do? I've never seen you here before? Are you a game developer?
Black Dragon37
25-Sep-2005, 21:09
Does it matter if he's a game developer or not? :???:
Titanio
25-Sep-2005, 21:11
BS. No way that's true (smiling hoping that it really is). Are you telling me that you don't have to dedicate a particlur SPE to do physics? So why do people always say, "If EA is using 3 SPEs for graphics and the other 4 for physics sound and AI then the CPU is being wasted"? I don't get it. So you're saying using your perfect example that any SPE could be used for physics at anytime? Is it smart to do it that way? So the 3 SPEs that EA was talking about might not always be the exact same SPEs, yet just will take 3 SPEs worth of information at any given time?
msckmas8808, it's called a job queue, it certainly can be done that way. People often talk about "reserving" a SPU for a specific task, but you certainly don't have to do that. If a task is going to occupy a SPU for the duration of a frame's processing, however, you could effectively say that it is reserved for that task. But some SPUs may well touch multiple tasks in the duration of one frame's processing.
SCEA's presentation from GDC has more on the job queue model, and others.
edit - SPE = SPU..SPU is the official name now, isn't it?
randycat99
25-Sep-2005, 21:25
I agree.
The significance of this is akin to no longer being restricted to some finite number of process threads that need to be mapped to a specific core or hardware thread, but now each process thread can be further divided into segments. Now you can have any number of segments (1000's of jobs vs. just 1, 2, 6, 9 threads) that can be executed, as appropriate, on whatever resources you have available. It essentially makes the whole worrying about divying up n-discrete processes (AI, physics, game code, etc) amongst x-processors in some "logical" manner as moot (maybe, "almost" moot, if you want to be picky ;) ).
Another way to imagine it is if you literally have 3 spools of thread and 8 sewing machines. If you restrict to the requirement that these threads are indivisible, then most certainly you will only be able to leverage 3 sewing machines to consume the threads. If you open it up to snipping many, many segments from these spools, and then feeding it to your pool of machines, then that opens up all sorts of possibilities of using all 8 sewing machines, or 16, or 100, or n-sewing machines... :shock:
mckmas8808
25-Sep-2005, 21:26
Does it matter if he's a game developer or not? :???:
Of course not it's just that he has seemed to explain himself in such a way that it sounds like he may be. And if so I would be curious if he is being employed to make a current or next generation game. No more, no less. And thanks Titanio. I will use this information for future references. :wink:
ihamoitc2005
25-Sep-2005, 21:26
So the 3 SPEs that EA was talking about might not always be the exact same SPEs, yet just will take 3 SPEs worth of information at any given time?
I dont know details of EA work but I too heard about this distribution of tasks you talked about. Perhaps they are talking in terms of cycles since number of cycles per unit times number of units is total cycles (in real world not precise) available and theyre using 3 units worth of cycles for graphics tasks and will fill up rest of cycles with other tasks. Remember, that graphics information only needed when game control says is needed so no more data processed than needed. Easy to manage SPE use with PPE. IBM has much information on this on the web.
mckmas8808
25-Sep-2005, 21:28
I agree.
The significance of this is akin to no longer being restricted to some finite number of process threads, but now each thread can be further divided into segments. Now you can have any number of segments (1000's of jobs vs. just 1, 2, 6, 9 threads) that can be executed, as appropriate, on whatever resources you have available. It essentially makes the whole worrying about divying up n-discrete processes (AI, physics, game code, etc) amongst x-processors in some "logical" manner as moot (maybe, "almost" moot, if you want to be picky ;) ).
Ok one question for you randy. Why don't more people point that then? Even here it seems like that info just fell through the cracks. Usually somebody like Acert would have went 4 paragraghs about how great and different that is from past systems.
randycat99
25-Sep-2005, 21:41
This revelation came to me when somebody else posted in another topic not long ago to break my mind from the concept of "threads" (with the implication to think more in terms of "packets")...plus I had to get a biker tatoo, but anyways.
At this point, I agree it is a fine distinction that a lot of people aren't picking up. It all stems from the well-established, classic understanding that multiple processors automatically involve some number of discrete threads. There's nothing wrong with that, as that is essentially how the problem has been approached for a very long time. There is no imperative that there can only be one approach, however. It's quite certain that by designing an architecture from the ground-up to embrace multiprocessing may inherently make new approaches possible, whereas adapting an existing architecture to simply support multiprocessing may only leave you with the classical approach as a viable approach.
aaronspink
25-Sep-2005, 22:20
This is from IBM. Seems DMA no problem no?
http://www-128.ibm.com/developerworks/power/library/pa-cbea.html
In this instance, I was refering to direct memory access, as just that, direct acces to the memory from within cell, the ability to load and store directly from the cell i-stream to memory. Cell only supports access to main memory via a copy engine that realistically much move large chunks of memory at a time to be efficient.
Aaron Spink
aaronspink
25-Sep-2005, 22:23
Xenon isn't x86.I've never underestimated MS's software development skills but with Sony' recent move of going with software houses such as Epic and using Nvidia's CG tools,it shows that they indeed making efforts to make their console easier to develop for this coming gen.
Nor did I say it was x86. The actual instruction set being used is a secondary issue to the overarching programming model. The whole of the mainstream of the computing industry is moving towards a model that is roughly the same as the x360 which will reap significant rewards. The network processor industry is the closest thing to cell and has been riddled with frustrations, bugs, and performance issues due to the complex programming models.
Aaron Spink
speaking for myself inc.
aaronspink
25-Sep-2005, 22:27
On the CELL on the other hand you have the same problems, when looking only at the PPE (although cache is a little less a problem since there is more per thread than Xenos has). The big difference is however that the SPE Model is not a SMP/T one at all. It can be thought of something like a Master -Slave relationship where the Master (PPE) delivers tasks to the individual SPE, much like simply calling a subroutine. Only that the subroutine is running in an ultra fast SPE instead of a GP core. This way there is much less synchronziation work needed and since each SPE is independent from the other it's also highly unlikely that "cache" stalls can occur (Also since SPEs don't have any cache)...
Gee, you just rediscovered the producer-consumer model which has been employed on SMP systems in the past.
Suffice to say that this model is certainly going to be a common one in game engines for X360. You kick off a bunch of threads at the start of the game loop to compute geometry, physics, etc for the next frame, and then gather the results together when they are done.
I would suggest that you read up on the various programming models in use today before making anymore comments.
Aaron Spink
speaking for myself inc.
Shifty Geezer
25-Sep-2005, 22:29
I posted an explanation but can't think where. There's different programming models that you can use on Cell (and different memory access models too).
One is the traditional 1 thread=1 job which people gravitate towards. They talk of 2 threads for this, 2 for that, 1 for so and so. The problem with this model is inefficiency. When a SPE has finished it's job it'll just be waiting around doing nothing. Another problem is data dependency which I'll explain in a bit
Another model is distributed computing, processing one task across multiple elements.
So say we have to process on the SPE's rigid body physics, AI, sound, fluid dynamics and texture synthesis. The traditional model might be...
SPEs 1&2 = Rigid body physics
SPE 3 = AI
SPE 4 = sound
SPE 5&6 = fluid dynamics
SPE 7 = texture synthesis
Now consider SPE usage over 15ms (about 1/60th second). If sound for the frame is complete in 4 ms, you've got SPE 4 sitting idle for 11 ms. And if your rigid body physics is getting complicated maybe 2 SPE's won't manage to fulfil it in time. We also have a problem with dependency. AI needs to react to things that are happening, which might be dependant on physics. Audio too needs to know where objects are and where they're colliding. You can't really calculate the audio until after the physics. The distributed comupting model might work like this (BTW : my use of terminology is pretty manky. The offical term isn't distributed computing but I can't remember what it is. However in this context I just mean distributing the process over several computation devices)
Time Process
0 ms SPEs 1-7 process fluid dynamics including objects on surface of water
2 ms SPEs 1-7 process rigid body physics
8 ms SPEs 1-7 calculate AI
11 ms SPEs 1-6 generate textures, SPE 7 generates audio
13 ms SPEs 1-4 work on post processing effects, SPE 5 processes audio encoding
This way there's less wastage and more flexibilty. It's also scalable which ties in with one of the origianl concepts for Cell. If your tasks can be comparmentalised, you can spread the workload over as many SPEs as the system has. And if you attach more SPEs by networking up with another Cell device, it can share in the workload.
Context switching has negligable overhead when switching tasks as long as you aren't switching task frequently. Where a PC CPU has to switch between potentially dozens of prcoesses a SPE doesn't. It can be left to finish the job. If your having a SPE working on several different and switching between them frequently, you're not making the most of the SPE.
aaronspink
25-Sep-2005, 22:30
I believe you are wrong. Your thinking in classic classic SMP/T terms. And like Nemo80 hinited, the correct way to look at SPE progamming is not like this. The key advantages the Cell has here is the DMA memory access model and that each SPE has a local store. In the cell programming model you would set up a DMA on the SPE and then let it execute/read/write in it's own private area.
And then it has to utilize the copy engine to move that data back to a memory accesible by the actual cpu and then syncronize with the cpu. Its the same damn model just with extra steps and complications. Its no different than having a multiple DSP card in a PC which you copy the data to, telll it to perform the calculations, copy the data back, and at some point syncronize with the main processor.
In the end, it is a producer-consumer model with added complexity.
Aaron Spink
speaking for myself inc.
Shifty Geezer
25-Sep-2005, 22:32
Nor did I say it was x86. The actual instruction set being used is a secondary issue to the overarching programming model. The whole of the mainstream of the computing industry is moving towards a model that is roughly the same as the x360 which will reap significant rewards.Though are not processor roadmaps heading towards a Cell like design? We've got SMP cores for now, but some years down the line Intel will be introducing a Cell structure of core(s)+synergistic processing unit. This change looks set to come sooner or later whether programmers like it or not, no?
aaronspink
25-Sep-2005, 22:34
Says you and others... but not everyone dislikes it :).
Given the choice between an architecture with DMA movement engines or DMA movement engines along with direct access, 9 out of 10 good programmers will prefer the later. The 10th was just in a car accident and suffered massive brain damage.
Aaron Spink
speaking for myself inc.
Titanio
25-Sep-2005, 22:34
Context switching has negligable overhead when switching tasks as long as you aren't switching task frequently. Where a PC CPU has to switch between potentially dozens of prcoesses a SPE doesn't. It can be left to finish the job. If your having a SPE working on several different and switching between them frequently, you're not making the most of the SPE.
Yeah, it should be stressed that the switch would only happen when the task is finished, not on regular blocking conditions. The idea with a SPU is to avoid blocking conditions ;)
The 10th was just in a car accident and suffered massive brain damage.
Necessary? No. Rude? Yes. Well done.
ihamoitc2005
25-Sep-2005, 22:38
In this instance, I was refering to direct memory access, as just that, direct acces to the memory from within cell, the ability to load and store directly from the cell i-stream to memory. Cell only supports access to main memory via a copy engine that realistically much move large chunks of memory at a time to be efficient.
Aaron Spink
True, but this is very efficient setup. Main memory is very slow and far away, last thing execution unit wants to do is waste cycles dealing with it directly for every little thing. This is why CPUs have cache. Cost of cache-miss is massive, even with low-latency XDR.
aaronspink
25-Sep-2005, 22:43
Though are not processor roadmaps heading towards a Cell like design? We've got SMP cores for now, but some years down the line Intel will be introducing a Cell structure of core(s)+synergistic processing unit. This change looks set to come sooner or later whether programmers like it or not, no?
To my knowledge, neither Intel nor AMD have published any roadmaps or intentions to develop anything like cell. From their published roadmaps, both Intel and AMD appear to be going down the path of multiple homogeneous processors on a die.
Aaron Spink
speaking for myself inc.
aaronspink
25-Sep-2005, 22:45
Necessary? No. Rude? Yes. Well done.
No rude, its called humor. I certainly could have said 10 out of 10, but what I said was certainly more humorous. Given the choice, no programmer is going to turn down the option of having direct acces to main memory. While there is an advantage to copy engines and private memory, there is a signficant disadvantage to giving up direct access.
Aaron Spink
speaking for myself inc.
ihamoitc2005
25-Sep-2005, 22:48
Given the choice between an architecture with DMA movement engines or DMA movement engines along with direct access, 9 out of 10 good programmers will prefer the later. The 10th was just in a car accident and suffered massive brain damage.
Aaron Spink
speaking for myself inc.
The more processing cores working on small bits of data wanting direct access the more the brain-damaged guy is smartest no? :)
well somebody calling the SPEs "DSPs" shouldnt be taken for serious anyways.
Nor did I say it was x86. The actual instruction set being used is a secondary issue to the overarching programming model. The whole of the mainstream of the computing industry is moving towards a model that is roughly the same as the x360 which will reap significant rewards. The network processor industry is the closest thing to cell and has been riddled with frustrations, bugs, and performance issues due to the complex programming models.
The engineers from SCE and Toshiba had sat down to talk with IBM during the early stages of the Cell development.They were in fact aware of multi core architectures since IBM had presented in front of them their own models to choose.It was after scrutinizing all available options that they decided to go with the master-slave approach.So saying that the Cell had been riddled by bugs is absolutely wrong.It was in fact a better decision that could yield the best performance/ease of development ratio.
aaronspink
25-Sep-2005, 23:05
The more processing cores working on small bits of data wanting direct access the more the brain-damaged guy is smartest no? :)
Nope. The DMA engines either have to be coherent or not. In the cell design they are coherent which implies that they flow into the pipeline a little ahead of the L2 controller and snoop the L2 before queuing in the memory controller.
In the case that they are incoherent, the programmer has the increased complexity of making sure anything that needs to be DMA'd is evicted from the caches.
The primary issue is maintaining the coherence for accesses to main memory from the SPE's and the complexity of it really doesn't change regardless if you are using a DMA copy engine or direct access.
Aaron Spink
speaking for myself inc.
To my knowledge, neither Intel nor AMD have published any roadmaps or intentions to develop anything like cell. From their published roadmaps, both Intel and AMD appear to be going down the path of multiple homogeneous processors on a die.
I don't see an identical tri-core CPU coming out in the market anytime soon that shares the same pool of L2 cache.Cell's master-slave approach is still the best for >2 multi cores processors.
Panajev2001a
25-Sep-2005, 23:05
Given the choice between an architecture with DMA movement engines or DMA movement engines along with direct access, 9 out of 10 good programmers will prefer the later. The 10th was just in a car accident and suffered massive brain damage.
Aaron Spink
speaking for myself inc.
Given the choice between co-processors with lockable cache and simple control on DMA movement (you really ought to play around with low level PlayStation 2 DMAC programming to enjoy it :)) to full blown INDEPENDENT processors with ample DMA control and flexibility... the decision is split ;).
There are many developers out there that loved PS2's VU's, did not like the VFPU of the PSP (yeah Faf, I know... a crime ;)) because of its simple dependent co-processor nature, and are in love with the SPE's because they see them mostly as an evolution of the VU's (with some minor drawbacks of course ;)).
Also, what programmers really would like is a single core, low latency, with bad-ass branch predictor,2-way SMT, 4-way, OOOe, 5 GHz processor with 256 KB of L1 cache, 2 MB of L2 cache, 16 MB of L3 cache and the best FPU in the world the FPU+VFPU combo the PSP uses :) along with an uber-optimizing compiler that converts all their single threaded code in perfectly optimized multi-threaded code :).
Seriously, even if the 10th likes the former approach if his code runs faster in the end... well :P.
aaronspink
25-Sep-2005, 23:08
well somebody calling the SPEs "DSPs" shouldnt be taken for serious anyways.
The SPEs have a significant simularity to various DSPs that are available and the overall programming model is also very similar to some DSP plug boards that are available. IMNSHO, it is actually the most apt description of the SPEs available.
I'll put my knowledge of computer architecture, system design, semiconductor physics, and VLSI design against yours any day of the week.
Aaron Spink
speaking for myself inc.
aaaaa00
25-Sep-2005, 23:17
In the cell programming model you would set up a DMA on the SPE and then let it execute/read/write in it's own private area.
Please elaborate. How does the SPE know to look for jobs? How does the SPE actually receive the job? How does the PPE know the job is complete? Where does the PPE retrieve the results from?
Saying "you would set up a DMA" is not enough detail. You don't know how long the DMA will take. You don't know how long the SPE will take to execute the job.
How do you synchronize the execution of the SPE to the PPE?
I'll put my knowledge of computer architecture, system design, semiconductor physics, and VLSI design against yours any day of the week.
Aaron Spink
speaking for myself inc.
No sorry, i usually dont argue with college boys. :-)
aaaaa00
25-Sep-2005, 23:26
why is that i get the feeling you equate 'correct' with 'easy'? i.e. why the fact that it's relatively easy to write smp multithreaded code (which, btw, many gamedevs i've worked with would outright disagree with) somehow translates to writing correct code?
I do not equate "correct" with "easy", only that there tends to be a relationship.
If it's harder to write correct code, then it tends to be harder to write fast correct code, since all fast code must be correct code to be useful.
If it's easier to write correct code, it also tends to be easier to write fast correct code.
The point is, the less time you need to insure your code is correct, the more time you can spend on optimization.
first, you step on the premise that correct smp multithreading is the easiest, most natural and, i get the feeling magically efficient form of concurrency.
I never said it was the most efficient, merely the easiest and most natural. This is simply because it is the most general form of multithreading -- you have N threads, they execute the same kind of opcodes, they share your entire address space, memory ordering is strict, and memory coherency between the threads is enforced.
Every multithreaded design pattern can be constructed from this basic foundation, which is why it's taught in schools and why it's the dominant form on general purpose CPUs.
Each special purpose optimization you add to this (weakly ordered memory model, NUMA, dropping coherency, asymmetry of the threads, private thread address spaces, etc) can improve peak performance, but at the cost of adding things to worry about.
well somebody calling the SPEs "DSPs" shouldnt be taken for serious anyways.
For shits and giggles read this (http://www.elecdesign.com/Articles/Index.cfm?AD=1&ArticleID=3495) from 2001, and notice how the structure with one host processor and a bunch of DSP cores with their own SRAM and shared memory resembles the structure of CELL.
As for
I'll put my knowledge of computer architecture, system design, semiconductor physics, and VLSI design against yours any day of the week.
No sorry, i usually dont argue with college boys.
You're pure comedy gold, do you know that?
Involuntary so, but pure gold nevertheless
Cheers
Gubbi
ihamoitc2005
26-Sep-2005, 00:30
The primary issue is maintaining the coherence for accesses to main memory from the SPE's and the complexity of it really doesn't change regardless if you are using a DMA copy engine or direct access.
No the primary issue is limiting the negative effect of memory latency on real world performance. As the number of processing units increase, the greater the marginal cost of direct access.
Its no surprise that theres 1.8MB of local store for the SPE and another 512kb for the dual-threaded PPE. That works out to 256kb/"thread". Plus its hooked up to very low latency XDR.
XeCPU has 6 threads sharing just 1MB L2, average of 183kb/thread leading to higher latency GDDR3 which sounds like cache-misses waiting to happen. I would be very surprised if anyone got decent performance using all the cores.
Its no surprise that theres 1.8MB of local store for the SPE and another 512kb for the dual-threaded PPE. That works out to 256kb/"thread". Plus its hooked up to very low latency XDR.
XeCPU has 6 threads sharing just 1MB L2, average of 183kb/thread leading to higher latency GDDR3 which sounds like cache-misses waiting to happen. I would be very surprised if anyone got decent performance using all the cores.
True, if each SPE is doing something completely independent.
However, since the L2 is shared among all three cores, for large, mostly read-only, datastructures you only have one copy, - in the L2. Whereas for the SPEs you'd need a copy in each local store.
Cheers
Gubbi
Acert93
26-Sep-2005, 00:54
XeCPU has 6 threads sharing just 1MB L2, average of 183kb/thread leading to higher latency GDDR3 which sounds like cache-misses waiting to happen. I would be very surprised if anyone got decent performance using all the cores.
Each thread wont necessarily be the same size. You will have small threads and large threads. e.g. with an SPE you get 256K. That is it. So if you have two "applets" -- one 128K and the other 384K, you are in trouble. So the 384K applet does not work and the 128K applet leaves unused memory.
With Xenon you have more flexibility in this regards. Further, cores can easily share and work on the same information. There is also a danger of equating threads with cores. It has been expressed by some developers that the second thread on a core will frequently be of the same nature.
The two philosophies are different. What I have noticed is that there are clear lines drawn... e.g. you stick your nose up at the idea of the Xenon getting decent performance, yet the same thing has been said about CELL.
As a developer told me multithreading is hard work, and each processor offers its own twist on how to solve that problem. There will be areas where each excells and where each fails, and it wont be easy on either one.
The search feature returns some good information (although far too much to read in even a week and it is cluttered with a lot of trash). But needless to say some devs have spoken up and the CELL model does pose some hurdles and problems (especially with data it will work well with, if the PPE is running your OS and delegating to the SPEs that does not leave a lot of extra room to work with).
I think this thread shows that there are different opinions, and even more that each architecture will benefit certain chores and will favor different programmers. CELL is very nice for a PS2 dev who has had to work hard with EE. Xenon is similar to the model PCs have gone and has more research and will appeal to PC developers.
Ultimately it will come down to tools. Your AAA dev houses have the money, time, people, tools, to crack the case. The real issue is 98% of your developers are NOT AAA guys. They are good--and they make great games!--but they don't have the advantage as the big guys with 250-400 member teams and 5 development studios to share information, code, and resources. Developers need tools to help them get the most out of BOTH platforms.
So whoever makes their platform most approachable to the most developers, and allows them to get the most performance out of their work, wins a magor victory. OBVIOUSLY that answer wont be the same for every dev or every title. The tradeoff of work/power may be great in one title but insuffecient in another.
So tools and architecture are both important factors, more so than "peak" performances. Which reminds of a similar scenario on the PC. Similar architectures, but not the same, can return results that are contradictory to the theoretical performance of a chip. The below quote is a good example of this:
Another way to look at this comparison of flops is to look at integer add
latencies on the Pentium 4 vs. the Athlon 64. The Pentium 4 has two double
pumped ALUs, each capable of performing two add operations per clock, that's
a total of 4 add operations per clock; so we could say that a 3.8GHz Pentium
4 can perform 15.2 billion operations per second. The Athlon 64 has three
ALUs each capable of executing an add every clock; so a 2.8GHz Athlon 64
can perform 8.4 billion operations per second. By this silly console
marketing logic, the Pentium 4 would be almost twice as fast as the Athlon
64, and a multi-core Pentium 4 would be faster than a multi-core Athlon 64.
Any AnandTech reader should know that's hardly the case. No code is
composed entirely of add instructions, and even if it were, eventually the
Pentium 4 and Athlon 64 will have to go out to main memory for data, and
when they do, the Athlon 64 has a much lower latency access to memory than
the P4. In the end, despite what these horribly concocted numbers may lead
you to believe, they say absolutely nothing about performance. The exact
same situation exists with the CPUs of the next-generation consoles; don't
fall for it.
ihamoitc2005
26-Sep-2005, 01:03
True, if each SPE is doing something completely independent.
However, since the L2 is shared among all three cores, for large, mostly read-only, datastructures you only have one copy, - in the L2. Whereas for the SPEs you'd need a copy in each local store.
Cheers
Gubbi
Except that all the SPEs can work in parallel or in sequence and can read from each other's local stores so redundant data isnt only unnecessary its silly.
Dunno. Large programs are broken into smaller procedures or code segments that make up your engine and these get pieced together to make the whole program. 256kb LS for data and code means your program isn't going to be totally massive, and I would guess much smaller than 256kb. Heck, 200kb of assembler isn't a pretty thought! You can achieve a lot in 32kb (whole 8 bit games even. Imagine how fast the original Elite could run when written for a SPE :shock: ) and I'd expect a process could be broken into manageable and efficient chunks. Seems more a matter of good design is needed rather than mystical programming powers. And note SPE's don't need assembler so the points moot anyway. Unless you're still developing for PS2!(I'm only debating this point based on your original statement about fine-tuned control over memory access and use of assembler.)
Of course large programs are broken into smaller chunks--this has been true for how many decades now? The problem is that given a large code base, no one wants to be writing assembler level code for all the little bits. Instead, you code first, then you profile, then you optimize, and it's during the optimization stage that you may prefer to drop down into assembler/C/machine hacks and want fine-grained control over memory.
When you code, you code for correctness first. Things that make that difficult end up jeopardizing both your profiling and optimization stages.
.Sis
ihamoitc2005
26-Sep-2005, 01:15
Each thread wont necessarily be the same size. You will have small threads and large threads. e.g. with an SPE you get 256K. That is it. So if you have two "applets" -- one 128K and the other 384K, you are in trouble. So the 384K applet does not work and the 128K applet leaves unused memory.
Using youre method of comparison, if one thread consumed 384k, you have just 5 threads with 128k/thread on XeCPU remaining..
On the other hand, if one thread consumed 384k on CELL, there is still 1 thread with 128k and 7 threads with 256k.
expletive
26-Sep-2005, 01:19
Each thread wont necessarily be the same size. You will have small threads and large threads. e.g. with an SPE you get 256K. That is it. So if you have two "applets" -- one 128K and the other 384K, you are in trouble. So the 384K applet does not work and the 128K applet leaves unused memory.
With Xenon you have more flexibility in this regards. Further, cores can easily share and work on the same information. There is also a danger of equating threads with cores. It has been expressed by some developers that the second thread on a core will frequently be of the same nature.
The two philosophies are different. What I have noticed is that there are clear lines drawn... e.g. you stick your nose up at the idea of the Xenon getting decent performance, yet the same thing has been said about CELL.
Glad to see someone looking for the reason why it will work instead of the reason why it wont. Does everyone really think that MS and IBM couldnt figure out that XeCPU wouldnt be a 'cache mess' before they went to final hardware (or even finished the initial design)?
From what ive read on this forum the whole 360 desgin seems pretty elegant and well thought-out. All the parts (and their connections) seem to be "just-right" for each other.
J
ihamoitc2005
26-Sep-2005, 01:28
From what ive read on this forum the whole 360 desgin seems pretty elegant and well thought-out. All the parts (and their connections) seem to be "just-right" for each other.
J
All consoles are designed to have as little extra capacity as possible. This is because of cost. But because it is closed box, developers dont need as much breathing room as on PCs. On the whole they are therefore much more elegant in design than PCs.
From what ive read on this forum the whole 360 desgin seems pretty elegant and well thought-out. All the parts (and their connections) seem to be "just-right" for each other.
JCouldn't agree more--but I'd extend it to cover the CELL design as well. Different philosophy in implementation, but still very elegant IMO, just in a different way. I've never been terribly excited by console architecture design and this last generation seemed particularly boring. The next offerings by Sony and MS are very interesting.
.Sis
Except that all the SPEs can work in parallel or in sequence and can read from each other's local stores so redundant data isnt only unnecessary its silly.
They can DMA to and from other SPEs' local stores, into their own. Hence a copy and hence redundant.
Cheers
Gubbi
ihamoitc2005
26-Sep-2005, 01:36
Does everyone really think that MS and IBM couldnt figure out that XeCPU wouldnt be a 'cache mess' before they went to final hardware (or even finished the initial design)?
Yes, this can happen because cost estimates can be wrong and many design decisions arent made only by engineers, but also accountants and marketers. Therefore last minute push to reduce cost can cause compromise in otherwise balanced design. Think of missing HD, originally standard, now $100 option. It is unfortunate but thats how it works.
aaronspink
26-Sep-2005, 01:47
Except that all the SPEs can work in parallel or in sequence and can read from each other's local stores so redundant data isnt only unnecessary its silly.
I don't believe that the SPEs can read each other's local stores. I believe that all access externally to/from a given local store must be done via the DMA copy engine. As such, the CELL has support for the copy engine to copy data from 1 SPE's local store to another SPE's local store.
Also any set of hardware contexts can operate in both parrallel or sequencial mode. This is pretty basic and not a new invention nor feature of cell.
Aaron Spink
speaking for myself inc.
blakjedi
26-Sep-2005, 01:50
Think of missing HD, originally standard, now $100 option.
Neither accurate nor relevant. HDD is not an engineer/design decision... building a chip is waaaaaaaay different. Also HDD was NEVER standard except on Xbox.
aaronspink
26-Sep-2005, 02:00
No the primary issue is limiting the negative effect of memory latency on real world performance. As the number of processing units increase, the greater the marginal cost of direct access.
The latency is actually going to be higher with the DMA engine.
Its no surprise that theres 1.8MB of local store for the SPE and another 512kb for the dual-threaded PPE. That works out to 256kb/"thread". Plus its hooked up to very low latency XDR.
The only reason there is so much is because it is statically partitioned and allocated which means that all the local store must be sized to cover the maximum amount of memory that will be needed by any given program. In a scenario where the LS was shared this would not be required and the LS would be smaller.
Also, I think your believe in the low latency of XDR is fairly misplaced. The latencies for XDR are roughly equivlent to the latencies for SDR, DDR, DDR2, and DDR3.
Aaron Spink
speaking for myself inc.
aaronspink
26-Sep-2005, 02:05
Using youre method of comparison, if one thread consumed 384k, you have just 5 threads with 128k/thread on XeCPU remaining..
On the other hand, if one thread consumed 384k on CELL, there is still 1 thread with 128k and 7 threads with 256k.
To misquote the misquoter, fool me once shame on you, fool me twice, 384K won't fit into a 256kb local store.
We haven't even gotten into cases where multiple threads share stay a 256 or 384 KB data structure making the effective size of the cache: 384 * 6 + 1024 - 384 = ~3MB.
Aaron Spink
speaking for myself inc.
ihamoitc2005
26-Sep-2005, 02:20
They can DMA to and from other SPEs' local stores, into their own. Hence a copy and hence redundant.
Cheers
Gubbi
It cannot DMA to and from each other's local stores it has its own internal bus that unites all the local-stores at very high speeds.
Each SPE only uses DMA for local-stores to access main memory, not to access shared local-store. Each SPE has to complete access to full local-store (1.8MB of coherent shared memory) and writes from any local-store to its register file via on internal bus at very fast speeds.
Furthermore, if the data is split into blocks and loaded in batches almost 1.8MB of local-store is available via the internal bus to all SPEs.
From link posted below:
This three-level organization of storage (register file, local store, main storage) -- with asynchronous DMA transfers between local store and main storage
Each SPE has full access to coherent shared memory, including the memory-mapped I/O space.
The most productive SPE memory-access model appears to be the one in which a list (such as a scatter-gather list) of DMA transfers is constructed in an SPE's local store so that the SPE's DMA controller can process the list asynchronously while the SPE operates on previously transferred data. In several cases, this new approach to accessing memory has led to application performance exceeding that of conventional processors by almost two orders of magnitude, significantly more than anyone would expect from the peak performance ratio (about 10x) between the Cell Broadband Engine and conventional PC processors.
darkblu
26-Sep-2005, 02:34
I do not equate "correct" with "easy", only that there tends to be a relationship.
If it's harder to write correct code, then it tends to be harder to write fast correct code, since all fast code must be correct code to be useful.
If it's easier to write correct code, it also tends to be easier to write fast correct code.
ok. here lies the crux of our disagreement - you're saying that smp multithreading allows for 'easier correct' code, and i say that it allows for 'easier' code per say, nothing more nothing less. as in "it won't automtically randezvouz your theads, it won't magically un-race them, it won't do squat more 'bout your code's correctness than the rudimentary mem coherency".
The point is, the less time you need to insure your code is correct, the more time you can spend on optimization.
sure. it's not clear though why you'd spend less time ensuring correctness of the smp code in comparison to 'cellular' code. a lame question for you: which would be potentially more problematic - priority inversion under xecpu or priority inversion under cell? when thinking about it take into accunt the number of processing elements, potential use of SMT, etc.
I never said it was the most efficient, merely the easiest and most natural. This is simply because it is the most general form of multithreading -- you have N threads, they execute the same kind of opcodes, they share your entire address space, memory ordering is strict, and memory coherency between the threads is enforced.
<sidenote>
if threads share the same address space then mem coherency _damn_better_ be enforced or you're in deep shit.
</sidenote>
Every multithreaded design pattern can be constructed from this basic foundation, which is why it's taught in schools and why it's the dominant form on general purpose CPUs.
i agree with your statement above up to "that's why it's taught in schools".
"as Charles Babbage's analytical engine was a turing-complete machine (at least as much as the desktop i'm writing this from) and turing machines are fundamental and taught at school, Babbage's analytical engine has been the dominant computer ever since."
see why?
Each special purpose optimization you add to this (weakly ordered memory model, NUMA, dropping coherency, asymmetry of the threads, private thread address spaces, etc) can improve peak performance, but at the cost of adding things to worry about.
and here we totally and utterly disagree. also you may want to give a wakeup call to all those high-performance NUMA architectures vendors about how wrong they are and what they cause to the developers.
darkblu
26-Sep-2005, 02:56
Ok, couple of things here, and i do appreciate your thoughtful response.
1. My original post was phrased in the form of a question:
"That said, paralleism with 3 identical cores and 6 identical threads should be a bit easier than a PPE and SPE design where each has different needs and potentially different roles shouldnt it?"
So i'm not trying to argue any point, just trying to udnerstand the difference and benefits of each approach.
2. I guess i interpret what is being said by JC slightly differently. My interpretation is he's saying:
a. multithreaded prgramming is a pain in the ass
b. from an 'ease of use' standpoint, the design in the 360 is the best possible case for a developer to coax performance benefits out of multithreading
c. even on the best possible case, its very difficult to realize real world benefits
In my mind, that still doesnt change the fact that regardless of how much aboslute performance gain you can wring out of the 360 CPU, at the end of the day its still easier to get that meager efficency from the XeCPU than the Cell.
What does this mean? I dont really know, more relative efficiency on the 360 CPU? Shorter developement times? Bertter games sooner in each console's lifecycle? No idea and only time will tell.
So in summary, even if the dev-friendly design of the XeCPU gets you nothing, its easier to get nothing on the 360 then on the PS3. :D
J
re point b.
'ease of use' is a tarball in itself (in case you haven't noticed the argument we're having with aaaaa00). what can be said for sure, and would not contradict with what Carmack says either, is that 360's model is very close to the typical smp pc. that would imply ease of porting/translation of programming experience from that same pc domain.
of course, at the end of the day you may be absolutely right - the bulk of console devs may turn out to suck at getting anything decent out of the cell. what you're undoubtedly right about even at this very moment, though, is that time will tell : )
ihamoitc2005
26-Sep-2005, 03:05
To misquote the misquoter, fool me once shame on you, fool me twice, 384K won't fit into a 256kb local store.
I like the misquoting of the misquoter and the sentiment behind it but in this case its misguided.
384k fits into 512kb, which is the size of the L2 cache of the PPE. As I said earlier, that still leaves 128kb for another PPE thread and 1.8MB of local store to be shared by 7 SPEs.
We haven't even gotten into cases where multiple threads share stay a 256 or 384 KB data structure making the effective size of the cache: 384 * 6 + 1024 - 384 = ~3MB.
Aaron Spink
speaking for myself inc.
You are talking about cache blocking no? You are misunderestimating the unlikelihood of 6 threads sharing block as well as the cost of cache miss.
More like 1.5 threads per core to be honest.
Bobbler
26-Sep-2005, 03:48
Can't we all just agree that the Cell and XeCPU offer two fairly different answers to the same problem? Both should work well, especially in a closed system.
The problem with cache on the XeCPU shouldn't really pose itself to be a huge problem because of the nature of a closed system -- you know the limtations ahead of time and so you can plan around them. Same with the Cell. You program to the strengths, not weaknesses. Neither CPU has a perfect model for cache (or cache equivalents) either -- It wouldn't be hard to find ten problems with each CPU, but is that really important?
The closer to release of these consoles the less I seem to care about the specs (I realize some people still like to debate things and dig into specs with fine tooth combs, and that's fine -- but I think getting into the fine semantics of things and trying to say one is a patently better solution is a bit overzealous).
aaronspink
26-Sep-2005, 04:08
It cannot DMA to and from each other's local stores it has its own internal bus that unites all the local-stores at very high speeds.
Each SPE only uses DMA for local-stores to access main memory, not to access shared local-store. Each SPE has to complete access to full local-store (1.8MB of coherent shared memory) and writes from any local-store to its register file via on internal bus at very fast speeds.
Furthermore, if the data is split into blocks and loaded in batches almost 1.8MB of local-store is available via the internal bus to all SPEs.
You are going to have to document your claim here. there is no documentation that I have seen that allows the SPEs to directly access the local store of other SPEs.
randycat99
26-Sep-2005, 04:08
Can't we all just agree that the Cell and XeCPU offer two fairly different answers to the same problem? Both should work well, especially in a closed system.
Quit being all moderate and even-handed, damn you! :p
aaronspink
26-Sep-2005, 04:10
I like the misquoting of the misquoter and the sentiment behind it but in this case its misguided.
384k fits into 512kb, which is the size of the L2 cache of the PPE. As I said earlier, that still leaves 128kb for another PPE thread and 1.8MB of local store to be shared by 7 SPEs.
but 384 won't fit within an SPEs local store.
You are talking about cache blocking no? You are misunderestimating the unlikelihood of 6 threads sharing block as well as the cost of cache miss.
I believe my assumptions are as perfectly valid as your assumptions if not more so.
Aaron Spink
speaking for myself inc.
Processes can be assigned to any free SPE at any given time dynamically. The SPEs were meant to process smaller chunks of data that are ordered by the PPU which has 2 threads.
Given a data as large as 384k it will most likely be assigned to the PPE.If ever there's a case whereby more than 256k needs to be stored and processed by the SPEs it's possible that the SPEs can interact which each other via the internal high speed Element Interconnect Bus(EIB).
ihamoitc2005
26-Sep-2005, 04:48
but 384 won't fit within an SPEs local store.
Why should it fit in SPE local-store when it can fit in PPE L2 Cache as I described?
I believe my assumptions are as perfectly valid as your assumptions if not more so.
Everything I say is hardware capability with no complicated programming required, just SPE style "small" programming. No assumptions or hypothetical situations, all real hardware capability.
OTOH, 6 threads on 384kb cache block on just 1MB of cache is pure hypothetical and not realistic since cache-block x threads effectively 230% of physics cache. Even 256kb cache block very risky for 6 threads with only 1MB cache since 150% of physical memory is risky. Too much cache-block x threads over physical memory is disaster waiting to happen. Cache miss after cache miss will be result. Maybe if developer finds opportunity, 6 threads at 192kb possible for effective 2MB total on very rare occasions. Still pushing chances and not recommended if consistent frame-rate matters.
You are going to have to document your claim here. there is no documentation that I have seen that allows the SPEs to directly access the local store of other SPEs.
Yes there is.The process is called stream processing
Why should it fit in SPE local-store when it can fit in PPE L2 Cache as I described?
Everything I say is hardware capability with no complicated programming required, just SPE style "small" programming. No assumptions or hypothetical situations, all real hardware capability.
OTOH, 6 threads on 384kb cache block on just 1MB of cache is pure hypothetical and not realistic since cache-block x threads effectively 230% of physics cache. Even 256kb cache block very risky for 6 threads with only 1MB cache since 150% of physical memory is risky. Too much cache-block x threads over physical memory is disaster waiting to happen. Cache miss after cache miss will be result. Maybe if developer finds opportunity, 6 threads at 192kb possible for effective 2MB total on very rare occasions. Still pushing chances and not recommended if consistent frame-rate matters.
I am having some slight difficulty understanding your Engrish.You speak Yoda English.Me know you from where.
ihamoitc2005
26-Sep-2005, 05:17
You are going to have to document your claim here. there is no documentation that I have seen that allows the SPEs to directly access the local store of other SPEs.
Its called the Element Interconnect Bus.
http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/D9439D04EA9B080B87256FC00075CC2D/$file/MPR-Cell-details-article-021405.pdf
aaaaa00
26-Sep-2005, 05:20
ok. here lies the crux of our disagreement - you're saying that smp multithreading allows for 'easier correct' code, and i say that it allows for 'easier' code per say, nothing more nothing less. as in "it won't automtically randezvouz your theads, it won't magically un-race them, it won't do squat more 'bout your code's correctness than the rudimentary mem coherency".
It adds fewer complications than something that requires developers to manage asynchronous DMAs, seperate small address spaces, and code overlays on top of whatever you need to do to make multithreading work in the first place.
a lame question for you: which would be potentially more problematic - priority inversion under xecpu or priority inversion under cell? when thinking about it take into accunt the number of processing elements, potential use of SMT, etc.
Priority inversion occurs when a high priority thread is waiting on a resource locked by a low priority thread, but a third thread is monopolizing CPU resources, starving the low priority thread and preventing it from finishing what it's doing and releasing the lock. This causes the high priority thread to behave as if it was running at low priority, hence "priority inversion".
Priority inversion is only possible when you have a contended resource shared between a high priority thread and a low priority thread AND you are scheduling multiple software threads on one single threaded physical processor -- the OS scheduler must decide which of the threads it can dispatch onto the CPU and which not to.
On xbox 360, for the high performance parts of your engine you are typically going to use exactly as many threads as there are hardware threads, so all of the threads are really running at the same time on the hardware -- there's no software threading going on.
Hence, there should be little possibility for priority inversion because none of the threads should ever get into a CPU starved state. Likewise priority inversion on SPEs is mostly likely a non-issue.
However, if you start scheduling software threads on one core of the XCPU, then yes, priority inversion will be possible, but that's no different than the same occuring on the CELL PPE. You'd only do software scheduled threads on the xbox 360 for low performance areas of your engine anyway.
and here we totally and utterly disagree. also you may want to give a wakeup call to all those high-performance NUMA architectures vendors about how wrong they are and what they cause to the developers.
I did not say the tradeoffs were never worth it. In fact newer SMP architectures like Opteron and Itanium make various optimizations to improve performance and step away from the classical SMP architecture.
For example, on a multiprocessor Opteron you have NUMA, which forces you to think about which node your allocations are coming from because accessing a data structure on the opposite node will knock your performance down.
Another example is that on a multiprocessor Itanium the memory model is weakly ordered, which means the CPU allows writes to memory to complete out of order from other writes and reads. This means you have to insert memory fence instructions in the right places to insure coherence before you do anything that involves more than one thread.
The point is, it is a fact that these optimizations do improve performance at the cost of making it more complicated for developers to write correct code.
From the point of view of a software developer, it is still my opinion that the best case scenario is a classical SMP and that classical SMP is the most general and straightforward form of multithreading.
The more you step away from the classical SMP architecture, the more things you have to think about when implementing your design, and at that point, it's all about the tradeoffs.
ihamoitc2005
26-Sep-2005, 05:29
I am having some slight difficulty understanding your Engrish.You speak Yoda English.Me know you from where.
I apologize if my Engrish aint perfect, but I try. Sometimes better than other times no? Regarding cache-blocking ... it is not good for sake of cache miss to have cache-block size x number of threads utilizaing cache block exceed total physical memory. If physical memory exceeded, then real chance of too many threads cache miss and fetch from memory. Too many cache miss is very bad and then we must feel ashamed of our failure and hang our heads in shame. I think Aaron "misunderestimates" danger of cache miss and cycles lost. See my Engrish is as good as the president!
Brodda Thep
26-Sep-2005, 05:51
Its called the Element Interconnect Bus.
http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/D9439D04EA9B080B87256FC00075CC2D/$file/MPR-Cell-details-article-021405.pdf
I don't see anything in there that says that SPEs can directly access the Local Store of other SPEs.
The relevant quote in that document may be:
Similarly, another SPE can use the DMA controller to move data to an address range that is mapped onto a local store of another SPE or even to itself.
Which is, of course, not direct access.
I think the biggest problem is that SPEs do not mesh well with current popular programming paradigms. Ideally, you would want to load up a small function/executable/whatever into a small part of the local store then churn through a ton of data while issuing new DMAs. Perhaps asking for 96k at a time. Issue a request, work on the other 96k while waiting, and then swapping and issuing another 96k request. But then your memory needs to be put into sequential memory blocks. You certainly don't want the data you need to be working on intermixed with data that has no use in the current thread or spread all over the memory space.
I like using object oriented programming techniques myself. And the hoops I would have to go through to get my data in the proper format does not sound fun, but then I haven't done any graphics or physics work worth speaking about. Certainly you won't want to be using anything that has dynamic memory needs, but then I assume that is avoided anyways in the console space as allocating memory tends to be expensive.
At any rate, you won't be putting normal threads on the SPEs without losing a lot of performance. It sounds like you will need a significantly different approach to getting the most out of cell, luckily that approach should work well with XeCPU and PCs. But going the other way is really not an option.
aaronspink
26-Sep-2005, 06:03
Why should it fit in SPE local-store when it can fit in PPE L2 Cache as I described?
Well, then the SPEs are useless, no? The issue is that any given SPE can't work on a data structure that is larger than 256KB (in the real world its actually much smaller due to the setup overhead and latency of the DMA engine you'll likely be limited to a real data set of ~64KB at a time to allow upload and download of the previous data set.
OTOH, 6 threads on 384kb cache block on just 1MB of cache is pure hypothetical and not realistic since cache-block x threads effectively 230% of physics cache. Even 256kb cache block very risky for 6 threads with only 1MB cache since 150% of physical memory is risky. Too much cache-block x threads over physical memory is disaster waiting to happen. Cache miss after cache miss will be result. Maybe if developer finds opportunity, 6 threads at 192kb possible for effective 2MB total on very rare occasions. Still pushing chances and not recommended if consistent frame-rate matters.
You apparently don't understand the issues surrounding constructive interference. It is possible and for some data structures likely that a large number of the threads will be referencing it at the same time. In these cases that 256KB or 384KB dataset for instance is all being shared among the threads within the cache resulting in an effective cache size much greater than the actual cache size. This is real and does happen.
Aaron Spink
speaking for myself inc.
aaronspink
26-Sep-2005, 06:05
Yes there is.The process is called stream processing
Which is all fine and good but the mechanisms which allow this on CELL require using the DMA copy engine to copy a portion of one SPE's local storage to the local storage of another SPE. It doesn't happen auto-magically, or at least there is nothing in the CELL documentation that would allow it to happen auto-magically.
the actual process involves aliasing a portion of 1 SPE's local store into the global address map and then initiating a DMA copy from another SPE into the mapped global address range which is translated into the first SPE's local store.
Aaron Spink
speaking for myself inc.
aaronspink
26-Sep-2005, 06:07
Its called the Element Interconnect Bus.
http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/D9439D04EA9B080B87256FC00075CC2D/$file/MPR-Cell-details-article-021405.pdf
The EIB is merely a transport mechanism and does not facilitate the direct movement of data in 1 SPE to another. All movement into and out of the SPE's local store is handled via the DMA engine (Called MFC in the actual design).
It is readily apparent that you haven't read the actual tech documents and are instead relying on heresay from analysts at disreputable firms.
Aaron Spink
speaking for myself in.c
ihamoitc2005
26-Sep-2005, 06:27
I don't see anything in there that says that SPEs can directly access the Local Store of other SPEs.
LS is shared memory and EIB connects SPEs to LS. EIB peak throughput 96B/cycle. Each SPE has 16B/cycle connect speed via EIB, same as PPU to L1 cache!
Look at diagram on page 2.
Maybe easier to understand if you read previous link as well which describes 3 level memory architecture.
It doesn't happen auto-magically, or at least there is nothing in the CELL documentation that would allow it to happen auto-magically.
Yes it does with the use of a simultaneous non realtime specialised OS in the background that manages the stream processing.IBM has this virtualisation technology incorporated into the Cell to do this.
aaronspink
26-Sep-2005, 06:51
LS is shared memory and EIB connects SPEs to LS. EIB peak throughput 96B/cycle. Each SPE has 16B/cycle connect speed via EIB, same as PPU to L1 cache!
Look at diagram on page 2.
Maybe easier to understand if you read previous link as well which describes 3 level memory architecture.
Maybe it would be easier to understand if you got a clue. The LS is NOT shared. Thats what the friggin LOCAL part of the name LOCAL store means. All access into and out of the LS of a SPE must me done through the DMA copy engine/MFC.
As I said, read the damn architecture documents and come back with a clue.
Aaron Spink
speaking for myself inc.
aaronspink
26-Sep-2005, 06:52
Yes it does with the use of a simultaneous non realtime specialised OS in the background that manages the stream processing.IBM has this virtualisation technology incorporated into the Cell to do this.
Um, in a word, NO. It doesn't happen automagically, it has to be programmer controlled via the MFC/DMA copy engine.
Aaron Spink
speaking for myself inc.
darkblu
26-Sep-2005, 07:10
It adds fewer complications than something that requires developers to manage asynchronous DMAs, seperate small address spaces, and code overlays on top of whatever you need to do to make multithreading work in the first place.
"overlapping address spaces with potential context violations, everything ever touched by more than one thread should be thread safe or you should be absolutely sure what you're doing, intra-thread caches constantly stepping on each other toes, etc.". see, everyone can nitpick just for the jist of it.
Priority inversion is only possible when you have a contended resource shared between a high priority thread and a low priority thread AND you are scheduling multiple software threads on one single threaded physical processor -- the OS scheduler must decide which of the threads it can dispatch onto the CPU and which not to.
actually, the part after the AND above is totally superfluous. priority inversion is each situation where a thread of priority N does prevent another thread of priority N + X from running by means of a third thread of priority N - Y, where there exists resource contention between the latter two theads and there's no such between the first thread and any of the other two (where N, X and Y are positives). therefore, the only two threads that should compete for the same cpu are those of priorities N - Y and N. the third one (N + X) may have a whole vacant cpu for itself - it doesn't matter, it still cannot run.
On xbox 360, for the high performance parts of your engine you are typically going to use exactly as many threads as there are hardware threads, so all of the threads are really running at the same time on the hardware -- there's no software threading going on.
sorry, i missed that - why? you have N threads for the high performance parts (N = num hw threads) and an arbitrary number of other 'non-high performance' threads - and you get software threading, just not among the high-performance threads, supposedly.
Hence, there should be little possibility for priority inversion because none of the threads should ever get into a CPU starved state.
after the slight corretions above i don't see why anynmore.
not only thread (N) can starve (N - Y) on a scheduling basis, but also (N - Y) can be running and still its SMT 'roomate' can trash the former's cache so badly, that you get a brand new form of 'priority inversion' - one where (N + X) cannot run because a thread of arbitrary low priority (even lower than N - Y) to which the former has no contention relations whatsoever is cache-bulling (N + X)'s lock-keeper (N - Y).
Likewise priority inversion on SPEs is mostly likely a non-issue.
aside from the 'likewise', yep. as you would not do multithreading on a single SPE (they have a big read flashing sign over them 'not for multithreadin'), and you really don't _need_ to - there are plenty of them.
I did not say the tradeoffs were never worth it.
so i was under the wrong impression up until now : )
ok then, who decides which tradeoffs are and which are not worth it?
ihamoitc2005
26-Sep-2005, 07:50
Maybe it would be easier to understand if you got a clue. The LS is NOT shared. Thats what the friggin LOCAL part of the name LOCAL store means. All access into and out of the LS of a SPE must me done through the DMA copy engine/MFC.
As I said, read the damn architecture documents and come back with a clue.
Aaron Spink
speaking for myself inc.
"Say what?" Have you not read about CELL stream processing where one SPE reads from another SPE's local store? Maybe instead of getting hot under your collar you should drink some iced tea and learn about how CELL actually works. Its not as complicated as you like to think.
TrungGap
26-Sep-2005, 08:14
"Say what?" Have you not read about CELL stream processing where one SPE reads from another SPE's local store? Maybe instead of getting hot under your collar you should drink some iced tea and learn about how CELL actually works. Its not as complicated as you like to think.
Aaron knows what he's talking about.
from http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/D9439D04EA9B080B87256FC00075CC2D/$file/MPR-Cell-details-article-021405.pdf
Each of the eight SPEs has its own private local store, and the local-store memory is aliased to main memory but does not participate in a cachecoherency protocol. Software must manage the movement of data and instructions in and out of the LS and is controlled by the MFC. The LS has data-synchronization facilities but does not participate in hardware cache coherency. The eight local stores do have an alias in the memory map of the processor, and a PPE can load or store from a memory location that is mapped to the local store (but it’s not a high-performance option). Similarly, another SPE can use the DMA controller to move data to an address range that is mapped onto a local store of another SPE or even to itself.
SPE == SPU + LS + MFC
SPEs and PPE connect to EIB
Eh, just read up on MFC and you'll understand.
edit: fix link
Shifty Geezer
26-Sep-2005, 09:35
Each thread wont necessarily be the same size. You will have small threads and large threads. e.g. with an SPE you get 256K. That is it. So if you have two "applets" -- one 128K and the other 384K, you are in trouble. So the 384K applet does not work and the 128K applet leaves unused memory.I imagine that's an unlikely scenario, like wanting to run a block 3 megs of executable on a processor with 1 MB cache. The crux of this issue is how large is an appulet? At the end of the day an appulet will be written to fit into the LS along with its data, so you won't ever have a developer trying to squeeze a pint of code into a half-pint store. If necessary they'll have to divide the process into two smaller appulets and maybe run them across two SPEs sharing data.
Shifty Geezer
26-Sep-2005, 09:41
The only reason there is so much is because it is statically partitioned and allocated which means that all the local store must be sized to cover the maximum amount of memory that will be needed by any given program. In a scenario where the LS was shared this would not be required and the LS would be smaller.And in a scenario where the LS was shared you'd by dividing its acces rate across 7 processors would you not? Where an L2 cache is a store, LS is a working space, I and D cache combined. Having six processors waiting for a seventh to finish working on the LS before they can access it doesn't sound very efficient!
Also, I think your believe in the low latency of XDR is fairly misplaced. The latencies for XDR are roughly equivlent to the latencies for SDR, DDR, DDR2, and DDR3.Yes, I keep hearing about XDR's low latencies but no-one offers actual numbers. However I gleen the reason it's classed as lower latency is because it's clocked so much higher. But I don't really know. It's one of the console hardware myths.
Shifty Geezer
26-Sep-2005, 09:42
(I'm only debating this point based on your original statement about fine-tuned control over memory access and use of assembler.)
Of course large programs are broken into smaller chunks--this has been true for how many decades now? The problem is that given a large code base, no one wants to be writing assembler level code for all the little bits. Instead, you code first, then you profile, then you optimize, and it's during the optimization stage that you may prefer to drop down into assembler/C/machine hacks and want fine-grained control over memory.
When you code, you code for correctness first. Things that make that difficult end up jeopardizing both your profiling and optimization stages.
.SisWhich is why I guess no-one writes in assembler anymore, and there's no need to anyway except on tiny little code snippets!
Aaron is completely spot on. He knows EXACTLY what he's talking about.
SPE have 256K of addressable memory (via 32 bit pointer). They have an MFC which can DMA memory from an larger external pool (via 64 bit pointer) into the local memory. Its just so happens that the virtual address space includes each SPE's local memory, so that they can DMA to/from each others LS.
The MFC are clever enough to take the shortest path when you do this, so its fast BUT apart from speed its exactly the same as accessing main memory. Its a DMA and most importantly you have to copy it into local memory first before use (so you can never access more than 256K (including code) at any one time).
Acert93
26-Sep-2005, 10:06
Using youre method of comparison, if one thread consumed 384k, you have just 5 threads with 128k/thread on XeCPU remaining..
On the other hand, if one thread consumed 384k on CELL, there is still 1 thread with 128k and 7 threads with 256k. Aaron pointed out my point pretty good. My method was only to show your comparison is not an equivalent one. You are trying to frame Xenon within the framework of CELL. They both require a different model of approach.
As for your exception, I was pretty clearly talking about the SPEs (as Aaron noted) so your what if does not fit my example very well. I could just counter that your SPE and PPE code are not necessarily interchangible (not to mention the PPE is going to be doing a lot of stuff related to the OS and delegating tasks to the SPEs, so consuming 384k on one intensive task could be counterproductive).
The point is you are going to have to make your code for the SPEs fit within the 256K block and if your code is only 200K the left overs cannot be realistically counted as "extra cache in the system".
They are different models of use.
Really, CELL and Xenon are different designs requiring a different approach. At this point neither has shown itself to be better or more effecient. Xenon's model is the PC route and has more research behind it and has shown to have some very large hurdles; CELL is a new approach to the problem. Obviously a lot of the batching and que related ideas could work on Xenon as well. The difference is it has less cores, but they are more "flexible" cores so you don't have as many cores to use with such a model. The initial leak and patents indicate MS is aimed more at a model where a single GPU is dedicated to graphics (renderer, procedural work) and taking the rest from there. On the reverse the method for Xenon is not favorable to the SPEs. You would not want to take the PPE code and run it on a SPE.
So one method wont necessarily work for the other. They are different designs with different needs.
So I don't understand this fixation of denoting how one wont work on a given context. Games that are aimed at exposing the CPUs and maximizing performance are not going to be easily ported becayse the CELL model and the Xenon model of approach to reach those goals is in conflict.
CELL is a bigger chip (50% bigger) and has a higher peak in floating point, so it should really excell there. You would expect a significantly larger chip to perform better on average.
I imagine that's an unlikely scenario, like wanting to run a block 3 megs of executable on a processor with 1 MB cache. The crux of this issue is how large is an appulet? At the end of the day an appulet will be written to fit into the LS along with its data, so you won't ever have a developer trying to squeeze a pint of code into a half-pint store. If necessary they'll have to divide the process into two smaller appulets and maybe run them across two SPEs sharing data. Yep. Different needs, different architecture, different approaches. There are solutions for these problems on BOTH models. They are just different. Because one solution does not work on one platform does not mean it is "the suck".
We all have our thinking caps on how to make CELL work, which is good. But this same type of thinking goes into any project regardless of the CPU being used. CELL is just more exciting because it is new, powerful, and with so many cores you can really try some new things. That is a good thing. But that does not mean the model Intel/AMD are using is useless either (not that you are argueing that). IBM thought it was a good enough approach to come to Sony with it first and eventually used this approach for MS.
And I am sure we would be looking at it different if a Pentium D or X2's 2nd core was aimed more at floating point performance. Obviously they did not and games can really use the extra FP peformance, so CELL's performance in this area is very exciting!
Acert93
26-Sep-2005, 10:23
Can't we all just agree that the Cell and XeCPU offer two fairly different answers to the same problem? Both should work well, especially in a closed system. That is what I am saying!
You program to the strengths, not weaknesses. Neither CPU has a perfect model for cache (or cache equivalents) either -- It wouldn't be hard to find ten problems with each CPU, but is that really important? Please, by all means, make yourself at home in the console forum. ;)
We haven't even gotten into cases where multiple threads share stay a 256 or 384 KB data structure making the effective size of the cache: 384 * 6 + 1024 - 384 = ~3MB. Good example. But this analogy does not work if you are counting cache and local store with the goal to get a "total amount" end product. That just does not jive with the Xenon model (large shared resource among a couple cores vs. statically partitioned resources across many cores).
Its kind of like counting peak performance. Yeah, it looks good on paper... but how does that play out in real life. The only fair measurement is to do so in context of the programming model it will employ. In this case CELL and Xenon are different, so trying to cram one approach down the throat of the other and pointing out deficiencies does not really respect their differences.
What you do on Xenon is not something you will do on CELL, and vice versa.
Well the topic of conversations from what I observered in this forum went from:
Which was more powerful:
RSX vs Xenos to Cell vs Xenon
Now it's which is easier to program for? Cell vs Xenon.
The Cell's SPEs has their fixed 256kb cache limitation but there are goods and bads to it.The bad thing is programmers would have to write codes that exactly make use of the fixed amount of cache when doing subroutines.OTOH if you're going to specifically write specialised code for each individual SPE it's easier to troubleshoot problems and see your codes in a neatly manner.
The Cell and Xenon is not gaps apart from each other when it come to programming.For every problem that existed for each CPU there's solutions to overcome them.But the performance crown would still have to the Cell.No denial to that.
dantruon
26-Sep-2005, 11:27
Well the topic of conversations from what I observered in this forum went from:
Which was more powerful:
RSX vs Xenos to Cell vs Xenon
Now it's which is easier to program for? Cell vs Xenon.
The Cell's SPEs has their fixed 256kb cache limitation but there are goods and bads to it.The bad thing is programmers would have to write codes that exactly make use of the fixed amount of cache when doing subroutines.OTOH if you're going to specifically write specialised code for each individual SPE it's easier to troubleshoot problems and see your codes in a neatly manner.
The Cell and Xenon is not gaps apart from each other when it come to programming.For every problem that existed for each CPU there's solutions to overcome them.But the performance crown would still have to the Cell.No denial to that.
well the topic of conversation for this thread is Itagaki, so why on earth they talk about Cell and Xenon
Shifty Geezer
26-Sep-2005, 11:34
Because Itagaki said PS3 was too complicated, so we debate the validity of that statement ;)
well the topic of conversation for this thread is Itagaki, so why on earth they talk about Cell and Xenon
Why?You see there a some people who takes every word that comes out from their favourite game developer as though they were revelations.In this case Itagaki commented that he prefered the Xbox 360 because it was less complex than the PS3.That sparked off debates that took each and every component of the 2 consoles that they could think of to argue.Me think this is hidden ******ism among some people.
london-boy
26-Sep-2005, 11:39
well the topic of conversation for this thread is Itagaki, so why on earth they talk about Cell and Xenon
Well you know, Cell happens to be the CPU in the PS3, and this Itagaki is talking about how hard PS3 is supposed to be to code for, compared to X360... Do you expect people to discuss the difficulties of coding for PS3 without talking about Cell, and how it compares to Xenon, given that's what the first post was about? :???:
It will only get 'better'... ;)
Itagaki Hardcore Pt. 2 - Itagaki talks Xbox 360, PS3, Revolution, Metal Gear Solid 4 and more.
http://teamninja.1up.com/
london-boy
26-Sep-2005, 11:42
The guy should spend more time making good games (or try anyway), than talking about the competition's work. My opinion obviously...
Acert93
26-Sep-2005, 11:47
The guy should spend more time making good games (or try anyway), than talking about the competition's work. My opinion obviously... He is a chatty thing, ain't he! He is kind of creepy IMO. He looks like the cross between a rockstar and grunge star. Obviously from a hardcore gamer perspective DoA seems to fall short of Tekken, SC, etc (and yet is a commercial success). But from the reviews it seems Ninja Guiden is a good game. Maybe he is chatting on DoA dev time :lol:
london-boy
26-Sep-2005, 12:00
He is a chatty thing, ain't he! He is kind of creepy IMO. He looks like the cross between a rockstar and grunge star. Obviously from a hardcore gamer perspective DoA seems to fall short of Tekken, SC, etc (and yet is a commercial success). But from the reviews it seems Ninja Guiden is a good game. Maybe he is chatting on DoA dev time :lol:
NG is definately a great game.
I just don't think it's professional to call for press conferences explicitly to talk about the competitor's work. He should stick to what he does. I don't see Bungie calling for conferences "About HL2 and PC gaming". Or Kojima talking about Splinter Cell. They might mention it, if it happens, but they surely don't go all out commenting at great lengths about their competition.
I just don't think it's professional to call for press conferences explicitly to talk about the competitor's work. He should stick to what he does.
Seconded. That's what I don't like about Sony-execs too, and why I think Peter Moore is a nice guy.
Sorry for the OT. :)
Acert93
26-Sep-2005, 13:24
NG is definately a great game.
I just don't think it's professional to call for press conferences explicitly to talk about the competitor's work. He should stick to what he does. I don't see Bungie calling for conferences "About HL2 and PC gaming". Or Kojima talking about Splinter Cell. They might mention it, if it happens, but they surely don't go all out commenting at great lengths about their competition. /signed. The fact is there are really talented people in EVERY industry who are JERKS. Look at the forums. You have the same thing. Some drop dead accurate and intelligent posters who seem to, well... follow the path of Itagaki (or whatever his name is!)
Being a jerk does not change the quality of your product (being a nice guy does not equate to more sales, at least not games) but it does rub wrong. Seconded. That's what I don't like about Sony-execs too, and why I think Peter Moore is a nice guy. I would agree, but it does not matter. MS likes it, at least to a degree. It means Sony is taking MS seriously.
I don't even bother mentioning people and positions that are irrelevant. In business you mock your competitor because it gives a sense of, "Ewww there stuff must be really good to be so confident" or it says "Ewww they are afraid of this new product".
Consoles are an industry of hype. So many "middle of the road" games have done really well because of style, market position, advertising, and because gamers did not know better. The key is creating a perception and then fulfilling it.
It is a business, one that is very serious in dollars (huge stakes), but is also the same industry closely tied with the lame game magazines. That is changing as the industry matures and the audiance expands and becomes older, but due to the nature of the industry you will get unprofessional stuff--only because it appeals to their audiance.
Gen X does not want to hear, "Yeah, they have a nice product to, but here is how we are different" but instead they feed on, "They are the suxor version 2! Not even version 2, but suxors version 1.5! My grandma could do better!111"
He is a chatty thing, ain't he! He is kind of creepy IMO. He looks like the cross between a rockstar and grunge star. Obviously from a hardcore gamer perspective DoA seems to fall short of Tekken, SC, etc (and yet is a commercial success). But from the reviews it seems Ninja Guiden is a good game. Maybe he is chatting on DoA dev time :lol:
He's batshit insane. Not only dresses like a rock star, but acts like one, and one on drugs at that. But his studio makes top notch games.
As for DOA falling short of Tekken: No way. SC - maybe and Virtua Fighter - maybe.
Cheers
Gubbi
london-boy
26-Sep-2005, 13:42
As for DOA falling short of Tekken: No way. SC - maybe and Virtua Fighter - maybe.
I guess taste has a lot to do with that. So it's almost pointless to discuss that. It's undeniable that Tekken5 has been given much higher scores than any DOA ever made, by pretty much every website on the net, for many different reasons. Same for SC and VF really.
But if you don't like Tekken, that's another story.
aaaaa00
26-Sep-2005, 14:10
"overlapping address spaces with potential context violations
If you want you can seperate the address spaces with the MMU, then you have things called processes. :wink: On console platforms, they're not generally used, but all modern OSes on SMP architectures support that abstraction if you want.
everything ever touched by more than one thread should be thread safe or you should be absolutely sure what you're doing
This is true of SPEs too. I'm still not clear how a DMA properly synchronizes SPE execution to the PPE. If the PPE wants a job executed, how does it signal to the SPE it wants the job to be done? If the SPE is already busy, how does the PPE add it to the SPE's job queue? How does it get a signal back from the SPE that the job is complete? How does it know where to collect the results from?
All of these operations require two threads to touch some sort of data structure, hence there has to be some sort of locking protecting those data structures, right?
actually, the part after the AND above is totally superfluous.
Consider the case in which you have 3 threads and 1 CPU.
The high priority thread is waiting on a resource held by the low priority thread. There is a medium priority thread consuming all cycles on CPU0. Every timeslice, the OS will examine the ready to run threads, notice that the medium priority thread is ready to run, and dispatch it. Because the low priority thread is never given a timeslice, it will continue to hold the lock, blocking the high priority thread from ever completing.
This is the priority inversion problem.
(Most OSes do some sort of priority inversion detection and/or boost the priority of CPU starved threads to help reduce the impact of priority inversion.)
Now consider the case in which you have 3 threads and 2 CPUs.
The high priority thread is waiting on a resource held by the low priority thread. There is a medium priority thread consuming all cycles on CPU0. Every timeslice, the OS will examine the ready to run threads, notice that the medium priority thread is ready to run, and dispatch it onto CPU0. Then it will see that the low priority thread is ready to run, and dispatch it onto CPU1. Because the low priority thread is given a timeslice, it will run, and will eventually release the lock, unblocking the high priority thread.
There is NO priority inversion problem in this case.
Now consider the case in which you have 3 threads and 3 CPUs.
The thread running on CPU0 is blocked waiting on a resource held by the thread on CPU2. There is a thread consuming all cycles on CPU1. The thread on CPU2 will still execute to completion and release the lock, thus allowing the thread on CPU0 acquire the resources and continue.
There is NO priority inversion problem in this case.
If you have exactly 6 threads assigned to 6 hardware contexts which are executing concurrently, you will never get a priority inversion problem -- a thread that holds a resource will continue to execute until it releases the resource, allowing the other thread to acquire the resource. There is no starvation precisely because the OS scheduler is not involved and there is no prioritization of threads going on, they're all running concurrently.
sorry, i missed that - why? you have N threads for the high performance parts (N = num hw threads) and an arbitrary number of other 'non-high performance' threads - and you get software threading, just not among the high-performance threads, supposedly.
It's stupid to blow 1000s of cycles on a software context switch for a high performance thread. For high performance threads you want to dedicate a hardware thread to it. For low performance threads, just pack them all onto one hardware thread to avoid scheduling them with your high performance threads and messing them up.
You can create 5 threads, then use SetThreadAffinityMask() to lock them to the 5 hardware threads that the system provides. From then on, those 5 threads will always execute on those 5 hardware threads and the system will never move them to other cores or cause them to preempt each other in software. Then you create all your low performance threads, lock them to the remaining hardware thread, and those get software threaded automatically by the OS scheduler.
You can vary this ratio or arrangement however you like. You can create 3 threads and lock them to the 3 cores and not use SMT or software scheduled threads at all. 4 threads, two on one, and one each on the remaining. Or 10 threads, 5 on the hardware threads, and the remaining 5 software scheduled on the last hardware thread. You have the flexibility to set it up however you want.
after the slight corretions above i don't see why anynmore.
not only thread (N) can starve (N - Y) on a scheduling basis, but also (N - Y) can be running and still its SMT 'roomate' can trash the former's cache so badly, that you get a brand new form of 'priority inversion' - one where (N + X) cannot run because a thread of arbitrary low priority (even lower than N - Y) to which the former has no contention relations whatsoever is cache-bulling (N + X)'s lock-keeper (N - Y).
Regarding cache line evictions, remember, going out to main memory is going to cost 100s of CPU cycles. So the first cache eviction that thread A hits will cause it to block, and allow B to execute -- remember it will be 100s of CPU cycles before the data comes back from memory and actually kicks out the cache line that B needs -- even if it does, which isn't likely -- because the cache will probably choose a colder cache line to evict than the one that B just spent a hundred CPU cycles working with.
In any case, cache eviction is a performance issue, not a correctness issue. You don't have to worry about this when you're just trying to get your multithreaded code to work -- it will run just fine, but slowly.
You can always go back later and clean it up and optimize it and rearrange your data structures, add the prefetching, and cache locking and whatever, whereas with stuff like DMAs and LS and asymmetric threads, you have to be upfront and all of it has to be right and working before your program can start doing any useful work.
Which is why I guess no-one writes in assembler anymore, and there's no need to anyway except on tiny little code snippets!:| But that's what I originally said... and the way you determine when to write these code snippets is through some performance-profiled directed effort. The rest of the time you want the system and coding to be as easy to use as possible.
.Sis
He's batshit insane. Not only dresses like a rock star, but acts like one, and one on drugs at that. But his studio makes top notch games.
As for DOA falling short of Tekken: No way. SC - maybe and Virtua Fighter - maybe.
Cheers
GubbiWow. When did this turn into a character assassination thread? Maybe I missed a report or something. Do you have links to back up this claim of drug use?
Itagaki's arrogance has, to me, always come across as an act. He seems to enjoy having fun in interviews.
.Sis
london-boy
26-Sep-2005, 15:37
Wow. When did this turn into a character assassination thread? Maybe I missed a report or something. Do you have links to back up this claim of drug use?
Itagaki's arrogance has, to me, always come across as an act. He seems to enjoy having fun in interviews.
.Sis
He said he looks and acts like a rockstar on drugs, not that the guy's taking drugs!!
He said he looks and acts like a rockstar on drugs, not that the guy's taking drugs!!:lol: My bad, Gubbi! I need to parse the grammar a little better, me thinks.
.Sis
Acert93
26-Sep-2005, 15:41
Wow. When did this turn into a character assassination thread? Maybe I missed a report or something. Do you have links to back up this claim of drug use? Gubbi said he looks and acts like a rock star on drugs--not that he used drugs.
While not flattering comments, I must say when i saw him he looked like a low class rock star. Growing up in the Seattle area during the "Grunge Scene" (although I have never been one much for music at all) the way he dresses does remind me of the rock stars strung out on heroine. It may all be an act as he tries to portray himself as a rockstar (which is his goal as I have heard) but he does look, well, like a rockstar (and many of those are big druggies).
He is eccentric. Gubbi is just commenting on that. And I must say the look he is aiming for made me think of the same thing. But I think we all know just because you LOOK a part does not mean it is so. Thee are typical Joes who are big crack users and there are rock stars who are all organic and would never think of drug use. But there is a stigma due to the popularity of certain things in different circles.
Shifty Geezer
26-Sep-2005, 15:44
:| But that's what I originally said... and the way you determine when to write these code snippets is through some performance-profiled directed effort. The rest of the time you want the system and coding to be as easy to use as possible.
.SisThe topic of assembler only came up in reference to Cell SPE programming and low-level data management. You don't NEED to write in assembler for the low-level data management of Cell though. You can write in C/C++. And maybe other languages would be well suited to SPEs, like Forth?
blakjedi
26-Sep-2005, 16:13
To be fair, it seems as if folks are seeking out TI. He's not calling them like "Hey 1UP! Do I have a story for you!" :???:
Good info in this thread. Learned alot.
Just for the record: I love the guy, very colourful.
Cheers
Gubbi
The topic of assembler only came up in reference to Cell SPE programming and low-level data management. You don't NEED to write in assembler for the low-level data management of Cell though. You can write in C/C++. And maybe other languages would be well suited to SPEs, like Forth?
Forth would probably be a particular bad choice since it centers around a stack. False dependencies on the TOS would choke the in-order cores of next-gen consoles.
C/C++ is as good as any other IMO. Maybe with some CSP instrinsics (or a good class library).
Cheers
Gubbi
darkblu
26-Sep-2005, 17:02
If you want you can seperate the address spaces with the MMU, then you have things called processes. :wink: On console platforms, they're not generally used, but all modern OSes on SMP architectures support that abstraction if you want.
setup separate address spaced and then use IPC? sorry, but you just blew a chunk of the potential advantage smp had over the cellular design - the oh-so-precious total memory coherency. congratulations.
This is true of SPEs too. I'm still not clear how a DMA properly synchronizes SPE execution to the PPE. If the PPE wants a job executed, how does it signal to the SPE it wants the job to be done? If the SPE is already busy, how does the PPE add it to the SPE's job queue? How does it get a signal back from the SPE that the job is complete? How does it know where to collect the results from?
i'm not aware ot actual implementation ether but i think we can safely assume hardware-run queues and interrupt notifications, can't we?
All of these operations require two threads to touch some sort of data structure, hence there has to be some sort of locking protecting those data structures, right?
yes, with locking most likely between _two_ threads running on the PPE. as opposed to many more on the xecpu.
Consider the case in which you have 3 threads and 1 CPU.
The high priority thread is waiting on a resource held by the low priority thread. There is a medium priority thread consuming all cycles on CPU0. Every timeslice, the OS will examine the ready to run threads, notice that the medium priority thread is ready to run, and dispatch it. Because the low priority thread is never given a timeslice, it will continue to hold the lock, blocking the high priority thread from ever completing.
This is the priority inversion problem.
thanks for the lecturing. now re-consider again what i wrote to you in my previous post and try to comprehend it (because if you had done this the first time you wouldn't have written the above).
Now consider the case in which you have 3 threads and 2 CPUs.
The high priority thread is waiting on a resource held by the low priority thread. There is a medium priority thread consuming all cycles on CPU0. Every timeslice, the OS will examine the ready to run threads, notice that the medium priority thread is ready to run, and dispatch it onto CPU0. Then it will see that the low priority thread is ready to run, and dispatch it onto CPU1.
says who? who said the lower priority thread was cpu-agnostic? you assume too much. how about cpu affinities? or you conveniently exclude them out of the scheme?
Because the low priority thread is given a timeslice, it will run, and will eventually release the lock, unblocking the high priority thread.
no. the only condition stated in the problem was that the middle priority thread competes for cpu with the lower priority one, you cannot assume there's conveniently a spare processor where you can run the latter one. pleace try to look more seriously at this problem.
If you have exactly 6 threads assigned to 6 hardware contexts which are executing concurrently, you will never get a priority inversion problem -- a thread that holds a resource will continue to execute until it releases the resource, allowing the other thread to acquire the resource. There is no starvation precisely because the OS scheduler is not involved and there is no prioritization of threads going on, they're all running concurrently.
yes, in the best possible theoretical setup. in practice, though, those 6 threads will have _some_ contention somewhere among themselves, thus some of them will be blocked now and then, thus you may want to get _some_ lower-priority work done on those hw context during that time (only if you care about good cpu utilization, of course), thus you will get the os scheduler involved, thus we get back to the place where we stated from with this problem.
It's stupid to blow 1000s of cycles on a software context switch for a high performance thread. For high performance threads you want to dedicate a hardware thread to it. For low performance threads, just pack them all onto one hardware thread to avoid scheduling them with your high performance threads and messing them up.
alright. so it turns out all you did not totally neglect the thread affinities. that's actually very good. now you can get back to the priority inversion problem and re-consider it.
You can create 5 threads, then use SetThreadAffinityMask() to lock them to the 5 hardware threads that the system provides. From then on, those 5 threads will always execute on those 5 hardware threads and the system will never move them to other cores or cause them to preempt each other in software. Then you create all your low performance threads, lock them to the remaining hardware thread, and those get software threaded automatically by the OS scheduler.
well, good luck with parallelizing those 5 threads so that they never randezvouz. chances are they will, in which case you may want to utilize their hw context somehow.
You can vary this ratio or arrangement however you like. You can create 3 threads and lock them to the 3 cores and not use SMT or software scheduled threads at all. 4 threads, two on one, and one each on the remaining. Or 10 threads, 5 on the hardware threads, and the remaining 5 software scheduled on the last hardware thread. You have the flexibility to set it up however you want.
you have the flexibility, yes. our original argument was about 'ease' and 'correctness' and their relation, though. we never questioned the flexibility of smp mulithreading - it's flexible, alright.
Regarding cache line evictions, remember, going out to main memory is going to cost 100s of CPU cycles. So the first cache eviction that thread A hits will cause it to block, and allow B to execute -- remember it will be 100s of CPU cycles before the data comes back from memory and actually kicks out the cache line that B needs -- even if it does, which isn't likely -- because the cache will probably choose a colder cache line to evict than the one that B just spent a hundred CPU cycles working with.
In any case, cache eviction is a performance issue, not a correctness issue. You don't have to worry about this when you're just trying to get your multithreaded code to work -- it will run just fine, but slowly.
sorry, i though it was you who brought to this argument the many things you have to worry about with smp 'deviations', say, accessing memory locales under numa.
You can always go back later and clean it up and optimize it and rearrange your data structures, add the prefetching, and cache locking and whatever, whereas with stuff like DMAs and LS and asymmetric threads, you have to be upfront and all of it has to be right and working before your program can start doing any useful work.
aha. same with smp. try doing useful work with priority inversions ; )
aaaaa00
26-Sep-2005, 22:16
setup separate address spaced and then use IPC? sorry, but you just blew a chunk of the potential advantage smp had over the cellular design - the oh-so-precious total memory coherency. congratulations.
I've been arguing about having the flexibility to construct the same abstractions, not about performance. (Besides, you can have threads where a large piece of the address space is shared and there's a private piece of address space for the thread -- most OSes provide the notion of shared chunks of address space between processes.)
i'm not aware ot actual implementation ether but i think we can safely assume hardware-run queues and interrupt notifications, can't we?
If it's implemented in hardware, then it's not as flexible or general. My whole point all along is that SMP is the most general and flexible form of multithreading, which is why it's the dominant form on general purpose CPUs.
You can construct the software equivalent of a "run queue" and "interrupt notification" from the building blocks present on an SMP and it's OS.
The point is, any piece of code that you can come up with on a CELL architecture, I can probably convert to run on an SMP without nearly as much trouble as going in the reverse direction. Perhaps it won't run as efficiently or as fast, but it will run.
says who? who said the lower priority thread was cpu-agnostic? you assume too much. how about cpu affinities? or you conveniently exclude them out of the scheme?
If you explicitly lock the lower priority thread to CPU0, we can ignore CPU1 entirely and the problem reduces to the first example. If you don't, the OS scheduler will decide it's ok to put it on CPU1 at some point. This will release the priority inversion and let the high priority thread continue. Sure, it's not optimal, but it does resolve the priority inversion.
I also pointed out that most OSes have a priority inversion detection or mitigation system built into their scheduler, which will try to look for these and resolve them by boosting the priority of the low thread temporarily.
well, good luck with parallelizing those 5 threads so that they never randezvouz. chances are they will, in which case you may want to utilize their hw context somehow.
CELL doesn't as much solve these problems as much as it just avoids them by enforcing restrictions in hardware in how jobs and SPEs interact.
You can enforce the same restrictions in software with your architecture on SMP. If you feed your threads CELL structured small independent jobs, they won't block on each other as far as locking and scheduling are concerned.
If you setup the system with 5 HW threads, with the sixth running regular software threads, or some such similar arrangement, you can make it look no different than a smaller and slower CELL, with the queues and interrupts implemented in software.
You have the flexibility to do any type of architecture on an SMP, which has been my point from the beginning, because it is the most general form of multithreading.
you have the flexibility, yes. ur original argument was about 'ease' and 'correctness' and their relation, though. we never questioned the flexibility of smp mulithreading - it's flexible, alright.
Thank you.
sorry, i though it was you who brought to this argument the many things you have to worry about with smp 'deviations', say, accessing memory locales under numa.
Well not all of the issues you have to worry about right away when coding.
My two examples were Opteron and Itanium.
On Itanium, weakly ordered memory model is definitely a correctness problem, if you don't nail it right away, your program won't work.
On Opteron, NUMA is a performance problem -- you can still transparently get access to the other node's memory, it's just slower.
Both these designs are SMP with some deviation. Each deviation adds things to think about when writing code on top of the basic multithreading issues. CELL is an extreme deviation from the classical SMP design, so it adds a lot of extra stuff to think about.
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.