Will Microsoft trump CELL by using Proximity Communication in XB720 CPU?

I'm enjoying reading this thread, as I suspect it will be about as applicable as speculation about the PS3 back when the PS2 was released. ;) Can I play, too?

If there are still gamedevs reading this, I wonder which would be better received:
more memory?
more bandwidth to the GPU?
more bandwidth to Cell?
more bandwidth between Cell and GPU?
more GPU computational resources?
more Cell computational resources?

[y'all can pretend that's divided up in a sensible way, too -- include an increased ROP count with increased GPU bandwidth, that sort of thing]

I'm guessing that the balance between bandwidth and flops in the GPU already favors flops, and that the OS-reserved memory size might be a source of obvious improvement.... [or, to make it more applicable to the conversation underway, that the gpu probably does deserve a crack at improvement as Cell is already pretty darned good, but that the improvement is likely to be less about computational ability and more about bandwidth]

I'm vote for more of all!!!;)
 
We've seen plenty.. The WarHawk demonstration in GDC last year, the Edge demonstration this year.. Even some of the computer vision stuff that some of the guys i'm working with have been doing is pretty darn incredible to say the least..

If you haven't seen much then you haven't really been following all that well..

Exactly why are any of those things not possible on a dual core CPU? I hate to bring it up because its a bit of a cliche now but Crysis still has the best physics I have seen implemented in any current or upcoming game. No demo's that I have seen on Cell are beyond that (im thinking of the Nuke exlposion in particular) and thats running on a dual core x86.

Its no good holding these things that Cell demonstrates as evidence of its superiority when there is something equally good or better out there on x86.

Like I said, perhaps one day they will bring something out that makes me think "wow, that blows away Crysis's physics" or "man, my dual core could never handle that" but so far, no-one has.


Cell has already superceeded modern x86 dual-core CPUs in so many areas directly and indirectly related to games.. There are many processes even that allow Cell to outmatch 4 and 8 core system purely based on the scale factor of concurrently threads available.. The evidence is here and its compelling.. If your waiting for a game however that showcases dramatic sophisticated processing at the capacity that only the Cell could give then your in for a pretty long wait because we're only at the start of the learning curve.. But then again if you're looking for the same thing for a 4 core x86 CPU also then the same thing applies..

This is pretty much an extension of the above but again, what exactly have they shown thats directly applicable to a real world gaming situation that exceeds what we have already seen the x86 CPU's to be capable of? All I have seen is scientific applications and game/tech demo's which don't exceed what we have seen on the PC (or 360). And your right about them being at the start of the learning curve for multiple cores aswell as Cell. If they are doing Crysis level physics on a 2 core Conroe, what will they be able to do if they really push an 8 core Barcelona for example? I know its not out yet but its really just over the horizon and most importantly, its x86 based and thus supposed to be massively inferior to Cell.

It's funny how some people seem to represent this attitude of impatience.. It's like "yeah we know Cell can be used to process thousands and thousands of rigid bodies in real time... and yes we know that Cell is capable of processing over 700+ independant animation systems in a scene concurrently.. and yes we know that Cell can process sophisticated computer vision algorithms whilst handling a multitude of other tasks on the go but that doesn't mean Cell is better than my AX264 until I see what it can do in a game..!!"

Claims are all well and good but until we see them put into practice I see little point in drawing your conclusions and closing the book. Its not as simple as "which architecture can theoretically process more riged bodies at once" so until its demonstrated that its possible to do this with real game assets alongside all the other tasks the CPU has to deal with during a game I don't see how a conclusion can even be drawn. The EE is a good example of the difference between what its supposed to be able to do and what it actually ended up being capable of.
 
This is pretty much an extension of the above but again, what exactly have they shown thats directly applicable to a real world gaming situation that exceeds what we have already seen the x86 CPU's to be capable of? All I have seen is scientific applications and game/tech demo's which don't exceed what we have seen on the PC (or 360).
Then you're looking in the wrong place! We've seen plenty of scientific applications where the x86 or similar CPUs are eclipsed by Cell. Folding@home, anyone? For games, there's obviously not much out yet. But in tech demos, Crysis doesn't strike me as the most impressive demo of physics so far. The Rubber Ducky E3 demo was more impressive. The DMM from LucasArts too (we don't know what platform that was, and may well be PC from what's shown). We also have developers of physics engines telling us they run better on Cell. And we have the Maya cloth demo. There's no reason to doubt Cell's abilities. You can doubt they'll make it into games, but you count discount the potential due to the existing demonstrations that have shown marked improvements possible on Cell.

The EE is a good example of the difference between what its supposed to be able to do and what it actually ended up being capable of.
Where did the EE fail? When used effectively (which was hard) it managed amazing things. Remember it's handling a lot of the rendering too.
 
Exactly why are any of those things not possible on a dual core CPU? I hate to bring it up because its a bit of a cliche now but Crysis still has the best physics I have seen implemented in any current or upcoming game. No demo's that I have seen on Cell are beyond that (im thinking of the Nuke exlposion in particular) and thats running on a dual core x86.

Its no good holding these things that Cell demonstrates as evidence of its superiority when there is something equally good or better out there on x86.

Like I said, perhaps one day they will bring something out that makes me think "wow, that blows away Crysis's physics" or "man, my dual core could never handle that" but so far, no-one has.
This is exactly the problem that I have with your arguement.. Tell me how many Crysis videos have you seen with any more than around 200-500 rigid bodies in motion at one time? (and the nuke explosion doesn't have any more then that.. there's the wooden building and the rest is particles..)

Just because you've seen physics in action "in a context that makes sense" you say its the most impressive demonstration of physics ever!!!111...

Sure it is from an artistic stance but not significantly from any technical one..

Like I said before.. You'll get your Cell-centric-physics-heavy-game soon enough.. Just be patient..

This is pretty much an extension of the above but again, what exactly have they shown thats directly applicable to a real world gaming situation that exceeds what we have already seen the x86 CPU's to be capable of?
I'm not going to waste my time iterating on this point.. If you haven't seen anything the go search for it.. There's plenty of tech demos and research work that has been done (as well as the kind of stuff I see everyday at work in the studio which I'm not at liberty to talk about) which clearly tell me that Cell clearly out-performs the competition.. Again I iterate: if you're looking for a game, you're going to have to wait..

Claims are all well and good but until we see them put into practice I see little point in drawing your conclusions and closing the book. Its not as simple as "which architecture can theoretically process more riged bodies at once" so until its demonstrated that its possible to do this with real game assets alongside all the other tasks the CPU has to deal with during a game I don't see how a conclusion can even be drawn.

Who's talking about theoretics?

You seem to have this strange inability to look at all of the Cell tech demoes shown so far for what they are...!?

They aren't just some "theoretical" pre-rendered "artist illustrations" of what maybe computationally feasible on Cell..
They ARE real-world demonstrations of technology designed to work in a game environment to handle game specific processing (optimised for the Cell) under the general constraints of a video game's run-time processing load.. For example, the physics demo could easily be scaled down to handle nearly of 800 rigid bodies (as opposed to 3000+ demonstrated) processed concurrently, leaving enough room for the Cell to handle the rest of the game code comfortably.. Edge demonstrates midleware which could effectively be used to schedule threads optimally and spread the workload across the PPE & SPEs in a balanced manner whilst maintaining very high processor utilisation 99% of the time..

These are not "ideas" to provide a ball-park estimation of whats possible, they are real-world systems developed to be integrated into a game engine..

Seriously you need to do your research before you try promoting such ideas..

The EE is a good example of the difference between what its supposed to be able to do and what it actually ended up being capable of.

Yes and when you look at games like GOW, ZOE2 and SOTC you can clearly see that the EE definitely exceeded all expectations in that regard..
 
Except that you can't even mention "conservative" and Northwood in the same sentence.:p The P4 line was an engineering freakshow and might have well been put together by magic. Intel spent billions and billions of dollars get the P4 line to the clockspeeds they did, until they finally hit the genuine clockspeed brickwall in Prescott. IBM can't do that, or even think about doing that. They don't have a tiny fraction of the kind of resources Intel threw at the P4 line.
The Pentium 4 Northwood core had a pipeline roughly as long as the PPE's.
It was probably 2-3 times wider internally than the PPE.
The P4 had a lot of custom circuitry work, and its pipeline was designed to hit high clock speeds.

The PPE has a similar philosophy, but to an even greater degree. IBM's been harping on its great circuit design techniques that enable high clock speeds.

The P4 hit 3.8 GHz at 130nm.

IBM with the PPC 970 at 130 nm hit 2.2 GHz. That core was roughly twice as wide as the P4, and it had a shorter pipeline.

Are you trying to say that a more conservative OoO core--narrower than the P4, half as wide, less aggressive, yet given the exact same long pipeline as the PPE, and it's made at 90nm--couldn't hit 3+ GHz?

Totally agreed except for the "conservative" OoO CPU. Not going OoO was in their better interest for the Xbox 360.

My argument is that there is no technical reason why it couldn't be done. Microsoft's more pressing constraints were the earlier release date and their price priorities.

Those are not technical reasons.

edit:

Is there much point to conservatice OoO though? If it's slimmed down, it's gains will be also, perhaps to the point of not benefitting much. We hear Xenon hasn't the greatest implementatiuon of features like branch prediction. To create a proper, well rounded OoO processor that benefits from the OoO features, you'd be looking at bigger cores. I'd be surprised at more than dual-core in that case, which, if games do become vector heavy, would put the CPU at a considerable disadvantage.

It's difficult to say. The problem with finding examples is that no major designs that introduced OoOE had matching in-order counterparts.
Every time a manufacturer transitioned to an OoO core, it also widened the chip, upped the cache, and added a lot of other complex features.

The Pentium vs Pentium Pro is an example of this.

4.5 million transistors for the first, and 5.5 for the second.

At 32 KB of cache for the first and 16 for the second, (at 6 transistors per bit, I'm guesstimating something like 1.5 million transistors in cache for Pentium, and half that for Pro), the logic section is 3 million for the Pentium and 4.5 million for PPro.

The PPro is ~1.5 times larger than the Pentium. It also tended to get 1.5x+ the performance in spec95.
The gains were variable.

However, at the same time, the Pentium Pro signficantly widened the core and added a complex decoding scheme.

For an ISA not as burdened by CISC decoding and implemented less aggressively, I feel a good portion of the low-hanging fruit could be captured without expanding the core as much as the PPro did.
 
Last edited by a moderator:
The DMM from LucasArts too (we don't know what platform that was, and may well be PC from what's shown).

FWIU the Starwars tech is headed to xb360 (and likely PC) as well. For another example of similar but probably not as robust tech, check out BFBC. Also coming to xb360. ;)

I thought Mercs2 would be a showpiece for tech not possible on xb360 that would truly take advantage of cell and show what the talk about physics and cell was all about, but alas, another xb360 game as well.

I'm truly curious to see a game that will strut it's stuff on cell via crazy physics calcs but nothing seems to be on the horizon ... yet.

/offtopic

Can anyone update me on the progress of running physics calcs on todays gpu's and how they fare?
 
So the scientific apps and demo's shown on Cell haven't exceeded what's doable on x86 PCs?

I haven't paid very close attention but can you point one out to me?

I've seen the Duck demo but I'm not sure it is something that couldn't be done on x86.
 
Last edited by a moderator:
The Pentium 4 Northwood core had a pipeline roughly as long as the PPE's.
It was probably 2-3 times wider internally than the PPE.
The P4 had a lot of custom circuitry work, and its pipeline was designed to hit high clock speeds.

The PPE has a similar philosophy, but to an even greater degree. IBM's been harping on its great circuit design techniques that enable high clock speeds.

This is incorrect. IBM relies heavily on automated circuit design techniques. It can hit it's high clock speed due to the relaxed nature of the circuit design. In other words, they kept the amount of "work" done per cycle as low as possible and as a consequence it stretches the pipeline as long as possible.

P4 doesn't just have "a lot" of custom circuit design, it is fully custom circuit design. No CPU made by any other company has anything near the amount of optimization the P4 has.

The P4 hit 3.8 GHz at 130nm.

Incorrect. It hit 3.4Ghz at 130nm, and 3.8 Ghz at 90nm.

IBM with the PPC 970 at 130 nm hit 2.2 GHz. That core was roughly twice as wide as the P4, and it had a shorter pipeline.

PPC 970 is wider only the sense it could issue more instructions per cycle. However, you're dealing with CISC vs. RISC instructions, and the PPC 970 probably does the same or less work per cycle on average. Plus what you get at 130nm is irrelevant to what you would get a 90nm. Clockspeeds hit a serious wall at 90nm as the 90nm PPC 970 could only reach about 2.5-2.7Ghz.

Are you trying to say that a more conservative OoO core--narrower than the P4, half as wide, less aggressive, yet given the exact same long pipeline as the PPE, and it's made at 90nm--couldn't hit 3+ GHz?

There's quite likely a serious floor to how "conservative" you can make an OoOE engine. In all OoO processors, you need some sort of instruction queue and a reorder buffer, both of which will need a minimum number of transistors and power usage. Plus much of the advantage of OoO is that you can make the CPU much wider than an in-order core, something that would be lost if you make the core that narrow. It's is quite reasonable that IBM decided that OoO is all but practically impossible for a triple core CPU going at 3.2Ghz.

My argument is that there is no technical reason why it couldn't be done. Microsoft's more pressing constraints were the earlier release date and their price priorities.

Those are not technical reasons.

That could be true of all features of any product that don't make it to the final release. With unlimited budgets, almost any goal can be reached. In this case though, they weren't even close to reaching that goal.

One thing you're completely neglecting is the power consumption problem: P4 was a massive power hog. Even the 130nm Northwood needed 80W+ per core. Prescott was much worse. All three cores of the Xenon used up about that much power, so a OoOE version would to be much more than just "conservative."
 
Next generation consoles won't enter the teraflop space because of the CPU's. If you look right now, GPU's are miles ahead of CPU's in flop power. I'm kind of doubt we will see 1 teraflop CPU's in the next consoles. Instead of working on raw computing power, I believe we will see CPU's instead that are more efficient.


http://blogs.mercurynews.com/aei/2006/10/the_playstation.html

Dean Takahashi: It seems like you finished the Cell chip designs early. The first prototypes came out in 2004 and this is 2006. Did you still need a lot of development time after that first tape out?

Jim Kahle: We used that first tape out to get the initial software up and running. There were modifications we did to the chip over time. The design center is still active and participating. Our roadmap shows we are continuing down the cost reduction path. We have a 65 nanometer part. We are continuing the cost reductions. We have another vector where we are going after more performance. We have talked about enhanced double-precision chips. Architecturally we have double precision but we will fully exploit that capability from a performance point of view. That will be useful in high-performance computing and open another set of markets.

Dean Takahashi: That sounds like it’s not a PlayStation 3 chip?

Jim Kahle: Yeah, it is a different vector. For us to extrapolate. We will push the number of special processing units. By 2010, we will shoot for a teraflop on a chip. I think it establishes there is a roadmap. We want to invest in it. For those that want to invest in the software, it shows that there is life in this architecture as we continue to move forward.

Dean Takahashi: Right now you’re at 200 gigaflops?


Jim Kahle: We’re in the low 200s now.

Dean Takahashi : So that is five times faster by 2010?

Jim Kahle: Four or five times faster. Yes, you basically need about 32 special processing units.


Dean Takahashi: AMD bought ATI Technologies and they signaled that a combined CPU and graphics processor is not so far off. They are going to do an initial crack at it for emerging markets in 2007. Is that something you see coming and is Cell anticipating this world already?

Jim Kahle: If you look at a gaming system, there is obviously a close relationship between graphics and the main processing elements. Over time we will look to see how effectively we can make the main processor and graphics tie together. I won’t go beyond that.




while GPUs are ahead of CPUs in complexity and floating point performance, next-gen console CPUs will no doubt break the Teraflop barrier.

Xenon/Xbox360 CPU is already over 100 GFLOPs, CELL/PS3 CPU is already over 200 GFLOPs.

it should be no problem for both Microsoft and Sony to have CPUs in the 1-2 TFLOP (single precision) ballpark.

that said, I know FLOPs is not the only measuring stick of CPU performance. other types of performance measurements might become more important in the next generation of consoles though FLOPs will still be important.
 
Last edited by a moderator:
I haven't paid very close attention but can you point one out to me?
Scientific apps : We have Folding@Home and Mercury's imaging demo's (not x86, but PPC)

I've seen the Duck demo but I'm not sure it is something that couldn't be done on x86.
That many mesh-collisions is way beyond anything I've seen on PC, plus on a fluid bed. They weren't bounding boxes/spheres colliding. This obviously is a best-case scenario as you're testing the model against itself, so there's minimal RAM thrashing for testing against different objects. It still shows something Cell can do that x86 can't because of it's architecture.

I'm a bit muddled how people can be on this forum and not know about the test-cases demonstrating Cell. They're the one's where every time Cell does well, it's claimed the competing code wasn't optimized...

I think there's a problem with the arguments too. "I haven't seen a duck demo with boats, cloth, water, and lots of collision on PC, so how can I know it's not possible?' Well if it was possible, why isn't it being done? Go find the very best physics demonstrations you can, and then show us that x86 can do all that. Otherwise you're basing faith in a system on the fact it hasn't been shown to fail. Bit like saying 'I could win an Olympic gold medal in the triathlon, but I just don't want to enter.' From that POV, no-one can prove you can't win a gold medal in the triathlon. The best they can argue is that you're seen to be out of breath whenever you have to chase 30 yards after the dog.
 
Breadth First Search is used in AI/pathfinding, something for which there is a paper out there with comparisons to x86 :p (Someone can correct me if I'm wrong, but it and algorithms like it are also fairly fundamental and underly many datastructures used in all sorts of applications, including games).

In terms of 'scientific' code that's been written up and compared to other processors, I've seriously lost count at this point of the papers that are out there. There's been a lot, and many have been posted here.

There are a number of other things that have been implemented and documented on Cell which could find its way into game code, but the problem with asking for demonstrations/comparisons of game code is that game companies tend not to disclose that kind of information in order to keep platform holders happy. Any time we have seen specifics discussed it has always only addressed one system (e.g. Havok published performance numbers for PS3 only), and any time even a remote comparison has been made, mouths were swiftly shut by PR people. It's in the academic/scientific community that things are transparent and get written up and published, so we'll just have to deal with what they're interested in.
 
Breadth First Search is used in AI/pathfinding, something for which there is a paper out there with comparisons to x86 :p (Someone can correct me if I'm wrong, but it and algorithms like it are also fairly fundamental and underly many datastructures used in all sorts of applications, including games).

To be clear, Breadth first search can be (has been !) done on x86 easily... but for a large problem size, the Cell variant outperformed a BlueGene 128 CPU node. I wouldn't expect a x86 to do that today.
 
To be clear, Breadth first search can be (has been !) done on x86 easily... but for a large problem size, the Cell variant outperformed a BlueGene 128 CPU node. I wouldn't expect a x86 to do that today.

Single cell or cell cluster?

BTW, for people who doubt cell, go read some research papers on stream computing. It has a lot of potential to be the ideal processing paradigm for quite a few (possibly most) different types of workloads.
 
To be clear, Breadth first search can be (has been !) done on x86 easily... but for a large problem size, the Cell variant outperformed a BlueGene 128 CPU node. I wouldn't expect a x86 to do that today.


Sure, I'm not even saying this is something that typically needs to be done faster or needs to be sped up significantly etc. etc. or what the merit in doing so is, just that as an algorithm that's relevant to games where Cell and x86 have been compared, it fits the bill.

Fox5 said:
Single cell or cell cluster?

Single Cell. I don't know HOW patsu's download link works (?) but there's also a dr. dobbs article on it here: http://www.ddj.com/dept/64bit/197801624
 
Sure, I'm not even saying this is something that typically needs to be done faster or needs to be sped up significantly etc. etc. or what the merit in doing so is, just that as an algorithm that's relevant to games where Cell and x86 have been compared, it fits the bill.



Single Cell. I don't know HOW patsu's download link works (?) but there's also a dr. dobbs article on it here: http://www.ddj.com/dept/64bit/197801624


You're right. My link was screwed up :(
I have fixed it to include Petrini's original presentation in Cell Summit 2006.
 
Last edited by a moderator:
Back
Top