Could PS3/Cell be another Itanium or Talisman?

Completely agree with Vince and Tuttle in this thread. CELL is not some gimmicky architecture that is just going to go away. It will be in an award winning piece of hardware (PS3) and will be in many different IBM workstations. IBM may even put it into supercomputers that it can sell for a huge profit yet lower cost compared to its Power series of processors in supercomputers.
 
A Japanese company committing to a risky technology is no guarentee of success. Remember the Fifth Generation Project? (For people not alive in the 80s it was a grand project by the Japanese government and industry to use artificial intelligence languages and technologies to leapfrog the US in software design. Didn't work out very well.)

IBM putting thousands of developers on a project is no guarentee of sucess. They did that on OS/2 as well, back in the day.

Also, I wonder who's paying the IBM developer's salaries? IBM may just be doing work-for-hire for Sony. They may not have any long-term investment in the product's success.

Of course IBM would love to create a new CPU architecture for the world, but it seems odd to start out with the PS3, rather than some super-computer or web-server-based systems. Cell just feels like a research project that nobody else wanted.

Building a new fab is no indication of commitment, either. New fabs are always being built, and should PS3 not work out IBM can certainly use the fab space for something else.

Vince said:
There's a paper by two Cell researchers on Inheriently Lower-Complexity Architectures and their ability to have vastly better Time-to-Market results due to their modular nature and the necessity of being ontime. Perhaps you should read it.

Am I the only person to find is amusing to see the phrase "Lower-Complexity Architecture" in conjunction with an architecture that requires distributed programming techniques, and the phrase "better time-to-market" in conjuction with a project that has taken five years, and hasn't shipped yet?

Oh, wait, maybe they're arguing against using Cell. ;)

Probably not, though. Probably they're arguing for Cell because they're suffering from near-sightedness. From their point of view they've improved things by pushing the complexity out of thier part of the system (the CPU) into the developer's lap. It's true that the CPU is simpler, but the system as a whole seems to have gotten more complex, not less.

Such an approach might make sense -- if the performance benefits are large enough. I guess time will tell.
 
FatherJohn said:
IBM putting thousands of developers on a project is no guarentee of sucess. They did that on OS/2 as well, back in the day.
Technologically, OS/2 was a success. The fact that it failed in consumer market had nothing to do with tech not delivering, and everything to do with IBM's complete incompetence to market a consumer level product.

Personally I put half of the blame for the world getting stuck with MS's sorry excuse of an OS back then, at IBM marketting dep.
 
FatherJohn you are very wrong in what you say. You use very well written posts to convey a point that isn't completely in touch with reality. You compare CELL to a project in the 80's and the only thing you point out that it has in common is that something Japanese will use it. Sony is not the Japanese government.

IBM is not doing work-for-hire as you say. It is clear to me and many others on this board that IBM has big plans for the architecture and will be implementing it into future computers. It has been confirmed so you are wrong in that account also.

And the PS3 is not going to be the initial platform it is tested on. It will be CELL workstations built by IBM themselves. That discredits your statements even further.

CELL may feel like a research you wouldn't want to work on since you seemingly cannot grasp the concept behind it. Simple units that can be thrown together onto one big massive chip. When these simple units are on one chip the performance can be quite staggering. I'm pretty sure IBM knows what its doing and will be able to get the efficiency of CELL to great levels.
 
Itanium itself has good performance and technically it's a success, but it's failing only because of cost reasons. Itanium 2 runs x86 code by software, so faster it gets faster it runs x86 binaries without messing up fixed hardware and complicating it. OTOH, Cell has nothing to do with supporting legacy x86 binaries. Besides, it has outlet channels to reduce its cost by achieving economies of scale which are not available to Itanium. Who puts Itanium on a TV set? :LOL:

On a related note, there are recent interviews by H. Goto @ Impress PC Watch with Intel Chief Technology Officer, Pat Gelsinger
http://pc.watch.impress.co.jp/docs/2004/1112/kaigai133.htm (machine translation by Excite)
and Gelsinger confirms that future computing is heading to vastly parallel architecture as most application codes are getting parallelism-familiar just like ray-tracing and Intel is working now toward it to develop 'many-core' CPU. He even talks that Intel aims at the first 'tera-flop' machine with a 'many-core' CPU beyond dual-/quad-core CPU. Then inevitably Goto throws out 'Cell' as another example of such an attempt, then how do you think Gelsinger answered? While he admits that Cell is an innovative new architecture, the difference between Intel's 'many-core' CPU and Cell is that the former has software compatibility to all legacy codes, while the latter has compatibility with nothing. (At least it seems Gelsinger thinks Cell is not a PPC variant)
 
Cell is a much more useful cheap for IBM than Sony. Scientific computing in general parallels nicely, gaming really doesn't (guess you can multithread your AI and physics engines but doing it completely without memory clashing will be a pain, I've found on simulations I've done simple parallaling schemes like OMP can result in big performance degradation in those situations)

Really unless they come up with the magical distributed compiler its going to one PITA to program for is all I have to say (and something tells me they aren't going to advance parallel compilers that much). Even very parallizable problems can often times not work as well as you think they would. But only time will tell though.
 
FatherJohn said:
Am I the only person to find is amusing to see the phrase "Lower-Complexity Architecture" in conjunction with an architecture that requires distributed programming techniques, and the phrase "better time-to-market" in conjuction with a project that has taken five years, and hasn't shipped yet?

I think you are, perhaps you should read the paper before comenting? Cell, from everything I've seen and heard, is the embodiment of this concept: from their unified datapaths to the modular construction. Where, exactly, isn't the chip an aggregate of Lower-Complexity Units as compared with anything else on the mainstream consumer marketplace embodied by Intel and AMD?

And TTM implies that you meet the deadline, whatever arbitrary deadline it is. Just because STI has spent 5 years doing research into the various aspects of the archictures doesn't mean that the last 2-3 years in which the design is layed out, synthesized and taped-out are irrelvent. You can spend 5 years doing fundimental research on an architecture and design with whatever extrapolation you have in mind (eg. 65nm sSOI), but unless you do research into how to actually gte the thing synthesized and functional to meet the goals once the process technology comes online - you're fucked.

I forget your name said:
Probably not, though. Probably they're arguing for Cell because they're suffering from near-sightedness. From their point of view they've improved things by pushing the complexity out of thier part of the system (the CPU) into the developer's lap. It's true that the CPU is simpler, but the system as a whole seems to have gotten more complex, not less.

How can you make such a comemnt based on the publically available information? You have no idea where the so-called 'complexity' will reside, nor do you know what programming model will be utilized.

And near-sighted is the exact opposite of what comes to mind. if anything, it's too far-sighted a design and will suffer from what happens when you give a group of thinkers a blank sheet of paper.
 
Cryect said:
Cell is a much more useful cheap for IBM than Sony. Scientific computing in general parallels nicely, gaming really doesn't (guess you can multithread your AI and physics engines but doing it completely without memory clashing will be a pain, I've found on simulations I've done simple parallaling schemes like OMP can result in big performance degradation in those situations)

I have a question. People keep talking about how certain tasks aren't very well made parallel-able (help me on the wording, heh), but why would you want to? Why do you want to speed-up a single process? Can't a single process be computed in enough time any any NG console element, or reduced to the point where it can be. Why not just compute a plurality of objects, things, entities, etc in the same length of time.

World Simulation and having worlds inhabited by GTA-esque cities that are busy, from my perspective, is much more ambitious and opens more possibilites as a game player than trying to run ever long microprograms faster on a few objects.
 
While it sounds like that works well it doesn't work quite as well as you would think at first glance. Monte Carlo molecule simulations work similar to that and the issues that break how parallel you can be is the number of interactions.

If you can avoid interactions between items great really it comes down to a smart scheduler for controlling which entities are doing what. This way you can try to avoid having two entities which are interacting together both executing their programs at the same time.

Event schedulers will be the main thing I see programmers having to get used to initially. A really good scheduler assuming you have enough independent discrete tasks should make it feasible to get a good chunk of the theoretical performance, early on though expect say 10% at most to be achieved.
 
Vince said:
World Simulation and having worlds inhabited by GTA-esque cities that are busy, from my perspective, is much more ambitious and opens more possibilites as a game player than trying to run ever long microprograms faster on a few objects.

This is a problem very close to my heart for a number of reasons.

This is exactly the sort of thing that is cell like architecture do not do well.

The problem is that systems like this rely on single large databases, and every "thread" in an environment like this requires close to Random access to the central data.

Cell is clearly engineered for tasks that allow clear segmentation of the code and data resources (graphics leaps to mind).

The reason that certain tasks can not be trivially converted to hundreds of simple parallel threads is because of data dependencies, in some cases different approaches can yield a good degree of parallelism, but this is not true in the general case.
 
Talking about the vision of Intel and Gelsinger again, this section at the Intel website titled 'Architecting the Era of Tera' based on the Spring 2004 IDF keynote speech by Gelsinger, shares the strikingly large amount with what Cell is supposed to do.

Tomorrow's Computing Workloads
To develop this architectural paradigm requires understanding the workloads of tera-era computing.

Tera-level computing involves three distinct types of workloads, or computing capabilities:
# Recognition: the ability to recognize patterns and models of interest to a specific user or application scenario.
# Mining: the ability to mine large amounts of real-world data for the patterns or models of interest.
# Synthesis: the ability to synthesize large datasets or a virtual world based on the patterns or models of interest.

Today, we have application-specific architectures optimized for a single workload. We don't expect an enterprise server to do real-time rendering, and we don't use 3D graphics engines to do database sweeps.

In the future, as the amount of data continues to grow, the computing capabilities for recognition, mining and synthesis (RMS) workloads are converging. Each type of workload, while very different, will require teraflops of processing capability applied to massive data streams. This convergence enables the creation of a new architecture that can meet all three workload requirements — recognition, mining, and synthesis — on a single platform.

[url=http://www.intel.com/pressroom/archive/speeches/gelsinger20040219.htm said:
Gelsinger[/url]]
And beyond that, we say what's next after Hyper-Threading Technology is multiple cores per die. Where, rather than big, full-chip implementations we now have localized implementations in each core and effective relationships with the "nth" level, second and third level caches.

But we're going to continue to go beyond there. And that just won't be in our server lines in the future; this will permeate every architecture that we build. All will have massively multicore implementations into the future.

These implementations won't just be at the chip level. We need to rearchitect the entire platform to address such massively multicore architectures. We need to redesign the CPUs, the memory interfaces, the caches, and the interconnects. We need to redesign the entire platform for these characteristics of scaling into the future.

And these will be the architectures of the era of tera of tomorrow. We've been analyzing this, and with much work, we're just at the beginning phases of looking at threading designs. And as we look at threading, we see and have been exploring the basic notions of how you make that operate in the core.

Also Gelsinger talks about a new Intel compiler which can create helper threads to hide memory latency or to process network packets, so the entire platform including software and hardware for processing massive datasets is the research target of Intel.

But Intel may be a bit too late to turn into this direction, as even AMD started earlier with multicore designs, not to mention Cell of 2000~.
 
ERP said:
The problem is that systems like this rely on single large databases, and every "thread" in an environment like this requires close to Random access to the central data.

In timestep simulations they mostly need read access though, which is only a problem as far as bandwith is concerned ... but doesnt limit parallelism otherwise.

The kind of timesteps you can use in a game are pretty coarse, this aint molecular dynamics.
 
Yes your right for the most part the simulation can rely on the previous frames results.

The issue with Cell is more the size of the database you need access to, not the data dependancy. My understanding (at least at the moment) is that each APU will only have direct access to a 128K chunk of memory.
 
I personally dont like the Cell architecture as described in the patents. I would much rather have had cores with mundane caches, using vertical multithreading to overcome memory latency when accessing non local memory.
 
ERP said:
My understanding (at least at the moment) is that each APU will only have direct access to a 128K chunk of memory.

My current understanding is that each A|SPU can directly access the shared memory by sending a DMA request to the DMAC which translates the virtual address to a physical one via the TLB, and if it doesn't exist it looks in a file in the shared memory and then arbitrates the transaction.

Also, there has been talk of each A|SPU having a local store and a flow controller. The thinking goes that the aggregate of A|SPUs then share their own L1 cache which stores data based on the load-access patterning of the DMAC's previous DMA's requests before hitting the system memory. Ergo, data is predicted and moved autonomously based on this patterning into the L1 outside of, and prior to, a formal DMA request. If there's a miss off the patterns, it DMAs it from the system memory directly into the LS. I would imagine that for streamed, sequential, data that's logically addressed this would be quite effective. Oh, and each PU has an L1 and L2 -- or so it's said.
 
And databases used for the sort of databases used for large scale AI don't have predictable fetch patterns. They are however relatively coherent from frame to frame, so a reasonably sized traditional L2 cache is a big win.

My information is hardly accurate at this point, but I don't expect to be able to do random access to any large scale databases on an APU. My hope at the moment is that the central PE core is a usable fast processor and not crippled in some way.

I do not expect from what I've read that Cell will be a good general purpose processor, but then I don't believe it's designed to be such.
 
Fafalada said:
FatherJohn said:
IBM putting thousands of developers on a project is no guarentee of sucess. They did that on OS/2 as well, back in the day.
Technologically, OS/2 was a success. The fact that it failed in consumer market had nothing to do with tech not delivering, and everything to do with IBM's complete incompetence to market a consumer level product.

Personally I put half of the blame for the world getting stuck with MS's sorry excuse of an OS back then, at IBM marketting dep.

Actually it's not that cut and dry.

OS/2 1.x was designed for the 286, when the 386 was already available (and hence had the brain dead "DOS penalty box"). It didn't even have a GUI subsystem for the first two versions. It had no apps, so no one ran it anyway.

The OS/2 2.x and newer kernels retained a lot of 16-bit turdage and brain damage of the 1.x kernel -- most of the core kernel was actually still 16-bit and written in assembler. The upper layers of the system were 32-bit and written in C.

In this respect the architecture was actually a lot like Windows 9x.

Most OS/2 physical device drivers had to be written in 16-bit assembler. The entire GUI subsystem could be frozen by a single app not servicing its message queue in a timely fashion, kind of like similar weaknesses in Win9x. OS/2's graphics engine didn't go fully 32-bit until OS/2 2.1, and despite the marketing verbiage, I don't think OS/2 ever actually went fully 32-bit.

OS/2 2.0 competed with Windows 3.1 (which it was indeed immeasurably superior to), but it had a lot of rough edges and required a much bigger machine than most people had at the time -- I remember running it on an 8 MB 486, which was a huge machine for the time, and enduring all sorts of weird behavior from the GUI shell, which was called the WPS (and was rather neat, but really slow).

By the time machines got big enough and OS/2 got enough polish to really compete (OS/2 3.0 Warp), it was up against NT3.5 -- which pretty much handed OS/2 its ass in about every objective measure I can think of except the GUI shell.

I ran OS/2 on my machines all the way up to OS/2 4.0, (code name Merlin), until I gave up and just switched to NT 4.0 -- there was basically nothing significant that OS/2 did that NT4 didn't do better.
 
ERP said:
And databases used for the sort of databases used for large scale AI don't have predictable fetch patterns. They are however relatively coherent from frame to frame, so a reasonably sized traditional L2 cache is a big win.
Assuming you don't want more AI entities than amount of L2 cache needed for each one of them.

And the issue as I said before even with a timestep simulation which is what I assume most games will use is making sure you don't have each of them trying to access the same amount of memory assuming there is any sort of crossbar (about any system has one so I would assume the Cell has one but not fully certain) where its preferrably you don't have everything access similar spots in memory. If each of your processes is using the same node the perceived memory bandwidth is cut to 1/n where n is the number of nodes in the crossbar.

Then again depends of course on how the memory crossbar is implemented. Though in general in a system where multiple items(APUs/GPUs/MPUs etc) are accessing memory at the same time you want each of them to be able to use a single node at a time so other devices can access memory as well (the more nodes in the crossbar the more devices that potentially can use the RAM at the same time but of course the memory bandwidth for each node is less but in a streaming system its prolly preferrably to have high levels of nodes).

Though if each of the programs needs to access the same memory of course any caching system should take care of the issue. The problem is when they need to access memory in a similar area but not actually the same memory (which if they are similar programs very likely that they are accessing the same arrays of memory).

ERP said:
My information is hardly accurate at this point, but I don't expect to be able to do random access to any large scale databases on an APU. My hope at the moment is that the central PE core is a usable fast processor and not crippled in some way.

Agreed
 
the only concievable way of achieving their performance goals was to adopt a revolutionary strategy.

Who ever said this strategy was revolutuonary?

"But not everyone thinks this approach is groundbreaking, given that some processors already use inter-chip multiprocessing. "I just don't see that Cell is revolutionary, except in its marketing impact," Glaskowsky said."

Glaskowsky obviously doesn't think so. :?

That would be Nintendo, nobody else is even close to the total success of Ninty.

That depends on how you define their success. If memory serves me right, Sony has not only stolen and gained more market share than Nintendo, but they have also managed to accumulate a larger profit since the very begining of their career as a console leader. I believe it was their $36 billion to Nintendo's 34, counting a few weeks to a month ago. ;)
 
Back
Top