Cell

In the "professional visualization" industry, using expensive SGI systems, the parallelization has been known (and used) for at least 10-15 years. Already at the Performer (an scene-graph API, if you didn't know about it) level you have multi-CPU-support in the form of App-Cull-Draw running in parallel on separate CPUs. This, of course, introduces a 2 frame latency which may be acceptable at 60 fps (33 ms extra latency) or higher (lower latency).

With an even higher level scene-graph API, like Vega, you can have even more separate processes running other tasks like collision detection, without you as a programmer needing to think about how to parallelize it.

As a programmer, you then of course have additional ways of extending the parallel nature of the whole system. The application could be divided into separate processes running the flight-model simulation, radar simulation, communication, etc - example from flight simulators which is what I used to work with developing.


The problem with this approach is that some tasks running as a single process (and therefore can't be subdivided) may be much more complicated than the other tasks - making this the bottleneck in the system. No matter how many extra CPUs you throw in, you won't get an improvement. And this is where the main problem of parallization lies: how do you parallelize a task that doesn't have a parallel nature?

/Per
 
The problem with this approach is that some tasks running as a single process (and therefore can't be subdivided) may be much more complicated than the other tasks - making this the bottleneck in the system. No matter how many extra CPUs you throw in, you won't get an improvement. And this is where the main problem of parallization lies: how do you parallelize a task that doesn't have a parallel nature?

Simple, let the current one do it's thing and do something else in the meanwhile. If you're only running one application on a system, that's in reality a special case. As for Sony, they want to push themselves as the desktop replacement. Cell is a step in that direction or so I gather.
 
Thats fine and dandy, but most people want to run games on a console. There's only a couple of tasks which take a lot of processor power, Rendering, physics and AI/game-logic ... I think all those can be parallelized on one level or another, I doubt it will be particularely easy though. If all developers get to help them along is multithreading debugging will become a nightmare for one.
 
MfA said:
If all developers get to help them along is multithreading debugging will become a nightmare for one.

i get the impression multi-thread debugging gets somehow overrated here. yes, it's more complicated than single-thread, but it's as easy/difficult as your debugging tools make it be - just the same as with single-thread debugging. ..apparently some developers need to step out of the windblows mindset and get their bottoms exposed to a pervasive multi-threading environment for a while..
 
MfA said:
Rendering, physics and AI/game-logic ... I think all those can be parallelized on one level or another, I doubt it will be particularely easy though. If all developers get to help them along is multithreading debugging will become a nightmare for one.

Again, with examples from developing military flight-simulators on high-end systems, the solution for this is running them as separate processes (basically programs that can be developed and debugged separately) having them extending data through shared memory, message queues and/or similar. However, running separate processes introduces full-blown context switches which is OK on multi-CPU solutions with large caches. I guess you could implement something like this on XBox, but I have no idea at all about PS2 and GC.

I'm not saying that this method is feasible on current or next-gen consoles, just wanted to provide an example on how high-level parallelism (might be 100+ processes on a 2- or 4-CPU machine) is used on high-end systems. Probably it would be very inefficient on an console.

/Per
 
First of all, I don't think there's any such thing as a highly-complex algorithm that cannot be split up into multiple processes, in one way or another. Granted, it's not always easy to do so, but hopefully we'll have developer tools that make it easier as parallelism in CPUs increases.

Regardless, the best thing about designing multiple cores into a single die is that the dies can use current instructions sets (and therefore have backwards compatibility), but use compiler-level or developer optimizations designed to leverage the massive parallelism.

Why is this good? Prallelism in most current CPUs is usually wasted, simply because it's up to the CPU itself to decide how to use prallelism, on the fly. Why do you think that the Pentium4, with its one floating-point execution unit, stands a chance against the Athlon with its three floating-point execution units?

But, when you put the parallelism control in the hands of the software (either at the compiler level or programming level), you can spend as much time as you want deciding how to split up a process into multiple threads for execution. Attempting to decide how to do parallelism in the processor means that there is an extremely limited amount of time to figure out how to do it, meaning you just can't make good use of parallelism.

The main benefit of a cell-like design as opposed to normal multiprocessing is cost. There's also the added benefit of making communication between the different processors much easier. The main drawback is that all of the processors on one die have to share one front-side bus.
 
Why is this good? Prallelism in most current CPUs is usually wasted, simply because it's up to the CPU itself to decide how to use prallelism, on the fly. Why do you think that the Pentium4, with its one floating-point execution unit, stands a chance against the Athlon with its three floating-point execution units?

Either you chose a really bad example or your not aware of FP code. IIRC, FP intensive code is about 20% FP. On top of that, it's important to note that the Athlon's FPU is grossly imbalanced with it's memory bandwidth. Further more it's what you pipeline that's important and then there is SSE2.
 
Saem said:
Either you chose a really bad example or your not aware of FP code. IIRC, FP intensive code is about 20% FP. On top of that, it's important to note that the Athlon's FPU is grossly imbalanced with it's memory bandwidth. Further more it's what you pipeline that's important and then there is SSE2.

First of all, I find that 20% number very hard to believe. If it were true, then our processors would have the integer circuits being approximately four times as fast as the floating point units. We don't see this happening. Additionally, though I haven't seen/examined the compiled assembly code of my programs, those in which I do use lots of floating-point, the only calculations that aren't FP are counters for looping and logical ops, which are rare in comparison to the number of FP calcs I do (I would estimate 80%-95% FP calcs from looking at the C++ code.).

As for SSE2, I was attempting to make a comparison on pure FPU performance (where the Athlon does win, but I don't think it's close to three times as fast per clock).

Unfortunately, I can't say that I've seen any real data that would show how much memory bandwidth-intensive today's FPU programs are.
 
SpecFp is a great indicator. It uses a lot of heavily implemented algorithms. Though it is DP, but it shows a quasi real world example. Before we start getting into it's a compiler benchmark and it's to memory bandwidth intensive. I'll simply dismiss all those rubbish arguements with, MPUs don't execute C and Fortran natively, you HAVE to compile code, so why not do it well and data sets don't fit into cache, seeing as Spec is a system level benchmark, their choice in workloads is fine.
 
I am beginning to think this uni Vs multi processor issues much like CISC Vs RISC. On one hand you have this one beast uni processor with its memory syetem, on the other hand you have many simpler ant multi processors each with its own memory system.

Cellular arrangement is actually in between that two extremes. Instead of uni processor with its own memory or multi processors, each having its own memory system, they suggested multi processors would have shared memory.

But increasing the number of processors into the same shared memory system would increase the chance of contention, there should be a limit to the number of processor you can add. So instead of adding more processors to the same shared memory, they make a new cluster of the same things, and they call each cluster, Cell, and the collection of these cells, Cellular.

So I think that's what Cell is, nothing more.

As for 1 TFLOPS on a single die. That's still determine by the process they used and the number of transistors. But also how efficient they are using those transistors at doing the operations instead of other things.
 
darkblu -

I have to disagree. writing multi-threaded programs is hard, even using a language like Java which has some language support for threads.

I would say it is inherently harder than single threaded programs, and requires a mental paradigm shift for coders, who mostly are used to writing single threaded applications. (IMO people mostly think in "single-threaded" fashion as well). C,C++, and even Java do nothing to clearly express paralellism, multi-threaded program behaviour, or effects of having multiple threads running the same code in the source text. They do not provide ways of managing/scheduling parallel subtasks. They do not address issues like exception handling in a multi-threaded program where the program state at any given time is non-deterministic, or how threads communicate/pass data between each-other.

That, IMO is what makes multi-threaded debugging very hard, regardless of what tool you are using. I think someone needs to produce a language which enables a coder to specify parallelism using the language :
i.e. be able to translate "I want calculation X to be performed for objects Y, each calculation is independant of anything else going on" to clean code.


Serge
 
That, IMO is what makes multi-threaded debugging very hard, regardless of what tool you are using. I think someone needs to produce a language which enables a coder to specify parallelism using the language :
i.e. be able to translate "I want calculation X to be performed for objects Y, each calculation is independant of anything else going on" to clean code.

I haev often given thought to such a language, but haven't gotten far in terms of figuring out how to go about it. I wonder, what about yourself?

One thing that did cross my mind was OpenBeOS creator's philosphy of each object is a process, of course, I wouldn't say EVERY object would be, but something like that.
 
Why would you want Cell to be VLIW? VLIW, from what I understand, is designed to allow for optimal usage of lots of parallel execution units within a single CPU. It seems to me that the two technologies are different means of tackling the same problem, and I see little reason to use something like Cell if you're already using VLIW, or vice versa.
 
Chalnoth said:
Why would you want Cell to be VLIW? VLIW, from what I understand, is designed to allow for optimal usage of lots of parallel execution units within a single CPU. It seems to me that the two technologies are different means of tackling the same problem, and I see little reason to use something like Cell if you're already using VLIW, or vice versa.

Well isn't CELL essentially one big CPU with many small simple execution units?
 
Well isn't CELL essentially one big CPU with many small execution units?

Actually, it's all about the abstraction level context in which you're making that statement. From what I gather, I have this in mind when I picture Cell.

There are simple MP (most likely an in order machine), that can do Int, FP, branch, load/stores within their cache(s) and a vector unit - and whatever else I left out. Each MP is connected to a control unit. This control unit assigns threads/processes -whatever you wish to call them. Another unit handles out of cache access requests and fulfills them and helps stave of data corruption. One could also assign processors a memory space composed of many smaller spaces. These would possibly be one unshared and one or more shared (for IPC, communicating with devices...). That way it has it handles it's own memory space and then hands the rest of the task to the "main" memory control/interface unit.

Basically, from the control blocks point of view, each processor is a functional unit which it assigns tasks (threads). On a lower level of abstraction, I'd say no, to your question.

As for what instructions the code compiles to. VLIW doesn't sound too hot. Actually, maybe (and I'm going to get nasty replies for this) a CISC instruction set might be in order - no pun intended. Of course, I'm not talking about the number of registers and so on, merely more meaningful/powerful instructions. This would add more decoding logic at the individual MP level, but there should be significant savings in the amount of bandwidth to deliver code to each MP and the I-cache -which is going to be small, I assume due to the limited transistor budget- would benifit from the "compression".

As an aside some of you might be interested in the following, http://www.opencores.com/projects/hmta/
 
At what level are you speaking about hyperthreading?

The overall processor or it's processing elements?
 
I was envisioning something on the front end of the chip that could whip out instruction fragments to all of the different cores.

I don't know anything about this stuff, obviously, but I thought I would just chime in with an oddball idea. :D
 
A few things you have to realize:

There are significant problems when attempting massive parallization in a CPU. The main reason is that it is rather common for processing ten instructions from now to depend on code executing now.

So, it is far from trivial to attempt to just split up a program among different execution units or CPUs in hardware. The hardware must be able to determine that the various pieces of code are totally independent of each other in order to split the pieces up and execute them separately. It turns out that given the limited amount of time/silicon space that the processors can spend figuring this out, extra execution units often go unused.

The main reason to have what are effectively separate CPUs in a single die is so that the software is able to tell the CPU what pieces of the program can be separated among execution units (this is also the reason for hyperthreading, btw...). In many situations, it really requires programmer input for maximum parallelism, but compilers can, potentially, also put a fair amount in completely on their own.

Regardless, I really hope that some compilers are released in the near future that really work toward easy multithreading programming and debugging. I wouldn't be surprised at all if Cell-like designs made their way into home systems within two to three years (granted, they wouldn't have 16 CPU's...more like two to four, at first).
 
Back
Top