The question is: IS CELL right for a game console?

scificube · Feb 9, 2005

I think Cray just made his stuff way to expensive to be used in the consumer market.(dipping trasistors in gallium dixoide or some other ridiculously expensive crap) Cray design are too specialized. Cray was after speed at any cost...where cost is something none can afford to ignore in the consumer market.

It's a circle argument no?

GHz drives sales...now core#s will.

Don't get me wrong though I do agree desktops have not been pushed for more the single task efficiency until recently and that is of course a big factor.

I just wanted to say...something.

randycat99 · Feb 9, 2005

Also bear in mind that of the additional millions of transistors put into the next great single CPU, fewer and fewer of them are actually going toward boosting performance, rather than equipping extra pipeline stages that will allow the thing to actually function at such elevated clockrates. In more simple terms, the single CPU is not necessarily getting faster for the additional number of transistors that are put in. The efficiency is dropping, and it is only by brute clockrate that the thing even manages to come out ahead for the transistor increase. Looking ahead, the areas of improvement are rapidly running thin, given the strategy of simply making a bigger single CPU. This may not be true forever, but it is true given the technology we have currently. So we have the current trend to simply augment cache sizes on existing CPU's or employ multi-cores to use up die space, rather than come up with an even larger CPU.

So it is not really a case of parallel processing units being proposed as an optional alternative to the single CPU. The single CPU route is clearly topping out, so the only logical mode of growth is to go parallel. Where unavoidably sequential tasks are concerned, the singular node of these parallel setups are usually quite comparable to single CPU performance, anyway. So the point is largely moot.

Farid · Feb 9, 2005

andypski said:
I believe that the statements that I made are true, in the sense that they were not specified in absolute terms. If you take an inherently serial task, and run it on a single core that is dedicated to running serial tasks quickly, and then you have a dual core with the same number of transistors running that same serial task then, other aspects of design being equal, I would expect the single core to win.

Of course, from a theorical point of view your statements are correct. An hypothetical xM transistors single core CPU running at fGHz should outperform, in lot of case, a xM transistors Multi-Cores CPU running at fGHz.

andypski said:
Problems with pushing clock speeds and higher transistor densities are a separate issue.

They're not a separate issue, since those problems are due to a physical limit that Intel, and others are starting to reach. Meaning that we cannot, or more exactly, we should not discuss a subject like single core/ multi-cores and do abstraction of physical realities.

Saying, today, and tomorow, a Multi-Cores will pack more raw power than a single core CPU, is, for me stating the obvious. Now will this raw power be usefull in all the scenario, that I honestly can't tell in one simple sentence, and that'd be also a completely different issue.

andypski said:
But with the same transistor budget I see no reason why they couldn't design a single core that would still outperform their dual core on serial tasks.

I have to agree on the fact, that IF Intel wasn't drive by market reality, they might try to "push" the physical barrier a little more before going the MC route, that's true. But what's also true is the fact that the physical barrier will soon be completetly impossible to push any further, so sooner or later, they'd do what they're actually doing.

andypski said:
Why? Are consoles magic?

I would actually have said that in terms of multiple core processors, where each core is an equal citizen, the PC market actually makes a lot of sense. In the PC space you have complex operating systems running multiple execution threads for entirely separate processes that may never need to synchronise or communicate. Bingo, big speedup in these cases by having multiple cores.

Actually, a game has a lot of threads, TnL/ 3DSound/ complex physic/ complex AI, etc. Therefore, a more than 2 core CPUs, on a console makes a lot more sense than the same Multi-Cores on a PC, since a PC would very rarely have more than two resources demanding threads running at the same time (Except PC games... If they were optimized, though).

andypski said:
Anyway, my purpose is not to argue that either multiple cores or single cores are 'superior' in some way. I was just pointing out that I don't believe the original statement...

"A single processor system will never reach the performance of a multi-CPU system."

...is one that is fundamentally true in any way.

As I said, theorically and in the absolute, you're right. But personaly, I was arguing the subject in context of today's market and today's R&D problems.

So, in other words, we agree, we're just discussing the subject from different POV.

andypski · Feb 9, 2005

Vysez said:
[Actually, a game has a lot of threads, TnL/ 3DSound/ complex physic/ complex AI, etc. Therefore, a more than 2 core CPUs, on a console makes a lot more sense than the same Multi-Cores on a PC, since a PC would very rarely have more than two resources demanding threads running at the same time (Except PC games... If they were optimized, though).

Yes a game has a lot of potential threads, but they are not independent - they are synchronous.

In the PC example that I gave I am running DVD decoding on one CPU, and running something like IE with a flash animation on the other. Neither of these has any dependency on the other, hence they can execute completely asynchronously.

In the case of TnL, 3Dsound etc. they all have to operate in a synchronous way. In the most basic example of this, my sound has to be synchronised with the action on screen (ie. the TnL and graphics).

One result of this is that I can't process an arbitrary amount into the future to use my execution resource fully - the total amount of sound processing that I can do is limited by what has happened in the game, which is running at a fixed rate.

What's the result of this? Well, let's say I make the simplest split of work possible - sound on one processing unit (PU), Tnl on another, physics on another etc. We can quickly see that my overall throughput may well become tied to the longest processing time for one task on any single PU - the other PUs are relying on finishing that task to generate the data for the next set of tasks they have to do (I have to know where my monsters will be and what they are doing before I can decide what sounds occur etc.)

So this is inherently inefficient - some of my PUs will finish ahead of others and start to starve, which gives me poor utilisation. If my longest task is itself parallelisable then I might be able to mitigate this and rebalance things by using multiple PUs for this task, but if the gating task is serial then I can't.

I might want to dynamically redistribute tasks to get a better execution balance, but now I run into the communication issue - to switch tasks around between PUs I have to communicate that appropriate state for that task around - redistributing information. The amount of information this requires may vary from small to very large, depending on the task. If it's very large then I can't redistribute it very often (or maybe not at all). If all my state for all tasks is static, and I have a large local store on each PU, then I may be able to keep the state for all my different tasks on every PU locally and switching overhead is low. If my state is dynamic and changing then I may not be able to get around the overhead of copying data around, because I can't keep an up-to-date copy on all my PUs all the time. If my total states are too large to be held in local memory then I will need to swap out the state for one task to swap in the state for another, and what happens if later I decide I want to run that first task again? Yup - I need to swap it back in, so I may need to add some sort of hysteresis to my distribution of tasks to avoid thrashing back and forth between states, causing unnecessary copies.

This is why the case of a multi-threaded, multi tasking OS can map well to multiple cores, whereas trying to parallelise what, in essence, is a single task - "Run a game" - can cause a lot more headaches.

As I said, theorically and in the absolute, you're right. But personaly, I was arguing the subject in context of today's market and today's R&D problems.
So, in other words, we agree, we're just discussing the subject from different POV.

Agreed. (At least to some extent

)

The types of opportunities for multiple cores on the PC desktop are very different from those in consoles, and getting good performance boosts is (relatively speaking) a simple thing for multiple independent tasks. So today's market and R&D problems are definitely pushing towards a migration to multiple cores on the modern PC desktop.

This is not at all the same thing as the expected multi-processing in a console environment, and I don't think the pressures are the same, which is why I thought the reference to Intel's current plans was not necessarily valid in the space of this debate.

- Admitting that you can get better performance running independent applications on a PC desktop with multiple cores is one thing - here a multiple core has many advantages, not least that it can actually _avoid_ context switches that need to occur on the single core when moving from task to task.

- Saying that you can get better performance from multiple cores by extracting parallelism from a single task with multiple dependent communicating subtasks is a different matter, and far more complex. While it's certainly possible, it's not easy.

Shifty Geezer · Feb 10, 2005

andypski said:
It has been for many years in many fields - why aren't all desktop processors massively parallel, if your statement is true? Why don't all desktop processors have a Cray-style massively parallel design, with vector processing units all over the place?

As I understand things, given a tranny count of 1 million, you'll get better performance from a single 1 megatranny core than 5x200,000 tranny cores. At 100 million trannies, you'll maybe get pros and cons for 1x100 megatranny core and 5x2 megatranny core. But as size increases, those extra trannies are gonna bomb out for single core. Do you think a linear single core CPU could make use of 1,000 million transisitors?

Up 'til now, CPU's have been in the lower sizes where single core was more effecient. Now we're closing on limits to single core sizes, where bigger means less benefit, and by spending those trannies on going dual core we get speed boosts. That's why there's talk of multi-core CPU's appearing in desktops now and not 10 years ago when the desktop computer design still merited multicore processing. A Pentium could perform more in one large core than those same transistors spread over many cores.

I'm no expert, but I've never heard of a single core CPU pushing 250 gflops, where multicore Cells and GPUs can. For more performance from increasing chip sizes, you have to go multicore (or maybe have so many megabytes cache entire programs are loaded into that instead?!)

andypski said:
What's the result of this? Well, let's say I make the simplest split of work possible - sound on one processing unit (PU), Tnl on another, physics on another etc. We can quickly see that my overall throughput may well become tied to the longest processing time for one task on any single PU - the other PUs are relying on finishing that task to generate the data for the next set of tasks they have to do (I have to know where my monsters will be and what they are doing before I can decide what sounds occur etc.)

This is true of a single core processor, where it will have to wait for the TnL, AI etc. to finish before processing the sound.

So this is inherently inefficient - some of my PUs will finish ahead of others and start to starve, which gives me poor utilisation. If my longest task is itself parallelisable then I might be able to mitigate this and rebalance things by using multiple PUs for this task, but if the gating task is serial then I can't.

Yes, it's inefficient, but it's still faster! A multicore system will be tied to the slowest process, a single core will be tied to how quickly it can process everything. I don't think anyone's expecting Cell to run at anywhere near 100% effeciency. I don't know of any single-core solution that could approach the processing grunt of Cell or another multicore setup (dual PPC for example), unless the parallelism is so ineffecient more time is spent waiting than processing.

The only real-world example I can think of, being no expert, is the Amiga computer. During the 80's the PCs single CPU was vastly more powerful than the Amiga's CPU and gubbins, but by having multiple processors running in parallel, the Amiga outclassed the PC in pretty much everything except raw processing (raytracing for example) for many years. Parallel has design problems you have to work around, but these don't instantly mean the design will be running choked most of the time. Though Cell may have occassions where PU's are sitting idle, the majority of the time they'll all be busy, and that means for the transistor count more work is being done than spending that on a more efficient but slower single-core design.

andypski · Feb 10, 2005

Shifty Geezer said:
I'm no expert, but I've never heard of a single core CPU pushing 250 gflops, where multicore Cells and GPUs can. For more performance from increasing chip sizes, you have to go multicore (or maybe have so many megabytes cache entire programs are loaded into that instead?!)

Yes - single CPU cores are not GFlop monsters - they have never been designed to be, and also it is easier to get a large raw GFlop number to quote by going wider rather than deeper. Of course, a raw GFlop number is not the be-all and end-all of performance, depending on what task you are trying to perform. GFlops might tell you how fast you can (theoretically) crunch numbers, but number crunching isn't everything.

If you were to take an original Pentium (a dual-issue in-order CPU) and modify it to run at 3GHz with 8 4-D vector coprocessors (that's 96 GFlops) and then run Windows XP on it, and then do the same thing on a P4 at 3GHz, which do you think would give you the better experience?

I believe it would be the P4. The multi-core cell-style Pentium would be far less efficient at executing code in this environment (partially because its extra cores are not additional Pentiums, but specialised vector maths units)

So now give them both a fast GPU and then run games - which will give you a better experience?

I'm really not sure - I would expect the P4 to scream through many of the housekeeping tasks of running a game far more quickly than the Pentium - the P4 is much better at general code execution. The Pentium with clever coding to use its co-processors may then massively outperform the P4 on other tasks.

The point is that I don't think it's clear-cut in this case where the transistors were best spent. It's not a slam dunk.

I'm really not against parallel processing at all - I think it's very exciting getting high speed vector processing into our hands, and it looks like it's the future, but it's not magic.

andypski said:
andypski said:

What's the result of this? Well, let's say I make the simplest split of work possible - sound on one processing unit (PU), Tnl on another, physics on another etc. We can quickly see that my overall throughput may well become tied to the longest processing time for one task on any single PU - the other PUs are relying on finishing that task to generate the data for the next set of tasks they have to do (I have to know where my monsters will be and what they are doing before I can decide what sounds occur etc.)

Click to expand...

This is true of a single core processor, where it will have to wait for the TnL, AI etc. to finish before processing the sound.

Yes, but the single-core may run each individual task significantly faster than the multi-core. Despite the lack of massive GFlops there is still branching, general code execution and data efficiency to consider. The serial processor, if running at high efficiency, can still do better than the multi-core if the multi-core is running inefficiently.

If the serial core runs the longest individual (unparallelisable) thread in half the time of the multi-core then that's the amount of time the single core still has in which to execute the other tasks in order to come out even with the multi-core.

The more tasks there are that can be parallelised, and the larger each individual task, the less likely the single core is to win. So the multi core will certainly win comfortably on many occasions, but not all.

So this is inherently inefficient - some of my PUs will finish ahead of others and start to starve, which gives me poor utilisation. If my longest task is itself parallelisable then I might be able to mitigate this and rebalance things by using multiple PUs for this task, but if the gating task is serial then I can't.

Click to expand...

Yes, it's inefficient, but it's still faster!

No! Not necessarily - it is task dependent as to whether it is faster or not

The only real-world example I can think of, being no expert, is the Amiga computer. During the 80's the PCs single CPU was vastly more powerful than the Amiga's CPU and gubbins, but by having multiple processors running in parallel, the Amiga outclassed the PC in pretty much everything except raw processing (raytracing for example) for many years

Now that's certainly an interesting example to pick, because I think there's a much closer parallel to choose. The Amiga was around at about the same time as the Atari ST, which used the same CPU, but running slightly faster. The Amiga had very useful and versatile custom chips for processing many graphics and sound tasks and the ST did not. The Amiga certainly outclassed the ST for graphics, but not by as wide a margin as you might expect from raw specs, because a lot of graphics tasks were memory throughput bound, and since neither CPU had any caches you couldn't really get that much parallelism occurring. As such (with some reasonably clever programming) you could do a lot of stuff on the ST that was equivalent to the Amiga, even without custom coprocessing to help.

Now, if you took the transistor budget for all the custom chips, and gave the ST, say, a 68020 processor, or even maybe just a 16MHz 68000 then I would have backed the ST to perform better than the Amiga at pretty much everything, and I guess that's the point. Inefficient and parallel is not necessarily better than efficient and serial.

nelg · Feb 10, 2005

This is like RISC v. CISC all over again. The truth will probably lie somewhere in the middle (dependant on the task). Im sure no one has modeled where the sweet spot would lie for games (especially if your GPU branches efficiently).

Shifty Geezer · Feb 10, 2005

andypski said:
No! Not necessarily - it is task dependent as to whether it is faster or not

And by you're reckoning, in a console the environment is more suited to a single core (seeing as this is about whether Cell's suited to PS3)? Having discussed the pros and cons, where do you see Cell gaming going? Do you think the apulet model is going to suffer horrendous bottlenecks that limit performance to 50% or less?

From my limited perspective I find this hard to imagine and can only think the tranny budget couldn't have been put to better use on a single core. I can't get over the idea of a massive single-core processor...

As such (with some reasonably clever programming) you could do a lot of stuff on the ST that was equivalent to the Amiga, even without custom coprocessing to help.

Oooo, the old Amiga vs. ST debate! Better hide this quick!

Now, if you took the transistor budget for all the custom chips, and gave the ST, say, a 68020 processor, or even maybe just a 16MHz 68000 then I would have backed the ST to perform better than the Amiga at pretty much everything, and I guess that's the point. Inefficient and parallel is not necessarily better than efficient and serial.

I disagree. The Mac had higher spec Cpu's than the Amiga but couldn't match it in performance. I had a friend with a 120MHz 486 DX4, vastly more CPU grunt, and it couldn't handle much of the Amiga's capabilities. This is getting into hazy ground as there's system architecture at work, and we all know the original IBM AT wasn't the best starting block for effecient computing, but through computing I've seen multiple parallel (albeit specialised) coprocessors as more effecient than a honking great CPU does everything model.

Anyway, we can agree that multi-processor vs. single processor depends on the task, but in the context of the PS3 you haven't said whether you think Cell could work or not. I'm curious to your take on this matter and do you think a honkin' great single-core CPU would be better?

Inane_Dork · Feb 10, 2005

nelg said:
This is like RISC v. CISC all over again.

Ah, I'm glad someone else noticed.

The "make it simple, make it fast" mantra hasn't taken over anything yet, and I doubt it will this time either.

Tweaker · Feb 10, 2005

I remain sceptical about those bold numbers especially if I remember Sega Saturn, similar bunch of processors extremely difficult to program.

andypski · Feb 10, 2005

Shifty Geezer said:
andypski said:

No! Not necessarily - it is task dependent as to whether it is faster or not

Click to expand...

And by you're reckoning, in a console the environment is more suited to a single core (seeing as this is about whether Cell's suited to PS3)? Having discussed the pros and cons, where do you see Cell gaming going? Do you think the apulet model is going to suffer horrendous bottlenecks that limit performance to 50% or less?

From my limited perspective I find this hard to imagine and can only think the tranny budget couldn't have been put to better use on a single core. I can't get over the idea of a massive single-core processor...

I think that the APUlet model is certain to have widely varying efficiencies during different parts of the work. Sometimes it may be running with a high degree of efficiency (if it is used for complex vertex shading, for instance, I could see it getting quite high utilisation). At other times I expect the APUlets could well get 0 utilisation (or close to it). The part that I'm not at all certain about is whether in general it will tend towards the higher or lower end of the scale. My suspicion would be towards the lower end due to the inherent difficulties in parallelising many tasks.

Even tending towards the lower efficiencies it would still be a potent calculating engine, just not as mindblowing as the numbers like to suggest.

Let's get this straight - fundamentally the situation could be that the transistor budget is well spent - no-one knows for sure as yet. I certainly don't. Once we start seeing applications then we will have a better idea - if utilisation of the APUlets is poor then clearly they weren't transistors well spent.

And I don't understand why you have difficulties understanding the idea of using a massive single-core processor. Today that is what most of us have in our machines on our desktops - they're all around us. Such a processor would represent a fairly massive performance increase over the last generation of consoles, on the order of 5->10x, with that level of performance immediately achievable in real-terms with no extra programming effort or paradigm shift.

Now, if you took the transistor budget for all the custom chips, and gave the ST, say, a 68020 processor, or even maybe just a 16MHz 68000 then I would have backed the ST to perform better than the Amiga at pretty much everything, and I guess that's the point. Inefficient and parallel is not necessarily better than efficient and serial.

Click to expand...

I disagree. The Mac had higher spec Cpu's than the Amiga but couldn't match it in performance.

We weren't talking about the Mac - it's hardly fair to compare a system where typically nothing was ever really coded to the metal with one like the Amiga or ST where everything was coded to the metal.

The Amiga vs. ST comparison is far more valid.

I had a friend with a 120MHz 486 DX4, vastly more CPU grunt, and it couldn't handle much of the Amiga's capabilities.

I don't doubt that there are some effects that you saw on the Amiga that were difficult to reproduce on the PC of the time, but these are far more likely to be down to other architectural limitations in the PC than any inability of the 486 to emulate pretty much any effect that the Amiga's coprocessors could do. And at a much higher speed to boot.

This is getting into hazy ground as there's system architecture at work, and we all know the original IBM AT wasn't the best starting block for effecient computing, but through computing I've seen multiple parallel (albeit specialised) coprocessors as more effecient than a honking great CPU does everything model.

Yes - when the parallel, specialised coprocessor is something like a GPU - tightly engineered to a specific task, it does it very fast. Far faster than a CPU at the same job, no doubt, but multiple cores in the sense that we are talking about are not truly specialised, they are instead rather generalised.

Without tailoring the hardware to a specific task you throw away a huge amount of efficiency. You then have to try to get back as much as you can on a specific task with clever programming.

Anyway, we can agree that multi-processor vs. single processor depends on the task, but in the context of the PS3 you haven't said whether you think Cell could work or not. I'm curious to your take on this matter and do you think a honkin' great single-core CPU would be better?

Of course I think Cell may work - it would be silly to dismiss it given the large amounts of raw power there to be tapped. It's how well it can be tapped that I am unsure of. And yes, I also think that for many things a honkin' great single core CPU might be better. There's simply no way to say definitively as yet - it was only my objection to claims that it is possible to be definitive about such things that got me started on this debate in the first place.

The question is: IS CELL right for a game console?

Should CELL be used in a GAME CONSOLE?

NO, it's needed for Servers and workstations

Yes, because it's the future technology for consoles

Other thoughts (post below)

scificube

randycat99

Farid

Artist formely known as Vysez

andypski

Shifty Geezer

uber-Troll!

andypski

nelg

Shifty Geezer

uber-Troll!

Inane_Dork

Rebmem Roines

Tweaker

andypski

Similar threads