ISSCC 2005

Entropy · Feb 13, 2005

nelg said:
DemoCoder said:

Cell will accelerate vector and multithreaded oriented tasks -- coding/decoding, rendering, compression/decompression, speech and handwriting recognition, sound and some device drivers, CAD, digital content creation, simulation, games, and other desktop oriented tasks.

It won't accelerate MS Office tasks, but scalar CPUs have diminishing returns in that area anyway. It won't accelerate server tasks like Web serving, database execution, socket servers, etc because frankly, those tasks are more amenable to a granular multithreaded approach, they are not stream oriented, and are predominantly scalar. On the other hand, I think Intel and AMD are going to get their ass handed to them in the server arena in the future, because cheap, low power, less complex chips can be built to handle server tasks, and linux and open-source means most applications can be easily recompiled to these commodity server-tuned systems, there's Microsoft lock-in effect. It may take a few years, but I think in the server space, I think the pendulum is going to swing back the other direction away from x86 towards other architectures.

Click to expand...

Would a x86 CPU plus a r5xx (assuming good branching), able to communicate as effectively as a Cell processor + a GPU, be a any better or worse? Is there something specific about cell that would make it better for "vector and multithreaded oriented tasks -- coding/decoding, rendering, compression/decompression, speech and handwriting recognition, sound and some device drivers, CAD, digital content creation, simulation, games, and other desktop oriented tasks" than using a GPU for such tasks?

There is no x86 CPU with 70-80 GB of off-chip, non-memory communication bandwidth. The best, the full-blown Opterons, have less than a tenth. There is no x86 CPU with an on-chip communication fabric dedicated to interconnecting the different A/SPUs.

To put it short, there is no way in hell an x86 processor comes even close to offering what this first Cell prototype does, with or without R500 GPU. That is not to say that an x86 processor + R500 GPU won't be able to produce some excellent games, or that there won't be specific things it might even be better suited for. Running x86 Windows software being the most prominent of course.

As has been pointed out, this first sample is significant not only in what it offers in and of itself, but also for the simplicity with which it can be both extended and simplified. It's a new approach, and arguably better. It doesn't have the awesome inertia of Wintel infrastructure behind it though, and IMHO it would be silly to see it as a threat to the Wintel hegemony. You need only look to how long IBMs 360 architecture maintained its hold in its field, to see how little architectural merit has to do with market dominance. The Wintel stronghold, administrations, care very little about entertainment performance. x86 evolution lately hasn't been driven by the interests of adminstrative use though, so it will be interesting to see where it will go next. Intels adaption of laptop CPUs for the desktop makes a lot of short term sense, and focussing on ergonomics, security and communication seems a reasonable direction for x86 future.

jvd · Feb 13, 2005

entropy they could easily make an athlon 64 with a new memory controler on die and get 70-80gb bandwidth .

They don't do that because of the costs in a pc and the fact that the cpus don't really need that much bandwidth for what they need to do .

Cell is a nice chip but it isn't as amazing as some think and it wont replace x86 in a desktop . The only thing that will replace x86 is x86-64 .

The x86 chips will go multi core and then adopt some of the things in cell and in other tech to make them faster and better .

Gubbi · Feb 13, 2005

DemoCoder said:
I think Intel and AMD are going to get their ass handed to them in the server arena in the future, because cheap, low power, less complex chips can be built to handle server tasks, and linux and open-source means most applications can be easily recompiled to these commodity server-tuned systems, there's Microsoft lock-in effect. It may take a few years, but I think in the server space, I think the pendulum is going to swing back the other direction away from x86 towards other architectures.

This remains to be seen. I'm only aware of two efforts that are trying to specifically design for massively multicored devices.

Sun's Niagara and Azul System's Vega. Niagara is an eight core device, each core being 4-way multithreaded, it won't run much above current US-III speeds (According to Chris at Ace's) it will take up 340mm^2 in a 90nm process. Niagara is a single socket design, ie. it's one chip - one system.

However, these are not likely to be simple, cheap and low power. Their only possibility to get a footholdis to have superior performance.

IMO Niagara is stillborn, 4-cored Opterons will be around 400mm^2 and most probably faster than one Niagara chip, furthermore you can basically add as many as you like to a system. I'm sure Intel will have something similar, and the x86 juggernought has a huge economy of scale advantage. But the Niagara chip is probably built more as a proof of concept thing than anything else.

Azul's efforts is much more promising, 24 cores in one device (the Vega chip), and it's possible to connect up 16 sockets coherently for 384 core systems.

Cheers
Gubbi

Entropy · Feb 14, 2005

Gubbi said:
DemoCoder said:

I think Intel and AMD are going to get their ass handed to them in the server arena in the future, because cheap, low power, less complex chips can be built to handle server tasks, and linux and open-source means most applications can be easily recompiled to these commodity server-tuned systems, there's Microsoft lock-in effect. It may take a few years, but I think in the server space, I think the pendulum is going to swing back the other direction away from x86 towards other architectures.

Click to expand...

This remains to be seen. I'm only aware of two efforts that are trying to specifically design for massively multicored devices.

Sun's Niagara and Azul System's Vega. Niagara is an eight core device, each core being 4-way multithreaded, it won't run much above current US-III speeds (According to Chris at Ace's) it will take up 340mm^2 in a 90nm process. Niagara is a single socket design, ie. it's one chip - one system.

However, these are not likely to be simple, cheap and low power. Their only possibility to get a footholdis to have superior performance.

IMO Niagara is stillborn, 4-cored Opterons will be around 400mm^2 and most probably faster than one Niagara chip, furthermore you can basically add as many as you like to a system. I'm sure Intel will have something similar, and the x86 juggernought has a huge economy of scale advantage. But the Niagara chip is probably built more as a proof of concept thing than anything else.

Azul's efforts is much more promising, 24 cores in one device (the Vega chip), and it's possible to connect up 16 sockets coherently for 384 core systems.

Big iron servers aren't terribly sensitive to CPU cost - they have associated costs that render savings on the CPU portion pretty irrelevant - which is why designing CPUs for server use makes some sense, you can charge good money for them. The drawback of course is that you have to show a substantial benefit over commodity parts, and that there just isn't all that many high end server sold, compared to for instance small rackmounts or dell ultracheapos.

It's a reasonable assumption that x86 will continue to rule the low end. The Opteron is a nice chip, but it hasn't been terribly successful in terms of high end servers - yet. And I'm not sure it will be because again, high-end servers depend on so much more than just a bang-for-buck CPU. (The Opterons deserve to wipe the floor with the Xeons though. But note that it doesn't in the marketplace, despite rather obvious technical superiority.)

While the assumption DemoCoder makes seems reasonable, note that IBM instead has successfully sold multiple complex cores on MCMs, combining maximum per thread performance with high levels of core integration. I think they know their customers, and this indicates that IBMs customers still want as high performance per thread as they can get. The niche I can see for a relatively cheap multicore server oriented CPU, is the mid-range, now dominated by 4-8 ways. The Opterons would seem worthy opponents there, but Intel has nothing particularly attractive IMHO. It would be interesting if IBM for instance produced, say, 4- or 8 core CPUs descended from the PE seen in the CELL prototype for just this niche. I have heard no such rumours though, but then business servers interest me only tangentially.

That the future is multiple cores per chip is pretty unquestionable. IBM has already been there for a couple of generations, Sun is moving there and has made ambitious (if perhaps somewhat flawed) designs, and Intel is going there (and both Intel and AMD in x86 space). With increasing levels of integration, could it ever be otherwise for servers?

The question is rather to what extent the underlying architecture is modified to take advantage of the knowledge that there are going to be several cores on the same piece of silicon.

Entropy · Feb 14, 2005

jvd said:
entropy they could easily make an athlon 64 with a new memory controler on die and get 70-80gb bandwidth .

No, they couldn't. (And the memory controller only talks to memory, not associated logic chips.) If it wasn't difficult, there would have been no need for Sony to pay license fees to RAMBUS. Packaging, maintaining signal integrity, pin counts, et cetera ad nauseum - bandwidth costs real money. Which brings us to...

They don't do that because of the costs in a pc and the fact that the cpus don't really need that much bandwidth for what they need to do.

True. The PC we know today wouldn't benefit greatly from that kind of bandwidth, for the tasks it is typically asked to do. And to supply it with such capabilities would increase costs for everybody - not good for an industry that needs growth in less affluent countries, and need western administrations to keep replacing their hardware at the current pace.

Cell is a nice chip but it isn't as amazing as some think and it wont replace x86 in a desktop . The only thing that will replace x86 is x86-64.

Well, even this first Cell chip is (IMHO) amazing - but not for all applications.
Regarding replacing x86 - it depends on what you're talking about. Underneath your Windows desktop - probably never, that is indeed x86-64 land.

The x86 chips will go multi core and then adopt some of the things in cell and in other tech to make them faster and better .

Of course.
But conversely, they will never go the full distance. x86 is an architecture developed almost 30 years ago, that is asked to do many jobs, and the really important one of those jobs is keeping administrations going. You can't expect it to evolve into something optimized for gaming/media, because that is simply not the main purpose of the infrastructure that keeps it going.
Not to mention that rebuilding x86 with Cell capabilities would require new programming methods, new compilers and tools, new or rewritten software to take advantage of the new features... You would have little to no benefit from your x86 legacy. So why would you insist on hobbling your spanking new design by keeping crufty old x86 at its center? It would make no sense. You would be better off emulating x86 if need be during a transitional period. (And I've heard Intel representatives express some fear of this.)
But again, such a rebuilding of x86 would be completely unjustified for the vast majority of x86 users. What interest would they have in paying for such chip capabilities? Personally, I see the move towards multi core as the latest attempt from Intel to keep ASPs up. It remains to be seen if the marketplace will follow the performance bait one more turn in the dance, or if the movement towards lower desktop ASPs is inexorable. Dual core - perhaps, eventually. Or not. Quad core - I just can't see the value ever being there, compared to the money you'll save by staying with the corresponding dual core. Adding cores is extremely unlikely to go far in PC space. IMHO.

Gubbi · Feb 14, 2005

Entropy said:
Well, even this first Cell chip is (IMHO) amazing - but not for all applications.
Regarding replacing x86 - it depends on what you're talking about. Underneath your Windows desktop - probably never, that is indeed x86-64 land.

Until we know more about the CELL programming model we can't really say. One problem I see with CELL is that the local storage scratch pad memory isn't kept coherent with the rest of the memory system. This means that the local storage itself is part of the SPU's internal state and has to be saved on a context switch. The problem is compounded by the fact that each SPU can DMA to/from other SPUs local storage.

So IBM have better come through with a clever way of virtualising SPUs or CELL will be disastrously difficult to program in a multiprocessing environment (like say Linux or Windows).

Unless of course you're not allowed to program the SPUs directly. IMO this is the most likely situation. SPUs will either only run managed code or they capabilities will only be exposed through libraries.

The situation is likely to be different for PS3, developers are likely to be allowed to program to the metal (or Faf will be dissapointed)

As for CELL replacing x86 on the desktop, won't happen. Ever. It's not about hardware, it's about software and infrastructure.

Cheers
Gubbi

Fafalada · Feb 14, 2005

nAo said:
Thank you rendezvous, I missed Danack question and your answer, but I'm still non that convinced.

There is at least one vector ISA out there that uses a prefix stack to address this issue.
You get the trivial stuff like broadcasts by explicit instructions, and add all the complex stuff through decorators(swizzling,absolutes/negates,constant replacement, saturation etc.).
Now - yes, this adds extra instructions to push decorators on the stack, but they are no-latency, no-dependancies instructions, and moreover they would be a perfect fit for the odd pipeline.

Basically it's similar to what Deano described, just quite a lot more powerfull then a simple permute, and with less limitations on execution order.

Now, of course I don't know if something like this is actually used in SPU - but it would make sense to me. The prefix management for all I described above can be done in as little as 2-3 extra instructions in the ISA...

darkblu · Feb 14, 2005

Gubbi said:
Until we know more about the CELL programming model we can't really say. One problem I see with CELL is that the local storage scratch pad memory isn't kept coherent with the rest of the memory system. This means that the local storage itself is part of the SPU's internal state and has to be saved on a context switch. The problem is compounded by the fact that each SPU can DMA to/from other SPUs local storage.

So IBM have better come through with a clever way of virtualising SPUs or CELL will be disastrously difficult to program in a multiprocessing environment (like say Linux or Windows).

Unless of course you're not allowed to program the SPUs directly. IMO this is the most likely situation. SPUs will either only run managed code or they capabilities will only be exposed through libraries.

i think you pretty much answered this question - i don't think SPU's will be a subject to context-switches/task-states; they will be more like 'devices' in multi-task environments - CPU threads will be able to block waiting for IO from SPU, but the state of the SPU will not be CPU sheduler's business.

Gubbi · Feb 14, 2005

darkblu said:
i think you pretty much answered this question - i don't think SPU's will be a subject to context-switches/task-states; they will be more like 'devices' in multi-task environments - CPU threads will be able to block waiting for IO from SPU, but the state of the SPU will not be CPU sheduler's business.

Yeah, read in the other thread that the PPE can't preempt the SPUs. CELL systems will be relying on cooperative multiprocessing for the SPU part. It's like the 1970s all over again

Looks more and more like a hardware engineer's wet dream and a software engineer's nightmare

Cheers
Gubbi

darkblu · Feb 14, 2005

Gubbi said:
Yeah, read in the other thread that the PPE can't preempt the SPUs. CELL systems will be relying on cooperative multiprocessing for the SPU part. It's like the 1970s all over again

Looks more and more like a hardware engineer's wet dream and a software engineer's nightmare

now, now. do you look the same way at your hdd or gpu? cause the cpu can't preempt them either : )

Panajev2001a · Feb 14, 2005

Gubbi said:
darkblu said:

i think you pretty much answered this question - i don't think SPU's will be a subject to context-switches/task-states; they will be more like 'devices' in multi-task environments - CPU threads will be able to block waiting for IO from SPU, but the state of the SPU will not be CPU sheduler's business.

Click to expand...

Yeah, read in the other thread that the PPE can't preempt the SPUs. CELL systems will be relying on cooperative multiprocessing for the SPU part. It's like the 1970s all over again

Looks more and more like a hardware engineer's wet dream and a software engineer's nightmare

Cheers
Gubbi

The PPE can block the SPU's, even in DRM mode it can kill their execution and reset them and I'd think the theory of ARPC (APU RPC's) is still there which means that in non-DRM/Secure mode the PPE can cause an Apulet to be loaded by particular SPU's and you cna code in the Apulet a stop based on the PPE sending another particular Apulet to the SPU for preemption purposes.

Tons of good Software Engineers from IBM, SCE and Toshiba worked on the CELL project, not only Hardware guys: so far they seem to have done an over-all great job on the CELL architecture and on the first CELL CPU and there are still some suprises related to the PPE and to the SPE (like the compiler and the CELL OS, etc...) that we have to see. I would think they klnow what they are doing and IMHO developers' concerns have been on their mind since the moment they began working on the CELL architecture on the CELL architecture and on PlayStation 3 in particular.

Gubbi · Feb 14, 2005

Panajev2001a said:
The PPE can block the SPU's, even in DRM mode it can kill their execution and reset them and I'd think the theory of ARPC (APU RPC's) is still there which means that in non-DRM/Secure mode the PPE can cause an Apulet to be loaded by particular SPU's and you cna code in the Apulet a stop based on the PPE sending another particular Apulet to the SPU for preemption purposes.

Yeah, the PPE can send a message to an SPU asking it to stop, but it can't make it. If the SPU ignores it (either because it simply hasn't finished it's task yet, or because of a bug or because a malicious user has made it crash on purpose) the only option is to kill the SPU.

That's not preemption at all. And the programming model isn't multithreaded at all, it batch processing more than anything else.

The consequence of this under a multiprocessing OS is that the SPUs will have to be treated as devices, accessed through an API, and only running managed code (to ensure malicious users can't crash the machine on purpose). This will severely limit it's appeal in spreading out into other markets IMO.

Cheers
Gubbi

Alejux · Feb 14, 2005

IMO, it's a common error, to assume that non of what we're talking about haven't been thought and thoroughly discussed by the STI engineers and designers. It's even a bigger mistake to start making all these fatalistic assumptions with so little information to go by.

Cell is not some new crazy project envisioned by Ken Kutaragi, it's the result of years of study from Sony, which must have matured a lot since PS2, and years of study from IBM, with it's whole history of research in multiprocessing and supercomputing. I wouldn't write off researchers in the CELL project as being adventurers or incompetent. Let's wait and see, people!

Shifty Geezer · Feb 14, 2005

That said, people do make mistakes and overlook important stuff. It wouldn't be out of the question for there to be real problems associated with developing for Cell.

Gubbi · Feb 14, 2005

Alejux said:
Cell is not some new crazy project envisioned by Ken Kutaragi, it's the result of years of study from Sony, which must have matured a lot since PS2, and years of study from IBM, with it's whole history of research in multiprocessing and supercomputing. I wouldn't write off researchers in the CELL project as being adventurers or incompetent. Let's wait and see, people!

I'm sorry If I've come across as dismissing the Sony/IBM feat as incompetent. It most certainly isn't!!! It's an amazing feat, in particular in IC integration

My main peeve is with the programming model. Coherency is inherently expensive. I can understand why the CELL designers wanted to move as much complexity to software as possible. And for PS3 it even makes sense, people are smarter than machines afterall, and people will find a way to use CELL effectively

But for CELL to move into other markets it has to be a target of main stream tools, main stream programming languages and a main stream programming model, which it isn't.

All IMO.

Cheers
Gubbi

version · Feb 14, 2005

SPE has own MMU and DMA units , possible to use local storage than a cache?

Gubbi · Feb 14, 2005

version said:
SPE has own MMU and DMA units , possible to use local storage than a cache?

The ISSCC presentations specifically said that no hardware coherency was done between LS and main memory, which is why I've been whining for the past few days.

Cheers
Gubbi

version · Feb 14, 2005

Gubbi said:
version said:

SPE has own MMU and DMA units , possible to use local storage than a cache?

Click to expand...

The ISSCC presentations specifically said that no hardware coherency was done between LS and main memory, which is why I've been whining for the past few days.

Cheers
Gubbi

what? SPE can to read and save from/to main memory per block(16 kb)
we can to write a software cache routine, then a cacheline 16 kb

Alejux · Feb 15, 2005

Gubbi said:
My main peeve is with the programming model. Coherency is inherently expensive. I can understand why the CELL designers wanted to move as much complexity to software as possible. And for PS3 it even makes sense, people are smarter than machines afterall, and people will find a way to use CELL effectively

I imagine they must putting a lot of effort on a compiler, so that most of these issues do not become relevant to the majority of programmers.

Ever since I first read about CELL and it's sofware-cell structure, I envisioned them making this radically new programming model created to take advantage of parallel and distributed programming, while leaving the lower level layers, to take care of all the problems and issues related to task concurrencies and other problematic issues.

I'm very curious to see what they'll come up with, and I confess I'll be very disappointed if their programming model is just some APU-spawing libraries with no automatic protection against the problems discussed in this thread.

version · Feb 15, 2005

version said:
SPE has own MMU and DMA units , possible to use local storage than a cache?

with cache , we have 9 nearly general cores , this is amazing !

ISSCC 2005

Entropy

jvd

Gubbi

Entropy

Entropy

Gubbi

Fafalada

darkblu

Gubbi

darkblu

Panajev2001a

Gubbi

Alejux

Shifty Geezer

uber-Troll!

Gubbi

version

Gubbi

version

Alejux

version

Similar threads