A new X86 architecture with Cell-like tech ?

eSa · Mar 21, 2006

Thought would post this here, because it is so close with all the discussion of Cell-processor
.
In a nutshell; AMD examines a possibility to include "Cell-like" multiprocessor co-processor to X86 HW. And it IS NOT a rumour, this has been admitted by AMD.

http://www.electronicsweekly.com/Articles/2006/03/15/37936/ClearspeedplansAMDco-processorlinkup.htm

maaoouud · Mar 21, 2006

But it has nothing to do with consoles, it's for pc servers.

There was some talk of this earlier around when Clearspeed first launched that 96 core thing, IIRC. But it's good to see that AMD finally decided to really focus on technologies to enhance server performance. While the first design will probably be launched as a pci design (since it's already designed for that), they probably want to change the interface to a Hypertransport link further on though.

SPM · Mar 21, 2006

This surprises me. The ix86 platform is principally for running Windows, and so binary compatibility with code for other earlier ix86 processors is essential. This need for binary compatibility means that techniques like use of in order processors and relying on smart compilers to optimise code to run efficiently can't be used since the code will need to be re-compiled specifically for the processor being used. Windows and most Windows applications being closed source don't allow this.

Windows ports to other processors have been tried but have failed miserably because of lack of binary compatibility (eg. MIPs, Alpha, Itanium, and all other attempts to move Windows to another processor). Linux on the other hand has been very successful on these and many other processors, because it and most of it's applications are distributed as source code, allowing either the distributor or the end used to recompile on other platforms.

So why is AMD going for a Cell type co-processor that most Windows apps won't use, tarher than traditional SMP? It has to be either because AMD is looking at Linux for server use, or because they intend to use their cell like co-processor to be used to replace embedded multi-media hardware. For example AMD could supply Windows drivers that would allow it to emulate an MPEG decoder, do accelerated graphics on an unaccelerated chipset, and emulate a very fast FP unit.

Carl B · Mar 21, 2006

SPM I think you're underestimating Linux in the server space to begin with; it's big. Also note that Sun is a big Opteron supporter and AMD is pretty much the poster child for x86 Solaris. I'd be surprised if the Clearspeed inclusion had much to do with anything other than targeting an increased range of customers with Opteron - maybe trying to torpedo the ever listing Itanium.

Clearspeed's products have been discussed here before, but the hype kind of died down. I read a Clearspeed/AMD article a couple of days ago and I admit I was pretty intrigued. Didn't more than skim it though, so I didn't get a sense for the size of the Clearspeed chip going into the package. Was it mentioned?

Lysander · Mar 21, 2006

The talks may be the result of an awareness of the stunning performance of the IBM/Sony/Toshiba microprocessor Cell.

This writer is making asumption on his own. How he came to cell is not understandable.

A future product for AMD is QuadCore, due out next year, a four-core microprocessor delivering double the performance of AMDâ€™s current two-core product, DualCore

Thats amd future, I think devs dont like (hardware) hierahical processors.

Guilty Bystander · Mar 21, 2006

This writer is making asumption on his own. How he came to cell is not understandable.

How about that statement made by one of Amd's head honchos saying the Cell is a wake up call for them?

A future product for AMD is QuadCore, due out next year, a four-core microprocessor delivering double the performance of AMDâ€™s current two-core product, DualCore

Good for Amd having a quad core Amd 64 but that still won't come close to even the Xenon much less the Cell.

Just to give you guys an example about the power of the Cell.
That 96 core PCI-E expansion thing can only deliver 25GFlop/s with 96 cores while 1 SPE can deliver those same 25GFlop/s on it own.
A dual core P4 or Amd 64 can only do about 8-12GFlop/s at best while 1 SPE can do 25GFlop/s.
The difference in power is huge when you're using all 7 or 8 SPE's.

Thats amd future, I think devs dont like (hardware) hierahical processors.

Correct I think it's pointless to think what might be coming from either Amd, Intel or IBM in the future.
The point is what's here now or in the next six months and take a guess there's gonna be nothing that can either get close to the Xenon and especially the Cell.

pjbliverpool · Mar 21, 2006

Guilty Bystander said:
How about that statement made by one of Amd's head honchos saying the Cell is a wake up call for them?

Good for Amd having a quad core Amd 64 but that still won't come close to even the Xenon much less the Cell.

Just to give you guys an example about the power of the Cell.
That 96 core PCI-E expansion thing can only deliver 25GFlop/s with 96 cores while 1 SPE can deliver those same 25GFlop/s on it own.
A dual core P4 or Amd 64 can only do about 8-12GFlop/s at best while 1 SPE can do 25GFlop/s.
The difference in power is huge when you're using all 7 or 8 SPE's.

Correct I think it's pointless to think what might be coming from either Amd, Intel or IBM in the future.
The point is what's here now or in the next six months and take a guess there's gonna be nothing that can either get close to the Xenon and especially the Cell.

The 25GFLOPs for the clearspeed processor is real world double precision GFLOPs. An SPE would be luckly to achieve 2.

Xenon and Cell beat AMD/Intel in theoretical peak single precision GFLOPs only. In both double precision, and real world FLOPs the performance difference isn't remotely as large and probably far below a quad core AMD or Intel x86, especially Conroe and the rumored upgraded Opterons/Athlons with twice the FPU's.

Then consider everything that isn't GFLOP related, and a single core high end Intel/AMD would likely walk all over Xenon and Cell in real world performance, nevermind a dual core. Im not even going to acknowledge a comparison with a quad core.

Brimstone · Mar 21, 2006

Guilty Bystander said:
How about that statement made by one of Amd's head honchos saying the Cell is a wake up call for them?

I don't remember that news. However I do recall AMD critisizing the asymetric design of CELL and how it's not right for AMD.

AMD see's their version of a CELL style architechture as symmetric.

The next problem that Weber touched on was the Cell approach to a heterogeneous multi-core microprocessor. To Fred Weber, a heterogeneous multi-core microprocessor is one that has a collection of cores, each one of which can execute the same code, but some can do so better than others - the decision of which to use being determined by the compiler. Weber referred to his version of heterogeneous multi-core as symmetric in this sense. Cell does not have this symmetric luxury; instead, all of their cores are not equally capable and thus, in Weber's opinion, Cell requires that the software needs to know too much about its architecture to perform well. The move to a more general purpose, symmetric yet heterogeneous array of cores would require that each core on Cell must get bigger and more complex, which directly relates back to Weber (and our) first problem with Cell that it is too far ahead of its time from a manufacturing standpoint.

http://arstechnica.com/news.ars/post/20050331-4763.html

Carl B · Mar 21, 2006

pjbliverpool said:
The 25GFLOPs for the clearspeed processor is real world double precision GFLOPs. An SPE would be luckly to achieve 2.

Xenon and Cell beat AMD/Intel in theoretical peak single precision GFLOPs only. In both double precision, and real world FLOPs the performance difference isn't remotely as large and probably far below a quad core AMD or Intel x86, especially Conroe and the rumored upgraded Opterons/Athlons with twice the FPU's.

Then consider everything that isn't GFLOP related, and a single core high end Intel/AMD would likely walk all over Xenon and Cell in real world performance, nevermind a dual core. Im not even going to acknowledge a comparison with a quad core.

Cell's DP performance is actually pretty good - pretty close to that 25GFlops. Sure it doesn't even approach it's SP capabilities, but at the same time it's decent DP from a standalone chip all the same.

Frank · Mar 21, 2006

SPM said:
This surprises me. The ix86 platform is principally for running Windows, and so binary compatibility with code for other earlier ix86 processors is essential. This need for binary compatibility means that techniques like use of in order processors and relying on smart compilers to optimise code to run efficiently can't be used since the code will need to be re-compiled specifically for the processor being used. Windows and most Windows applications being closed source don't allow this.

Windows ports to other processors have been tried but have failed miserably because of lack of binary compatibility (eg. MIPs, Alpha, Itanium, and all other attempts to move Windows to another processor). Linux on the other hand has been very successful on these and many other processors, because it and most of it's applications are distributed as source code, allowing either the distributor or the end used to recompile on other platforms.

So why is AMD going for a Cell type co-processor that most Windows apps won't use, tarher than traditional SMP? It has to be either because AMD is looking at Linux for server use, or because they intend to use their cell like co-processor to be used to replace embedded multi-media hardware. For example AMD could supply Windows drivers that would allow it to emulate an MPEG decoder, do accelerated graphics on an unaccelerated chipset, and emulate a very fast FP unit.

But... What is the difference between such processors that use a different instruction set and all the current x86 processors who all translate those instructions to their own microcode before executing them in any case? I don't think there is any x86 processor nowadays that runs those instructions natively. They all use their own internal (RISC) instruction set.

Java and .NET are both Just In Time compilers as well, they don't compile their executables to machinecode either. The x86 instruction set is just one of the multiple representations you can use to create an executable binary.

SPM · Mar 22, 2006

DiGuru said:
But... What is the difference between such processors that use a different instruction set and all the current x86 processors who all translate those instructions to their own microcode before executing them in any case? I don't think there is any x86 processor nowadays that runs those instructions natively. They all use their own internal (RISC) instruction set.

Java and .NET are both Just In Time compilers as well, they don't compile their executables to machinecode either. The x86 instruction set is just one of the multiple representations you can use to create an executable binary.

Yes, but optimising compilers for in-order processors create code for a very specific processor, since the optimisations are tied into the internal architecture of the specific processor. So performance for binaryies that needs to run on several processors with slightly different internal architectures like Windows running on a generic ix86 processor will always be poor on in-order architectures. This is probably why in-order architectures haven't caught on for Windows, but dominate non ix86 RISC designs.

Hence the dictum that Cell type in-order architectures are bad for OS performance probably does hold true for Windows, but not necessarily for Linux if it is re-compiled for the specific processor with smart compiler optimisations to overcome the limitations of in-order execution. A cell based PC might therefore run the Linux OS very well.

As for reasons AMD might have for producing a cell-like design, massive floating point performance in servers is only required for supercomputing clusters. This is a very niche market. IBM already has this cornered with the forthcoming double precision version of Cell, and I don't think it is worth AMD spending a lot of money trying to corner this market. This leaves only use of the SPE like cores on desktop PCs to emulate hardware via OS drivers as the only way to exploit the cores. Besides, what do you really use such high floating point performance for? Answer - (besides supercomputing) for multi-media, and 3D graphics acceleration, both of which can be implemented via drivers called by generic Windows code. Maybe combining the AMD processor with a cheap integrated 2D graphics chipset will give you accelerated 3D graphics, Mpeg decoding and sound more cheaply than with dedicated hardware for these.

ERP · Mar 22, 2006

SPM said:
Yes, but optimising compilers for in-order processors create code for a very specific processor, since the optimisations are tied into the internal architecture of the specific processor. So performance for binaryies that needs to run on several processors with slightly different internal architectures like Windows running on a generic ix86 processor will always be poor on in-order architectures. This is probably why in-order architectures haven't caught on for Windows, but dominate non ix86 RISC designs.

Hence the dictum that Cell type in-order architectures are bad for OS performance probably does hold true for Windows, but not necessarily for Linux if it is re-compiled for the specific processor with smart compiler optimisations to overcome the limitations of in-order execution. A cell based PC might therefore run the Linux OS very well.

As for reasons AMD might have for producing a cell-like design, massive floating point performance in servers is only required for supercomputing clusters. This is a very niche market. IBM already has this cornered with the forthcoming double precision version of Cell, and I don't think it is worth AMD spending a lot of money trying to corner this market. This leaves only use of the SPE like cores on desktop PCs to emulate hardware via OS drivers as the only way to exploit the cores. Besides, what do you really use such high floating point performance for? Answer - (besides supercomputing) for multi-media, and 3D graphics acceleration, both of which can be implemented via drivers called by generic Windows code. Maybe combining the AMD processor with a cheap integrated 2D graphics chipset will give you accelerated 3D graphics, Mpeg decoding and sound more cheaply than with dedicated hardware for these.

OK I'm going to repeat myself here....

I have yet to see a compiler, including ones targeted at In Order designs do an even vaguely decent job of scheduling to hide memory and instruction latencies.

Even if one of these compilers did exist an OOO core could and in general would ALWAYS do a better job.

The reason that OOO cores are prevalent in the desktop space is simply, it's been the best way to improve performance on existing applications. OOO designs will display considerably higher sustainable IPC's in real applications than in order designs, simply because they can mask latency where in order designs can't.

In Cell and X360 IBM is trading off the execution benefits of OOO designs to get back transistors to throw at FP Performance, they obviously believe that this is the primary performance bottleneck in games. It's a very radical tadeoff, trust me you'd be surprised how poorly Cell or X360 CPU would run apps not tailored towards their architectures.

I should start a poll on roughly how many Instructions Per Clock a dev ought to expect out of these in order architectures in a real application, I'll give you a clue, it's a LOT lower than the peaks people throw around on BBS's.

Asher · Mar 22, 2006

xbdestroya said:
Cell's DP performance is actually pretty good - pretty close to that 25GFlops. Sure it doesn't even approach it's SP capabilities, but at the same time it's decent DP from a standalone chip all the same.

The amount of effort you'd need to put into Cell to get decent DP performance compared to a processor such as the Athlon X2 is insane. That's one thing you need to keep in mind.

dukmahsik · Mar 22, 2006

highly doubt intel and amd are going to design their next gen cpus around "cell"

Carl B · Mar 22, 2006

Asher said:
The amount of effort you'd need to put into Cell to get decent DP performance compared to a processor such as the Athlon X2 is insane. That's one thing you need to keep in mind.

That's very true, but I was just kind of addressing the notion that Cell is 'weak' in the DP area. I didn't mean to sound like I was implying it's not more work to extract that DP performance.

Shifty Geezer · Mar 22, 2006

Is any multicore architecture now going to be referred to as a Cell-like design?

London Geezer · Mar 22, 2006

Shifty Geezer said:
Is any multicore architecture now going to be referred to as a Cell-like design?

That's what i was wondering. But i guess it's the usual marketing speak, with the "Mario killer", "GT killer", "Cell killer" kind of mentality, where everything new is compared and expected to be a "niller" of what's been done before, without being evaluated in its own right.

Gubbi · Mar 22, 2006

From what I understand, AMD's accelerator will be nothing like CELL.

It is supposed to plug into a AM2 socket (or similar, s940, s939, whatever). It's going to use regular cache coherent Hyper Transport links to communicate with the host processor(s). That alone makes it completely different from CELL.

The architecture of the processors themselves is an unknown right now. Some have speculated that it's a vector processor bolt on (a real one, not SIMD extensions), others that it's SPE like DSPs (but memory coherent).

All in all I see a fairly limited use in commercial servers. The need for fast floating point in most server apps is already more than well enough served by current processors. The acceleration for stuff like SSL encryption doesn't take a stand-alone processor.

So it looks like this is directed at the technical high performance market (supers).

Cheers

Frank · Mar 22, 2006

Asher said:
The amount of effort you'd need to put into Cell to get decent DP performance compared to a processor such as the Athlon X2 is insane. That's one thing you need to keep in mind.

Yes, as long as you're talking about a single core. But everyone is going multi-core in any case.

So, as long as you have enough separate tasks, Cell has no competition at the moment, whatever the workload.

Blazkowicz · Mar 22, 2006

ERP said:
OK I'm going to repeat myself here....

I have yet to see a compiler, including ones targeted at In Order designs do an even vaguely decent job of scheduling to hide memory and instruction latencies.

Even if one of these compilers did exist an OOO core could and in general would ALWAYS do a better job.

The reason that OOO cores are prevalent in the desktop space is simply, it's been the best way to improve performance on existing applications. OOO designs will display considerably higher sustainable IPC's in real applications than in order designs, simply because they can mask latency where in order designs can't.

In Cell and X360 IBM is trading off the execution benefits of OOO designs to get back transistors to throw at FP Performance, they obviously believe that this is the primary performance bottleneck in games. It's a very radical tadeoff, trust me you'd be surprised how poorly Cell or X360 CPU would run apps not tailored towards their architectures.

I should start a poll on roughly how many Instructions Per Clock a dev ought to expect out of these in order architectures in a real application, I'll give you a clue, it's a LOT lower than the peaks people throw around on BBS's.

maybe you repeat yourself, stating the obvious, but most people don't get it. a lot of reactions even are, "these whining developers should do a better job / will have to spend more time to optimize their code". like they would be able to spend an awful lot of time hand-optimizing ASM so it runs better on in-order cores..
but this would not be possible anyway, as you say, OOO allows to mask latency, and it's an optimizitation which can only be done at execution time, on the CPU.

Itanium was a failed attempt to move some of the optimizations on the compiler.

as for a dev's estimate, I got one : Carmack states in his 2005 Quakecon speech that on a X360 core, code runs at roughly half the speed of a PC CPU core.

A new X86 architecture with Cell-like tech ?

eSa

maaoouud

SPM

Carl B

Friends call me xbd

Lysander

Guilty Bystander

pjbliverpool

B3D Scallywag

Brimstone

B3D Shockwave Rider

Carl B

Friends call me xbd

Frank

Certified not a majority

SPM

ERP

Asher

dukmahsik

Carl B

Friends call me xbd

Shifty Geezer

uber-Troll!

London Geezer

Gubbi

Frank

Certified not a majority

Blazkowicz

Similar threads