A new X86 architecture with Cell-like tech ?

But it has nothing to do with consoles, it's for pc servers.

There was some talk of this earlier around when Clearspeed first launched that 96 core thing, IIRC. But it's good to see that AMD finally decided to really focus on technologies to enhance server performance. While the first design will probably be launched as a pci design (since it's already designed for that), they probably want to change the interface to a Hypertransport link further on though.
 
This surprises me. The ix86 platform is principally for running Windows, and so binary compatibility with code for other earlier ix86 processors is essential. This need for binary compatibility means that techniques like use of in order processors and relying on smart compilers to optimise code to run efficiently can't be used since the code will need to be re-compiled specifically for the processor being used. Windows and most Windows applications being closed source don't allow this.

Windows ports to other processors have been tried but have failed miserably because of lack of binary compatibility (eg. MIPs, Alpha, Itanium, and all other attempts to move Windows to another processor). Linux on the other hand has been very successful on these and many other processors, because it and most of it's applications are distributed as source code, allowing either the distributor or the end used to recompile on other platforms.

So why is AMD going for a Cell type co-processor that most Windows apps won't use, tarher than traditional SMP? It has to be either because AMD is looking at Linux for server use, or because they intend to use their cell like co-processor to be used to replace embedded multi-media hardware. For example AMD could supply Windows drivers that would allow it to emulate an MPEG decoder, do accelerated graphics on an unaccelerated chipset, and emulate a very fast FP unit.
 
SPM I think you're underestimating Linux in the server space to begin with; it's big. Also note that Sun is a big Opteron supporter and AMD is pretty much the poster child for x86 Solaris. I'd be surprised if the Clearspeed inclusion had much to do with anything other than targeting an increased range of customers with Opteron - maybe trying to torpedo the ever listing Itanium.

Clearspeed's products have been discussed here before, but the hype kind of died down. I read a Clearspeed/AMD article a couple of days ago and I admit I was pretty intrigued. Didn't more than skim it though, so I didn't get a sense for the size of the Clearspeed chip going into the package. Was it mentioned?
 
The talks may be the result of an awareness of the stunning performance of the IBM/Sony/Toshiba microprocessor Cell.
This writer is making asumption on his own. How he came to cell is not understandable.

A future product for AMD is QuadCore, due out next year, a four-core microprocessor delivering double the performance of AMD’s current two-core product, DualCore

Thats amd future, I think devs dont like (hardware) hierahical processors.
 
This writer is making asumption on his own. How he came to cell is not understandable.

How about that statement made by one of Amd's head honchos saying the Cell is a wake up call for them?

A future product for AMD is QuadCore, due out next year, a four-core microprocessor delivering double the performance of AMD’s current two-core product, DualCore

Good for Amd having a quad core Amd 64 but that still won't come close to even the Xenon much less the Cell.

Just to give you guys an example about the power of the Cell.
That 96 core PCI-E expansion thing can only deliver 25GFlop/s with 96 cores while 1 SPE can deliver those same 25GFlop/s on it own.
A dual core P4 or Amd 64 can only do about 8-12GFlop/s at best while 1 SPE can do 25GFlop/s.
The difference in power is huge when you're using all 7 or 8 SPE's.

Thats amd future, I think devs dont like (hardware) hierahical processors.

Correct I think it's pointless to think what might be coming from either Amd, Intel or IBM in the future.
The point is what's here now or in the next six months and take a guess there's gonna be nothing that can either get close to the Xenon and especially the Cell.
 
Guilty Bystander said:
How about that statement made by one of Amd's head honchos saying the Cell is a wake up call for them?



Good for Amd having a quad core Amd 64 but that still won't come close to even the Xenon much less the Cell.

Just to give you guys an example about the power of the Cell.
That 96 core PCI-E expansion thing can only deliver 25GFlop/s with 96 cores while 1 SPE can deliver those same 25GFlop/s on it own.
A dual core P4 or Amd 64 can only do about 8-12GFlop/s at best while 1 SPE can do 25GFlop/s.
The difference in power is huge when you're using all 7 or 8 SPE's.



Correct I think it's pointless to think what might be coming from either Amd, Intel or IBM in the future.
The point is what's here now or in the next six months and take a guess there's gonna be nothing that can either get close to the Xenon and especially the Cell.

The 25GFLOPs for the clearspeed processor is real world double precision GFLOPs. An SPE would be luckly to achieve 2.

Xenon and Cell beat AMD/Intel in theoretical peak single precision GFLOPs only. In both double precision, and real world FLOPs the performance difference isn't remotely as large and probably far below a quad core AMD or Intel x86, especially Conroe and the rumored upgraded Opterons/Athlons with twice the FPU's.

Then consider everything that isn't GFLOP related, and a single core high end Intel/AMD would likely walk all over Xenon and Cell in real world performance, nevermind a dual core. Im not even going to acknowledge a comparison with a quad core.
 
Last edited by a moderator:
Guilty Bystander said:
How about that statement made by one of Amd's head honchos saying the Cell is a wake up call for them?


I don't remember that news. However I do recall AMD critisizing the asymetric design of CELL and how it's not right for AMD.

AMD see's their version of a CELL style architechture as symmetric.

The next problem that Weber touched on was the Cell approach to a heterogeneous multi-core microprocessor. To Fred Weber, a heterogeneous multi-core microprocessor is one that has a collection of cores, each one of which can execute the same code, but some can do so better than others - the decision of which to use being determined by the compiler. Weber referred to his version of heterogeneous multi-core as symmetric in this sense. Cell does not have this symmetric luxury; instead, all of their cores are not equally capable and thus, in Weber's opinion, Cell requires that the software needs to know too much about its architecture to perform well. The move to a more general purpose, symmetric yet heterogeneous array of cores would require that each core on Cell must get bigger and more complex, which directly relates back to Weber (and our) first problem with Cell that it is too far ahead of its time from a manufacturing standpoint.

http://arstechnica.com/news.ars/post/20050331-4763.html
 
pjbliverpool said:
The 25GFLOPs for the clearspeed processor is real world double precision GFLOPs. An SPE would be luckly to achieve 2.

Xenon and Cell beat AMD/Intel in theoretical peak single precision GFLOPs only. In both double precision, and real world FLOPs the performance difference isn't remotely as large and probably far below a quad core AMD or Intel x86, especially Conroe and the rumored upgraded Opterons/Athlons with twice the FPU's.

Then consider everything that isn't GFLOP related, and a single core high end Intel/AMD would likely walk all over Xenon and Cell in real world performance, nevermind a dual core. Im not even going to acknowledge a comparison with a quad core.

Cell's DP performance is actually pretty good - pretty close to that 25GFlops. Sure it doesn't even approach it's SP capabilities, but at the same time it's decent DP from a standalone chip all the same.
 
SPM said:
This surprises me. The ix86 platform is principally for running Windows, and so binary compatibility with code for other earlier ix86 processors is essential. This need for binary compatibility means that techniques like use of in order processors and relying on smart compilers to optimise code to run efficiently can't be used since the code will need to be re-compiled specifically for the processor being used. Windows and most Windows applications being closed source don't allow this.

Windows ports to other processors have been tried but have failed miserably because of lack of binary compatibility (eg. MIPs, Alpha, Itanium, and all other attempts to move Windows to another processor). Linux on the other hand has been very successful on these and many other processors, because it and most of it's applications are distributed as source code, allowing either the distributor or the end used to recompile on other platforms.

So why is AMD going for a Cell type co-processor that most Windows apps won't use, tarher than traditional SMP? It has to be either because AMD is looking at Linux for server use, or because they intend to use their cell like co-processor to be used to replace embedded multi-media hardware. For example AMD could supply Windows drivers that would allow it to emulate an MPEG decoder, do accelerated graphics on an unaccelerated chipset, and emulate a very fast FP unit.

But... What is the difference between such processors that use a different instruction set and all the current x86 processors who all translate those instructions to their own microcode before executing them in any case? I don't think there is any x86 processor nowadays that runs those instructions natively. They all use their own internal (RISC) instruction set.

Java and .NET are both Just In Time compilers as well, they don't compile their executables to machinecode either. The x86 instruction set is just one of the multiple representations you can use to create an executable binary.
 
DiGuru said:
But... What is the difference between such processors that use a different instruction set and all the current x86 processors who all translate those instructions to their own microcode before executing them in any case? I don't think there is any x86 processor nowadays that runs those instructions natively. They all use their own internal (RISC) instruction set.

Java and .NET are both Just In Time compilers as well, they don't compile their executables to machinecode either. The x86 instruction set is just one of the multiple representations you can use to create an executable binary.

Yes, but optimising compilers for in-order processors create code for a very specific processor, since the optimisations are tied into the internal architecture of the specific processor. So performance for binaryies that needs to run on several processors with slightly different internal architectures like Windows running on a generic ix86 processor will always be poor on in-order architectures. This is probably why in-order architectures haven't caught on for Windows, but dominate non ix86 RISC designs.

Hence the dictum that Cell type in-order architectures are bad for OS performance probably does hold true for Windows, but not necessarily for Linux if it is re-compiled for the specific processor with smart compiler optimisations to overcome the limitations of in-order execution. A cell based PC might therefore run the Linux OS very well.

As for reasons AMD might have for producing a cell-like design, massive floating point performance in servers is only required for supercomputing clusters. This is a very niche market. IBM already has this cornered with the forthcoming double precision version of Cell, and I don't think it is worth AMD spending a lot of money trying to corner this market. This leaves only use of the SPE like cores on desktop PCs to emulate hardware via OS drivers as the only way to exploit the cores. Besides, what do you really use such high floating point performance for? Answer - (besides supercomputing) for multi-media, and 3D graphics acceleration, both of which can be implemented via drivers called by generic Windows code. Maybe combining the AMD processor with a cheap integrated 2D graphics chipset will give you accelerated 3D graphics, Mpeg decoding and sound more cheaply than with dedicated hardware for these.
 
SPM said:
Yes, but optimising compilers for in-order processors create code for a very specific processor, since the optimisations are tied into the internal architecture of the specific processor. So performance for binaryies that needs to run on several processors with slightly different internal architectures like Windows running on a generic ix86 processor will always be poor on in-order architectures. This is probably why in-order architectures haven't caught on for Windows, but dominate non ix86 RISC designs.

Hence the dictum that Cell type in-order architectures are bad for OS performance probably does hold true for Windows, but not necessarily for Linux if it is re-compiled for the specific processor with smart compiler optimisations to overcome the limitations of in-order execution. A cell based PC might therefore run the Linux OS very well.

As for reasons AMD might have for producing a cell-like design, massive floating point performance in servers is only required for supercomputing clusters. This is a very niche market. IBM already has this cornered with the forthcoming double precision version of Cell, and I don't think it is worth AMD spending a lot of money trying to corner this market. This leaves only use of the SPE like cores on desktop PCs to emulate hardware via OS drivers as the only way to exploit the cores. Besides, what do you really use such high floating point performance for? Answer - (besides supercomputing) for multi-media, and 3D graphics acceleration, both of which can be implemented via drivers called by generic Windows code. Maybe combining the AMD processor with a cheap integrated 2D graphics chipset will give you accelerated 3D graphics, Mpeg decoding and sound more cheaply than with dedicated hardware for these.


OK I'm going to repeat myself here....

I have yet to see a compiler, including ones targeted at In Order designs do an even vaguely decent job of scheduling to hide memory and instruction latencies.

Even if one of these compilers did exist an OOO core could and in general would ALWAYS do a better job.

The reason that OOO cores are prevalent in the desktop space is simply, it's been the best way to improve performance on existing applications. OOO designs will display considerably higher sustainable IPC's in real applications than in order designs, simply because they can mask latency where in order designs can't.

In Cell and X360 IBM is trading off the execution benefits of OOO designs to get back transistors to throw at FP Performance, they obviously believe that this is the primary performance bottleneck in games. It's a very radical tadeoff, trust me you'd be surprised how poorly Cell or X360 CPU would run apps not tailored towards their architectures.

I should start a poll on roughly how many Instructions Per Clock a dev ought to expect out of these in order architectures in a real application, I'll give you a clue, it's a LOT lower than the peaks people throw around on BBS's.
 
xbdestroya said:
Cell's DP performance is actually pretty good - pretty close to that 25GFlops. Sure it doesn't even approach it's SP capabilities, but at the same time it's decent DP from a standalone chip all the same.

The amount of effort you'd need to put into Cell to get decent DP performance compared to a processor such as the Athlon X2 is insane. That's one thing you need to keep in mind.
 
Asher said:
The amount of effort you'd need to put into Cell to get decent DP performance compared to a processor such as the Athlon X2 is insane. That's one thing you need to keep in mind.

That's very true, but I was just kind of addressing the notion that Cell is 'weak' in the DP area. I didn't mean to sound like I was implying it's not more work to extract that DP performance.
 
Shifty Geezer said:
Is any multicore architecture now going to be referred to as a Cell-like design?

That's what i was wondering. But i guess it's the usual marketing speak, with the "Mario killer", "GT killer", "Cell killer" kind of mentality, where everything new is compared and expected to be a "niller" of what's been done before, without being evaluated in its own right.
 
From what I understand, AMD's accelerator will be nothing like CELL.

It is supposed to plug into a AM2 socket (or similar, s940, s939, whatever). It's going to use regular cache coherent Hyper Transport links to communicate with the host processor(s). That alone makes it completely different from CELL.

The architecture of the processors themselves is an unknown right now. Some have speculated that it's a vector processor bolt on (a real one, not SIMD extensions), others that it's SPE like DSPs (but memory coherent).

All in all I see a fairly limited use in commercial servers. The need for fast floating point in most server apps is already more than well enough served by current processors. The acceleration for stuff like SSL encryption doesn't take a stand-alone processor.

So it looks like this is directed at the technical high performance market (supers).

Cheers
 
Asher said:
The amount of effort you'd need to put into Cell to get decent DP performance compared to a processor such as the Athlon X2 is insane. That's one thing you need to keep in mind.
Yes, as long as you're talking about a single core. But everyone is going multi-core in any case.

So, as long as you have enough separate tasks, Cell has no competition at the moment, whatever the workload.
 
ERP said:
OK I'm going to repeat myself here....

I have yet to see a compiler, including ones targeted at In Order designs do an even vaguely decent job of scheduling to hide memory and instruction latencies.

Even if one of these compilers did exist an OOO core could and in general would ALWAYS do a better job.

The reason that OOO cores are prevalent in the desktop space is simply, it's been the best way to improve performance on existing applications. OOO designs will display considerably higher sustainable IPC's in real applications than in order designs, simply because they can mask latency where in order designs can't.

In Cell and X360 IBM is trading off the execution benefits of OOO designs to get back transistors to throw at FP Performance, they obviously believe that this is the primary performance bottleneck in games. It's a very radical tadeoff, trust me you'd be surprised how poorly Cell or X360 CPU would run apps not tailored towards their architectures.

I should start a poll on roughly how many Instructions Per Clock a dev ought to expect out of these in order architectures in a real application, I'll give you a clue, it's a LOT lower than the peaks people throw around on BBS's.

maybe you repeat yourself, stating the obvious, but most people don't get it. a lot of reactions even are, "these whining developers should do a better job / will have to spend more time to optimize their code". like they would be able to spend an awful lot of time hand-optimizing ASM so it runs better on in-order cores..
but this would not be possible anyway, as you say, OOO allows to mask latency, and it's an optimizitation which can only be done at execution time, on the CPU.

Itanium was a failed attempt to move some of the optimizations on the compiler.

as for a dev's estimate, I got one : Carmack states in his 2005 Quakecon speech that on a X360 core, code runs at roughly half the speed of a PC CPU core.
 
Back
Top