View Full Version : Cell vs Tesla 10 vs Firestream 9250
randomhack
22-Jun-2008, 04:59
Pardon my ignorance wrt Cell but I wonder why the cell based accelerator boards are so damned expensive compared to the GPU based boards? I also wonder whether Cell based systems offer any performance advantage over the GPU boards.
Is it the power draw thats attrative on the Cell?
Shifty Geezer
22-Jun-2008, 11:22
I think because Cell is out and has been out for a couple of years, and the GPU boards aren't. ;) Prior to these latest developments there was nothing to challenge Cell's sustained performance. Now that the GPU IHVs are adjusting their designs for general purpose processing, and competing with each other, they are clearly in a price struggle to attract performance dollars - if Firestream was the only accelerated option out there, do you think it'd remain so 'cheap'!
This is something of a make-or-break point for Cell I think. It hasn't managed to take of and establish itself in a broad market or the CE space. It's current niche of performance computing is being challenged by the GPU guys. The advantage to Cell at the moment is in performance per watt, but in specialist applications custom ASICs do a good job already. In the PC space, such as Toshiba's laptop venture, Cell is likely to be surpassed by integrated GPU hardware that can transition between jobs. Function as a standard PC GPU they'll get mainstream adoption, and with all these GPU's present, people will actually start targeting them, and unlike Cell, they'll get the software that has the performance advantages.
I'm not feeling optimistic for Cell at the moment :(
For CE space, what is the performance figures for NEC's and Broadcom's SoC for Blu-ray players ? These devices need to run general programs on HD assets. They would be Cell challengers on another front.
For specialized business and scientific applications, The GPGPUs already churn out impressive numbers for work that match well to their architecture. I am also waiting for some comparison numbers from Folding@Home (There will be a new PS3 client too according to Vijay Pande (http://foldingforum.org/viewtopic.php?f=3&t=2756&sid=670be1952013467bc61d9e0386727653#p26167)). The problem is these SIMD nodes may have higher FLOP count, but the effective work done may not be as great (compared to the FLOP count). For F@H, it is said that the GPUs need to recalculate some values a few times because they can't store these intermediate values.
The current incarnation of GPGPUs also have limited application areas. Nonetheless, the sheer number of cores may still provide the necessary performance advantage despite the wastage. I also believe that integrated GPU will likely kill any prospect for SPUREngine and Cell accelerator boards here.
For supercomputing, power consumption has become one of the key limiting factors. Does Cell still offer better performance/watt compared to GPUs ? If true, they will hold a strategic advantage there (since they can add more nodes to improve performance and likely still achieve good efficiency due to Cell's software design). I know of clusters that caused frequent brown-outs in their neighbourhood in Asia.
In general, energy use will become more and more problematic moving forward.
Pardon my ignorance wrt Cell but I wonder why the cell based accelerator boards are so damned expensive compared to the GPU based boards? I also wonder whether Cell based systems offer any performance advantage over the GPU boards.
Is it the power draw thats attrative on the Cell?
Cell offers a number of advantages, especially in the supercomputing space. That's due mainly to interconnects; where GPUs are slaved to the PCIe slots, Cell is not, and thus communication node-to-node and with the 'system' in general is at a much lower latency than with GPU-based systems. Cell itself as an architecture reflects a number of supercomputing concepts, and the idea of 'interconnects' is at the core of it... the EIB and the low-latency LS.
GPUs on a targeted level are of course great performers, and I certainly don't think the average person should go out and buy a Cell add-in board when they can work with a GPU instead. But for scaled HPC and supercomputing solutions on the enterprise scale, the combination of IBM's reputation and BladeCenter system, the improving SDK tools and 'transparency' offered in heterogeneous arrangements, and actual real performance advantages in certain situations keeps Cell quite competitive in the present term, and I think that IBM is well-positioned at the moment with it.
How approachable the Firestream is I'm not sure myself, certainly it's DP performance is quite strong, but against the Tesla 10 Cell compares favorably on both DP performance and wattage. NVidia's rack-mount server offering at 700 watts would put the QS22 at rough parity in terms of DP performance for just a couple thousand more while consuming have the watts for example.
RudeCurve
24-Jun-2008, 02:50
Prior to these latest developments there was nothing to challenge Cell's sustained performance.
I believe Clearspeed's accelerator boards have existed long before CELL acceleration boards and they've proven themselves in Linpack sustained performance testing too. In fact they were added to the TSUBAME supercomputer at Tokyo Institute of Technology in Japan to boost sustained performance back in 2006 while only consuming a small amount of additional power. Clearspeed has a new processor out now CSX700 with 96 GFLOPS of performance yet only consumes 25W of power. That is pretty darn amazing.
http://www.clearspeed.com/docs/resources/ClearSpeed_e710_Marketing_Brochure_5-08.pdf
What is CELL's performance/watt?
Clearspeed's application utility is limited in the HPC space, in part due again to the PCIe limitations, and in part simply due to its architecture; something that has been discussed many a time on these forums, especially around the time of Cell's original introduction. At the same time though obviously, Clearspeed only has utility in the HPC space... so it's function is more as a very efficient situational boost computation-wise.
It is a very good Flop/watt performer, but I think the fact that no system of note uses it at its core should tell you all you need to know. Along with the PCI-express hurdles of both GPGPU and Clearspeed solutions, it should be noted that on the memory front the PowerXCell 8i is in a much better position as well. The QS22 accommodates up to 32GB of RAM per blade, where the Clearpseed card is limited to 2GB per card.
RudeCurve
24-Jun-2008, 05:31
The QS22 accommodates up to 32GB of RAM per blade, where the Clearpseed card is limited to 2GB per card.
You are comparing a blade with a card?
CATS 700 1U blade with 24GB of ECC RAM, 1.1TFLOPS double precision, 500Watts power draw.
http://www.clearspeed.com/docs/resources/ClearSpeed_CATS_700_5-08.pdf
Who's using Clearspeeds?
http://www.clearspeed.com/newsevents/news/pressreleases/ClearSpeed_BAE_Agreement_070904.php
http://www.clearspeed.com/newsevents/news/pressreleases/ClearSpeed_Meraka_Institute_Feb5_07.php
http://www.clearspeed.com/newsevents/news/pressreleases/ClearSpeed_2006_11_13.php
http://www.clearspeed.com/newsevents/news/pressreleases/ClearSpeed_Tao.php
http://www.clearspeed.com/newsevents/news/pressreleases/ClearSpeed_Warwick.php
http://www.clearspeed.com/newsevents/news/pressreleases/ClearSpeed_Sun.php
http://www.clearspeed.com/newsevents/news/pressreleases/ClearSpeed_HP_approved_071113.php
http://www.clearspeed.com/newsevents/news/pressreleases/ClearSpeed_CATS_071113.php
CELL isn't being used everywhere, it's being used in HPC and PS3. It's only being used by Toshiba because they've invested so much money in it and want to get something tangible out of it same with CELLs in supposedly SONY tvs. CELL is not all that convincing in the CE space. Toshiba and SONY promised to use them in their CE devices because...well they developed it along with IBM. Do you see anybody else using CELLs in their CE devices? No, everybody else is using ASICs.
Impressive FLOP count and energy efficiency !!! Admittedly, I have not tracked ClearSpeed closely.
According to the latest top 500 supercomputer list (http://www.top500.org/list/2008/06/100), the highest ranked ClearSpeed-equiped system placed at #24 (with 12344 nodes). Ranked above it (#23) is a "Dell" Xeon cluster with 9600 nodes on infiniband. With such a superior specs, how did ClearSpeed lag behind a smaller Xeon cluster ? Perhaps the host system limited ClearSpeed's advantages in some ways ? Did PCIe cap the overall performance ?
There was also no power rating numbers given. Why not flaunt it ?
* ClearSpeed is a co-processor while Cell is a general purpose CPU. So the application area is somewhat different.
* Is ClearSpeed based on a SIMD programming model (Their whitepaper seems to imply so). If true, it will have the same specialization/limitation as GPGPU, but with a much better performance/watt rating. OTOH, GPGPU may come "free/integrated" one day, potentially squeezing Cell and ClearSpeed accelerators out. The current Cell prospect indeed lies within IBM, Toshiba and Sony until they evolve the business further. Where does ClearSpeed intend to go from here onwards ? Who do you think is likely to acquire ClearSpeed (the company) ?
* What is the up-time for a typical ClearSpeed system ? Is it robust enough for mission critical applications ?
* Are they expensive ?
randomhack
24-Jun-2008, 07:44
About clearspeed this article provides some details about prices etc :
http://www.hpcwire.com/features/ClearSpeed_Puts_the_Pedal_to_the_Metal.html
Price quoted is roughly $3000-$3600.
That's very cheap.
I don't know how the Cell vendors are going to position their solutions, but it looks like ClearSpeed competes more with GPGPU and Cell co-processor boards only. A ClearSpeed node will require a host CPU (so the total power consumption is the sum of the host CPU's and the ClearSpeed coprocessor's draw). It will also require another interconnect to hook the host CPUs together, in addition to ClearSpeed's PCIe interface. These are all extra costs, space and power.
If necessary, a Cell node may in fact run on its own without a host. Its MIMD + SIMD design also allows it to speed up a wider range of problems. However, the DP performance/watt is lower than ClearSpeed's.
Would be interesting to see the 2009 Top 500 list. After dropping from #5 to #24, it may be time for ClearSpeed to gun for top 5 again.
RudeCurve
24-Jun-2008, 09:11
The IBM Roadrunner achieved a petaflop machine with Cell accelerators by mapping each Cell processor to an Opteron core, minimizing the CPU portion of the machine. But ClearSpeed estimates that just 50 racks of CATS-700/x86 servers will deliver a petaflop machine, and use only one fifth the power (750 KW) of the Roadrunner.
Oh my! Clearspeed are only using 90nm accelerater chips too!
According to the latest top 500 supercomputer list (http://www.top500.org/list/2008/06/100), the highest ranked ClearSpeed-equiped system placed at #24 (with 12344 nodes). Ranked above it (#23) is a "Dell" Xeon cluster with 9600 nodes on infiniband. With such a superior specs, how did ClearSpeed lag behind a smaller Xeon cluster ? Perhaps the host system limited ClearSpeed's advantages in some ways ? Did PCIe cap the overall performance?
Well here is the reason.
The ClearSpeed accelerated result of 47 TeraFLOPS is a 24 percent performance boost from the non-accelerated performance of 38 TeraFLOPS published in the June 2006 TOP500 list. The increased performance, which is delivered with only a one percent increase in energy consumption, has enabled the TSUBAME supercomputer to achieve a ninth place ranking on the TOP500 list published this week at the 2006 Supercomputing conference in Tampa, Florida.
This upgrade consisted of only 360 Clearspeed boards or 720 Clearspeed processors. In other words they were able to achieve an additional 9TFLOPS from just 720 CS processors. So basically the Clearspeed processors accounted for only about 10% -20% of the total number of processors that make up TSUBAME. Also keep in mind this was back in 2006 when Dell's ABE didn't even exist on the top 10. With that said it's not really surprising that ABE barely beat out TSUBAME. If you were to build a supercomputer today from the ground up with the new Clearspeed processors, you'd be able to build a PFLOPS machine very easily and cheaply and only consume a fraction of the power and space as existing systems.
ClearSpeed is a co-processor while Cell is a general purpose CPU. So the application area is somewhat different.
Yes but in the HPC sector the CELLs are being used as coprocessors, they still need host CPUs. RodeRunner uses dualcore Opterons...thousands of them. The Power core in the CELL as used in Roadrunner is used for other needed duties, it doesn't just sit idle.
This upgrade consisted of only 360 Clearspeed boards or 720 Clearspeed processors. In other words they were able to achieve an additional 9TFLOPS from just 720 CS processors. So basically the Clearspeed processors accounted for only about 10% -20% of the total number of processors that make up TSUBAME. Also keep in mind this was back in 2006 when Dell's ABE didn't even exist on the top 10. With that said it's not really surprising that ABE barely beat out TSUBAME. If you were to build a supercomputer today from the ground up with the new Clearspeed processors, you'd be able to build a PFLOPS machine very easily and cheaply and only consume a fraction of the power and space as existing systems.
If the new processor performs as expected, I'd imagine ClearSpeed aims to take the #1 spot in 2009 (or 2010).
Yes but in the HPC sector the CELLs are being used as coprocessors, they still need host CPUs. RodeRunner uses dualcore Opterons...thousands of them. The Power core in the CELL as used in Roadrunner is used for other needed duties, it doesn't just sit idle.
In the RoadRunner configuration, there are about 2 Cells to 1 Opteron node. The latter feeds data and runs the network. For ClearSpeed, I'd imagine you need at least 1 to 1 to keep the coprocessor active.
For smaller scale deployment, the users should be able use a cluster of Cell servers "as is" though (e.g., I believe Georgia Tech, U. of Maryland have these).
In any case, Cell is still a more general architecture with a wider range of applications, but your mileage will vary.
randomhack
24-Jun-2008, 10:13
This upgrade consisted of only 360 Clearspeed boards or 720 Clearspeed processors. In other words they were able to achieve an additional 9TFLOPS from just 720 CS processors.
Hmm that gives roughly 25 gflops per board measured versus 96 gflops peak? Or were these numbers using some older board or something? If these are with the current boards, then I am not terribly impressed.
If 25gflops are with the current board, then thats roughly 120$/gflop, not a terribly attractive cost even if the power requirement is only 1W/gflop. On the cost side, you could do a lot better by buying a generic intel or amd quad-core part. On the power side, you also have to include some power on the host side when using clearspeed.
RudeCurve
24-Jun-2008, 13:20
Hmm that gives roughly 25 gflops per board measured versus 96 gflops peak? Or were these numbers using some older board or something? If these are with the current boards, then I am not terribly impressed.
If 25gflops are with the current board, then thats roughly 120$/gflop, not a terribly attractive cost even if the power requirement is only 1W/gflop. On the cost side, you could do a lot better by buying a generic intel or amd quad-core part. On the power side, you also have to include some power on the host side when using clearspeed.
Yes those were the older boards based on the CSX600 processors which gets 25 GFLOPS per processor. The new boards have only 1 processor and is rated at 96 GFLOPS. These new boards themselves don't even have fans because of the low power consumption.
You are comparing a blade with a card?
CATS 700 1U blade with 24GB of ECC RAM, 1.1TFLOPS double precision, 500Watts power draw.
That blade is just a bunch of cards; it doesn't address any of the problems I brought up with the solution (PCI-express, memory per chip), just provides them in a denser footprint.
CELL isn't being used everywhere, it's being used in HPC and PS3. It's only being used by Toshiba because they've invested so much money in it and want to get something tangible out of it same with CELLs in supposedly SONY tvs. CELL is not all that convincing in the CE space. Toshiba and SONY promised to use them in their CE devices because...well they developed it along with IBM. Do you see anybody else using CELLs in their CE devices? No, everybody else is using ASICs.
No, Cell isn't being used everywhere. But it's impact on HPC - and architecting in general - has been more significant than that of Clearspeed, and the fact is that a system based on Cell is simply more versatile than one based on Clearspeed in terms of the workloads it can address.
As for Toshiba and the SpursEngine, I believe that the chip offers some very tangible differentiation. Since it's at the heart of their post-HD DVD media strategy as well, it obviously has merit on the performance level vs competing available ASICs to boot.
RudeCurve
24-Jun-2008, 16:23
That blade is just a bunch of cards; it doesn't address any of the problems I brought up with the solution (PCI-express, memory per chip), just provides them in a denser footprint.
Who in the HPC sector said it's a problem? Show me one scientific problem that the Clearspeeds couldn't solve with a high degree of efficiency due to its *memory and PCIe problems*.
By the way having more memory per chip doesn't outweight CELL's disadvantages which is heat, power consumption, and footprint. A 1U Clearspeed blade provides over 1TFLOPS of computer power. A 1U CELL blade could only provide a fraction of that. If I could run my molecular simulations with 1TFLOPS of compute power and 24GBs of memory with a high degree of efficiency, why would I want to run it on a CELL blade? What advantage does a CELL blade offer? More memory? Have you ever thought that maybe CELL needs more memory to be as efficient? Is the higher memory capacity per CELL chip a tangible feature or is it just a marketing bullet point?
No, Cell isn't being used everywhere. But it's impact on HPC - and architecting in general - has been more significant than that of Clearspeed, and the fact is that a system based on Cell is simply more versatile than one based on Clearspeed in terms of the workloads it can address.
That's like claiming basketball player A is taller than basketball player B...by 1 inch...hardly convincing.
As for Toshiba and the SpursEngine, I believe that the chip offers some very tangible differentiation. Since it's at the heart of their post-HD DVD media strategy as well, it obviously has merit on the performance level vs competing available ASICs to boot.
Toshiba is using CELL to do upscaling and image enhancement/processing in their CE devices, that's not a tangible differentiation over an ASIC designed to do the same thing. Now if Toshiba releases a TV where you could use it to surf the internet, that would be a tangible difference.
Who in the HPC sector said it's a problem? Show me one scientific problem that the Clearspeeds couldn't solve with a high degree of efficiency due to its *memory and PCIe problems*.
By the way having more memory per chip doesn't outweight CELL's disadvantages which is heat, power consumption, and footprint. A 1U Clearspeed blade provides over 1TFLOPS of computer power. A 1U CELL blade could only provide a fraction of that. If I could run my molecular simulations with 1TFLOPS of compute power and 24GBs of memory with a high degree of efficiency, why would I want to run it on a CELL blade? What advantage does a CELL blade offer? More memory? Have you ever thought that maybe CELL needs more memory to be as efficient? Is the higher memory capacity per CELL chip a tangible feature or is it just a marketing bullet point?
Memory addressability directly relates to the nature and scope of the problems a chip is able to work on, as it provides a direct limit as to the amount of data in play. Your bringing up the 24GB of memory again vs the obviated 2GB per chip/card within the server itself makes me wonder whether you're grasping the point here. With the QS22, it's 16GB per chip; with the CATS 700, it's 2GB.
Further as the system scales out across several nodes (or even intra-node), the PCI bus becomes a crucially limiting factor as the latency comes into play, and the latency is enormous - over a hundred times greater. HPC and supercomputing on a large scale are as much about the interconnects as they are about the chips themselves, and managed communication across the system as a whole.
Toshiba is using CELL to do upscaling and image enhancement/processing in their CE devices, that's not a tangible differentiation over an ASIC.
The results are tangibly different, that's for sure, or they would just be using an ASIC. The magic mirror demo, the MPEG-2 tiling, and the super-upscaling are all applications that I've not seen replicated elsewhere on a cheap ASIC, and I'll point out what should be the obvious point: if Toshiba could achieve the same result outside of the SpursEngine using cheaper hardware, that's what they'd be doing.
That's like claiming basketball player A is taller than basketball player B by 1 inch...big deal and not convincing at all.
I'll end by addressing this. If you're here to bash Cell, you're in the wrong place. This sub-forum is explicitly for the discussion of Cell in terms of its ecosystem and programming. This thread itself is 50/50 in terms of whether it should be here or not, but I opted to let it stay for comparative purposes. What seems to be happening though is that someone with a seeming derisive view of the architecture is taking this as an opportunity to rail against it instead. Your admiration for Clearspeed is noted; indeed it's an admirable architecture. But you need to either a) change your tone within this sub-forum, or b) take the rant to a different sub-forum.
I think ClearSpeed's strategy ties in well with the rise of the "generic node" (off-the-shelves) supercomputers. The exceptionally high FLOP count is a hallmark of SIMD processors, and complements the host processor's generality.
The power rating is impressive too, but it is in fact diluted by the typically hungry host processor. Nonetheless, the averaged performance/watt should still be attractive (Otherwise, life will be tough).
Its real world impact depends on various factors such as effectiveness (as with GPGPUs), robustness, adoption, etc. If it is as good as advertised, we should see some shuffling in the Top 500 list in 2009 and 2010.
At the moment, I think their key problem is business execution because they had 2 years since their first breakthrough product launched, but they have not made major impact yet.
Cell is a different animal because its vision is based on the message passing model. Being a self-contained, predictable, power and space efficient compute element, it can scale from CE devices to a server node amazingly well. It has also been used to speed up assorted problems, from tree travesals to number crunching. Unfortunately, this generalization also makes the basic unit less performant (or too expensive) compared to highly specialized devices for a given task.
I think both ClearSpeed and Cell will need at least two to three more years before we can conclude their success in their intended use.
I'd say for now Sony looks the most committed and well-postitioned to take advantage of Cell (There are more Cell software/services to come ;-) ). Toshiba is just starting (since they have always wanted to remove the SPU and have finally done so). IBM is probably pushing mini-RoadRunners aggressively to other installations as we speak, and also re-grouping/re-evaluating their position.
Because Cell is actively deployed and used on a large scale, it can bring interesting dynamics to the ecosystem. Today, the Folding@Home project is an interesting exercise in a public, fully distributed Cell network.
Toshiba is using CELL to do upscaling and image enhancement/processing in their CE devices, that's not a tangible differentiation over an ASIC designed to do the same thing. Now if Toshiba releases a TV where you could use it to surf the internet, that would be a tangible difference.
I think Toshiba mentioned that Cell allows them to do Super-Upconversion. Indeed more features are coming soon. Surfing the net and video conferencing are just 2 examples highlighted. If you look at the bundled software on their SPUREngine laptop, you will also see potential in UI breakthrough and other home media server use.
In Sony's case, they are also starting to add tru2way capability to HD TVs (i.e., built-in Java set-top boxes). This is similar to PS3's BD-Live foundation and its PlayTV accessory.
I think a closed P2P (Cell) network for content distribution is interesting too.
I think ClearSpeed's strategy ties in well with the rise of the "generic node" (off-the-shelves) supercomputers. The exceptionally high FLOP count is a hallmark of SIMD processors, and complements the host processor's generality.
The power rating is impressive too, but it is in fact diluted by the typically hungry host processor. Nonetheless, the averaged performance/watt should still be attractive (Otherwise, life will be tough).
Its real world impact depends on various factors such as effectiveness (as with GPGPUs), robustness, adoption, etc. If it is as good as advertised, we should see some shuffling in the Top 500 list in 2009 and 2010.
At the moment, I think their key problem is business execution because they had 2 years since their first breakthrough product launched, but they have not made major impact yet.
Clearspeed's core market is the same as the market that the GPGPU makers are initially playing in; the workstation-as-supercomputer market, for lack of a better term. When a single add-in card is all you need to really manage on a system level, what you find yourself with is an incredible price/performance value. Certainly the compute resources of a Clearspeed product or a Firestream or a Tesla achieved through 'traditional' means would be much more expensive and require a much larger footprint. What is essentially created is a thriving cottage industry where players in the HPC space who would prior have found equivalent systems cost-prohibitive are now able to do serious work on cheap setups.
But as we go into the massive rack-server environments, the solutions that on the desktop provide an unequaled value begin to strain in areas that are important at scale. Utility is still there to be had, but it becomes constrained in relation to theoretical non-board bound implementations of those same architectures, and their competition.
Beyond that, the fact is that a lot of additional factors play into the appeal of an architecture.
Each of the four architectures being discussed here (Cell, Radeon, GeForce, Clearpseed) has a clear set of differentiating positive qualities, but some qualities can be weighted as more important than others, and some have a longer list of such qualities than others.
To its benefit, the fact is that Cell has the weight of the preeminent supercomputing manufacturer behind it (and the support/service that brings), an evolving SDK that has made code porting increasingly simple, an out-of-the-box heterogeneous support structure, Bladecenter inclusion, and the fact that the PS3 provides a full cheap work environment... all are factors that are playing to the PowerXCell 8i's favor right now in the HPC space, and certainly the chip has gained a lot of momentum.
For NVidia, the deal is that it's cheap, accessible, and CUDA has leveraged those facts to become a respectably established tool in a very short amount of time. CUDA essentially is the face of GPGPU at the moment, and the scope of individuals who have delved into it speaks to its promise as a field.
AMD doesn't have the established programmability head-start that NVidia does, but the DP performance is stellar, wattage is under control, and everyone knows what the deal is essentially.
Clearspeed has been brought up time and again around here, and the fact is that for DP performance they are the watt/Flop leader. But the constraints of the board model and the disadvantages in proliferation and programmability they face when compared to the above players puts them at a comparative disadvantage in terms of "taking off." This new card and a supposedly much-improved SDK may be what they need to break out, we'll see what happens.
But with the above players and the arrival of Larrabee in the near future, there is no question the Top500 list a couple of years from now is going to look wholly changed.
I think both ClearSpeed and Cell will need at least two to three more years before we can conclude their success in their intended use.
I think both can already be qualified as successes, industry-dependent. Clearspeed has always occupied a certain niche which it certainly performs well in; whether GPGPU squeezes it or not, I think it will have to be considered as a 'success' at what it did. For Cell, frankly the gains in HPC I think are at or beyond what many cynically believed would be the case when the architecture launched. In the consumer-electronics industry though, it's not looking very prolific at all.
In Sony's case, they are also starting to add tru2way capability to HD TVs (i.e., built-in Java set-top boxes)
I don't think that has anything to do with Cell though. :)
RudeCurve
24-Jun-2008, 17:49
Memory addressability directly relates to the nature and scope of the problems a chip is able to work on, as it provides a direct limit as to the amount of data in play. Your bringing up the 24GB of memory again vs the obviated 2GB per chip/card within the server itself makes me wonder whether you're grasping the point here. With the QS22, it's 16GB per chip; with the CATS 700, it's 2GB.
I understand the concept just fine, the more memory is better rule also has limitations, ever heard of diminishing returns? On the other hand you don't seem to be understanding my point and you still haven't offered any proof to support your claims. If 2GB per processor is a problem then show me some realworld HPC problems that perform poorly with your invented memory capacity and PCIe bottleneck.
Further as the system scales out across several nodes (or even intra-node), the PCI bus becomes a crucially limiting factor as the latency comes into play, and the latency is enormous - over a hundred times greater. HPC and supercomputing on a large scale are as much about the interconnects as they are about the chips themselves, and managed communication across the system as a whole.
Another basic concept that has no realworld evidence with respect to Clearspeeds design implementation. Show me where PCIe has dampened Clearspeeds realworld performance. I'd like to see some numbers.
The results are tangibly different, that's for sure, or they would just be using an ASIC. The magic mirror demo, the MPEG-2 tiling, and the super-upscaling are all applications that I've not seen replicated elsewhere on a cheap ASIC, and I'll point out what should be the obvious point: if Toshiba could achieve the same result outside of the SpursEngine using cheaper hardware, that's what they'd be doing.
The magic mirror is a niche application. Why design an ASIC for such a niche market? It's not worth it. MPEG2 tiling..how is that tangible? Wouldn't a consumer need multiple input sources running at the same time to be able to feed this MPEG2 tiling engine? Why would a normal person have such a setup?
Super Upscaling...let's see the results first before we claim it's something only CELL could provide. Again for whatever reason you keep ignoring my point. Toshiba is not using ASICs for these things because they're niche applications, it's not practical nor economical to design different ASICS for niche applications that have limited market value. It's not tangible.
The only application that will see volume is Super Upscaling, again let's see the results first to see how it compares to the highend upscaling ASICs already out on the market. If it ends up being the same then it's just reinventing the wheel.
I'll end by addressing this. If you're here to bash Cell, you're in the wrong place. This sub-forum is explicitly for the discussion of Cell in terms of its ecosystem and programming. This thread itself is 50/50 in terms of whether it should be here or not, but I opted to let it stay for comparative purposes. What seems to be happening though is that someone with a seeming derisive view of the architecture is taking this as an opportunity to rail against it instead. Your admiration for Clearspeed is noted; indeed it's an admirable architecture. But you need to either a) change your tone within this sub-forum, or b) take the rant to a different sub-forum.
I'm just responding to uninformed statements that were made about CELLs supposed advantages. The CELL boards are competing in the same market as the Clearspeed and GPGPU boards whether we want to accept it or not. If you think the CELL blades are not competing for the same HPC dollars as the Clearspeed blades then you're in for a rude awakening. Finally if we don't have real numbers to verify supposed advantages/disadvantages it's all FUD.
I understand the concept just fine, the more memory is better rule also has limitations, ever heard of diminishing returns? On the other hand you don't seem to be understanding my point and you still haven't offered any proof to support your claims. If 2GB per processor is a problem then show me some realworld HPC problems that perform poorly with your invented memory capacity and PCIe bottleneck.
Another basic concept that has no realworld evidence with respect to Clearspeeds design implementation. Show me where PCIe has dampened Clearspeeds realworld performance. I'd like to see some numbers.
There is no 'proof' required. The nuclear simulations that Roadrunner will be working on are just one obvious example of specific applications for which memory availability provides a very real and tangible barrier, due to their size and scope. The PCI bottleneck, on the micro scale, comes in when the chip has to swap out data in its already contrained memory footprint. Consider then the effects of several nodes feeding one another processed data and the breakdown that occurs when you have processing stalls on an ever compounded level. Think of a graphics card needing to go out to main memory because it couldn't fit all data into the on-card RAM, and you get a very clear idea of what we're discussing here. Doesn't matter that the chip itself is capable of a certain level of performance, as soon as it needs to cross that PCI-e bus, performance goes to hell compared to the theoreticals. Supercomputers are built around their interconnects as much as their architectures; this isn't anything that I'm making up here, this is established.
The magic mirror is a niche application. Why design an ASIC for such a niche market? It's not worth it. MPEG2 tiling..how is that tangible? Wouldn't a consumer need multiple input sources running at the same time to be able to feed this MPEG2 tiling engine? Why would a normal person have such a setup?
I'm not saying that a consumer 'needs' it. I'm not saying either that the SpursEngine is the result of a wise economic decision. What I am saying is that the SpursEngine is more powerful, flexible, and capable than most ASICs across a number of tasks.
I'm just responding to uninformed statements that were made about CELLs supposed advantages. The CELL boards are competing in the same market as the Clearspeed and GPGPU boards whether we want to accept it or not. If we don't have real numbers to verify supposed advantages/disadvantages it's all FUD.
Yes, the Cell competes in the same markets, obviously. It also obviously has certain advantages from an HPC perspective, advantages that I certainly don't have documents on tap to 'prove' anything to you with, but advantages that an understanding of system architecting and needs-profiling should lay bare as obvious to anyone. Your refusal to accept the memory and PCI-e constraints as 'real' sans evidence speaks, frankly, to either a naivete or lack of understanding on your part wrt the issue.
Clearspeed's core market is the same as the market that the GPGPU makers are initially playing in; the workstation-as-supercomputer market, for lack of a better term. When a single add-in card is all you need to really manage on a system level, what you find yourself with is an incredible price/performance value. Certainly the compute resources of a Clearspeed product or a Firestream or a Tesla achieved through 'traditional' means would be much more expensive and require a much larger footprint. What is essentially created is a thriving cottage industry where players in the HPC space who would prior have found equivalent systems cost-prohibitive are now able to do serious work on cheap setups.
But as we go into the massive rack-server environments, the solutions that on the desktop provide an unequaled value begin to strain in areas that are important at scale. Utility is still there to be had, but it becomes constrained in relation to theoretical non-board bound implementations of those same architectures, and their competition.
Carl, if they have been in the market for 2 years plus a product rev, then I assume they have learned their lessons even if they went in with a mismatched solution originally. Their first supercomputer benchmark was also a testament to their applicability for problems related to the benchmarks. This is not to say that ClearSpeed will be suitable for all supercomputing applications. To me, it seems to be targeting a very specific and real need within this space. OTOH, Cell (and hence RoadRunner) is more general. We have all seen papers on 50 times speed up from Cell performing tree searches. On the number crunching front, even though it may not perform as well as specialized hardware, it is still applicable for a large number of math problems.
I don't think that has anything to do with Cell though. :)
Yeah not directly. There are embedded processors supporting "Java instructions" but Cell's ability to run Java programs together with other media processing will help in these hardware units.
The only application that will see volume is Super Upscaling, again let's see the results first to see how it compares to the highend upscaling ASICs already out on the market. If it ends up being the same then it's just reinventing the wheel.
Toshiba has already demo'ed SUC to a panel of journalists last few weeks. The differences are noticeable but naturally they did not compare the superupscaled picture with true 1080p original image.
MPEG2 tiling..how is that tangible? Wouldn't a consumer need multiple input sources running at the same time to be able to feed this MPEG2 tiling engine? Why would a normal person have such a setup?
I think someone used this as an example for nextgen TV or video library UI (where you can see live thumbnails of channels during navigation).
Carl, if they have been in the market for 2 years plus a product rev, then I assume they have learned their lessons even if they went in with a mismatched solution originally.
Well that's what I'm saying though; who says their intiial product was mismatched? It served the market of add-in HPC quite admirably, which was the market they were targeting, and why I think it's easy enough to dub them a qualified 'success.'
Their first supercomputer benchmark was also a testament to their applicability for general benchmarking. This is not to say that ClearSpeed will be suitable for all supercomputing applications. To me, it seems to be targeting a very specific and real need within this space. OTOH, Cell (and hence RoadRunner) is more general. We have all seen papers on 50 times speed up from Cell performing tree searches. On the number crunching front, even though it may not perform as well as specialized hardware, it is still applicable for a large number of math problems.
I think it's wrong to reduce it to 'general' vs 'specialized' though. More accurate to say simply that each architecture has certain strengths and weaknesses. Roadrunner was built for a specific purpose; the fact that it is also broad/general in the number of tasks it can address is a secondary benefit rather than the primary driver. And for that primary purpose though, Clearspeed's latency and memory disadvantages would have ruled it out as a contender before the questions of programming/generality ever even arose.
We'll be seeing a supercomputer based on GPGPU in the not-too-distant future, and it should serve as a good test-bench for a lot of the angles we're discussing here.
http://www.beyond3d.com/content/news/632
I have a feeling that the GPU's here are going to be acting as selective turbo-chargers, however, with a substantial amount of the work still done on the Intel CPUs. It's worth noting that in Roadrunner, almost all generated Flops come from the SPE's, with the Opterons serving almost entirely in an I/O role.
Yeah not directly. There are embedded processors supporting "Java instructions" but Cell's ability to run Java programs together with other media processing will help in these hardware units.
But I'm not sure we'll be finding it in any of those hardware units except for the PS3 itself, unless Spurs makes its way as well. Cell is obviously 'better' from a capabilities standpoint, but the Tru2Way manufacturers will opt for lower-power and price savings where possible I think.
Well that's what I'm saying though; who says their intiial product was mismatched? It served the market of add-in HPC quite admirably, which was the market they were targeting, and why I think it's easy enough to dub them a qualified 'success.'
I think it's wrong to reduce it to 'general' vs 'specialized' though. More accurate to say simply that each architecture has certain strengths and weaknesses. Roadrunner was built for a specific purpose; the fact that it is also broad/general in the number of tasks it can address is a secondary benefit rather than the primary driver. And for that primary purpose though, Clearspeed's latency and memory disadvantages would have ruled it out as a contender before the questions of programming/generality ever even arose.
I see what you're saying (although my definition of 'success' is different but I'll ignore that to keep the discussion clean). Yes, the customers will want to make sure as "all" their needs are met adequately before committing to a spanky supercomputer.
If we get down to this level, I think without concrete examples and customers, it's hard to argue about ClearSpeed's (and Cell's) suitability or unsuitability.
But I'm not sure we'll be finding it in any of those hardware units except for the PS3 itself, unless Spurs makes its way as well. Cell is obviously 'better' from a capabilities standpoint, but the Tru2Way manufacturers will opt for lower-power and price savings where possible I think.
Point taken. I think it also depends on the business model. I was told 2Wire's BOM cost is prohibitive, but AT&T chose it eventually because of specific built-in customer support features. These advanced features allowed them to conduct their business more cost effectively compared to the usual brands.
So yes, Cell is expensive. But it depends on what else is in the system and *if* Cell enables/differentiates them. If not, then everyone will go the low cost route.
RudeCurve
25-Jun-2008, 06:27
There is no 'proof' required. The nuclear simulations that Roadrunner will be working on are just one obvious example of specific applications for which memory availability provides a very real and tangible barrier, due to their size and scope. The PCI bottleneck, on the micro scale, comes in when the chip has to swap out data in its already contrained memory footprint. Consider then the effects of several nodes feeding one another processed data and the breakdown that occurs when you have processing stalls on an ever compounded level. Think of a graphics card needing to go out to main memory because it couldn't fit all data into the on-card RAM, and you get a very clear idea of what we're discussing here. Doesn't matter that the chip itself is capable of a certain level of performance, as soon as it needs to cross that PCI-e bus, performance goes to hell compared to the theoreticals. Supercomputers are built around their interconnects as much as their architectures; this isn't anything that I'm making up here, this is established.
You've been speaking in generalities yet at the same time making specific claims. Anyone can claim "more memory is better" or "more bandwidth is better", but you've taken it a step further and claiming 2GB/processor is a "problem" and 16GB/processor is suddenly "not a problem". You know this how? Have you ran any real scientific problems using Clearspeed accelerater cards to come to this conclusion? Or are you simply assuming that since 16GB>2GB the Clearspeeds must not be able to do real problem solving at the same level as CELL due to its 2GB/processor "problem"? Since you've already determined that 2GB/processor is a problem, let me ask you this, where is the point of diminishing returns wrt memory capacity per accelerating processor? Is it more/less than 16GB?
What I am saying is that the SpursEngine is more powerful, flexible, and capable than most ASICs across a number of tasks.
Well that's a reasonable assumption. A DSP is more flexible than an ASIC yes.
You've been speaking in generalities yet at the same time making specific claims. Anyone can claim "more memory is better" or "more bandwidth is better", but you've taken it a step further and claiming 2GB/processor is a "problem" and 16GB/processor is suddenly "not a problem". You know this how? Have you ran any real scientific problems using Clearspeed accelerater cards to come to this conclusion? Or are you simply assuming that since 16GB>2GB the Clearspeeds must not be able to do real problem solving at the same level as CELL due to its 2GB/processor "problem"? Since you've already determined that 2GB/processor is a problem, let me ask you this, where is the point of diminishing returns wrt memory capacity per accelerating processor? Is it more/less than 16GB?
I didn't say that 2GB is a "problem," I said it was a problem for the computations Roadrunner is targeted towards. This entire concept of 'diminishing returns' of yours works under the premise that there exists only one class/size/scope of scientific computing, and that as such we are on a graded performance slope. For certain problems, 2GB is more than enough. For others, it is not enough. It is not about some linear or asymptotic performance gain here related to memory, latency, and/or bandwidth; it is about a performance cliff that these architectures will fall off of when they are significantly constrained in any one of these areas.
So, for the QS22, it simply is able to handle a larger scope of problems before hitting such a barrier. For the Clearspeed card, that barrier is lower. Now - the majority of simulations one is likely to run would be fine in a 2GB environment, but when we're talking about truly massive data sets... and for Los Alamos' purposes they specifically requested more. We're talking about simulating the direct and secondary effects of a nuclear explosion on surrounding matter and environment here on a second-to-second level. You're being a bit flippant in terms of what you are willing to toss aside in your consideration, to the extent that you don't acknowledge a very material spread in a key point of capacity.
I'm personally inclined to believe that your thinking on the matter is a product of the quasi-technical articles that float around the Internet, that dismiss 'hyped' specs at the same time as they hail them when convenient. Case in point, the whole notion that Clearspeed has superior DP Flop and wattage counts... and thus these matter... but Cell having greater memory access and bandwidth must conversely mean nothing. Your rather snide 'DSP' remark makes me think you are likely one of those that when Cell was touting Flop and watts numbers back in the day though, those were in turn reduced to 'hype' in your book.
I'm a fan of Clearspeed and what they're looking to achieve in terms of a performance/value proposition, just as I'm a fan of the disruptive power of GPGPU computing and its accessibility. But the Cell-hating gets old, and is wholly unsuited to this sub-forum. The Cell architecture doesn't suck, it's actually quite forward looking, and it's gaining good traction in the HPC space right now. If I've interpreted your tone/purpose here incorrectly, I do apologize, but realize that this sub-forum is an extension of this site here (http://www.cellperformance.com/articles/) rather than a general sub-forum of Beyond3D, and as such there really is no room for tolerance here in terms of either real or perceived Cell-related trolling. As I mentioned before, this thread itself was a questionable one in terms of whether it should remain here or be moved elsewhere, but... well it's here for now.
Shingoshi
29-Sep-2008, 19:00
I found this as a result of a search on Google to see if AMD/ATI were yet producing a rackmount product similar to the Nvidia Tesla project. When I first became acquainted with the existence of the Cell, I was rather impressed. But I was confused, even from the beginning as to why the price was so high for it (the uninformed perception I speak of below). I'm speaking specifically of the Mercury Cell Board, at a then $8000 MSRP. So when the Nvidia came along, and now the AMD Firestream, just from my limited understanding of all the specifics involved here, it seems indeed gloomy.
This morning I was just playing with the idea (dream) of building a compute engine inside of a 16U case. I had been looking at different options with regard to backplanes to achieve the highest level of processor density possible. I came across the ClearSpeed, but didn't know anything about it. At first, I saw the 96GFlop number and thought wow! But I wasn't certain if that was impressive or not, because I hadn't really investigated this enough. However, your spirited discussion here leads me to pursue this further to understand more fully the benefits each has to offer.
Clearly, right now, at $1000 per Firestream 9250 card, it seems very attractive. And based on the fact that it offers a lower initial cost for installation, many people are going to be attracted to it, and the Nvidia. And the problem for these other two is that most of the public on which this competition will depend, will find themselves looking at cheaper solutions. I was thinking that Physx had missed a great opportunity by limiting the number of cards that could be used, thinking what a difference they could make for small projects. I realize that Supercomputing may have specific needs, but those are beyond the requirements of most small projects. And as those small projects gain popularity, the companies (Nvidia/AMD) will enjoy an increasing public perception of the presumed superiority, regardless of that being true or not.
The Clearspeed/Cell front (consumer base) will likely diminish in size, even though larger sums will be spent on each installation. The problem is that most of the money will be made (by their competition) in the volume of sales acquired from smaller clients, and they will likely ultimately set the trend for what's to come. And additionally, Nvidia/AMD will have the assets to make whatever adaptation they need to compete with the larger installations. They're not going to sit back and say to themselves, "that's just too large for us to consider.
I guess what I'm trying to say is that all of this may become moot. Because if more equity is eventually dumped into the cheaper solutions, the larger projects will find less opportunities to grow into. They have a very limited set of clientele (for supercomputing), compared to the masses which will adopt the products of the younger upstarts. And then the accountants will of course be involved in decisions which have everything to do with money, and less with performance.
So time will prove what will come of all this. I do know that a few of those mentioned here will change their models and strategies as required by economic conditions. And whoever makes the best decisions in that regard will be the winner(s).
Just my uniformed observations.
Shingoshi
Shingoshi
30-Sep-2008, 03:07
http://www.clearspeed.com/acceleration/reliability/ecc/
Based on this, I feel justified in thinking that all memory should be ecc. Presently, such a move would not be well accepted. But if the computational markets become more of an asset to those manufacturing GPGPUs, ecc will likely become a more standardized feature in more systems. Also add to this that if the reasoning given by Clearspeed, concerning the increasingly smaller size of chips and the speed at which they operate, ecc may become a necessity for everyone, regardless of their activities. And when you consider how relatively inexpensive it would be for GPGPUs to add ecc to their foundation, the present advantage of this for Clearspeed and anyone else currently using ecc, will evaporate. Adding ecc would just be too simple to pass up. Just think of a Firestream/Tesla with either ecc registered or FBDIMMs. Can you see four 4GB DIMMs per card? That would significantly diminish any argument as previously mentioned.
There's another thing to think about. Any changes in the systems of Nvidia/AMD, will be generally be seen as advancements, by increasing the functionality of their units. However, I don't think their competition will be viewed so favorably, if they now go back and attempt to introduce features to compete with Nvidia/AMD. Maybe I'm wrong here and this might not make any sense. But I do think it will cost less for Nvidia/AMD to enhance their products, than it would be for the others. And then, there's the all important factor of name recognition. Very few people know who the other companies are. Yes, they're each major names in their own right, at least for the Cell group. But they're attempting to convince the public to accept platforms with which the masses are unfamiliar. That's not true for Nvidia/AMD. Their collective presence is about as ubiquitous as one could hope for. They don't have to sell themselves to the public, that's already been accomplished.
I guess I'm thinking that performance will be measured by who has the largest market share. Not who is more efficient at work.
Maybe? Time will tell.
Shingoshi
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.