Playstation 3: Hardware Info and Price

Gubbi said:
Outside of compute farms you are not going to see CELL in servers, last of all in webservers. Context switching SPEs is extremely expensive, so virtualizing them is, for all practical purposes, impossible. That makes them 100% unfit for server workloads.
What about servers providing encrypted content where the encryption programs can remain static in the SPEs LS? I can also think of cases where images are generated on demand where the SPEs would help out nicely.

Or are you just talking about legacy J2EE style programs?
 
Tahir2 said:
Umm it's not quite as simple as that.. we use networking to load OS's onto customer machines using the Windows OPK software and Ghost.

The specs vary widely - the Gigabit controllers are generally faster.. today even used an old 10Mbit card. That was a bit painful when dealing with such large transfers.
Right, my assumption is that any machine with a 10MBit card is going to have other slower components as well. Something with a gigabit card would have better specs. I'm also assuming that file io is the limiter in this regards (concerning the PS3). But I don't doubt there are circumstances where a gigabit network is appropriate and useful.
 
Tahir2 said:
No - your assumption is incorrect.
Can you elaborate at all? This is counter to my experience, where I could have two fairly powerful machines transfer files, both sitting behind a simple switch. They are file io limited and not network bound.

I'm just trying to understand the scenario where you see network bandwidth as the limiter.
 
Without specifics the 10Mbit card was on a new machine that did not have drivers in the OPK image, so we couldnt use its GLan or LAN ports and I had to use an old Network card to get it onto our network. This is pre-OS using a modified OPK disc.

Ghosting over the Gigabit network using that dinosaur 10Mbit card takes about 30-40 minutes for a 2GB image.
Going to a Marvell network card on the other end using a 32bit version of a Ghost EXE takes approximately 3 minutes on a good day and typically 5 minutes.
Using 100Mbit cards takes longer, but not 10x slower or even close.

So there is some difference in the implementations and it is related to the hardware in the network cards used.

That is as specific as I can be....
 
Gubbi said:
Outside of compute farms you are not going to see CELL in servers, last of all in webservers. Context switching SPEs is extremely expensive, so virtualizing them is, for all practical purposes, impossible. That makes them 100% unfit for server workloads.

Cheers

+ For Cell server they are working with IBM and SCE may release a game server. Kutaragi wants game servers such as Polyphony Digital ones to be Cell-based. In the future he wants to create a cyber world by stacking up thousands of Cell servers.

source: http://www.beyond3d.com/forum/showthread.php?t=31328

@Sis: copying images sure does max out our 100Mbit connections. We're moving to a new building in a few weeks time, where we have 1Gbit all round. It'll be completely new and freshly installed by HP. If you're not maxing out your 100Mbit connections, one of the most common reasons is that your network isn't set up all that efficiently, or that there is simply too much traffic on it already. We had our 100Mbit network completely overhauled several years ago replacing HUBs with Cisco Routers, and that really boosted our network performance and reliability considerably.

Another area where I think the 1Gbit connection is going to help is live multiplayer competitions on LAN parties.

Also, don't underestimate how heavy webpages are becoming. Once you are starting to reach higher download speeds, or you combine flash and other video elements, music streaming and what not, a Cell is going to help out. Also web browsers are very efficient in building up a page asynchronously, and switching between a lot of threads on a non-threaded machine is very inefficient.

Having said that, you are probably right that it won't matter now for most webpages. But take for instance something like the Opera browser, which can remember the pages you have in memory. When you boot up these are automatically refreshed, something which could be done faster on a multicore machine.
 
Sis said:
Ah, ok. I thought Arwin was using that to back up the statement that Cell will be good at multithreading tasks. I didn't realize this thread had turned into the viability of using PS3 as web servers.

Cell (at least the SPEs) aren't good at multi-threading - they are good at parallel processing. This is because context switches are expensive on the SPE due the the huge local store compared to register set of most CPUs and the fact that to get best efficiency out of the SPE, you need to load the code into the local store and run it from there. Cell needs to stream processes or run one process, then unload it and run another in sequence rather than switch from process one to another while both processes are active.
 
Gubbi said:
Outside of compute farms you are not going to see CELL in servers, last of all in webservers. Context switching SPEs is extremely expensive, so virtualizing them is, for all practical purposes, impossible. That makes them 100% unfit for server workloads.

Cheers

I agree, but I don't think this is the biggest issue. The biggest issue is that web servers aren't CPU intensive, they are i/o bound. The CPU in a web server spends most of it's time caching or buffering data from hard drive, network, SAN/NAS etc. to/from RAM. You can't really use the SPEs effectively to do this, so in a Cell based web server, you will not make much use of the SPEs.

There are applications where Cell can and will be used in servers accesible on the web though:
1) In combination with farm of conventional web servers to do things like encryption or compression of data streams produced by conventional servers.
2) Future compute intensive server applications - like telephone voice recognition, image recognition or biometric security processing (probably in combination with conventional servers).
3) Serving compure intensive applications like multi-player distributed processing and server based games.
 
SPM said:
I agree, but I don't think this is the biggest issue. The biggest issue is that web servers aren't CPU intensive, they are i/o bound. The CPU in a web server spends most of it's time caching or buffering data from hard drive, network, SAN/NAS etc. to/from RAM. You can't really use the SPEs effectively to do this, so in a Cell based web server, you will not make much use of the SPEs.

There are applications where Cell can and will be used in servers accesible on the web though:
1) In combination with farm of conventional web servers to do things like encryption or compression of data streams produced by conventional servers.
2) Future compute intensive server applications - like telephone voice recognition, image recognition or biometric security processing (probably in combination with conventional servers).
3) Serving compure intensive applications like multi-player distributed processing and server based games.

I'm sorry, but you are completely wrong.

Most webpages today are generated dynamically. Generally there are two ways of doing things: 2-tier systems (frontend+DBMS) or 3-tier systems (frontend-application server- DBMS). Popular frameworks/programming languages for the front end include PHP, .NET and Java, for the application server tier: .NET and Java. The DBMS can be anything: Oracle, MsSQL, PostGres, MySQL etc etc.

The frontend often has 150 threads to serve requests, the application server layer can have any number of threads (1 to umpteen hundred) , the DBMS often have at least 32 threads.

All are CPU intensive, and often all of this is residing on the same 1, 2 or 4 CPU box.

gzip compression of the output is trivial in comparison to the generation of the pages.

As for your list of possible applications of CELL in a server environment:
1.) Using CELL to only do encryption is insanely un-economic. Modern CPUs easily systain >100Mbit 3des encryption bandwidth, that means less than 3 4-way boxes will do >1Gbit/s. If you need more than that there are NICs with encryption support that is much cheaper than dishing out for a CELL server
2.) People who think face-recognition will be everywhere has seen Minority Report to many damn times. There may be demand for servers to do these tasks, but demand will be small, and regular server CPUs won't be that far behind CELL in data processing performance.
3.) MoM workloads seem particularly unfit for CELL, huge datasets that are not easily partitioned into chunks that fit in LS.

Cheers
 
Last edited by a moderator:
Gubbi said:
1.) Using CELL to only do encryption is insanely un-economic. Modern CPUs easily systain >100Mbit 3des encryption bandwidth, that means less than 3 4-way boxes will do >1Gbit/s. If you need more than that there are NICs with encryption support
that is much cheaper than dishing out for a CELL server.
Don´t you find it strange that both IBM and Sun are incorporating dedicated hardware for encryption in the server CPUs if it's so much better/cheaper to let a GP CPU core do it or to have it in the NICs?
Gubbi said:
2.) People who think face-recognition will be everywhere has seen Minority Report to many damn times. There may be demand for servers to do these tasks, but demand will be small, and regular server CPUs won't be that far behind CELL in data processing performance.
Do you have any performance/price and performance/wattage perspective in your predictions? Or do you see these things happening so far into the future that the OOO CPUs from intel have started incorporating dedicated help CPUs in a similar fashion as the SPUs? ;)
 
Last edited by a moderator:
Gubbi said:
I'm sorry, but you are completely wrong.

Most webpages today are generated dynamically. Generally there are two ways of doing things: 2-tier systems (frontend+DBMS) or 3-tier systems (frontend-application server- DBMS). Popular frameworks/programming languages for the front end include PHP, .NET and Java, for the application server tier: .NET and Java. The DBMS can be anything: Oracle, MsSQL, PostGres, MySQL etc etc.

The frontend often has 150 threads to serve requests, the application server layer can have any number of threads (1 to umpteen hundred) , the DBMS often have at least 32 threads.

All are CPU intensive, and often all of this is residing on the same 1, 2 or 4 CPU box.

I can't agree with any of this, and I will explain why.

The bottleneck with web servers, database servers is not the CPU, it is the hard drive. Efforts in improving web or database server performance is based on boosting storage performance. This can be done a number of ways -

1) Using server clusters (load balancing server farms, not HPC clusters) where you duplicate identical servers complete with hard drives and content, and put a router between the Internet annd the servers and have it allocate server connection requests in a round robin fashon. Here you parallelize the the entire server.
2) Using RAID or SANs where you parallelize the storage with special hardware and arrays of hard drives.
3) Using the CPU to try to maximise performance by queueing, caching, and buffering data in RAM.

Cell is designed to parallelize CPU tasks. Web servers and database servers require parallelized storage not parallelized CPU tasks, and so Cell does nothing to improve web server/database server performance.

The applications you quote that run on web or database servers do not involve any intensive processing (floating point performance is unimportant), and the nature of the work required of the CPU is the same - for example a database server will read data out of a hard drive and cache and index it to try to improve speed. As for dynamic web content from databases, PHP, Java etc. the processing required is hardly CPU intensive - data streams are read from storage, and a simple translation transforms this by inserting text data from database, PHP or Java application. The main demand on the CPU is still queueing, cacheing, and buffering data in RAM, and this is not the task that Cell was designed for.

Another important thing going against Cell is that server applications require heavy multi-threading. The majority of server applications function by having a server daemon process like inetd handle the establishment of connections and then spawn off separate server threads for each connection, thus allowing multiple connections on a single server port. Java and PHP applications also spawn off a large number threads. The problem with Cell is that doing a context change on SPEs is expensive, and so this computing workload will be done on the PPE. It is maybe possible to rewrite some applications so the PPE will manage the multi-tasking, and hand off some of the work to SPEs in a manner that resembles multi-tasking on the PPE, but why bother? That requires a complete rewrite of the server application and it is unlikely that anybody will bother.

The bottom line, is that in a webserver or a database server, the SPEs will be underutilized, and Cell will perform no better than an x86 with similar cache and similar in order architecture. It is therefore better to go for a conventional CPU with more cache, out and of order execution (and an SMP architecture or multiple cores if a heavily loaded J2EE server is required).

gzip compression of the output is trivial in comparison to the generation of the pages.

This is just plain wrong. There is an overhead in in terms of memory and resources in running Apache, a J2EE, JSP, or PHP or database server, but the generation of web pages themselves is much less CPU intensive than encrypting or compressing a file of similar size.

As for your list of possible applications of CELL in a server environment:
1.) Using CELL to only do encryption is insanely un-economic. Modern CPUs easily systain >100Mbit 3des encryption bandwidth, that means less than 3 4-way boxes will do >1Gbit/s. If you need more than that there are NICs with encryption support that is much cheaper than dishing out for a CELL server
2.) People who think face-recognition will be everywhere has seen Minority Report to many damn times. There may be demand for servers to do these tasks, but demand will be small, and regular server CPUs won't be that far behind CELL in data processing performance.
3.) MoM workloads seem particularly unfit for CELL, huge datasets that are not easily partitioned into chunks that fit in LS.

1) I am talking about applications like telephone carriers tunneling thousands of VOIP calls through a backbone connection, or a neighbourhood watch website streaming out live mpeg encoded data from a number of CCTV cameras. One Cell server blade will handle the output for hundreds of VOIP channels, or many CCTV cameras.
2) I am talking about police database servers that match suspects with photo or fingerprint records, and intelligent computerised telephone answering and response machines that can recieve voice input or recognise individuals from voice prints. These are all applications that are being actively developed at the moment. Who knows, in future you may be able to do a Google search for people on the Internet by uploading a photo.
3) I don't understand what you are saying. I am talking about multi-player games for which the local compute intensive parts run on PS3 clients, with group interaction AI stuff running on a Cell SMP server or cluster on a website, and with high level communication protocol over the Internet between the two. The huge dataset doesn't have to fit into the LS, only the chunk that the SPE is processing, the same as any game running on the PS3.
 
Crossbar said:
Don´t you find it strange that both IBM and Sun are incorporating dedicated hardware for encryption in the server CPUs if it's so much better/cheaper to let a GP CPU core do it or to have it in the NICs?

I'm not saying that there are solutions out there that don't require dedicated hardware to accelerate specific tasks.

Just that the vast majority (as in >95%) don't.

Crossbar said:
Do you have any performance/price and wattage/price perspective in your predictions? Or do you see these things happening so far into the future that the OOO CPUs from intel have started incorporating dedicated help CPUs in a similar fashion as the SPUs? ;)
SPUs are good for very specific tasks. Tasks that can take advantage of the high SIMD FP throughput, the large LS bandwidth and the low LS latency. Something like Intel's upcoming Woodcrest is just good at everything.

Cheers
 
Last edited by a moderator:
SPM said:
The bottleneck with web servers, database servers is not the CPU, it is the hard drive.
Efforts in improving web or database server performance is based on boosting storage performance. This can be done a number of ways -

Bzzzt. Wrong!

Most DBMS workloads for webservers has very high hitrates in RAM. If it's low, you simply add more RAM, it is as simple as that. Most disk traffic is from updates that require the DBMS to write data all the way to disk. Disk controllers with battery backed cache RAM works wonders to eliminate the latency incurred by this.

SPM said:
Cell is designed to parallelize CPU tasks.
Yeah but one task at a time. CELL is a pest to virtualize. Try multiplexing >200 threads on CELL and see performance tank.

SPM said:
The applications you quote that run on web or database servers do not involve any intensive processing (floating point performance is unimportant), and the nature of the work required of the CPU is the same - for example a database server will read data out of a hard drive and cache and index it to try to improve speed. As for dynamic web content from databases, PHP, Java etc. the processing required is hardly CPU intensive - data streams are read from storage, and a simple translation transforms this by inserting text data from database, PHP or Java application. The main demand on the CPU is still queueing, cacheing, and buffering data in RAM, and this is not the task that Cell was designed for.

Your ignorance in this field is showing. Significant work is done just to set up a request execution environment. Even something as simple as echoing "hello world" to the client involves significant work. In the case of PHP you have the overhead of the interpreter, which is nontrivial. On Java and .NET solutions you still need to initialize an execution environment, and JIT compile/optimize various sections of code. Then you have all the business logic that is actually programmed on these system, which again can be very non-trivial (eg. >2 M lines of code).

As for the DBMS backend. Each query submitted has to be parsed, optimized, scheduled for execution, executed, which in turn can involve scanning multiple indices+tables with triggers and stored procedures, themselves written in procedural languages like PERL or Java or what not. There is a TON of CPU work going on in a DBMS. Luckily they parallize well so you can actually throw CPUs at this part of the system to good effect, and of course multiplex a much greater number of threads on these physical CPUs.

But if you don't believe me, I'm ready to bet a month's wages on CELL not taking over the server world in the next 10 years.

Cheers
 
Last edited by a moderator:
Gubbi said:
Yeah but one task at a time. CELL is a pest to virtualize. Try multiplexing >200 threads on CELL and see performance tank.
Do you have hard data for this? It seems odd to make a highly parallel processor that's inefficient at switching threads. How many cycles are you talking about for a context change?

Well, I guess sometimes it's not too hard to partition your work without actually needing to change threads.
 
Mintmaster said:
Do you have hard data for this? It seems odd to make a highly parallel processor that's inefficient at switching threads. How many cycles are you talking about for a context change?

Here. Some way down in that Q&A it is said that a context switch is 20us. Primarily from swapping LS content.

But even that is misleading, since it can only be achieved if there is no contention for main memory bandwidth. The 20us figure translates into 50,000 context changes per second, - 7,000 per SPE, - all limited by bandwidth. And you really don't want to be wasting your bandwidth on switching contexts.

Mintmaster said:
Well, I guess sometimes it's not too hard to partition your work without actually needing to change threads.

Well, we were discussing how CELL would fit a typical webserver setup, so consider a typical webserver setup:
1. Apache webserver
2. Tomcat/Resin java application server (can also be used as a frontend)
3a. DBMS (Oracle)
3b. Directory server (Sun's, Tivoli or OpenLDAP)

Apache acts as a frontend and serves static html and images. It also handles execution of PHP and Perl scripts. If a request comes in for a .jsp page the request is sent on to the second tier. Here you can execute all kinds of code, and send requests off for the 3rd tiers.. Everytime you fire off a SQL or LDAP request you have a potential context switch.

This means that a request for a non-trivial page can result in multiple (2 to hundreds) context transitions.

I don't really think poor context switch performance is a fault of CELL, it was never designed to do stuff like this. It's designed to run one program that can be broken down in a fixed number of sub-task, which can then be executed at high speed.

Cheers
 
Last edited by a moderator:
Gubbi said:
Here. Some way down in that Q&A it is said that a context switch is 20us. Primarily from swapping LS content.

But even that is misleading, since it can only be achieved if there is no contention for main memory bandwidth. The 20us figure translates into 50,000 context changes per second, - 7,000 per SPE, - all limited by bandwidth. And you really don't want to be wasting your bandwidth on switching contexts.
If you use software cache you can reduce greatly the time that you need since you won't need to swap the LS content.

If you virtualizate the SPUs to use them in the JVM for example you need to use software cache since it is the only way to execute code that was not made for the SPUs.
 
deathkiller said:
If you use software cache you can reduce greatly the time that you need since you won't need to swap the LS content.

Not really an option, is it ? Having an average load-to-use latency of 20 cycles on an in-order CPU is not going to do anything but embarrass you performance-wise.

Cheers
 
Gubbi said:
Not really an option, is it ? Having an average load-to-use latency of 20 cycles on an in-order CPU is not going to do anything but embarrass you performance-wise.

Cheers
Yep, looking at IBM documents you would reduce the speed of the code to one third versus manual optimized memory transfers but it is the only option unless you want to rewrite all code to optimize it for SPUs.

If you are going to do that, why don't use an alternative predictive shedulling so every thread decide what they need to swap? you could also halve the resources of SPU per thread and have two in the LS all the time.
 
<quote>Originally Posted by SPM
The bottleneck with web servers, database servers is not the CPU, it is the hard drive. Efforts in improving web or database server performance is based on boosting storage performance. This can be done a number of ways -<quote>

Bzzzt. Wrong!

Most DBMS workloads for webservers has very high hitrates in RAM. If it's low, you simply add more RAM, it is as simple as that. Most disk traffic is from updates that require the DBMS to write data all the way to disk. Disk controllers with battery backed cache RAM works wonders to eliminate the latency incurred by this.

Bzzt my ass. As I said, heavy caching of storage into RAM is used in database servers and most hard drive and storage also do heavy caching to RAM. The mere fact this CPU time and transfer of data to/from RAM in expended in an effort to speed disk access indicates where the real bottleneck is - in the disk access and not RAM or CPU. This improves disk access speed, but does not remove the fact that hard drive storage remains the bottleneck - not even close. As for adding RAM, most production web/database servers in use on the web right now have a RAM limit of 4GB to 8GB of RAM available on the majority of web and database servers serving the Internet right at this moment. This is small compared to the hard drive storage being served. The idea that this amount of RAM will eliminate disk latency issues is a joke.

<quote>
Originally Posted by SPM
Cell is designed to parallelize CPU tasks.
</quote>

Yeah but one task at a time. CELL is a pest to virtualize. Try multiplexing >200 threads on CELL and see performance tank.

Did I say anything different? I said Cell is no good for multi-tasking because of the high cost of context switching due to the big local store. What you can do is dedicate an SPE to a particular task - for example handling a network connection, use the PPE to handle multi-tasking, and have it hand off the actual job of handling the network connection to the SPE dedicated to that task. The SPE dedicated to that task would already have the code loaded, and to change context between two network connection threads would therefore only require use of DMA to transfer and restore the data in the local store associated with a particular network connection object while leaving the rest of the local store intact. This is a small overhead. True multi-tasking on SPEs (ie. have the context change to any process) is expensive, because the whole local store would have to be saved and restored.

<quote>Originally Posted by SPM
The applications you quote that run on web or database servers do not involve any intensive processing (floating point performance is unimportant), and the nature of the work required of the CPU is the same - for example a database server will read data out of a hard drive and cache and index it to try to improve speed. As for dynamic web content from databases, PHP, Java etc. the processing required is hardly CPU intensive - data streams are read from storage, and a simple translation transforms this by inserting text data from database, PHP or Java application. The main demand on the CPU is still queueing, cacheing, and buffering data in RAM, and this is not the task that Cell was designed for.
<qoute>

Your ignorance in this field is showing. Significant work is done just to set up a request execution environment. Even something as simple as echoing "hello world" to the client involves significant work. In the case of PHP you have the overhead of the interpreter, which is nontrivial. On Java and .NET solutions you still need to initialize an execution environment, and JIT compile/optimize various sections of code. Then you have all the business logic that is actually programmed on these system, which again can be very non-trivial (eg. >2 M lines of code).

Err.. it isn't my ignorance that is showing here. Like you, a lot of people run a java application like "hello world", find it runs slowly and conclude wrongly that it is CPU intensive. The reason it runs slowly is because it takes a long time to load the JVM from hard drive into memory. It is not the "hello world" program or it's interpreter that is CPU intensive, but the loading of the JVM which is disk i/o intensive. Having to have millions of bytes of code loaded into memory first makes it memory intensive not CPU intensive. If you run an application like that on a J2EE server with a JIT compiler where the JVM and JIT compiled code has already been loaded, you will find it runs typically only about 1.2 to 2 times more slowly than native compiled C code.

As for the DBMS backend. Each query submitted has to be parsed, optimized, scheduled for execution, executed, which in turn can involve scanning multiple indices+tables with triggers and stored procedures, themselves written in procedural languages like PERL or Java or what not. There is a TON of CPU work going on in a DBMS. Luckily they parallize well so you can actually throw CPUs at this part of the system to good effect, and of course multiplex a much greater number of threads on these physical CPUs.

An interpreter has to do work, yes, but this is memory intensive, not CPU intensive. High Performance Computing or supercomputing refers to something very specific - number crunching (more accurately data crunching). Interpreters, web servers, database servers shift or look up strings/blocks of data around between i/o ports and RAM, but perform only very simple operations on the strings/blocks of data. A cluster of web servers may work very hard at this, but that does not make it a supercomputer or a HPC computer, it is called a cluster of web servers or a server farm, not a supercomputer. The point is that Cell is not designed to perform well for i/o intensive usage, while a conventional CPU with lots of cache, and OO execution is. Having said this though, the presence of the fast local store which can be used to hold the code to emulate an interpreter will count for the SPE running an interpreter, while the need to DMA whole blocks of intepreter byte code and data from main RAM will count against it.

But if you don't believe me, I'm ready to bet a month's wages on CELL not taking over the server world in the next 10 years.

What are you on about man? Didn't I say exactly the same thing:

Web servers/database servers --- Cell BAD

Specialist applications like HPC/Supercomputing clusters, encryption, compression, image processing, voice processing, video streaming server applications etc. ---- Cell GOOD.

I don't want your money anyway.
 
Back
Top