View Full Version : First Cell Benchmarks
Supernatural
25-Nov-2006, 08:50
Lock if old:
・Dhrystone v2.1
PS3 Cell 3.2GHz: 1879.630
PowerPC G4 1.25GHz: 2202.600
PentiumIII 866MHz: 1124.311
Pentium4 2.0AGHz: 1694.717
Pentium4 3.2GHz: 3258.068
・Linpack 100x100 Benchmark In C/C++ (Rolled Double Precision)
PS3 Cell 3.2GHz: 315.71
PentiumIII 866MHz: 313.05
Pentium4 2.0AGHz: 683.91
Pentium4 3.2GHz: 770.66
Athlon64 X2 4400+ (2.2GHz): 781.58
・Linpack 100x100 Benchmark In C/C++ (Rolled Single Precision)
PS3 Cell 3.2GHz: 312.64
PentiumIII 866MHz: 198.7
Pentium4 2.0AGHz: 82.57
Pentium4 3.2GHz: 276.14
Athlon64 X2 4400+ (2.2GHz): 538.05
source: http://rian.s26.xrea.com/nicky.cgi?DT=20061121A#20061121A
Are spe's being used or what ?
Mefisutoferesu
25-Nov-2006, 09:14
PPE seems like it's pretty decent. Not sure how literal you can take those numbers in terms of optimizations between PPC and the PPE, but looks like the PPE is pretty good in it's own right.
P.S. It's GCC at -o3 optimization only... I would have liked to see it with unrolled loops, but I don't think the benchmarks mean much of anything anyway...
nonamer
25-Nov-2006, 09:25
Are spe's being used or what ?
No.
That said, it's mostly good news. It's pretty bad at integer and double precision stuff as expected, but single precision is pretty good for an in-order chip. Better than a P4 at least (although P4's do suck).
No.
That said, it's mostly good news. It's pretty bad at integer and double precision stuff as expected, but single precision is pretty good for an in-order chip. Better than a P4 at least (although P4's do suck).
not bad then , from all the negative comments i heard fafalad and other devs say expected ppe to be extremely bad.
now I hope someone benchmarks them spe's
Are spe's being used or what ?
No, this will be for the PPE only. The single and double precision performance will be an order of magnitude larger if the SPEs are used. Also it is likely not using an optimising compiler which will boost the performance for an in-order CPU like the PPE.
The Dhrystone results again confirm again that for the PPE on it's own running general purpose integer, data processing and branching code without code optimisation (what the Cell is worst at), the PPE is roughly equivalent to a 2.5MHz P4. If the SPEs can be used to some of this integer code, even non-optimally, then the SPE will be a lot faster for this general purpose code than a 2.5GHz P4, although this can only be done on programs specially written for Cell.
The FLOPS rating for a 3.2 GHz P4 extreme edition is about 3.3GFlops I believe, while the Cell including the SPEs is about 218GFlops, a factor of 66 times as fast, so for media acceleration, ray tracing etc. there is no contest.
Cell isn't a bad processor for use in a general purpose desktop computer OS use - not the best at everything, but pretty decent nevertheless.
Shifty Geezer
25-Nov-2006, 10:40
Cell isn't a bad processor for use in a general purpose desktop computer OS use - not the best at everything, but pretty decent nevertheless.I dare say that it might actually be the best choice. It may be pretty 'average' on a lot of the simpler, more generic tasks, but those aren't demanding and 'average' suffices. And when it comes to the intensive media tasks, it'll shine. Given a choice between one CPU that eats through media processing and muddles along at an okay speed in typing docs and browsing webpages, and a CPU that's quick at typing docs and browsing webpages but much slower at media work, I'd choose the former. Most of the time I spend waiting for my computer to do things, it's in the media data processing departments. Heck, I would have stuck with my 800MHz PIII if it wasn't for the image processing and similar tasks I was doing! That's plenty fast enough for mundane tasks, and the performance increase in Word is negligable.
I think Cell was a smart move to develop a processor better suited to the needs of the modern computer, balancing the performance asymmetrically as the workloads people want to handle are asymmetric in demands. Given these unoptimized values as rough comparitors, it seems Cell pretty much hit the mark.
london-boy
25-Nov-2006, 12:22
It's true that Cell is good enough for "Word" and a beast for multimedia tasks, but let's not forget that the whole group of tasts categorised under "General purpose" is not just Word.
Personally, my "general purpose" usage would be internet browising, typing emails and reports using Word, Excel, Powerpoint and most other Office apps. Cell is more than fast enough at doing these kind of tasks, and the speed you get in the Media-related tasks is stunning. So for me, Cell would be very nice.
I'm sure that other people have other needs though, seen how the "general purpose" group of tasks seems to include anything that doesn't deal with video and sound.
Now after reading this I cannot help myself, but wonder whether any of you ever did some word processing on say a short but mildly complex (say even 10 pages, multiple fonts, embedded graphs, macros, etc.etc.) document on a 866mhz computer. Word processing certainly isn't as trivial as some make it out to be here.
...and regarding the benchmarks, I have a hard time believing that say a P4 clocked at 2GHz is nearly equal to a P4 at 3.2GHz at double precision math, but when you then move to SP, it first off all is 8 times slower than at dp and second of all nearly 4 times slower then if clocked 60% higher.
Now after reading this I cannot help myself, but wonder whether any of you ever did some word processing on say a short but mildly complex (say even 10 pages, multiple fonts, embedded graphs, macros, etc.etc.) document on a 866mhz computer. Word processing certainly isn't as trivial as some make it out to be here.
I set up a dissertation of about 200 pages using tables, graphs, multiple fonts on less. No problem.
LunchBox
25-Nov-2006, 15:10
I'm quite impressed with what the PPE in cell could do.
It's actually performing better than what I was given the impression to, when most people were saying how much of a lackluster the PPE performance was...
Not trying to open anything up or trying to maliciously, derail the thread...
just wanted to ask it here than make a new thread for it...
Since I read here before that the PPE in cell is similarly designed, comparatively to the XCPU PPE cores (I think it was heavily discussed proficiently in B3D before the thread went to a steep downhill)...
Would those number give a ballpark figure as to what the XCPU PPE could do as well?
or will their difference in cache or VMX unit registers vary enough to show otherwise?
Shifty Geezer
25-Nov-2006, 15:11
I'm sure that other people have other needs though, seen how the "general purpose" group of tasks seems to include anything that doesn't deal with video and sound.
I've used a PC for pretty much everything a PC can do, including raytracing, music sequencing, programming, DTP and photo editing. The only area(s) I think that probably can't be optimized for SPEs and will be relatively slow on PPE t the point you'd want it to run faster, is compiling. Everything else should either get sped up a lot by SPEs, or run at a satisfactory speed on the PPE (going by these benchmarks, which of course are preliminary and subject to change without notice).
Now after reading this I cannot help myself, but wonder whether any of you ever did some word processing on say a short but mildly complex (say even 10 pages, multiple fonts, embedded graphs, macros, etc.etc.) document on a 866mhz computer. Word processing certainly isn't as trivial as some make it out to be here.Yep. I've done quite complex DTP on an 800 MHz PIII. It's not instantaneous, but it's certainly useable. And for a Cell optimized DTP package if one appears, the graphics+font rendering, which is a demanding aspect, should be very fast. eg. Most TT fonts will fit snuggly into a SPE's LS for very fast text creation.
archie4oz
25-Nov-2006, 17:35
Now after reading this I cannot help myself, but wonder whether any of you ever did some word processing on say a short but mildly complex (say even 10 pages, multiple fonts, embedded graphs, macros, etc.etc.) document on a 866mhz computer. Word processing certainly isn't as trivial as some make it out to be here.
Actually until May, I was using an 867MHz G4 laptop as my most heavily used, day-2-day machine. Your problem isn't your machine, it's your software (namely "Word")... :) Granted my G4 was no barn-burner but it was quite workable. Like Shifty mentions, it's the multimedia apps that push the drive to beefier machines (The other being bloated apps that either run under slow run-times (Java, .NET), or apps with more features than sense).
...and regarding the benchmarks, I have a hard time believing that say a P4 clocked at 2GHz is nearly equal to a P4 at 3.2GHz at double precision math, but when you then move to SP, it first off all is 8 times slower than at dp and second of all nearly 4 times slower then if clocked 60% higher.
The benchmarks aren't really all that indicative of much really, other than how well existing processors run existing codebases with basic compilation effort. GCC doesn't do a particularly great job with G4s (or G3s) however they can get a way with a bit because of their relatively low missed branch penalties. As for the 2 P4s, I'm willing to venture that the 2GHz is a Northwood or Williamette and the 3.2 is a Prescott.
The FLOPS rating for a 3.2 GHz P4 extreme edition is about 3.3GFlops I believe, while the Cell including the SPEs is about 218GFlops, a factor of 66 times as fast, so for media acceleration, ray tracing etc. there is no contest.I'm sure Intel would rate a P4 3.2GHz closer to 12.8Gflops/s. The 218Gflops/s number for CBE isn't a real-world figure either, so that'd actually be fair game.
But on the whole the comparison to the P4 isn't that useful. The P4 is so dinky and broken in so many ways, OTOH so much effort has been put in compiler optimizations for the idiosyncrasies of this turd. As far as I'm concerned the numbers can be fun to look at but any inference about "general performance" in relation to saner architectures should be avoided.
*ahem*
I'd like to direct your attention more to the PIII and Athlon X2 numbers.
IIRC the first (http://linuxps3.net/index.php?option=com_content&task=view&id=34&Itemid=33) benchmarks were actually Dhrystone in an emulated x86 Windows environment and Geekbench for PPC Linux. But these just confirm what those have been saying; PPC - better than we thought! :)
But hell, I've been running Fedora on this thing since launch day and I could've told you that.
It's not like the lack of optimization (and memory bottleneck) doesn't show in the time it takes certain apps to open, but once opened everything runs quite smoothly and comfortably. I'd like to really see PS3 take off as a standardized platform for Linux, both useage and homebrew development-wise - I think we could see great things a couple of years out from now.
Truly, if it had an internal burner and RSX was 'open,' there really wouldn't be any sort of barrier in my mind whatsoever. Even as it stands, those are very liveable limitations. For being an above-average first-gen Blu-ray player, a top-of-the-line games console, and a capable-if-quirky Linux PC all rolled into one... I mean... well I'll say I'm very pleased so far with my PS3 experience!
Naboomagnoli
26-Nov-2006, 00:40
A question, preceded by a lot of unnecessary beating-about-the-bush (aka context):
Someone on PS3Forums (not as technical as B3D obviously, but you still get the odd developer/hobbyist/person with an iota of intelligence over there) said that making the RSX - and I would assume SPE's - available to Linux wouldn't happen because it'd mean that developers would be able to skirt around the issue of paying Sony royalties for games that are made, even if the maximum performance available to these developers is lessened to something along the lines of a PS3 "arcade" game.
Would this be an accurate thing to say? Or would Sony be able to offer use of the RSX in Linux only to people who buy a License from Sony, therefore stopping Sony from losing money they rely on to claw some money back from hardware costs? Maybe Sony would expect this market to be a lot smaller than the market for orthodox PS3 games and just look the other way?
Or "Other" (please describe ______________________________________________ ) :)
Access to SPE is already possible AFAIK, and I wouldnt be surprised if an OpenGL-Driver will be released someday.
So why should a gamedev still release "PS3-Games" and not "Linux"-Games that the PS3 can also play? Given the amount of money necessary to develop a next-gen title, Publishers will opt for the largest audience - that still being direct PS3 Games.
Add to that that PS3 Games can have some level of copy-protection thats impossible on a open Linux plattform, the complication that could arise because of different distros/installations (Amount of free mem?) compared to a garantueed common Plattform with automatic Tieins to Sonys Online-Plattform and you see that it wont be a viable alternative to native PS3-Games.
edit: Im talking about full scale titles, there could quite be a market for cheap/free/shareware games.
I don't expect that Sony has much incentive to open up the details of the RSX to the public; I personally am not expecting them to do so. But then again, not that I really care either. I can fully deal with the forced differentiation of PS3 general computing device, and PS3 games device. If Sony wants control of the latter, such is their right IMO. And if they do open up the RSX later... hey, bonus.
(But yeah, like NPL said the SPEs are already open Naboomagnoli - they show up under 'devices' in the system overview display)
I don't expect that Sony has much incentive to open up the details of the RSX to the public; I personally am not expecting them to do so. But then again, not that I really care either. I can fully deal with the forced differentiation of PS3 general computing device, and PS3 games device. If Sony wants control of the latter, such is their right IMO. And if they do open up the RSX later... hey, bonus.
What do you mean with "open up details"? I mean they provide drivers for Linux for their GPUs, I dont think they care about that kind of "openess". Its not like we would need to poke with RSX registers directly to be happy ;) Atleast myself would be very content with an OpenGL/ES driver & accelerated desktop
Thinking it upside down, RSX may become more open in the Linux as SCE don't have to worry about backward compatibility in the PS4 for Linux apps :idea: :wink:
What do you mean with "open up details"? I mean they provide drivers for Linux for their GPUs, I dont think they care about that kind of "openess". Its not like we would need to poke with RSX registers directly to be happy ;) Atleast myself would be very content with an OpenGL/ES driver & accelerated desktop
'They' is NVidia though... we're not dealing with NVidia here, but Sony - they own this chip design. I mean if they do or if they don't, I won't really be surprised either way. I just understand from an economic standpoint why they might wish to emphasize their own licensed games in all matters gaming-related.
EDIT: Ok you know what? Drivers, no drivers... we're just going to have to wait and see. Let's just all agree that whatever ends up happening, Linux on PS3 rules. :cool:
'They' is NVidia though... we're not dealing with NVidia here, but Sony - they own this chip design. I mean if they do or if they don't, I won't really be surprised either way. I just understand from an economic standpoint why they might wish to emphasize their own licensed games in all matters gaming-related.
EDIT: Ok you know what? Drivers, no drivers... we're just going to have to wait and see. Let's just all agree that whatever ends up happening, Linux on PS3 rules. :cool:
Sure thing, but was that ever point of the discussion? I could understand the lack of 3D-Acceleration, but having no acceleration for simple stuff like line-drawing, blitts, etc would really suck.
I`d just love to hear a definite word either way already. :cry:
Sure thing, but was that ever point of the discussion? I could understand the lack of 3D-Acceleration, but having no acceleration for simple stuff like line-drawing, blitts, etc would really suck.
I`d just love to hear a definite word either way already. :cry:
Sony doesn't have to provide full access to RSX from Linux. They can limit Linux based games from competing with PS3 games by providing an X-server operating under the DRMed GameOS to which Linux applications can connect to. They could and should provide 3D acceleration via the RSX, but they could limit acceleration to the OpenGL functions that Beryl/Compiz 3D desktop uses. This could also allow BD/DVD movies and XMB applications to be viewed from a window in the Linux desktop.
The other alternative is to use the SPEs to do 2D and 3D acceleration.
I d like to see some comparable benchmanrks using the SPE's. These benchmarks undervalue the Cell's real performance.
The other alternative is to use the SPEs to do 2D and 3D acceleration.
That's what I've been thinking will happen. But in the end, would it matter much? I thought SPEs are actually very well suited to graphics and media duties.
theteamaqua
26-Nov-2006, 02:58
wow not inpressive at all, maybe b/c those apps were writen for x86??
wow not inpressive at all, maybe b/c those apps were writen for x86??
You're looking at it the wrong way. It's precisely because these benchmarks do nothing to reflect/tap the 'power' of Cell that they are making so many of us happy... because even under these 'worst-case' conditions... Cell's actually not doing that bad at all! Certainly well enough to run Firefox and OpenOffice, y'know?
It's all about perspective; this is a great starting point for Cell performance in general computing/Linux tasks.
Shifty Geezer
26-Nov-2006, 10:45
Sony doesn't have to provide full access to RSX from Linux. They can limit Linux based games from competing with PS3 games by providing an X-server operating under the DRMed GameOS to which Linux applications can connect to...I hoping, and maybe even expecting, YDL has RSX optimizations. If Terrasoft have had PS3's for a while to develop a custom Linux, and yet Fedora runs out of the box, what exactly have Terrasoft been doing with their development kit, and why did Sony think it important to partner with them? It could just be Cell optimized apps, but I feel it's more likely the OS has been properly system-optimized, rather than being just a PPC OS.
If not, E17 is going to burn cycles needlessly!
I hoping, and maybe even expecting, YDL has RSX optimizations. If Terrasoft have had PS3's for a while to develop a custom Linux, and yet Fedora runs out of the box, what exactly have Terrasoft been doing with their development kit, and why did Sony think it important to partner with them? It could just be Cell optimized apps, but I feel it's more likely the OS has been properly system-optimized, rather than being just a PPC OS.
If not, E17 is going to burn cycles needlessly!
If YDL has RSX optimisations then it would be present as a driver, and FC6 and other distros would also be able to use it if included. Open source is modular and portable so whatever work YDL does on open source drivers would be available on other distros as well.
To develop a third party open source driver requires time and access to hardware and hardware specs and documenation. It is unlikely that one would have been developed so soon after release when everyone here is not even clear on RSX's specifics. If the hardware specifics are not made available, then an open source driver may never happen - as in ATI and nVidia graphics cards. The graphics card manufacturers keep the specifics required to write a driver secret in order to maintain an additional lock on people copying features of their hardware. Hence all accelerated nVidia and ATI Linux drivers are proprietary and are written by the manufacturers.
I rather think this is what will happen with the RSX - either Sony will have to provide a proprietary accelerated hardware driver, or we won't have an accelerated driver at all since Sony won't want to reveal the secrets of RSX to the open source community. Also there is another complication. Sony - reasonably in my opinion won't want to release an accelerated driver that allows games to move off the PS3 franchise onto PS3 Linux. The way around this might be to provide a proprietary driver that just accelerates 2D and the limited 3D OpenGL functions that Compiz/Beryl 3D desktops use. This will prevent serious commercial games use, but will allow less graphics intensive games to be developed. Another option is not to provide a driver at all, but to have a proprietary X-Server with similar restrictions run under the hypervisor to which PS3 Linux can connect to. This would have the advantage of being able to display everything - Linux applications, movie display, Game OS applications etc. in Windows on PS3 Linux, but would use up more hypervisor system resources. Sony should also provide RSX hardware acceleration for sound and playing video codecs.
The bottom line is that Sony will have to provide all this not YDL, just the same as ATI and nVidia not Microsoft have to provide Windows drivers for their cards. We need to start lobbying Sony to provide some kind of drivers to accelerate Linux 3D desktops compiz/beryl, sound and movie codecs. It really is down to Sony and no one else.
As far as YDL work on PS3, I think that is mainly on including the SPE libraries, which allow access and control the SPEs using device drivers. This code is released by IBM and is also available to other distributions or separate download if required, but may involve more inconvenience. Also YDL binaries may be compiled with Cell optimisations while FC6 would be compiled for a generic PowerPC. I doubt is any libraries (media, graphics, font etc.) have been re-written yet for SPE acceleration - that will take time.
Personally I prefer to use FC6 because I use FC6 on a PC, and it just makes it easier to remember congiguration, file locations etc. if you use one distro on all your machines. FC6 with the extra repository, and the livna.org non-free repository, plus Flash, Real, Mplayer codecs etc. it makes for a really good solid and up to date distro - better than Ubuntu and certainly better than SuSE at the moment in my opinion. If you intend to do development work on the PS3 or extract every bit of performance out of the PS3 using Cell specific compiled code, you could try YDL or Gentoo instead.
A non-free distro with all the proprietary stuff bundled in and preconfigured, supplied by Sony or a commercial vendor is a must for mass market PS3 use as a computer though, since downloading drivers, codecs and non-free media players is too complicated for joe/jane average.
bobthebub
26-Nov-2006, 14:12
I hoping, and maybe even expecting, YDL has RSX optimizations.
I had been hoping for that too but the following would seem to rule it out
http://www.terrasoftsolutions.com/products/faq/ps3/devel.shtml
QUESTION: What level of graphics support is available?
At this point in time, YDL runs in framebuffer mode on the PS3, meaning there is no 2D nor 3D accelleration nor support for OpenGL. The x.org driver is fully functional in its framebuffer mode, offering quality support for a wide variety of hi-def televisions and computer monitors that comply with the PS3 video output signals.
"at this point in time" leaves some hope for future developments though.
Shifty Geezer
26-Nov-2006, 14:27
As far as YDL work on PS3, I think that is mainly on including the SPE libraries, which allow access and control the SPEs using device drivers. This code is released by IBM and is also available to other distributions or separate download if required, but may involve more inconvenience. Also YDL binaries may be compiled with Cell optimisations while FC6 would be compiled for a generic PowerPC. I doubt is any libraries (media, graphics, font etc.) have been re-written yet for SPE acceleration - that will take time. Maybe not, but remember this YDL isn't just for PS3, but also for Cell servers such as Mercury. They're going to want proper use of their hardware I'm sure. This isn't a Linux distro compiled in a month or so. It's been worked on for something like a year (http://www.terrasoftsolutions.com/news/2005/2005-11-15.shtml)at least.
Well, that said, the FAQs from Terrasoft seem disappointing...
QUESTION: What level of graphics support is available?
At this point in time, YDL runs in framebuffer mode on the PS3, meaning there is no 2D nor 3D accelleration nor support for OpenGL. The x.org driver is fully functional in its framebuffer mode, offering quality support for a wide variety of hi-def televisions and computer monitors that comply with the PS3 video output signals.
Also
QUESTION: Does it overwrite the GameOS?
YDL runs on top of the GameOS much in the same way that Linux runs on top of firmware on Power or BIOS on x86.
Which doesn't mention 'at this time,' suggesting multitasking of Linux and GameOS (for messaging, downloads, etc.) won't happen :(
Still, go to Terrasoft's homepage (http://www.terrasoftsolutions.com/)and you'll see what they think of PS3 :wink4:
Shifty Geezer
26-Nov-2006, 14:30
"at this point in time" leaves some hope for future developments though.If Sony are going to want proper Cell workstation applications running, I think OpenGL support is necessary. Maya on a Cell workstation seems like something they're aiming for, for example.
Maybe not, but remember this YDL isn't just for PS3, but also for Cell servers such as Mercury. They're going to want proper use of their hardware I'm sure. This isn't a Linux distro compiled in a month or so. It's been worked on for something like a year (http://www.terrasoftsolutions.com/news/2005/2005-11-15.shtml)at least.
YDL isn't just for PS3, but RSX is. It is down to Sony as hardware manufacturers to develop the drivers for RSX or pay nVidia to do it.
They have been working on Cell clusters. I doubt if Sony has paid them anything to develop YDL, and certainly not on accelerated X drivers for RSX which I would nVidia to be paid to do not YDL.
Well, that said, the FAQs from Terrasoft seem disappointing...
Also
Don't blame YDL, YDL are giving the thing away free as a fully open source distribution.
It is Sony I am dissapointed with. Even if YDL wanted to , they can't develop accelerated drivers without access to the RSX hardware DRM and documentation, which I doubt they have been given. They should have paid YDL to produce a non-free distribution with all the proprietary multimedia players, codecs, drivers and bells and whistles integrated in and pre-configured to run out of the box, and bundled with the PS3 as standard. I don't think Sony fully understands the potential of a properly integrated mass market Linux to boost the PS3.
Which doesn't mention 'at this time,' suggesting multitasking of Linux and GameOS (for messaging, downloads, etc.) won't happen :(
Still, go to Terrasoft's homepage (http://www.terrasoftsolutions.com/)and you'll see what they think of PS3 :wink4:
Linux does not need to replace GameOS to multi-task with it. It can communicate with GameOS on the loopback network interface and interact with it that way. Sound can be streamed from gameOS to Linux applications and vice-versa. The Linux GUI (X-Window System) is a client server architecture and always runs this way. You can display the PS3 Linux session on a remote PC X-server display connected by a network cable, on a display server running under the same PS3 Linux connected via a loopback interface, or on an X-Server display running in GameOS. What is more, wherever the X-server (the display) is running, you can have X-clients (X applications) from anywhere (the PC, Linux on the PS3, and GameOS) displaying on it at the same time. This is all built into the X- Window System and only requires a few minor configuration changes to implement. Sony will have to write the X-client and X-server parts for Game OS though, although because it is a case of reusing existing open source software for the most part, it should not be difficult at all.
Basically it is all down to Sony. If Sony want to do it they can very easily and cheaply. As I said I don't think Sony understands the potential of the computer feature to allow the PS3 to take over a big chunk of the home computer/media center market.
If Sony are going to want proper Cell workstation applications running, I think OpenGL support is necessary. Maya on a Cell workstation seems like something they're aiming for, for example.
It would be great, but I think there is zero chance Sony will allow this for fear of losing game sales. Sony is subsidising the PS3 and hope to recover the subsidies through game sales. It wouldn't make sense to allow Linux games to run on PS3 outside their franchise.
The best we can hope for is RSX for drivers for 2D acceleration, video and sound acceleration, and OpenGL drivers that work at full speed for the operations that are used by 3D desktops like Compix/Beryl, and crippled (slowed) operations for the rest. This would alloy Linux to run very fast on general and multi-media applications, and would allow users to write and run OpenGL games on the PS3, but they wouldn't run fast enough to make them commercially competitive with native PS3 games. I would be happy with this. It is better to ask Sony for this and get something rather than ask Sony for the world and get nothing.
Another alternative is to use the SPEs for open source Linux drivers and media acceleration independently of Sony. Again this will mean PS3 Linux games will not be competitive with PS3 native games, but at least it will give the people developing these experience with Cell. Since the PS3 PPE runs pretty well on it's own without compiler optimisation or accelerated drivers, It should zip along for typical desktop and multi-media applications with SPE acceleration (although it won't be a top notch games machine). But there again you have PS3 games for that.
One thing i don't get..
If linux is running as an application under GameOS as regular games would, why would rsx be locked down from one application ?.. Of course you need a driver to run OpenGL but if that's the case it would just be a matter of time when someone release a driver.
It would be great, but I think there is zero chance Sony will allow this for fear of losing game sales. Sony is subsidising the PS3 and hope to recover the subsidies through game sales. It wouldn't make sense to allow Linux games to run on PS3 outside their franchise.As soon as there are no shortages and the hardware aint sold at loss, Sony aint losing anything if people use PS3`s without buying PS3-Games (ie use it as media-pc, for folding, as linux-box, whatever). At least not directly.
It could affect them if people buy/dl Linux games, when they would buy PS3-Games if that option wont be available. I dont see that happening as I dont expect Linux-Games to be on the same level as PS3-Games. (I did a post on that earlier)
However it could compete with their marketplace-games, but Im still unsure of Sonys scope on that (so far it seems they will be short, cheaper, but still commercial-quality).
No matter how you turn it, a fullblown OpenGL driver would significantly add value to PS3-Linux for many people. Once they have that PS3 in front of them they would have to buy atleast 1 BRD-Movie and one PS3-Game if they are anything like me ;)
Now, something completely different (but still fitting the topic):
Im kinda shocked the PS3 draws 180 Watts idling in the XMB (XBM?). I would expect ti to be way below 100 Watt (40-60 Watt) when doing nothing usefull. IBM touted Cells Powersaving features and Im sure the RSX has some of those too. Powersaving not enabled within the Firmware, or thats the best it can do?
Also would be interesting if someone could take measurement running Linux, if that OS does a better job of powers-saving
One thing i don't get..
If linux is running as an application under GameOS as regular games would, why would rsx be locked down from one application ?.. Of course you need a driver to run OpenGL but if that's the case it would just be a matter of time when someone release a driver.
the Linux-driver would run atleast partially through the hypervisor to allow sharing between GameOS/Linux.
For Games, those could use a more liberal setup, allowing more hardware-access (still needs to be shared with GameOS). Just because Games and Linux run beside the same GameOS doesnt mean they get treated the same.
Habbe it's not that simple, mainly due to who the 'someone' has to be.
Anyway I'm not as upset as many are on this issue; Sony didn't have to give us anything on this front - no Linux, no nothing, but they've gone so far as to make theplatfrom completely open to third-party software and that's going a hell of a ways IMO.
It's true that the way to really turn PS3 into a PC competitor in the living room would have been to create a custom Linux distro installed on every unit that was easy as hell to use, but... I mean Sony is in the midsts of quite the war in terms of software build-out in other areas of the Playstation effort (dev tools development, online infrastructure, etc..) and it's understandable that the Linux may have only been pushed through as it was due to Kutaragi and other high-minded execs saying 'do it.'
TerraSoft is managing both of Sony's own supercompute Cell clusters however, so indeed if there is going to be native SPE acceleration anywhere in the next month available to the consumer, I expect it out of YDL.
On the side I think video acceleration on Linux will definitely be an SPE effort rather than a GPU-side effort, and that makes sense enough anyway. Frankly I feel the SPEs will be the ones approached lacking RSX access, and this in all honest is better for Sony, better for the knowledge-base of the development community (both professional and otherwise), and better for the Cell architecture as a whole.
the Linux-driver would run atleast partially through the hypervisor to allow sharing between GameOS/Linux.
For Games, those could use a more liberal setup, allowing more hardware-access (still needs to be shared with GameOS). Just because Games and Linux run beside the same GameOS doesnt mean they get treated the same.
Ok. How are the devs accessing RSX ? straight down to the metal or thru some API included in GameOS ?.
Anyway have to agree with xbdestroya, RSX/OpenGL in linux isn't a big issue though i would do my gaming in Ps3 and not in linux... A 3d desktop is nice but i can live without it.
Ps. if there are access to RSX from linux my guess was that the G70 drivers that are released on linux would give that "someone" i nice start.
As soon as there are no shortages and the hardware aint sold at loss, Sony aint losing anything if people use PS3`s without buying PS3-Games (ie use it as media-pc, for folding, as linux-box, whatever). At least not directly.
I agree. My bet is no RSX access (*at most* software OpenGL using SPEs) until the PS3 game and entertainment business mature.
Furthermore, getting the community to focus on Cell is a plus for them. RSX looks easy for the devs to exploit for the moment.
At this point, I'm wondering whether the GameOS is also based on some sort of embedded Linux, or something from scratch.
Ok. How are the devs accessing RSX ? straight down to the metal or thru some API included in GameOS ?.OpenGL/ES. but even if you take that driver and adopt it for Linux it wont do anything. The hypervisor is shutting off everything. In other words, you dont need a driver for RSX, but a driver to talk to the hypervisor. (Edit: ) And the hypervisor needs to make all features available you want to support.
Anyway have to agree with xbdestroya, RSX/OpenGL in linux isn't a big issue though i would do my gaming in Ps3 and not in linux... A 3d desktop is nice but i can live without it.Screw 3d-desktop, Id be happy if that nonsense would go away overnight. Its just that having a powerfull GPU and using it as Framebuffer is heresy.
But xbdestroya is right: Sony has an awful lot of work ahead of them and Linux wont be far up their list.
Shifty Geezer
26-Nov-2006, 17:56
Furthermore, getting the community to focus on Cell is a plus for them. RSX looks easy for the devs to exploit for the moment.That's very true. Demo coders will have to do everything on Cell (as well as serious application writers!) which means developing experience and techniques for making those SPE's do all sorts of things. That'll definitely be a plus for the Cell platform and future - eg. Mobile devices with just a Cell including software graphics rendering would benefit from the tricks pioneered on open Cell platforms. It'll be interesting how much of a software renderer is produced in 5 years time. I expect some entrepreneurs will produce a pretty amazing renderer. Maybe not heavy on the texturing, but some shading effects and procedural shading is likely to appear.
Ok. How are the devs accessing RSX ? straight down to the metal or thru some API included in GameOS ?.
Anyway have to agree with xbdestroya, RSX/OpenGL in linux isn't a big issue though i would do my gaming in Ps3 and not in linux... A 3d desktop is nice but i can live without it.
Ps. if there are access to RSX from linux my guess was that the G70 drivers that are released on linux would give that "someone" i nice start.
The hypervisor allows multiple OSes/applications to run on top of it. The hypervisor can restrict access that a guest OS (Linux) has to the bare hardware.
The hypervisor allows multiple OSes/applications to run on top of it. The hypervisor can restrict access that a guest OS (Linux) has to the bare hardware.
Yepp i get your points and i agree,i'm not arguing against them but is there any indication that "other OS" will have restrictions when it comes to Cell and RSX ?.
StefanS
26-Nov-2006, 18:31
I have the slight suspicion that Sony blocks access to RSX to keep tabs on the homebrew scene. They might later release something similar to MS XboxLive Arcade dev thingy for a few bucks.
Titanio
26-Nov-2006, 18:39
Their history doesn't suggest a fear of the homebrew community matching or outdoing professional developers (they gave unfettered access with previous systems). There's a much wider gap there than being able to use a GPU or not (just sheer production/scale).
Kutaragi even discussed addressed GPU programming before when talking about PS3 Linux:
This will be the first form that [the Cell] will be spread. It can connect a keyboard, and it has all the necessary interfaces. It can run media, and it can run on a network. It's got such an all-around purpose, and it's open. It will become completely open if we equip it with Linux, and programmers will be able to do anything with it. It's the same thing with the graphics, since it has shaders.
It is possible there was a policy change since then, but I think a far more likely explanation is simply current driver availability.
Time will tell. I do think there will be an endorsed homebrew game development scene at some point from Sony, but I also think they're going to want to monetize that at least nominally. We'll just see what happens I guess. For now there's enough 'other' stuff to keep the mind occupied on PS3/Linux. :)
Titanio
26-Nov-2006, 18:51
Time will tell. I do think there will be an endorsed homebrew game development scene at some point from Sony
This is it, I think..but yeah, time will tell re. GPU access.
Yeah well... they will have to be careful about 2 things:
(A) Operationally, they don't seem to have sufficient resources to flesh-out their game business on-time. Mucking around with more open source stuff now is going to worsen the situation (and developer relationships !). It is imperative that they fix and enhance the GameOS side of fence first.
So I'm guessing that the Linux side has been passed to partners to execute (Sony is hands-off for now).
(B) Business and marketing-wise, what is the revenue model compared to and/or complement existing game business ? Without addressing this point, I don't see how and why Sony management will open up Linux fully _today_. There has to be a tie-in somehow. I don't see any yet except for Sony to build up expertise for the Cell platform.
In this case, I believe hupfinsgack's XNA Express suggestion is more likely (because it has a business element). However I think the effort will be more than just for games, and the $$$/upside involved for both Sony and the participants are much greater. It's for building out a plethora of Cell platforms for assorted devices (including PS3).
EDIT:
Titanio and xbdestroya, what do you mean by homebrew game development ? Does Sony's new "Beyond" programme count as "home brew" in your world view ?
Titanio and xbdestroya, what do you mean by homebrew game development ? Does Sony's new "Beyond" programme count as "home brew" in your world view ?
Patsu I've been out of it lately - fill me in here! :)
What's the 'Beyond' program?
(and I agree with your other points btw)
StefanS
26-Nov-2006, 19:18
In this case, I believe hupfinsgack's XNA Express suggestion is more likely (because it has a business element).
That's what it's called. My description was very homeresque :lol:
Emm... it's just the eDistribution initiatives for PS3 devs. I don't know what tools they use now. Also don't know whether people can whip up some demo on PS3 Linux as part of the application process.
http://us.playstation.com/beyond/welcome.html
Shifty Geezer
26-Nov-2006, 20:13
It is possible there was a policy change since then, but I think a far more likely explanation is simply current driver availability.How hard can the drivers be though? RSX can surely take the existing PPC nVidia drivers for G71 with little more than some tweaks.
I have the slight suspicion that Sony blocks access to RSX to keep tabs on the homebrew scene. They might later release something similar to MS XboxLive Arcade dev thingy for a few bucks.
Isn't the concern here that if Linux is open enough, professional developers will be able to produce professional games for sale without having to pay Sony money? An XNA Express solution won't avoid that. At best, you could charge a once-off fee for PS3 owners to download some drivers that'll enable full PS3-Brew Linux games, and then miss out on royalties for every PS3-Brew game sold. What Sony really want is every professional title to pay them a slice, which is going to be hard to regulate on an open Linux.
I think if that's the concern of Sony, the open hardware will either come when games aren't the fundamental source of income and people buy PS3 for other tasks, or when Sony feel safe that Linux games won't eat into their official game sales.
crazygambit
26-Nov-2006, 22:04
I think if that's the concern of Sony, the open hardware will either come when games aren't the fundamental source of income and people buy PS3 for other tasks, or when Sony feel safe that Linux games won't eat into their official game sales.
I truly don't see the likes of EA and others releasing their games on Linux on the PS3. Simply because it's not standard they'd be needlessly reducing their target audience. It's the same reason you said they wouldn't support a peripheral like the Fusion. In this case I don't see the cost savings in royalties to be worth the lost revenue in sales.
I truly don't see the likes of EA and others releasing their games on Linux on the PS3. Simply because it's not standard they'd be needlessly reducing their target audience. It's the same reason you said they wouldn't support a peripheral like the Fusion. In this case I don't see the cost savings in royalties to be worth the lost revenue in sales.
More games could be released for Linux, not just for PS3 specifically, if PS3 manages to popularise Linux. It remains to be seen if Sony will expose RSX to third party operating systems - if only Sony would recognise the potential.
I truly don't see the likes of EA and others releasing their games on Linux on the PS3. Simply because it's not standard they'd be needlessly reducing their target audience. It's the same reason you said they wouldn't support a peripheral like the Fusion. In this case I don't see the cost savings in royalties to be worth the lost revenue in sales.
Actually Linux is 100% standard as far as EA would be concerned, and certainly much more standard than Windows, because EA can supply their own embedded version of Linux for free on the CD along with any PS3 drivers their game needs on the game DVD/BD itself. An embedded version of Linux should only take up about 100MB of space, maybe less.
crazygambit
27-Nov-2006, 02:00
Actually Linux is 100% standard as far as EA would be concerned, and certainly much more standard than Windows, because EA can supply their own embedded version of Linux for free on the CD along with any PS3 drivers their game needs on the game DVD/BD itself. An embedded version of Linux should only take up about 100MB of space, maybe less.
Ok fair enough. But do you really think that's an option? Which version of Madden do you think will sell more? The X360 one you just pop in and play or the PS3 one where you have the hassle of instaling a whole operating system. Plus would publishers take the risk of less copy protection? What's stoping you from renting that Linux version of Madden and installing it in you HDD?
Like I said, I just don't think it's worth it for big time publishers to go the Linux route on PS3, even if they had complete access to the hardware.
To make a little summary about PS3 Linux-enviroment:
- Only 198 MB memory out of 256
- 6/5 SPEs out of 7
- Needs HDCP-display (no ordinary monitors)
- No gfx hw support
- No Flash support
- No playing/ripping CDs
At the moment, the whole YDL+PS3 combo really needs improving A LOT. What makes me irrated is that this is so typical, a great product with really lackluster final delivery...
To make a little summary about PS3 Linux-enviroment:
- Only 198 MB memory out of 256
- 6/5 SPEs out of 7
- Needs HDCP-display (no ordinary monitors)
- No gfx hw support
- No Flash support
- No playing/ripping CDs
At the moment, the whole YDL+PS3 combo really needs improving A LOT. What makes me irrated is that this is so typical, a great product with really lackluster final delivery...6 SPEs are fully available and it's the same as official games. 198MB RAM, I guess some parts of the 60MB is reserved for the virtual frame buffer, if RSX becomes open it's likely to be reduced. "No playing/ripping CDs", what are you talking about?
EDIT: too much overlap with the other topic it seems
http://www.beyond3d.com/forum/showthread.php?t=35590
Actually Linux is 100% standard as far as EA would be concerned, and certainly much more standard than Windows, because EA can supply their own embedded version of Linux for free on the CD along with any PS3 drivers their game needs on the game DVD/BD itself. An embedded version of Linux should only take up about 100MB of space, maybe less.
But WHY would they do this? To expose themselves to hacks, cheats, "backups" etc?
6 SPEs are fully available and it's the same as official games. 198MB RAM, I guess some parts of the 60MB is reserved for the virtual frame buffer, if RSX becomes open it's likely to be reduced. "No playing/ripping CDs", what are you talking about?
EDIT: too much overlap with the other topic it seems
http://www.beyond3d.com/forum/showthread.php?t=35590
Yellow dog Linux, released today do not have CD playing/ripping support or flash support.
deathkiller
27-Nov-2006, 11:33
Yellow dog Linux, released today do not have CD playing/ripping support or flash support. It have CD playing/ripping software this is a example: http://www.terrasoftsolutions.com/products/ydl/included/details.php?app_id=1857&var=description&find=CD
You can even read a PS3 BD Game disc from linux (you can't play it of course).
It have CD playing/ripping software this is a example: http://www.terrasoftsolutions.com/products/ydl/included/details.php?app_id=1857&var=description&find=CD
You can even read a PS3 BD Game disc from linux (you can't play it of course).
Yes. But at the moment there is no kernel/hw level support for that:
(and no flash because there is no PS3 powerpc port for it... )
"
Su p p o r t e d C o m p u t e r s
PS3 2006
* With YDL v5.0, the PS3 is fully supported with the exception of the NVIDIA graphics card (see note on Graphics).
** The Nvidia graphics card is not supported beyond framebuffer mode. This does does not reduce the quality of the image, but does not provide accelerated video nor OpenGL support.
*** At this time, system sound and .ogg files are fully functional. However, neither playing nor ripping audio CDs functions. Stay tuned for updates ...
"
Yellow dog Linux, released today do not have CD playing/ripping support or flash support.
USB flash memory support is normally built into Linux distros, if it isn't it can be added. You just have to enable it (usually under System> Preferences > Removable Drives and Media in Gnome).
He's talking about Adobe Flash... which eSa, there *is* support for. It's Flash 9 that there isn't current support for.
Okie... let me understand this correctly.
Audio CD playback - Optional software (done by PPU) ? . What does it mean when YDL says "Stay tuned for update" if PS3 can play CD ? Or it can't ?
CD Ripping - Need USB CD ROM but optional software can be installed.
Flash - Flash 8/7 using Adobe or Open Source alternatives respectively.
Are the above descriptions accurate ?
I expect that the BluRay driver for Linux isn't completely finished, insomuch that access to Audio CDs is currently not possible. A lot of BluRay devices apparently can't read Audio CDs, but the PS3 can. But I'm thinking that support for this feature isn't present yet in the Linux driver, maybe because it requires a small kernel fix or something similar.
I'm expecting that this problem will be solved quickly.
Alright, I think I get it now. Thanks !
Dave Glue
28-Nov-2006, 02:20
Certainly better than I expected - roughly 2.4ghz P4 speed, which is very usable.
Now, this translating into a succesful standardized Linux platform down the line? I just don't see it. Hobbyists? Sure - there are people running AmigaOS of course. :) But will the PS3 be significantly cheaper in 2 years time, and where will a $600-$800 PC be technically? Budget GPU's at the time will likely be above the RSX, and physics/media/encoding jobs could be significantly sped up by running on the GPU, or perhaps even a motherboard GPU offering. It doesn't have to be "as good as" or "much better" than a PC to offset all the inherent advantages of a truly open platform - it has to completely decimate it in certain tasks to gain any mindshare as anything more than a hackers tool.
There are huge obstacles to overcome. I would just focus on the fact that this means ported code may not be a disaster (although that doesn't explain COD3's performance on the PS3...)
mohibbur
28-Nov-2006, 02:25
i have decided to post the official benchmarks released by the tech giants.
first from IBM
http://www-128.ibm.com/developerworks/power/library/pa-cellperf/
clearly G5 is trashed.......each SPE outguns just about any processor......Without optimizers released by my university (www.mcmaster.ca).......each spe surely would beat quad core by a margin of 2:1.............NOTE: the CELL is wayy more efficient than normal pc cores.
http://www.mc.com/literature/literature_files/Cell-Perf-Simple.pdf
opteron 275(dual core),Pentium 4 XEON 3.6ghz, g5 2.0ghz all gets murdered by the CELL by a minimum margin of 12:1.
NOTE : as for complex FFT each SPE outguns the pc cores by a minimum margin of 2:1.
what a shame some guys were comparing CELL with P4...........the PPE is the only dual core processor that can handle 2 full threads(CRYTEK).
now an article from MIT.
http://cag.csail.mit.edu/crg/papers/eichenberger05cell.pdf
indicates that SPEs are fully scalable and flexible........researchers there have also found out that with proper optimizers the performance of each SPE can improve tenfold!!!!!
FLOP count : XENON vs CELL
http://www.forbes.com/home/free_forbes/2006/0130/076.html
Speed Thrills: Cell Zaps Rivals
Cell
IBM, SONY, TOSHIBA
Transistors
(mil)
234
Performance
(gigaflops)
230
Xbox 360
processor
IBM
Transistors
(mil)
165
Performance
(gigaflops)
77
Pentium 4
Extreme Edition 840
Intel
Transistors
(mil)
250
Performance
(gigaflops)
26
Sources: Microprocessor Report; IBM.
NOTE: IBM has officially stated that XENON could do 77gflops(115 was made up by MS) and it is they who reinstated the 230gflops figure for CELL.
FROM UNIVERSITY OF CALIFORNIA@Berkeley(super article)
as anticipated the CELL eats away the pc cores (including itanium ,opteron) and X1E superscalar (also termed as supercomputer on some sites ) processor by huge margins.
In double precision apps CELL beats X1E by a margin of 3:1 .
In single precision apps competition doesnt exist!
after seeing the biased trash-talk about the CELL by x360 ******s and others i have decided to post these official benchmarks from IBM,MC,MIT,UC.
http://www.cs.berkeley.edu/~samw/projects/cell/LBLTalk.pdf ..............this article should give everybody an overview about why CELL is so powerful.
i am confident that with optimizers CELL alone would be able to handle the graphical tasks besides the other computational tasks....CELL is very adept at raycasting and raytracing.IF it performs so well in heavy computational area then i have no doubt that it would perform spectacularly in less intensive graphical tasks too....
Your spirit is apprecated mohibbur, but just so you know we've seen all of those before. ;) (for example being the most recent, the McMaster paper is located just a couple of threads down from here)
These benchmarks are interesting in their own right for different reasons - don't sell their value short. No one here is comparing Cell to the P4, we're comparing the PPE core to a P4 on unoptimized code in a real word environment under real world GP workloads. I think you'll agree that's fair. And as has been mentioned, we're pleasantly surprised with the results.
RollingBalls
28-Nov-2006, 05:05
Now after reading this I cannot help myself, but wonder whether any of you ever did some word processing on say a short but mildly complex (say even 10 pages, multiple fonts, embedded graphs, macros, etc.etc.) document on a 866mhz computer. Word processing certainly isn't as trivial as some make it out to be here.
You're young.
Certainly better than I expected - roughly 2.4ghz P4 speed, which is very usable.
Now, this translating into a succesful standardized Linux platform down the line? I just don't see it. Hobbyists? Sure - there are people running AmigaOS of course. :)
It depends on Sony. If they deliver a packaged and pre-configured Linux DVD with all drivers, non-free codecs and non-free programs (like flashplayer, realplayer, Adobe PDF reader etc.) licensed and bundled, included with every PS3, then PS3 Linux will be a mainstream OS, much, much bigger and much more mainstream than Apple's OSX for example, and every bit as easy to use. The problem with Linux as opposed to Windows for joe/jane average is that 1)Linux isn't preinstalled, 2) Linux non-free codecs, non-free drivers and non-free applications cannot be bundled with the OS. It requires someone with considerable knowledge to find these on the Internet and install and configure them correctly. Multi-media codecs in particular are difficult to get working properly.
But will the PS3 be significantly cheaper in 2 years time, and where will a $600-$800 PC be technically? Budget GPU's at the time will likely be above the RSX, and physics/media/encoding jobs could be significantly sped up by running on the GPU, or perhaps even a motherboard GPU offering. It doesn't have to be "as good as" or "much better" than a PC to offset all the inherent advantages of a truly open platform - it has to completely decimate it in certain tasks to gain any mindshare as anything more than a hackers tool.
If you look at consoles carefully, you will find that the cost drops continuously as they are re-engineered. The PS3 will be significantly cheaper in 2 years time. The advance of technology allows the fixed specification console to get cheaper and the non-fixed spec PC to become more powerful.
Naah! To be a hackers tool, it has to completely decimate the PC in certain tasks - which it does in HPC. To be a successful mass market OS, it has to come bundled and preconfigured so it can run out of the box without any effort on the end users part.
There are huge obstacles to overcome. I would just focus on the fact that this means ported code may not be a disaster (although that doesn't explain COD3's performance on the PS3...)
I am not sure what you are trying to say here. The console code base will have the cutting edge games. Porting from the PS3 / Xbox 360 / Wii to the PC or between PS3 / Xbox 360 /Wi is the way things will generally happen - at least for cutting edge games. In addition, because PCs have various CPUs, graphics cards, and memory busses, unlike consoles, PC games have to code for the lowest common denominator regardless of what the highest performance CPU, graphics card etc. available is.
Rangers
28-Nov-2006, 13:21
Yeah that's gonna be a problem with the PS3 computer going forward. It's going to be left behind in specs, particularly default RAM, by even budget computers, fast.
Imagine using your 64 MB Xbox for a PC, 2-3 years later. It might have been decent in 2001...in 2003+, not so much.
________
PAMELACUM (http://camslivesexy.com/cam/PAMELACUM)
Shifty Geezer
28-Nov-2006, 13:31
Yeah that's gonna be a problem with the PS3 computer going forward. It's going to be left behind in specs, particularly default RAM, by even budget computers, fast.
It'll only become a problem if the applications need more RAM, which most don't. If you can Word processor, web browse, play music etc. now on a 256 MB machine, you'll still be able to do those same tasks when PCs are shipping with 4GB of RAM, unless that's a dramatic change in the way data is stored so a Word doc or a view of IGN takes 512 MB.
There's lot of things an Amiga can't do now, but in its time it's was covering most of the bases for 5+ years. The areas of most rapid growth in resource demands are games, image+movie editing, modelling and raytracing apps, and music sequencing, all of which need loads of data on hand. Anyone not doing these things could probably get by on 150 MB of available RAM without ever knowing they're limited. Anyone wanting to do those things won't just find PS3's RAM limiting in 5 years, but will find it potentially limiting even now. However, the PS3 hasn't the software to run these task at the moment, so no-one should be looking at it to do them professionally.
Yeah that's gonna be a problem with the PS3 computer going forward. It's going to be left behind in specs, particularly default RAM, by even budget computers, fast.
Imagine using your 64 MB Xbox for a PC, 2-3 years later. It might have been decent in 2001...in 2003+, not so much.
Sony can always sell a multi-media computer version with more RAM when RAM prices come down. The basic Internet PC functions ie. web browsing, music, photo viewing don't need to take up any more space in 5 years time than the 197MB plus say 197MB of ramdisk swap space (if GDDR is used that way) that there is now. Linux is different from Windows (where only Microsoft can remove stuff causing bloat) - you don't need to include things that you don't need, which is why Linux is used in embedded devices like smart mobile phones, routers, firewalls, NAS servers etc.
The PS3 can also act as a display device for a network connected Linux PC, allowing multiple users to log in and run independent sessions on the Linux PC and sharing data. This requires mimimal RAM. 32MB RAM is sufficient if you are running a thin client with just an X-server and no local applications. You could also run the browsing and movie playing etc. as local applications and others as remote applications on the same desktop.
One interesting possibility for using the PS3 at the latter part of it's life when the hardware is cheap is to embed it into an HDTV to give you a multi-media PC. Toshiba announced it will embed Cell into every high end HDTV it manufactures. This provides facilities like decoding several video channels simultaneously and displaying them as live thumb screens in sides of the full screen you are watching, which you can switch to by clicking on them. It would also allow things like instant replay. Sony could do the same but go one step further by embedding a PS3 into an HDTV to give movie playback, PVR capability on the hard drive, game play capability, and buying a wireless keyboard and mouse would give computer capability as well, plus you can run remote sessions on another Linux PC over wireless or the network.
I think as a media/home PC, PS3 has lots of possibilities, if Sony wants to exploit them.
Rangers
28-Nov-2006, 14:19
Sony can always sell a multi-media computer version with more RAM when RAM prices come down. The basic Internet PC functions ie. web browsing, music, photo viewing don't need to take up any more space in 5 years time than the 197MB plus say 197MB of ramdisk swap space (if GDDR is used that way) that there is now. Linux is different from Windows (where only Microsoft can remove stuff causing bloat) - you don't need to include things that you don't need, which is why Linux is used in embedded devices like smart mobile phones, routers, firewalls, NAS servers etc.
The PS3 can also act as a display device for a network connected Linux PC, allowing multiple users to log in and run independent sessions on the Linux PC and sharing data. This requires mimimal RAM. 32MB RAM is sufficient if you are running a thin client with just an X-server and no local applications. You could also run the browsing and movie playing etc. as local applications and others as remote applications on the same desktop.
One interesting possibility for using the PS3 at the latter part of it's life when the hardware is cheap is to embed it into an HDTV to give you a multi-media PC. Toshiba announced it will embed Cell into every high end HDTV it manufactures. This provides facilities like decoding several video channels simultaneously and displaying them as live thumb screens in sides of the full screen you are watching, which you can switch to by clicking on them. It would also allow things like instant replay. Sony could do the same but go one step further by embedding a PS3 into an HDTV to give movie playback, PVR capability on the hard drive, game play capability, and buying a wireless keyboard and mouse would give computer capability as well, plus you can run remote sessions on another Linux PC over wireless or the network.
I think as a media/home PC, PS3 has lots of possibilities, if Sony wants to exploit them.
These are good points but I think that the tendency is to big, memory hog programs (heh, maybe a backhanded plug for Blu Ray). And those big flashy..whatever they are windows proggy's are going to do loads better on your 1024 MB <$500 cheapo tower (which'll also sport a super zippy pentium CPU and big ol HDD).
Of course these limitations can and will be worked around, it's just thinking about the forward looking limiters that will likely present. But yeah, there are a LOT of possibilities here that need to be explored. The one plus we have is a hugely better level of baseline graphical capabilities. That is what needs to be worked on leveraging imo.
________
Vaporizer Reviews (http://vaporizers.tv/)
Infinisearch
28-Nov-2006, 14:49
In regards to opengl, as already stated expose it using the flash rom OS except cripple it. Limit the number and size of VBO,PBO,textures... allowed, this way for simpler things you get fully accelerated 3d but it would limit the ability to make full fledged games. On a related note you could also disable compressed textures, given the limited ram the system has, it would also hamper full fledged game development.
Shifty Geezer
28-Nov-2006, 15:08
These are good points but I think that the tendency is to big, memory hog programs (heh, maybe a backhanded plug for Blu Ray). And those big flashy..whatever they are windows proggy's are going to do loads better on your 1024 MB <$500 cheapo tower (which'll also sport a super zippy pentium CPU and big ol HDD).
Of course these limitations can and will be worked around, it's just thinking about the forward looking limiters that will likely present.Like what exactly? Even with totally over-the-top user interfaces gobbling resources, you're going to be hard pushed to create apps that consume that much RAM, at least for a useful purpose. By far the greatest memory hog is assets. Second to that comes expansive data structures for processing, such as trees. There's little reason to go that route for data. eg. It may be that instead of storing your Word document as a string of character values, future WPs store every letter in a relational database, needing loads of RAM. But why? And even if they do do that, PS3 or other RAM constrained platforms can stick to the current methods that don't need buckets of RAM and still are very useable.
I myself can't see any direction applications can go that'll see them struggling on 150+ MB. If UI's are to become more flashy, they would be better off being vector based, actually reducing RAM footprint.
There is a possibility now I think about it of some data types being stored in more expansive structures, such as not storing images in 2D arrays of colour values. Again though, this comes down to data intensive tasks. For the ordinary activities, whatever happens in the PC space, PS3 will still be a competant machine. Putting it another way, I'm on an Athlon XP2500 with Ti4200 here. It does everything I want at an okay speed, so I won't be upgrading. The only incentive I have to upgrade is to play games or raytrace faster (the reason I upgraded last time) - neither of which I'm doing much of these days. The fact the new PCs are much, much faster then my PC does make my PC start working slower! In 5 years time when PCs are even more super-dooper, I'll still be able to do the same tasks in the same way, except if the OS gets dropped from support and I have to upgrade to a new OS requiring a new machine. And most of the tasks that could do with a speed-up would benenfit from processor upgrades, and not more RAM.
Like what exactly? Even with totally over-the-top user interfaces gobbling resources, you're going to be hard pushed to create apps that consume that much RAM, at least for a useful purpose. By far the greatest memory hog is assets.
Just browsing a thread at these forums takes up 60MB in a browser (any browser). Linear thread view-> big canvas -> loads of memory allocated.
Browse 3-4 different websites, run an email client, run a couple of IM clients and you're way past the 256MB mark.
And most of the tasks that could do with a speed-up would benenfit from processor upgrades, and not more RAM.
Emphatically wrong. I'd much rather have a P-3 833MHz with 1GB ram than a dual core A64 X2 with 256MB.
Cheers
fireshot
28-Nov-2006, 15:39
<200mb of ram is pushing it, home computing is designed around Windows . Ideally 512mb could suffice PS3 Linux users for a good 4-5 years. Another victory for unified memory?
Is it possible to open up RSX memory for "Other OS" needs? IMHO the execution part of PS3, on a whole has "under-delivered".
These are good points but I think that the tendency is to big, memory hog programs (heh, maybe a backhanded plug for Blu Ray). And those big flashy..whatever they are windows proggy's are going to do loads better on your 1024 MB <$500 cheapo tower (which'll also sport a super zippy pentium CPU and big ol HDD).
Of course these limitations can and will be worked around, it's just thinking about the forward looking limiters that will likely present. But yeah, there are a LOT of possibilities here that need to be explored. The one plus we have is a hugely better level of baseline graphical capabilities. That is what needs to be worked on leveraging imo.
I am not sure whether the majority of Internet PC programs will follow the same path to bloat. The concept of Internet hosted services and portals and the spread of mobile devices on which you can read email and email attachments or browse the web, is driving file sizes the other way. OSes and locally hosted Office suites have been getting bloated, but Internet related applications haven't. For example Firefox is lightweight compared to last gen browsers because it needs to be downloaded from the web. A similar thing will happen with client side programs for Internet hosted services. Lightweight viewers and office document readers will also be required by smartphones etc. which will push file sizes down. The PS3 would make a very good platform for using online Internet application services like Google apps, Yahoo, Hotmail, etc. - which would also be the ideal way for most users to use home PCs.
Shifty Geezer
28-Nov-2006, 16:11
Just browsing a thread at these forums takes up 60MB in a browser (any browser). Linear thread view-> big canvas -> loads of memory allocated.
Browse 3-4 different websites, run an email client, run a couple of IM clients and you're way past the 256MB mark.Huh? 60 MB to browse a webpage?!?! How does PSP do it then? :p
Seriously, I can start with 240 MB in use (music player and Messenger running, plus other misc crap), open this forum, IGN and Eurogamer to get to 300 MB, then run Word with a 20k document (312 MB) and music playback (320 MB), and have consumed 80 MB.
Now of course there will be occasions where you could hit the RAM limit (ignoring the availability of Virtual RAM), but mostly, for general operation, it shouldn't be a problem, especially when PS3 targetted apps (should) make an effort to use minimal RAM. A person who hits the RAM limit because they've opened so many windows, browsers, and apps, probably needs to work on their workflow before adding more RAM :D
Emphatically wrong. I'd much rather have a P-3 833MHz with 1GB ram than a dual core A64 X2 with 256MB.Sure, because RAM is a limit, especially for Windows. But would you rather have a PIII with 8 GB over an A64 X2 with 1 GB? Once the RAM becomes 'enough', adding more has no advantage. Windows is a RAM hog - always has been. Apps don't need to be. The OS doesn't need to be either. If someone releases a Linux for PS3 that does like Windows and consumes 200+ MB in just OS, it just won't get installed!
Gubbi is your browser bogged down with toolbars or something?
Whatever the case, in my real-world useage of Linux on PS3, I had OpenOffice open, a Firefox window with several tabs, that 'Asteroids' knock-off game open, and still had a decent buffer of memory left available. Once an application opens, the dynamic allocation between the HDD swap space and the XDR seems pretty intelligently and seemlessly handled. Granted we're talking about 5400RPM drives here... but the fact is the PS3 is quite livable as an Internet PC.
Anyway I've installed a new hard drive as of yesterday (120GB 8MB cache up from the 20GB 2MB cache), so after I get Linux back on today (seeing about Fedora again vs YDL) I'll open some junk up and take a screen capture of the system resources monitor to get a physical piece of evidence.
In regards to opengl, as already stated expose it using the flash rom OS except cripple it. Limit the number and size of VBO,PBO,textures... allowed, this way for simpler things you get fully accelerated 3d but it would limit the ability to make full fledged games. On a related note you could also disable compressed textures, given the limited ram the system has, it would also hamper full fledged game development.
Even if full access was granted the chances of someone, other than a homebrew coder, making a fully fledged game using linux is remote and the chances of it being anywhere near the level of "PS3" games even harder to imagine.
Not only will it have the PS3s OS running in the background using 60+ mb of memory, but it will also have whatever resources Linux needs to run also. I'm guessing that will leave very little space for a game.
I suppose it will also be difficult to use RSX to access whatever amount of XDR ram is left without the full blown RSX dev kit.
Interesting discussions.
On the issue of limited memory in the future... Kutaragi has mentioned that Sony may release a better I/O and memory spec'ed PS3 later in its lifecycle for PC (or other) usage scenarios.
Comparing PS3 Linux and PC Vista resource usage is apple-vs-orange though, the latter consumes much CPU, GPU and memory when running general applications. Linux should be more resource friendly for sometime to come (Eagerly waiting xbdestroya's next post).
If the PS3 Linux folks (e.g., YDL) wants to succeed, they will be better off focusing PS3 Linux on UI/usability and certain class of applications (e.g., media related software, PS3 development tools, or new genre of living room apps). Office applications are "just" hygiene factors. Most people don't use all the features of Word anyway, but a properly optimized Excel or Statistics package would be nice.
Just browsing a thread at these forums takes up 60MB in a browser (any browser). Linear thread view-> big canvas -> loads of memory allocated.
Browse 3-4 different websites, run an email client, run a couple of IM clients and you're way past the 256MB mark.
I have run Linux on 192MB RAM and I often opened more than 3-4 websites plus one with embedded video, email client with several email windows open, plus an IM client, and an mp3 player playing in the background, plus OpenOffice, and I haven't had any problems. Maybe there is a problem on Windows due to the OS hogging more space, but 192MB is perfectly usable for this type of workload on Linux. Also note that on Linux, running a browser of open office means that the loaded applications and any dynamic libraries are shared and only get loaded once. Bear in mind also that you won't be typing into all the applications simultaneously - in most cases some or most of the applications you have open are idle, and only kept open because you want to come back to those later, and so if they get paged to hard drive, you won't really notice.
Emphatically wrong. I'd much rather have a P-3 833MHz with 1GB ram than a dual core A64 X2 with 256MB.
Cheers
Depends what OS you are running I guess. For Linux I would rather have an A64 X2 with 256MB RAM than a P3 833MHz with 1GB RAM, because the A64 X2 is faster.
On the PS3, even if the 256MB GDDR isn't directly accessible by Cell, if it can be used as 197MB ramdisk swap space by having RSX fetch on request, that would speed things up to close to what you would get with 394MB ram. On Linux you usually allocate swap space equal to RAM size anyway, so this should be fine for normal usage and would allow some GDDR space for display requirements.
aaaaa00
28-Nov-2006, 21:03
I myself can't see any direction applications can go that'll see them struggling on 150+ MB. If UI's are to become more flashy, they would be better off being vector based, actually reducing RAM footprint.
Everyone is switching to true buffered rendering on desktop windows to enable all the fancy 3D desktop effects that are in vogue right now. OSX has had it for years, Vista has it when you have Glass turned on, and the fancy new OpenGL window managers for Linux have it.
That means the OS maintains a bitmap for every window you have open -- basically apps render to a texture that represents the window, and your 3D card composites all the textures together to render your desktop.
So at 1920x1080x4 bytes per window, that's an 8 MB texture per maximized window.
If you open ten fullscreen maximized windows, thats 80MB of backing store you need. Open 100 windows, and you need 800MB.
There's stuff you can do to minimize this (compression, discarding back buffers, etc), but the bottom line is that GUIs are going to get more memory hungry if you want fancy pretty desktops.
Aren't we all agreed that such GUI interfaces are superfluous bloat, however?
jonabbey
28-Nov-2006, 21:50
On the PS3, even if the 256MB GDDR isn't directly accessible by Cell, if it can be used as 197MB ramdisk swap space by having RSX fetch on request, that would speed things up to close to what you would get with 394MB ram. On Linux you usually allocate swap space equal to RAM size anyway, so this should be fine for normal usage and would allow some GDDR space for display requirements.
Yeah it'll be interesting to see whether anyone gets swapping to GDDR going. With a read speed of 16MB/sec (and a much higher write speed), it should be fast enough to make a useful swap pool, though nothing comparable to having that much more main XDRAM on hand.
Yeah it'll be interesting to see whether anyone gets swapping to GDDR going. With a read speed of 16MB/sec (and a much higher write speed), it should be fast enough to make a useful swap pool, though nothing comparable to having that much more main XDRAM on hand.Erm, all the talk about using it as swap obviously would get the RSX to initiate the request when writing to the XDR e.g. swapping back in, at full speed (that's a few GB/s).
Erm, all the talk about using it as swap obviously would get the RSX to initiate the request when writing to the XDR e.g. swapping back in, at full speed (that's a few GB/s).
Besides swapping doesn't happen all the time, otherwise doing swap on a hard drive would slow things down to a crawl.
crazygambit
29-Nov-2006, 02:11
Just browsing a thread at these forums takes up 60MB in a browser (any browser). Linear thread view-> big canvas -> loads of memory allocated.
Browse 3-4 different websites, run an email client, run a couple of IM clients and you're way past the 256MB mark.
Emphatically wrong. I'd much rather have a P-3 833MHz with 1GB ram than a dual core A64 X2 with 256MB.
Cheers
Well... I'm on my trusty 500 Mhz PIII with 256 of the slowest available RAM (Which includes 16MB for the embedded video card :lol: ) and not only do I have 11 threads in this very forum on Firefox's tabs. Plus some MSN windows and some other crap. And I'm using probably the most bloated version of XP and it runs fine. I'm certain the PS3 can handle it better.
Plus I see no reason they won't release a PS3 with more memory down the line for Linux if it does well. And it will certainly do better than PS2 Linux, of that I'm sure.
inefficient
29-Nov-2006, 02:54
Kids today with their multi-ghz multi-core multi-gigs of ram PCs are spolied!
I remeber doing DTP and photo editing on a "workstation" class machine that only had 8MB of ram. And we got the job done.
The problem is that user perceptions of what is a "reasonable" time for an operation to complete keeps getting shorter and shorter. But the fact is you can get quite a lot done with only 256MB of ram. And for browsing 3-4 websites at at a time: Jesus Christ of course you can!
aaaaa00
29-Nov-2006, 03:27
Aren't we all agreed that such GUI interfaces are superfluous bloat, however?
What can I say, people seem to love pretty 3D superfluous bloat. :wink:
GregLee
29-Nov-2006, 03:40
Even if full access was granted the chances of someone, other than a homebrew coder, making a fully fledged game using linux is remote and the chances of it being anywhere near the level of "PS3" games even harder to imagine.
Not only will it have the PS3s OS running in the background using 60+ mb of memory, but it will also have whatever resources Linux needs to run also. I'm guessing that will leave very little space for a game.
Where is the big space disparity between Sony hypervisor + GameOS + PS3-game, on the one hand, and Sony hypervisor + Linux + Linux-game, on the other? You must think Linux uses space much less economically than the GameOS, right? That could be true, I suppose, but why do you think that?
Making a fully fledged game using Linux is very difficult for reasons unrelated to computational power and space. Finding a team of talented coders to make a free game engine is lots easier than gathering a team of talented artists to create scenery and models.
Greg
inefficient
29-Nov-2006, 04:01
Making a fully fledged game using Linux is very difficult for reasons unrelated to computational power and space. Finding a team of talented coders to make a free game engine is lots easier than gathering a team of talented artists to create scenery and models.
Greg
Damn people wanting to get paid for art! Whatever happened to "starving artists"?!
Everyone is switching to true buffered rendering on desktop windows to enable all the fancy 3D desktop effects that are in vogue right now. OSX has had it for years, Vista has it when you have Glass turned on, and the fancy new OpenGL window managers for Linux have it.
I don't think so. Glass + Vista (and all the fancy 3D desktop) requires pretty hefty system specs (min. 1 Gb RAM, recommended 2 Gb). Many of my friends won't be turning it on.
It's still back to what niche PS3 finds. Speed and simplicity are valid features too (see Google and Apple).
Fafalada
29-Nov-2006, 06:57
What can I say, people seem to love pretty 3D superfluous bloat.
I think 3D IS entirely redundant there - people love the added responsiveness and prettyed up look from accelerated desktop - if you throw them in a GUI where you actually have to navigate in 3d most of them will be crying for their mommy.
Granted you are getting the memory bloat regardless, then again after being stuck in GDI world for 12 years most of us will gladly pay the extra memory price to get the hell out.
Shifty Geezer
29-Nov-2006, 08:28
What can I say, people seem to love pretty 3D superfluous bloat. :wink:OS bloat can be extreme, but that doesn't give an idea of applications needing more and more memory. If I have a paint package that requires 20 MB of RAM to run, it'll require that on Windows 98, Win 2k or XP or Vista, whether the OS takes up 16 MB RAM or 512 MB. The OS can use up whatever resources it wants, but to preserve memory on a memory-limited box, you'd be just plain stupid to take an OS that lets you run PS3Paint in 20 MB, and require 80 MB to run the same program with the same features. (I'd even say you'd be crazy to do that in a non-RAM constrained box, but people need some reason to upgrade their RAM ;))
Also, is 3D UI rendering going to matter to PS3 in 5 years' time? Surely it'll remain with 2D (E17 etc.) interfaces. It may not look as fancy as modern PCs at the time if they go the memory-guzzling route, but it'll still do the same work and run the same apps.
With Linux, you can cut the bloat by removing stuff which isn't necessary. Take a look as what can run from an 84MB disk space.
http://www.tuxmachines.org/node/6384
Maybe there is a problem on Windows due to the OS hogging more space, but 192MB is perfectly usable for this type of workload on Linux. Also note that on Linux, running a browser of open office means that the loaded applications and any dynamic libraries are shared and only get loaded once. Bear in mind also that you won't be typing into all the applications simultaneously - in most cases some or most of the applications you have open are idle, and only kept open because you want to come back to those later, and so if they get paged to hard drive, you won't really notice.
Windows isn't any worse in my experience than Linux. Yeah it uses more when you boot up, but once you start to use a bunch of applications Linux quickly catches up. Few software systems are more bloated than X-windows. As for shared libraries, Linux is usually worse than Windows because while they both use shared libraries, different applications on Linux can run with specific version of a library. Try looking at just about every Gnome app. out there.
Depends what OS you are running I guess. For Linux I would rather have an A64 X2 with 256MB RAM than a P3 833MHz with 1GB RAM, because the A64 X2 is faster.
Your computer won't be any faster when it spends 90% of its time swapping.
The amount of RAM in a PC shapes the way you use your PC. We've all had PCs with 256MB ram in them and we got work done. You get real good at running one or two apps at a time, closing apps you won't use the next 60 seconds, close windows in your browser etc. etc.
However once you get more RAM in your PC you use more RAM because your workflow changes: you have more apps running, you have more websites open in your browser, - and you get more work done because you don't waste time restarting apps. Once you're used to this kind of behaviour it is really hard to go back
So while a PS3 with 256MB ram might be plenty good if you come from a 256MB PC experience it will be an exercise in self-control (patience) coming from a 1GB PC experience.
Cheers
Windows isn't any worse in my experience than Linux. Yeah it uses more when you boot up, but once you start to use a bunch of applications Linux quickly catches up. Few software systems are more bloated than X-windows. As for shared libraries, Linux is usually worse than Windows because while they both use shared libraries, different applications on Linux can run with specific version of a library. Try looking at just about every Gnome app. out there.
But actually that's very similar these days for Windows. Each application will typically be using its own version of a dll, if versions aren't of the same type. This was actually the solution to the pre-W2K's 'dll hell'.
The amount of RAM in a PC shapes the way you use your PC. We've all had PCs with 256MB ram in them and we got work done. You get real good at running one or two apps at a time, closing apps you won't use the next 60 seconds, close windows in your browser etc. etc.
All my PCs at home have 256mb RAM. For almost everything this is a complete non-issue.
However once you get more RAM in your PC you use more RAM because your workflow changes: you have more apps running, you have more websites open in your browser, - and you get more work done because you don't waste time restarting apps. Once you're used to this kind of behaviour it is really hard to go back
So while a PS3 with 256MB ram might be plenty good if you come from a 256MB PC experience it will be an exercise in self-control (patience) coming from a 1GB PC experience.
Cheers
This is not my experience. At work I have 2 PCs with 1Gb and 2Gb respectively, and right now, it's tremendous overkill.
You also have to remember that the PS3 does have a large amount of video memory, and that graphics are still major RAM consumers. Ideally, E17 would manage to keep all the graphics on GDDR3's 256mb memory. I don't know if that works at this time, but it should be possible.
All my PCs at home have 256mb RAM. For almost everything this is a complete non-issue.
This is not my experience. At work I have 2 PCs with 1Gb and 2Gb respectively, and right now, it's tremendous overkill.
Exactly opposite for me. My home PC has 2GB, PCs at work have 1GB, all are dual cores.
Neither are overkill. Work PC hovers around 700MB used. Home PC is usually well above 1GB, I only close stuff when I start gaming.
Cheers
Neither are overkill. Work PC hovers around 700MB used. Home PC is usually well above 1GB, I only close stuff when I start gaming.
Cheers
If you still have 300MB unallocated, then you're barely using any memory with your applications, and instead Windows is eating up everything. Just look at your task-manager and tell me how much your main processes are consuming. Or just start up Windows, don't open any applications, and then tell me how much Windows has allocated.
If you still have 300MB unallocated, then you're barely using any memory with your applications, and instead Windows is eating up everything. Just look at your task-manager and tell me how much your main processes are consuming. Or just start up Windows, don't open any applications, and then tell me how much Windows has allocated.
After boot up 220MB is used, ~100MB of these are IM clients (in tray), AV service+client and other stuff that runs at windows start up.
A clean Windows install usually uses ~110MB after start up, that's for the OS, services and user-interface (explorer.exe). You can get this down to ~85MB if you disable all services not strictly needed.
Windows don't "eat up" RAM, apps do (and explorer.exe is an app.)
Cheers
Agree that people with high-end PC will have to adjust their working style on a PS3 Linux.
People will have to re-discover an optimal way to run their lives on a PS3 Linux. When I get my unit, I'll try a combination of Google apps (e.g., Email, Calendaring), Remote Desktop (Windows native stuff), local servers (to serve out services to Windows), and clients (PS3 homebrew development).
As for resource usage on Windows before Vista, we also need to account for Windows Services, Apps hooking themselves into the Shell, the web browser or/and all the running processes to perform assorted functions (e.g., Anti-virus), different managed and unmanaged run-times (say, COM, MFC, ...), large registry after years of use, etc. ... All these contribute to my laptop's weight today. I don't think I can/want to remove all of them (e.g., AV and Personal Firewall).
After boot up 220MB is used, ~100MB of these are IM clients (in tray), AV service+client and other stuff that runs at windows start up.
A clean Windows install usually uses ~110MB after start up, that's for the OS, services and user-interface (explorer.exe). You can get this down to ~85MB if you disable all services not strictly needed.
Windows don't "eat up" RAM, apps do (and explorer.exe is an app.)
Cheers
Windows XP isn't bad in terms of memory use, but you have to remember that it keeps memory allocated even if it isn't strictly needed, for all sorts of caching purposes. I just booted up my dual core PC (which I don't use much yet), and it has a 330M Commit Charge at startup. 180M of that is actually used up by processes, which include svchost, explorer, OfcPfwSvc, TmListen, ShellKer, spoolsv, services, and a few more svchosts.
But we'll see. The strength of YDL will have to be that it is optimised for the PS3 ...
~100MB of these are IM clients (in tray) Wow. I use Miranda for all IM protocols, it takes 600kB of RAM sitting in the tray. Talk about bloat.
No one will argue that 192MB are plenty, but if you use decently optimized apps and no extreme memory hogs you can get quite far with that.
Shifty Geezer
29-Nov-2006, 14:37
Neither are overkill. Work PC hovers around 700MB used. Home PC is usually well above 1GB, I only close stuff when I start gaming.Can you give an example of the breakdown of apps you have open that use up a GB of RAM? I've a 1 GB machine here and I'd like to recreate it, because at this point, with my experience, the amount of open applications I'd need to use up my RAM is way, way more then I'd be comfortable using. I can't understand any activities that'd need 8 browser windows, 3 office apps, and a load of other programs running.
I'd also ask, do you think your usage is similar to most people's? And will those most people find the RAM limit of PS3 constricting if it only allows for half a dozen open applications (+media playback for listening to music while you work)?
Exactly opposite for me. My home PC has 2GB, PCs at work have 1GB, all are dual cores.
Neither are overkill. Work PC hovers around 700MB used. Home PC is usually well above 1GB, I only close stuff when I start gaming.
Cheers
All I can say is I haven't ever had your problems with 256MB RAM on Linux, and on Windows XP running on 512MB at work, I have only had such problems on a spyware infected machine. Erasing the hard drive and reinstalling Windows fixed that.
Maybe you need to check your PC for Windows spyware/trojan applications. These would explain the sort of problems you are having, perhaps running too many background applications like interactive virus checking applications, intelligent firewalls etc. might also be the cause.
As I said before Windows and Linux are very different. With Linux you can either recompile the kernel to remove modules you don't need, or compile them as modules, and leave out the modules you don't need. You also have full control over what processes are started on bootup, and with Linux, any shared libraries that are loaded are loaded just once (even if several different users use the same library it is still only loaded just once). This means the OS and application memory overhead can be very much lower on Linux than Windows.
Shompola
29-Nov-2006, 15:51
Huh? 60 MB to browse a webpage?!?! How does PSP do it then? :p
Come on now... I get a lot of out of memory messages when I visit random web-sites with my PSP. I can't be the only one getting that?
60MB might be exaggarating... but modern web-browser are notorious, and so is the PSP one. You can multi-task with 256MB of RAM, or even 128MB... but it is very annoying.
GregLee
29-Nov-2006, 16:26
As for shared libraries, Linux is usually worse than Windows because while they both use shared libraries, different applications on Linux can run with specific version of a library. Try looking at just about every Gnome app. out there.
For a given major version of Gnome, I don't believe it ever actually happens that different applications require more than one version of a shared library to be memory resident. Gnome applications require minimum versions of the libraries, but not specific versions. (And the reason they do that is developers don't want to bother testing their applications against earlier versions of libraries they personally no longer use.)
Greg
Interesting PS3 Linux kernel and programming details (If not already posted):
http://felter.org/wesley/files/ps3/linux-20061110-docs/
I'm glad that the IBM "SPU virtual file system" paper got implemented and deployed :D
Fafalada
30-Nov-2006, 02:41
Windows don't "eat up" RAM, apps do (and explorer.exe is an app.)
Well everything evolves - back in the day I could work in WinNT with 128MB & 133Mhz P5, and it was MORE responsive then current 1-2GB machines when using development studio, a couple of browser windows and IM program or two.
Granted - current versions of MSDev take like 130MB for themseles on startup alone, but I'm pretty sure it's not just the apps that got fatter. The nasty GDI extensions that serve no other purpose other then to make OS look uglier must have something to do with it too.
Well everything evolves - back in the day I could work in WinNT with 128MB & 133Mhz P5, and it was MORE responsive then current 1-2GB machines when using development studio, a couple of browser windows and IM program or two.
I call rose colored spectacles on this one.
I recently booted a PPro with 128mb running NT to recover some files and I was stunned how much slower it actually was to use then my current desktop, because like you I remembered it being faster.
aaaaa00
30-Nov-2006, 03:27
I call rose colored spectacles on this one.
I recently booted a PPro with 128mb running NT to recover some files and I was stunned how much slower it actually was to use then my current desktop, because like you I remembered it being faster.
I ditto that. I keep around an old 300 mhz laptop with 128MB RAM just to keep perspective -- and I would never want to trade my dual-core AMD64 2GB Vista box for it -- ever.
Different CPU, different OS, different app, different hardware from your old 300Mhz PC. :lol2:
Back on topic... The benchmark and the performance in the FC6 video still surprise me. Hopefully someone can give some impression when (free) YDL + E17 is released.
I would swap for a faster hard disk for heavy use though.
I don't expect that Sony has much incentive to open up the details of the RSX to the public; I personally am not expecting them to do so. But then again, not that I really care either. I can fully deal with the forced differentiation of PS3 general computing device, and PS3 games device. If Sony wants control of the latter, such is their right IMO. And if they do open up the RSX later... hey, bonus.
(But yeah, like NPL said the SPEs are already open Naboomagnoli - they show up under 'devices' in the system overview display)
yes its interesting that the SPE's can also be used as standard , infact i think its Lu_ZERO
that has submitted some code to allow spe's testing
http://overlays.gentoo.org/dev/lu_zero/wiki/CellTutorial
http://planet.gentoo.org/developers/lu_zero
the thing that seems logical is that its possible to use some/all these spe's as a form of GPU processor unit then blit the result to the frame buffer?.
that being the case i want to know if some devs here are working on anything SPE/AltiVec OpenGL
related , i wonder if this http://www.khronos.org/message_boards/viewtopic.php?t=673&highlight=ps3 thread might point to some ideas to begin such an option.
http://www.khronos.org/opengles/spec/
theres also the FreeVec AltiVec library if SPE programming seems to hard for the new coders reading?.
http://freevec.org/
"About libfreevec
libfreevec is a free (LGPL) library with hand-optimized replacement routines for GLIBC, such as memcpy(), strlen(), etc. These routines have been written specifically to take advantage of the AltiVec unit (a.k.a Velocity Engine or VMX), and will only work on processors that include this unit. This means they will not work on older processors, such as 603, 604, 750 (G3) or the POWER family of CPUs.
Check the Features page for more details on the exact set of functions that are included, with comments and rough speed gain estimates, and the FAQ for additional information about using the library itself.
After boot up 220MB is used, ~100MB of these are IM clients (in tray), AV service+client and other stuff that runs at windows start up.
A clean Windows install usually uses ~110MB after start up, that's for the OS, services and user-interface (explorer.exe). You can get this down to ~85MB if you disable all services not strictly needed.
Windows don't "eat up" RAM, apps do (and explorer.exe is an app.)
Cheers
Actually Windows does eat up RAM - it just doesn't show up on the Windows Task Manager. For example besides the OS system components, components of IE, MS Office preloader etc. don't show up on Task Manager, although they do take up RAM space.
With Linux you can pretty well get rid of any bloat you don't need, and a lot of what is normally loaded, including kernel modules is never used if you are using it on one platform with a fixed hardware spec and fixed usage like the PS3 used as a desktop. Linux pre-compiled kernels that come with distributions like Fedora (and YDL) are generic and need to cover all possibilities, and so include all sorts of kernel drivers and kernel modules that might be used in server applications and for many different types of hardware. The Linux OS can be made to run on 4-16 MB RAM for a non-GUI server by the way, and 16-32MB RAM for full GUI. Applications are on top of that.
inefficient
07-Dec-2006, 14:09
Got YDL running on the machine nice and configured now.
Getting VideoLAN/VLC running was the hardest part. The OS itself was a piece of cake - anyone could do it. I had to manually hunt down and install 37 rpms. So it was some trial and error.
http://img387.imageshack.us/img387/6864/ps3loadeddesktopshothv1.th.jpg (http://img387.imageshack.us/my.php?image=ps3loadeddesktopshothv1.jpg)
In the above screenshot I've got 2 Firefox browsers open with 6 tabs total with a webpage loaded in each. A terminal session with 2 tabs. VLC playing an Xvid video. Gaim IM client loaded and connected to MSN. System monitor. And this is about the point where it is still fairly comfortable and responsive. Any more apps and swapping will become noticeable.
1080p weirdness - I got 1080p working by using mode 5 but the resolution is 1688x964 and does not actually fill my TV at the default setting.
VLC is buggy - at least the one hacked together is. Full screen mode in particular seems to cause instability. There is no filtering or post processing. And there will be some screen tearing sometimes. I look forward to someone putting together an actual official PS3 release version of this app. But it's cool that it works at all.
Next stop: Xmame!
StefanS
07-Dec-2006, 14:14
Hi inefficient!
Would you give the Linux command line clients of Folding@Home, Cure@Home, etc. a try? A few Beyond3d staffers are interested if it actually works, even if there's no specific PS3 version...
inefficient
07-Dec-2006, 15:17
Hi inefficient!
Would you give the Linux command line clients of Folding@Home, Cure@Home, etc. a try? A few Beyond3d staffers are interested if it actually works, even if there's no specific PS3 version...
There are no Linux PPC binaries for Folding@Home. Only OSX PPC versions. And its not open source either so we can't compile it.
The only grid based project I can find a ppc version of is distributed.net
StefanS
07-Dec-2006, 15:22
There are no Linux PPC binaries for Folding@Home. Only OSX PPC versions. And its not open source either so we can't compile it.
The only grid based project I can find a ppc version of is distributed.net
:oops: I didn't realize there was no Linux PPC version. That really blows... Anyway, try the distributed one, if you got some time. :cool2:
Come on now... I get a lot of out of memory messages when I visit random web-sites with my PSP. I can't be the only one getting that?
60MB might be exaggarating... but modern web-browser are notorious, and so is the PSP one. You can multi-task with 256MB of RAM, or even 128MB... but it is very annoying.
My PDA is fine with very low ram. It's got 128MB, but I use most of it for storage. Even still, the most memory intensive browser on PDA is mobile firefox at about 10MB, and there's probably less than 2MB of stored files on any website after that. And mobile opera is quite good at internet browsing in limited memory, though it could definitely use more cpu power.
BTW, I find Vista RC2 much more responsive for multitasking than any previous version of Windows.
Got YDL running on the machine nice and configured now.
Getting VideoLAN/VLC running was the hardest part. The OS itself was a piece of cake - anyone could do it. I had to manually hunt down and install 37 rpms. So it was some trial and error.
http://img387.imageshack.us/img387/6864/ps3loadeddesktopshothv1.th.jpg (http://img387.imageshack.us/my.php?image=ps3loadeddesktopshothv1.jpg)
In the above screenshot I've got 2 Firefox browsers open with 6 tabs total with a webpage loaded in each. A terminal session with 2 tabs. VLC playing an Xvid video. Gaim IM client loaded and connected to MSN. System monitor. And this is about the point where it is still fairly comfortable and responsive. Any more apps and swapping will become noticeable.
1080p weirdness - I got 1080p working by using mode 5 but the resolution is 1688x964 and does not actually fill my TV at the default setting.
VLC is buggy - at least the one hacked together is. Full screen mode in particular seems to cause instability. There is no filtering or post processing. And there will be some screen tearing sometimes. I look forward to someone putting together an actual official PS3 release version of this app. But it's cool that it works at all.
Next stop: Xmame!
For VLC compilation, do you have to change the source ? or "just" hunting down and experimenting 37 RPMs ? (I know it's hard work nonetheless).
I have finally set up my PS3 with a HDCP-compliant monitor for desktop work. Did u pay for YDL distribution, or is there a 5.0 free download already (supposed to mirrored publicly around Christmas) ?
inefficient
08-Dec-2006, 03:16
For VLC compilation, do you have to change the source ? or "just" hunting down and experimenting 37 RPMs ? (I know it's hard work nonetheless).
I have finally set up my PS3 with a HDCP-compliant monitor for desktop work. Did u pay for YDL distribution, or is there a 5.0 free download already (supposed to mirrored publicly around Christmas) ?
Linux is actually not completly backwards anymore when it comes to installing software anymore. YDL5 comes with tools for updating or installing new software stored in their rpm repositories. There is of course yum and there is a gui based app (I think just a front end for yum). Idealy, the tools not only download and install the package but all it's dependencies making life easier.
Right now the official YDL5 repository is only for paying customers. But it should be mirrored on the free sites shortly.
Anyway... the problem with VLC is that the retail channels cannot officially distribute components required by VLC because of legal issues. For example the mp3 decoder, mpeg2 decoder, dvd playback, etc, etc. The reasons should be obvious. When you try to install the VLC rpm it will just dump out a list of missing dependencies.
There doesn't seem to be any single good repository for ppc rpms. But you can find them scattered in various places.
Just today it found this post where some other guy was nice enough to actually list the sources for all the rpms he downloaded. http://forums.qj.net/f-ps3-linux-283/t-how-to-play-videos-on-yellow-dog-50-81670.html
It seems he used 40 rpms. His sources are not exactly the same as mine. But he used a blend of yd4, fc4 and fc5 rpms like me.
Linux is actually not completly backwards anymore when it comes to installing software anymore. YDL5 comes with tools for updating or installing new software stored in their rpm repositories. There is of course yum and there is a gui based app (I think just a front end for yum). Idealy, the tools not only download and install the package but all it's dependencies making life easier.
Right now the official YDL5 repository is only for paying customers. But it should be mirrored on the free sites shortly.
Anyway... the problem with VLC is that the retail channels cannot officially distribute components required by VLC because of legal issues. For example the mp3 decoder, mpeg2 decoder, dvd playback, etc, etc. The reasons should be obvious. When you try to install the VLC rpm it will just dump out a list of missing dependencies.
There doesn't seem to be any single good repository for ppc rpms. But you can find them scattered in various places.
Just today it found this post where some other guy was nice enough to actually list the sources for all the rpms he downloaded. http://forums.qj.net/f-ps3-linux-283/t-how-to-play-videos-on-yellow-dog-50-81670.html
It seems he used 40 rpms. His sources are not exactly the same as mine. But he used a blend of yd4, fc4 and fc5 rpms like me.
I am not sure about YDL, but for FC6, I use the FC6, I suggest the following:
fedora-core.repo
fedora-extras.repo
fedora-updates.repo
fedora-legacy.repo
fedora-development.repo
fedora-extras-development.repo
fedora-updates-testing.repo
Also for the potentially restricted stuff like DeCSS, multi-media codecs etc:
livna.repo
livna-devel.repo
livna-testing.repo
I also found mplayer to be better than vlc with the FC6+livna repositories, but you still need to download the codecs package from mplayerhq. Also if you have SELinux turned on, it blocks certain insecure things done by various multi-media libraries which require text relocation or changing the access protection of memory on the heap, so you will need to either disable SELinux or install setroubleshooter which will prompt you and tell you the command to run to allow permission for that library.
You also need to download and install non-freely distributable stuff like flashplayer, realplayer etc. separately. You can use
yum localinstall <rpm package pathname>
yum localupdate <rpm package pathname>
to install or update downloaded third party packages taking care of dependencies automatically.
As for backward compatibility, you can install the appropriate old libraries, but it is a pain in the butt to locate the old source source libraries and compile them, but for most cases, if the distro no longer supports old libraries, it isn't worth the effort. Sometimes binaries that won't work with the current installed libraries will work if their source is compiled against the current ones. I have very old Linux mail server running RedHat 5.0 which nobody wants to replace because the damn thing just keeps running. It has been running 24/7 since 1988, and has never once crashed or had a security issue in the 8 years it has been running. The only port exposed to the Internet is the SMTP port, so the only maintenance required now is to update sendmail every time there is a serious security flaw. Recompiling recent versions of sendmail against the very old libraries still works. Thankfully we are going to replace the server soon because this will probably become a serious problem if it doesn't work.
Heinrich4
08-Dec-2006, 14:43
Forgive me for off topic.
I have a file .ppt of the speaking Deano Calver (year 2005) that PPE of cell would function almost that "physically" as 2 processors of 1.6GHz more than one cpu "thread like"( if I understood right).
It will be fedora is using in processing alone "one of these" in the comparison with powerpc?
inefficient
10-Dec-2006, 08:29
Here is some output from the python script benchmark pybench. It makes for a nice example of general purpose code.
Average round time: 10729.00 ms
For comparison. On my AMDx64 X2 3800@2ghz it scores 4113.59 ms
I was expecting much worse. Only being about 2.5x slower compared to a traditional desktop CPU for this type of code seems pretty reasonable.
Tests: per run per oper. overhead
------------------------------------------------------------------------
BuiltinFunctionCalls: 143.60 ms 1.13 us 0.50 ms
BuiltinMethodLookup: 237.35 ms 0.45 us 0.50 ms
CompareFloats: 135.45 ms 0.30 us 0.50 ms
CompareFloatsIntegers: 220.45 ms 0.49 us 0.50 ms
CompareIntegers: 148.70 ms 0.17 us 1.00 ms
CompareInternedStrings: 133.90 ms 0.27 us 2.00 ms
CompareLongs: 137.25 ms 0.30 us 1.00 ms
CompareStrings: 155.95 ms 0.31 us 2.00 ms
CompareUnicode: 147.55 ms 0.39 us 1.50 ms
ConcatStrings: 317.75 ms 2.12 us 1.50 ms
ConcatUnicode: 279.70 ms 1.86 us 1.50 ms
CreateInstances: 233.40 ms 5.56 us 0.50 ms
CreateStringsWithConcat: 79.30 ms 0.40 us 0.50 ms
CreateUnicodeWithConcat: 312.15 ms 1.56 us 1.00 ms
DictCreation: 139.40 ms 0.93 us 0.50 ms
DictWithFloatKeys: 392.95 ms 0.65 us 2.00 ms
DictWithIntegerKeys: 216.05 ms 0.36 us 2.00 ms
DictWithStringKeys: 171.05 ms 0.29 us 2.00 ms
ForLoops: 154.90 ms 15.49 us 0.00 ms
IfThenElse: 119.45 ms 0.18 us 2.00 ms
ListSlicing: 55.60 ms 15.89 us 0.50 ms
NestedForLoops: 116.85 ms 0.33 us 0.50 ms
NormalClassAttribute: 244.00 ms 0.41 us 1.00 ms
NormalInstanceAttribute: 212.00 ms 0.35 us 1.00 ms
PythonFunctionCalls: 188.40 ms 1.14 us 0.50 ms
PythonMethodCalls: 187.85 ms 2.50 us 0.00 ms
Recursion: 135.50 ms 10.84 us 0.00 ms
SecondImport: 129.80 ms 5.19 us 0.00 ms
SecondPackageImport: 128.80 ms 5.15 us 0.00 ms
SecondSubmoduleImport: 155.75 ms 6.23 us 0.00 ms
SimpleComplexArithmetic: 129.60 ms 0.59 us 0.50 ms
SimpleDictManipulation: 134.45 ms 0.45 us 1.00 ms
SimpleFloatArithmetic: 170.00 ms 0.31 us 1.00 ms
SimpleIntFloatArithmetic: 144.65 ms 0.22 us 1.00 ms
SimpleIntegerArithmetic: 146.70 ms 0.22 us 1.50 ms
SimpleListManipulation: 102.40 ms 0.38 us 1.00 ms
SimpleLongArithmetic: 99.60 ms 0.60 us 0.50 ms
SmallLists: 203.90 ms 0.80 us 1.00 ms
SmallTuples: 187.25 ms 0.78 us 1.00 ms
SpecialClassAttribute: 237.95 ms 0.40 us 1.00 ms
SpecialInstanceAttribute: 254.00 ms 0.42 us 1.00 ms
StringMappings: 232.90 ms 1.85 us 1.50 ms
StringPredicates: 191.20 ms 0.68 us 3.00 ms
StringSlicing: 226.70 ms 1.30 us 1.00 ms
TryExcept: 296.05 ms 0.20 us 2.00 ms
TryRaiseExcept: 159.45 ms 10.63 us 0.50 ms
TupleSlicing: 176.75 ms 1.68 us 0.00 ms
UnicodeMappings: 375.50 ms 20.86 us 2.50 ms
UnicodePredicates: 211.65 ms 0.94 us 3.50 ms
UnicodeProperties: 218.60 ms 1.09 us 3.50 ms
UnicodeSlicing: 228.60 ms 1.31 us 1.00 ms
------------------------------------------------------------------------
Average round time: 10729.00 ms
BTW, if there was still any question about the clock speed of the ps3.
[Jeremy@playstation pybench]$ cat /proc/cpuinfo
processor : 0
cpu : Cell Broadband Engine, altivec supported
clock : 3192.000000MHz
revision : 5.1 (pvr 0070 0501)
processor : 1
cpu : Cell Broadband Engine, altivec supported
clock : 3192.000000MHz
revision : 5.1 (pvr 0070 0501)
timebase : 79800000
machine : PS3PF
pjbliverpool
11-Dec-2006, 16:37
Here is some output from the python script benchmark pybench. It makes for a nice example of general purpose code.
Average round time: 10729.00 ms
For comparison. On my AMDx64 X2 3800@2ghz it scores 4113.59 ms
I was expecting much worse. Only being about 2.5x slower compared to a traditional desktop CPU for this type of code seems pretty reasonable.
Sorry if this is a dumb question but is that benchmark single or multithreaded?
Here is some output from the python script benchmark pybench. It makes for a nice example of general purpose code.
Average round time: 10729.00 ms
For comparison. On my AMDx64 X2 3800@2ghz it scores 4113.59 ms
I was expecting much worse. Only being about 2.5x slower compared to a traditional desktop CPU for this type of code seems pretty reasonable.
Tests: per run per oper. overhead
------------------------------------------------------------------------
BuiltinFunctionCalls: 143.60 ms 1.13 us 0.50 ms
BuiltinMethodLookup: 237.35 ms 0.45 us 0.50 ms
CompareFloats: 135.45 ms 0.30 us 0.50 ms
CompareFloatsIntegers: 220.45 ms 0.49 us 0.50 ms
CompareIntegers: 148.70 ms 0.17 us 1.00 ms
CompareInternedStrings: 133.90 ms 0.27 us 2.00 ms
CompareLongs: 137.25 ms 0.30 us 1.00 ms
CompareStrings: 155.95 ms 0.31 us 2.00 ms
CompareUnicode: 147.55 ms 0.39 us 1.50 ms
ConcatStrings: 317.75 ms 2.12 us 1.50 ms
ConcatUnicode: 279.70 ms 1.86 us 1.50 ms
CreateInstances: 233.40 ms 5.56 us 0.50 ms
CreateStringsWithConcat: 79.30 ms 0.40 us 0.50 ms
CreateUnicodeWithConcat: 312.15 ms 1.56 us 1.00 ms
DictCreation: 139.40 ms 0.93 us 0.50 ms
DictWithFloatKeys: 392.95 ms 0.65 us 2.00 ms
DictWithIntegerKeys: 216.05 ms 0.36 us 2.00 ms
DictWithStringKeys: 171.05 ms 0.29 us 2.00 ms
ForLoops: 154.90 ms 15.49 us 0.00 ms
IfThenElse: 119.45 ms 0.18 us 2.00 ms
ListSlicing: 55.60 ms 15.89 us 0.50 ms
NestedForLoops: 116.85 ms 0.33 us 0.50 ms
NormalClassAttribute: 244.00 ms 0.41 us 1.00 ms
NormalInstanceAttribute: 212.00 ms 0.35 us 1.00 ms
PythonFunctionCalls: 188.40 ms 1.14 us 0.50 ms
PythonMethodCalls: 187.85 ms 2.50 us 0.00 ms
Recursion: 135.50 ms 10.84 us 0.00 ms
SecondImport: 129.80 ms 5.19 us 0.00 ms
SecondPackageImport: 128.80 ms 5.15 us 0.00 ms
SecondSubmoduleImport: 155.75 ms 6.23 us 0.00 ms
SimpleComplexArithmetic: 129.60 ms 0.59 us 0.50 ms
SimpleDictManipulation: 134.45 ms 0.45 us 1.00 ms
SimpleFloatArithmetic: 170.00 ms 0.31 us 1.00 ms
SimpleIntFloatArithmetic: 144.65 ms 0.22 us 1.00 ms
SimpleIntegerArithmetic: 146.70 ms 0.22 us 1.50 ms
SimpleListManipulation: 102.40 ms 0.38 us 1.00 ms
SimpleLongArithmetic: 99.60 ms 0.60 us 0.50 ms
SmallLists: 203.90 ms 0.80 us 1.00 ms
SmallTuples: 187.25 ms 0.78 us 1.00 ms
SpecialClassAttribute: 237.95 ms 0.40 us 1.00 ms
SpecialInstanceAttribute: 254.00 ms 0.42 us 1.00 ms
StringMappings: 232.90 ms 1.85 us 1.50 ms
StringPredicates: 191.20 ms 0.68 us 3.00 ms
StringSlicing: 226.70 ms 1.30 us 1.00 ms
TryExcept: 296.05 ms 0.20 us 2.00 ms
TryRaiseExcept: 159.45 ms 10.63 us 0.50 ms
TupleSlicing: 176.75 ms 1.68 us 0.00 ms
UnicodeMappings: 375.50 ms 20.86 us 2.50 ms
UnicodePredicates: 211.65 ms 0.94 us 3.50 ms
UnicodeProperties: 218.60 ms 1.09 us 3.50 ms
UnicodeSlicing: 228.60 ms 1.31 us 1.00 ms
------------------------------------------------------------------------
Average round time: 10729.00 ms
BTW, if there was still any question about the clock speed of the ps3.
[Jeremy@playstation pybench]$ cat /proc/cpuinfo
processor : 0
cpu : Cell Broadband Engine, altivec supported
clock : 3192.000000MHz
revision : 5.1 (pvr 0070 0501)
processor : 1
cpu : Cell Broadband Engine, altivec supported
clock : 3192.000000MHz
revision : 5.1 (pvr 0070 0501)
timebase : 79800000
machine : PS3PF
it seems interesting given the fact the linux doesnt take advantage of the Altivec in any way, wereas the x86 linux (assumeing your pybench test was running off x86 linux?) seems to take full advantage of the equivalant mmx instructions.
the mac OSX does however take full advantage and has many Altivec optimisations, anyone able to run a test there to compare its result?.
theres always libfreevec http://freevec.org/ that might even out the scores and improve overall general throughput for all apps if someones willing to take the time to include it as a PPC GLIBC replacement.
"libfreevec is a free (LGPL) library with hand-optimized replacement routines for GLIBC, such as memcpy(), strlen(), etc. These routines have been written specifically to take advantage of the AltiVec unit (a.k.a Velocity Engine or VMX), "
"For example, did you know that you can do byte swapping with AltiVec 7 times faster than with scalar code? Or that it is possible to sort integers and floats 4 times faster with the help of AltiVec? Were you aware that it helps to do string searching faster? Memory hashing gets upto 7 times faster. The list could just go on and on... "
http://www.powerdeveloper.org/forums/viewtopic.php?t=387&postdays=0&postorder=asc&start=0
Rainbow Man
12-Dec-2006, 04:35
I find all of this utterly fascinating. It saddens me I cannot vote for any thread on this site, or any of the amazing posters in it.. :(
One unhappy camper unfortunately.
Maybe consoles are starting to mature now. It certainly seems that way. Home hacking of consoles have mostly been confined to dark cellars and closed inner circles and kept very low-key really, except maybe for the original xbox - which you needed a modchip for anyway.
Now with ps3, and performance not actually sucking all that mcuh it seems.. Who can say where it'll end.. fascinating, fascinating! I'm sure there will pop up many PS3 websites dedicated to homemade programs under linux now in the coming year. Maybe linux as a whole will get an upswing because of this?
Peace.
Cell Bechmarks (http://www-128.ibm.com/developerworks/power/library/pa-cellperf/#sec3)
Hi, i was looking for some ps3 reviews on the internet and i found this:
In IBM’s controlled testing environment, their optimized code on 8 SPE only yielded a performance number of 155.5GFLOPS. If it took 8 SPE to achieve that, no way 6 will be able to and that testing was done in a fashion that didn’t model all the complexities of DMA and the memory system. Using a 1Kx1K matrix and 8 SPE they were able to achieve 73.4GFLOPS, but the PS3 uses 6 SPE for games and these tests were done in controlled environments. So going on this information, even 73.4GFLOPS is seemingly out of reach, showing us that Sony didn’t necessarily lie about the cell’s performance as they made clear the 218GFLOPS was “theoretical.” But just like Microsoft they definitely wanted you to misinterpret these numbers into believing they were achievable.
Full Review (http://dpad.gotfrag.com/portal/story/35372/?spage=4)
So this means, that PS3's CELL is just able to achieve 73.4GFlops? , i don't really know how this bechmarking works, but i saw that when they used a larger matrix, they obtained more performance from the SPU's. Could someone explain me how this bechmarking works, and if the information of the review is true and CELL is just able to achieve 73.4GFlops?
Thnks.
inefficient
12-Dec-2006, 07:15
Cell Bechmarks (http://www-128.ibm.com/developerworks/power/library/pa-cellperf/#sec3)
So this means, that PS3's CELL is just able to achieve 73.4GFlops? , i don't really know how this bechmarking works, but i saw that when they used a larger matrix, they obtained more performance from the SPU's. Could someone explain me how this bechmarking works, and if the information of the review is true and CELL is just able to achieve 73.4GFlops?
Thnks.
Thats like buying a car that says its max theoretical speed is 218mph. But then becoming dissapointed later when you realize than for every day driving through the city you go much slower. And even on the highway you can only manage around 73mph because of traffic problems.
Anyway those benchmarks are taken on a Cell simulator not a real machine. On a real Cell device like the ps3 you would likely run into other bottlenecks not accounted for in the simulation.
All that particular test says is that they were able to get x code running at x simulated speed. You can't come to a conculsion about what the actual maximum achieviable speed is. Much less conclude that that Sony is a bunch of liars.
Using a 1Kx1K matrix and 8 SPE they were able to achieve 73.4GFLOPS etc. etc.
So this means, that PS3's CELL is just able to achieve 73.4GFlops? , i don't really know how this bechmarking works, but i saw that when they used a larger matrix, they obtained more performance from the SPU's. Could someone explain me how this bechmarking works, and if the information of the review is true and CELL is just able to achieve 73.4GFlops?
If they used a larger matrix and obtained a higher number, it means the system has more room to go beyond the 1K x 1K matrix test (e.g., 2K x 2K, 4K x 4K, ...). So 73.4 GFLOPS for that benchmark is just a non-peak data point.
I'm not familiar with the benchmark to comment on what it does.
EDIT: Oops.... didn't see the link to the benchmark :embarrased:. Gubbi and Shifty have answered the question well below.
Thats like buying a car that says its max theoretical speed is 218mph. But then becoming dissapointed later when you realize than for every day driving through the city you go much slower. And even on the highway you can only manage around 73mph because of traffic problems.
Anyway those benchmarks are taken on a Cell simulator not a real machine. On a real Cell device like the ps3 you would likely run into other bottlenecks not accounted for in the simulation.
All that particular test says is that they were able to get x code running at x simulated speed. You can't come to a conculsion about what the actual maximum achieviable speed is. Much less conclude that that Sony is a bunch of liars.
Having said this, the fact is that because the SPEs can run code and access data independently of each other and of memory and intercommunication busses, Cell can get closer to the theoretical maximum FP performance than conventional processors.
But, the 1kx1k means 1000x1000 pixels, right? so if no TV is bigger than 1.920 x 1.080 pixels, does that means, that the GFLOPS of the 8 SPU's will be limited to something around 80GFlops?
jonabbey
12-Dec-2006, 14:05
But, the 1kx1k means 1000x1000 pixels, right? so if no TV is bigger than 1.920 x 1.080 pixels, does that means, that the GFLOPS of the 8 SPU's will be limited to something around 80GFlops?
When running that algorithm presumably, yeah.
When running that algorithm presumably, yeah.
But, can they improve it?
But, the 1kx1k means 1000x1000 pixels, right? so if no TV is bigger than 1.920 x 1.080 pixels, does that means, that the GFLOPS of the 8 SPU's will be limited to something around 80GFlops?:lol:
Wait, are you for real?
mr_arcam
12-Dec-2006, 14:39
he is taking the piss....... i hope :P
:lol:
Wait, are you for real?
I don't really know much about this, just asking if am i right... so, it's like i said or not?
Shifty Geezer
12-Dec-2006, 14:51
So this means, that PS3's CELL is just able to achieve 73.4GFlops?In that one implementation of that one task, yes. The efficiency of a processor, the number of obtained GFlops applied performance, depends in the task and the implementation and their suitability to the processor. In some tasks, Cell performs much closer to it's theoretical maximums. If you really wanted, you could just loop through vector adds and obtain all 25.6 GFlops per SPE, so the peak rate is certainly obtainable. Getting real work is a different matter, and that will fall some way short on all processors. eg. If Cell only manages 70 GFlops out of 150 peak, you can be sure a CPU with a peak of 20 is more likely hitting 10 GFlops, as a rough average which is utterly useless because so much depends on system architecture.
Something that bothers me about the article you linked to, which is common in internet debates, is in comparing Cell to XeCPU, Cell's attained efficiency is pointed out as lowering the GFLop count of PS3, but efficiency of XeCPU seems to be ignored. The same process hasn't been benchmarked on XeCPU. How does XeCPU cope with Linpack in a 1k x 1k matrix? Furthermore, solving equations (what Linpack is) isn't an ideal use of Cell in PS3. In other tasks like transforming vectors, efficiency will be much higher.
At the end of the day, the peak numbers are just that. What matters is how the processor works in real-world applications, which isn't just a matter of being able to do lots of sums quickly. Benchmarks can be useful for getting a rough idea of a processor's performance, as long as you read the small print and know what's being benchmarked. They're no good for comparing different processors unless you run the benchmarks on both processors, and can be sure they are equally optimized. In this case, the performance of Cell determined from these benchmarks should not be posted in the same article as a comparison of the two system processors, because no comparable benchmark is provided for XeCPU (of which we have no benchmarks. Hopefully XNA devs will get something working there.). For comparing the processors it's useless as we don't know if XeCPU is as efficient, more efficient, or less efficient than Cell at Linpack. And furthermore that benchmark is pretty useless for console anyway as solving equations isn't the main use of a CPU and Linpack is not representative of the workload.
But, the 1kx1k means 1000x1000 pixels, right? so if no TV is bigger than 1.920 x 1.080 pixels, does that means, that the GFLOPS of the 8 SPU's will be limited to something around 80GFlops?
No, no, no, no, no! 1k x 1k means 1 million equations to be solved. Linpack is a 'supercomputer' benchmark. It has no bearing at all on console game resolution. The GFLops of Cell are limited to 156 peak or whatever it is, with the actually number of sums performed per second being between 0 and 156 billion depending on the code being executed.
So this means, that PS3's CELL is just able to achieve 73.4GFlops? , i don't really know how this bechmarking works, but i saw that when they used a larger matrix, they obtained more performance from the SPU's. Could someone explain me how this bechmarking works, and if the information of the review is true and CELL is just able to achieve 73.4GFlops?
Thnks.
That's the results for one benchmark, other benchmarks will give different results.
If you look at the result for 4K x 4K matrix multiply the result was 200 GFLOPS.
It's also a year old, so used an older version of Cell. I don't know if Linpack was effected but I know FFTs (which were already fast) got quite a boost from the addition of "huge pages".
Why not?
Linpack is a synthetic linear algebra test, not a game or rendering engine.
Cheers
inefficient
13-Dec-2006, 16:28
I installed the GNU toolchains for Cell development this evening and I'm having too much fun!
I wanted to run some toy code I know is already fairly optimized for the gcc compiler so I've run the following code on it so far.
Fibonacci number generator
http://dada.perl.it/shootout/fibo.gcc.html
Mandelbrot generator
http://shootout.alioth.debian.org/gp4/benchmark.php?test=mandelbrot&lang=gcc&id=2
Fibonacci results are straight forward. It's ~140% faster on the SPU.
fib (PPU) 0.130743 secs
fib (SPU) 0.093876 secs
Mandelbrot results are more interesting. Output size is 1000x1000. Hmmm... it runs faster on the PPU than the SPU.
mandelbrot (PPU) 2.595214 secs
mandelbrot (SPU) 7.623138 secs
Check the code and see that is using doubles instead of floats. Change the type from double to float and get this result.
mandelbrot (PPU) 2.715478
mandelbrot (SPU) 3.542034
better...
I installed the GNU toolchains for Cell development this evening and I'm having too much fun!
I wanted to run some toy code I know is already fairly optimized for the gcc compiler so I've run the following code on it so far.
Fibonacci number generator
http://dada.perl.it/shootout/fibo.gcc.html
Mandelbrot generator
http://shootout.alioth.debian.org/gp4/benchmark.php?test=mandelbrot&lang=gcc&id=2
Fibonacci results are straight forward. It's ~40% faster on the SPU.
fib (PPU) 0.130743 secs
fib (SPU) 0.093876 secs
Mandelbrot results are more interesting. Output size is 1000x1000. Hmmm... it runs faster on the PPU than the SPU.
mandelbrot (PPU) 2.595214 secs
mandelbrot (SPU) 7.623138 secs
Check the code and see that is using doubles instead of floats. Change the type from double to float and get this result.
mandelbrot (PPU) 2.715478
mandelbrot (SPU) 3.542034
better...
just a slight error ;)
inefficient
13-Dec-2006, 16:42
Forgot to add this:
Compiled with -O2
fib (PPU) 0.035347
fib (SPU) 0.035380
mandelbrot (PPU) 0.777137
mandelbrot (SPU) 1.593385
:shock:
But this might not be considered fair since the PPU instruction set is well known and compiler optimizations are better.
The Mandelbrot program writes to stdout, this causes the SPE to contact the PPE as it handles stdout, that'll hurt it.
putc(byte_acc,stdout);
If those lines are removed from both versions the comparisons should be better.
Also, try it compiled with XLC using -O5, that'll auto vectorise both versions and that should give a much better view of potential performance.
Forgot to add this:
Compiled with -O2
fib (PPU) 0.035347
fib (SPU) 0.035380
For what argument ?
Cheers
RollingBalls
14-Dec-2006, 13:03
This is completely out of the blue, but has anyone tried printing with CUPS?
inefficient
14-Dec-2006, 13:05
The Mandelbrot program writes to stdout, this causes the SPE to contact the PPE as it handles stdout, that'll hurt it.
putc(byte_acc,stdout);
If those lines are removed from both versions the comparisons should be better.
Also, try it compiled with XLC using -O5, that'll auto vectorise both versions and that should give a much better view of potential performance.
Compiled with -O5 there was no improvement with the Mandelbrot. Some improvement with with Fibonacci. It suggests the auto-SIMDfication does not really do anything here. But it does so something to speed up recursive functions.
fib (PPU) 0.027056
fib (SPU) 0.028134
mandelbrot (PPU) 0.786111
mandelbrot (SPU) 1.593085
Initially thought it was the IO functions that was slowing down the code as well (actually I modified to code to write to a file not stdout). But even after competely taking out the file IO parts of the routine there is only a very little performance increase.
I'm guessing either the spu stdio library is setup to buffer IO writes to LS so it has very little impact, or they setup some kind of virtual write back cache in the LS.
For what argument ?
30
I'm surprised so far how well the spu-gcc compiler hides the LS from you. It's not really obvious to me how much communication is going on between the LS and main memory during actual code execution.
Compiled with -O5 there was no improvement with the Mandelbrot. Some improvement with with Fibonacci. It suggests the auto-SIMDfication does not really do anything here. But it does so something to speed up recursive functions.
fib (PPU) 0.027056
fib (SPU) 0.028134
mandelbrot (PPU) 0.786111
mandelbrot (SPU) 1.593085
<snip>
I'm surprised so far how well the spu-gcc compiler hides the LS from you. It's not really obvious to me how much communication is going on between the LS and main memory during actual code execution.
Both programs have very small footprints, so it's no surprise that they can be contained in cache/LS.
For a P4 2.8GHz:
> time ./fib 30
1346269
real 0m0.011s
user 0m0.008s
sys 0m0.004s
> time ./frac 1000 >dummy
real 0m0.273s
user 0m0.272s
sys 0m0.004s
edit: both compiled with gcc -02 on a FC5 Linux
Cheers
inefficient
14-Dec-2006, 14:49
Vectorized version:
mandelbrot (PPU) 1.302384
mandelbrot (SPU) 0.785173
Slower on the PPU faster on the SPU.
#include <stdio.h>
#include <stdlib.h>
typedef float v2sf __attribute__ ((vector_size(8)));
int mandelbrot (int w,int h, FILE *out)
{
int bit_num = 0;
char byte_acc = 0;
int i, iter = 50;
float x, y, limit_sqr = 4.0;
v2sf Zrv, Ziv, Crv, Civ, Trv, Tiv;
v2sf zero, one, _1p5, two;
float *Zr = (float*)&Zrv, *Zi = (float*)&Ziv,
*Cr = (float*)&Crv, *Ci = (float*)&Civ,
*Tr = (float*)&Trv, *Ti = (float*)&Tiv;
#define initv(name, val) *((float*)&name) = (float) val; \
*((float*)&name+1) = (float) val
initv(zero,0.0); initv(one,1.0); initv(_1p5,1.5); initv(two,2.0);
fputs("P4\n1000 1000\n", out);
for(y=0;y<h;++y)
{
for(x=0;x<w;x+=2)
{
Zrv = Ziv = Trv = Tiv = zero;
*Cr = x/w; *(Cr+1) = (x+1.0)/w;
*Ci = y/h; *(Ci+1) = *Ci;
Crv = two * Crv - _1p5;
Civ = two * Civ - one;
for (i=0;i<iter && ((*Tr + *Ti <= limit_sqr) || (*(Tr+1) + *(Ti+1) <= limit_sqr)); ++i)
{
Ziv = two*Zrv*Ziv + Civ;
Zrv = Trv - Tiv + Crv;
Trv = Zrv * Zrv;
Tiv = Ziv * Ziv;
}
byte_acc <<= 2;
if(*Tr + *Ti <= limit_sqr)
byte_acc |= 0x02;
if((*(Tr+1) + *(Ti+1) <= limit_sqr))
byte_acc |= 0x01;
bit_num+=2;
if(bit_num == 8)
{
putc(byte_acc,out);
byte_acc = 0;
bit_num = 0;
}
else if(x == w-1)
{
byte_acc <<= (8-w%8);
putc(byte_acc,out);
byte_acc = 0;
bit_num = 0;
}
}
}
}
int
main(unsigned long spuid) {
FILE *f = fopen("image_spu.pbm", "wb");
mandelbrot(1000,1000,f);
return(0);
}
Vectorized version:
mandelbrot (PPU) 1.302384
mandelbrot (SPU) 0.785173
Slower on the PPU faster on the SPU.
Sorry, aren't you only testing the first 2 elements in each 8-vector for escape conditions ?
Edit: Also you step through with x+=2.
Anyway, you linked to a 2-way packed double version here (http://shootout.alioth.debian.org/gp4/benchmark.php?test=mandelbrot&lang=gcc&id=3), which yields the following on a 2.8GHz P4:
> time ./mandelbrot.gcc-3.gcc_run 1000 >dummy
real 0m0.202s
user 0m0.200s
sys 0m0.000s
I think it's safe to say that getting performance out of CELL (and XCPU for that matter) takes a bit of work.
Cheers
inefficient
14-Dec-2006, 15:30
This was buggy - output was wrong
typedef float v2sf __attribute__ ((mode(V2SF)));
And this didn't even show correct results on the PPU with doubles
typedef double v2df __attribute__ ((mode(V2DF)));
So I changed it to this.
typedef float v2sf __attribute__ ((vector_size(8)));
I assumed vector_size is in bytes not number of elements, but I could be wrong. Two floats would be 8 bytes. And this seems to work.
I didn't write the original, but step through with x+=2 actually make sense if you read the code. It's actually calculating 2 bits per iteration. Thats the part that is vectorized.
I assumed vector_size is in bytes not number of elements, but I could be wrong. Two floats would be 8 bytes. And this seems to work.
Hmm, I'd expect it to work on 8 element vectors.
It would still work if you step through 2 units at a time, you just redo the calculations four times.
Could you try to set vectorsize to 4 and make two extra escape conditions, and step through with x+=4
Something like this:
#include <stdio.h>
#include <stdlib.h>
typedef float v2sf __attribute__ ((vector_size(4)));
int mandelbrot (int w,int h, FILE *out)
{
int bit_num = 0;
char byte_acc = 0;
int i, iter = 50;
float x, y, limit_sqr = 4.0;
v2sf Zrv, Ziv, Crv, Civ, Trv, Tiv;
v2sf zero, one, _1p5, two;
float *Zr = (float*)&Zrv, *Zi = (float*)&Ziv,
*Cr = (float*)&Crv, *Ci = (float*)&Civ,
*Tr = (float*)&Trv, *Ti = (float*)&Tiv;
#define initv(name, val) *((float*)&name) = (float) val; \
*((float*)&name+1) = (float) val
initv(zero,0.0); initv(one,1.0); initv(_1p5,1.5); initv(two,2.0);
fputs("P4\n1000 1000\n", out);
for(y=0;y<h;++y)
{
for(x=0;x<w;x+=4)
{
Zrv = Ziv = Trv = Tiv = zero;
*Cr = x/w; *(Cr+1) = (x+1.0)/w;
*Ci = y/h; *(Ci+1) = *Ci;
Crv = two * Crv - _1p5;
Civ = two * Civ - one;
for (i=0;i<iter && ((*Tr + *Ti <= limit_sqr) || (*(Tr+1) + *(Ti+1) <= limit_sqr)); ++i)
{
Ziv = two*Zrv*Ziv + Civ;
Zrv = Trv - Tiv + Crv;
Trv = Zrv * Zrv;
Tiv = Ziv * Ziv;
}
byte_acc <<= 4;
if(*Tr + *Ti <= limit_sqr)
byte_acc |= 0x08;
if((*(Tr+1) + *(Ti+1) <= limit_sqr))
byte_acc |= 0x04;
if(*(Tr+2) + *(Ti+3) <= limit_sqr)
byte_acc |= 0x02;
if((*(Tr+1) + *(Ti+1) <= limit_sqr))
byte_acc |= 0x01;
bit_num+=4;
if(bit_num == 8)
{
putc(byte_acc,out);
byte_acc = 0;
bit_num = 0;
}
else if(x == w-1)
{
byte_acc <<= (8-w%8);
putc(byte_acc,out);
byte_acc = 0;
bit_num = 0;
}
}
}
}
int
main(unsigned long spuid) {
FILE *f = fopen("image_spu.pbm", "wb");
mandelbrot(1000,1000,f);
return(0);
}
Cheers
Hmm, seems you're right, vectorsize is in bytes..
And forget the code I pasted, it's full of bugs.
Cheers
Alright, this works for 4-way floats:
#include <stdio.h>
#include <stdlib.h>
typedef float v2sf __attribute__ ((vector_size(16)));
int mandelbrot (int w,int h, FILE *out)
{
int bit_num = 0;
char byte_acc = 0;
int i, iter = 50;
float x, y, limit_sqr = 4.0, recpw=1.0/w;
v2sf Zrv, Ziv, Crv, Civ, Trv, Tiv;
v2sf zero, one, _1p5, two;
float *Zr = (float*)&Zrv, *Zi = (float*)&Ziv,
*Cr = (float*)&Crv, *Ci = (float*)&Civ,
*Tr = (float*)&Trv, *Ti = (float*)&Tiv;
#define initv(name, val) *((float*)&name) = (float) val; \
*((float*)&name+1) = (float) val; \
*((float*)&name+2) = (float) val; \
*((float*)&name+3) = (float) val
initv(zero,0.0); initv(one,1.0); initv(_1p5,1.5); initv(two,2.0);
fputs("P4\n1000 1000\n", out);
for(y=0;y<h;++y)
{
for(x=0;x<w;x+=4)
{
Zrv = Ziv = Trv = Tiv = zero;
*Cr = x*recpw; *(Cr+1) = (x+1.0)*recpw; *(Cr+2) = (x+2.0)*recpw; *(Cr+3) = (x+3.0)*recpw;
*Ci = y/h; *(Ci+1) = *Ci; *(Ci+2) = *Ci; *(Ci+3) = *Ci;
Crv = two * Crv - _1p5;
Civ = two * Civ - one;
for (i=0;i<iter && ((*Tr + *Ti <= limit_sqr) || (*(Tr+1) + *(Ti+1) <= limit_sqr) || (*(Tr+2) + *(Ti+2) <= limit_sqr) || (*(Tr+3) + *(Ti+3) <= limit_sqr)); ++i)
{
Ziv = two*Zrv*Ziv + Civ;
Zrv = Trv - Tiv + Crv;
Trv = Zrv * Zrv;
Tiv = Ziv * Ziv;
}
byte_acc <<= 4;
if(*Tr + *Ti <= limit_sqr){
byte_acc |= 0x08;
}
if((*(Tr+1) + *(Ti+1) <= limit_sqr)){
byte_acc |= 0x04;
}
if((*(Tr+2) + *(Ti+2) <= limit_sqr)){
byte_acc |= 0x02;
}
if((*(Tr+3) + *(Ti+3) <= limit_sqr)){
byte_acc |= 0x01;
}
bit_num+=4;
if(bit_num == 8)
{
putc(byte_acc,out);
byte_acc = 0;
bit_num = 0;
}
else if(x == w-4)
{
byte_acc <<= (8-w%8);
putc(byte_acc,out);
byte_acc = 0;
bit_num = 0;
}
}
}
}
int
main() {
FILE *f = fopen("image_spu.pbm", "wb");
mandelbrot(1000,1000,f);
return(0);
}
Please note the -ffast-math switch fucks up precision and gives wrong results.
The above was compiled with:
gcc -D_ISOC9X_SOURCE -O3 -mfpmath=sse -msse2 -march=pentium4 -funroll-loops -o mandelbrot.gcc-3.gcc_run mandelbrot.c -lm
2.8GHz, P4:
> time ./mandelbrot.gcc-3.gcc_run
real 0m0.132s
user 0m0.132s
sys 0m0.000s
Cheers
Rainbow Man
14-Dec-2006, 17:58
I don'treally understand much about code (I just read the time figures when running it hehe), but PS3 has 6 free SPUs does it noit? So can't a mandelbrot be split up into 6 pieces calculated individually..?
Anyoen tried that yet? :cool:
Shifty Geezer
14-Dec-2006, 20:57
You could probably split in several ways, but the easiest I think is render one pixel on each SPE. Each pixel is autonomously calculated so you could parallize that way. That'd get a 6x speed up. Not very exciting though. I did think SPE's would be very fast at Mandlebrot, but of course with the simplicity of the code, a standard processors L1 cache probably does a pretty good job of keeping the execution units occupied, so SPE's LS might not be such a huge boon.
When it comes to raytracing a 3D Mandlebrot terrain though... ;)
Hi inefficient, where did you get the vectorized source from ?
Vectorized version:
mandelbrot (PPU) 1.302384
mandelbrot (SPU) 0.785173
Slower on the PPU faster on the SPU.
#include <stdio.h>
#include <stdlib.h>
typedef float v2sf __attribute__ ((vector_size(8)));
int mandelbrot (int w,int h, FILE *out)
{
int bit_num = 0;
char byte_acc = 0;
int i, iter = 50;
float x, y, limit_sqr = 4.0;
v2sf Zrv, Ziv, Crv, Civ, Trv, Tiv;
v2sf zero, one, _1p5, two;
float *Zr = (float*)&Zrv, *Zi = (float*)&Ziv,
*Cr = (float*)&Crv, *Ci = (float*)&Civ,
*Tr = (float*)&Trv, *Ti = (float*)&Tiv;
#define initv(name, val) *((float*)&name) = (float) val; \
*((float*)&name+1) = (float) val
initv(zero,0.0); initv(one,1.0); initv(_1p5,1.5); initv(two,2.0);
fputs("P4\n1000 1000\n", out);
for(y=0;y<h;++y)
{
for(x=0;x<w;x+=2)
{
Zrv = Ziv = Trv = Tiv = zero;
*Cr = x/w; *(Cr+1) = (x+1.0)/w;
*Ci = y/h; *(Ci+1) = *Ci;
Crv = two * Crv - _1p5;
Civ = two * Civ - one;
for (i=0;i<iter && ((*Tr + *Ti <= limit_sqr) || (*(Tr+1) + *(Ti+1) <= limit_sqr)); ++i)
{
Ziv = two*Zrv*Ziv + Civ;
Zrv = Trv - Tiv + Crv;
Trv = Zrv * Zrv;
Tiv = Ziv * Ziv;
}
byte_acc <<= 2;
if(*Tr + *Ti <= limit_sqr)
byte_acc |= 0x02;
if((*(Tr+1) + *(Ti+1) <= limit_sqr))
byte_acc |= 0x01;
bit_num+=2;
if(bit_num == 8)
{
putc(byte_acc,out);
byte_acc = 0;
bit_num = 0;
}
else if(x == w-1)
{
byte_acc <<= (8-w%8);
putc(byte_acc,out);
byte_acc = 0;
bit_num = 0;
}
}
}
}
int
main(unsigned long spuid) {
FILE *f = fopen("image_spu.pbm", "wb");
mandelbrot(1000,1000,f);
return(0);
}
I did think SPE's would be very fast at Mandlebrot, but of course with the simplicity of the code, a standard processors L1 cache probably does a pretty good job of keeping the execution units occupied, so SPE's LS might not be such a huge boon.
This computation doesn't have many registers used so I don't think it'll even hit the L1 (at least on PowerPCs). There is however room for vectorising but even then there are dependancies causing stalls.
Loop unrolling should help a lot with this, the advantage the SPEs have is they'll not run out of registers, everything else will.
The program itself will easily fit in L1 / LS.
I'm gonna have a bash at vectorising it tomorrow.
Results for G4 @ 1.33GHz / gcc version 4.0.1 (Apple)
note: "putc"s removed
gcc -Wall -fast -mcpu=7450 -mtune=G4 -faltivec -mno-multiple mandelbrot2.c
N=1000
0.45644999 Seconds
Fafalada
15-Dec-2006, 03:01
Vectorized version:
Rather then vectorizing - can you try something really simple?
Take the scalar version and align all the variables to 16 bytes ( "__attribute__((aligned(16)))" ) - this is relevant for SPU only.
You could probably split in several ways, but the easiest I think is render one pixel on each SPE. Each pixel is autonomously calculated so you could parallize that way. That'd get a 6x speed up. Not very exciting though. I did think SPE's would be very fast at Mandlebrot, but of course with the simplicity of the code, a standard processors L1 cache probably does a pretty good job of keeping the execution units occupied, so SPE's LS might not be such a huge boon.
When it comes to raytracing a 3D Mandlebrot terrain though... ;)
It seems that the RPM, cell-sdk-lib-samples-1.1-10.noarch.rpm, contains a ray tracer for Cell.
Quaternion Julia Set Ray-tracing Sample
This sample was inspired by Keenan Crane’s work to implement Julia Set ray-tracers on GPUs using Nvidia’s Cg language. Procedurally generated surfaces are a hot topic relative to the Cell processor due to their dynamic resolution independent nature, high computational intensity, and low memory foot print. This sample attempts to preserve Keenan’s Cg coding style while showing the flexibility of Cell’s programming models. The Julia sample demonstrates several Cell software technologies including a SPE centric load balancing frame work first written for our Terrain Rendering Engine (TRE) and a software cache optimized for SIMD texture performance. The sample also provides a demonstration platform for the advantage of structure of array (SOA) over array of structures (AOS) data organization when running on SIMD architectures.
You can find the RPM here: http://www.alphaworks.ibm.com/tech/cellsw/download
You could probably split in several ways, but the easiest I think is render one pixel on each SPE. Each pixel is autonomously calculated so you could parallize that way. That'd get a 6x speed up. Not very exciting though. I did think SPE's would be very fast at Mandlebrot, but of course with the simplicity of the code, a standard processors L1 cache probably does a pretty good job of keeping the execution units occupied, so SPE's LS might not be such a huge boon.
As ADEX said, it has around zilch data-cache (or LS) foot-print.
The SIMD nature of the SPEs dictates that you would want to do at least 4 pixels at a time (like the above code). Optimally you'd want to do 2x2 pixels for better branch coherence (ie. better coherence in the escape condition between pixels), and not the 4x1 arrangement above.
Also, instead of the scalar escape condition test above, You would want to compute a parallel (4-way) predicate on the escape condition directly into a bitmask (and remove all branches, I have low confidence in compilers being able to do this on their own).
Then you want to unroll it with at least 4 to cover the sizeable execution latencies of the SPEs. So you end up doing 4x4 pixels at a time. This is almost turning into GPGPU programming at this stage, because we rely on branch coherence to get good performance.
Doing the above, I'd expect to see SPE performance speed up >10 times compared to the above and be measurably faster than a 2.8GHz P4, and certainly much faster than said P4 running the code above.
I don't think there's any doubt mandelbrots could be computed faster on a SPE than on a PC processor. It just takes (significantly) more work than copy&pasting some code off a website and compiling it with standard switches.
Cheers
inefficient
15-Dec-2006, 11:33
Alright, this works for 4-way floats:
Please note the -ffast-math switch fucks up precision and gives wrong results.
The above was compiled with:
gcc -D_ISOC9X_SOURCE -O3 -mfpmath=sse -msse2 -march=pentium4 -funroll-loops -o mandelbrot.gcc-3.gcc_run mandelbrot.c -lm
2.8GHz, P4:
> time ./mandelbrot.gcc-3.gcc_run
real 0m0.132s
user 0m0.132s
sys 0m0.000s
Cheers
4-way on the Cell. -funroll-loops -O5 (file IO is still included)
mandelbrot (PPU) 0.413113
mandelbrot (SPU) 0.160597
Not bad, already a 10x increase from the the original (1.593085). And not far off from your score on a P4@2.8ghz.
4-way on the Cell. -funroll-loops -O5 (file IO is still included)
mandelbrot (PPU) 0.413113
Craptacular
mandelbrot (SPU) 0.160597
Very good, better than expected.
Cheers
Why there is such a difference between the ppu and the spu when they run the vectorised customized code?
The altivex unit have 128 registers, run at the same speed, etc...
How explain the 4X difference in speed?
Rainbow Man
15-Dec-2006, 13:30
Not bad, already a 10x increase from the the original (1.593085). And not far off from your score on a P4@2.8ghz.
Very nice as far as performance go! How deep canit zoom before the image turns to mush?
Peace.
Why there is such a difference between the ppu and the spu when they run the vectorised customized code?
The altivex unit have 128 registers, run at the same speed, etc...
How explain the 4X difference in speed?
Ok I think I find the response myself.
GCC compiler ignores altivec unit, correct if I'm wrong ;)
If true the result of ppu vs spu is irrevelent.
Ok I think I find the response myself.
GCC compiler ignores altivec unit, correct if I'm wrong ;)
If true the result of ppu vs spu is irrevelent.
Don't think so. The vectorized version is still twice as fast as the scalar one.
Of course, it would be interesting to see the assembler output from gcc using the -S option (hint: Inefficient ;) )
Cheers
Titanio
15-Dec-2006, 16:34
Why there is such a difference between the ppu and the spu when they run the vectorised customized code?
The altivex unit have 128 registers, run at the same speed, etc...
How explain the 4X difference in speed?
The PPE VMX unit has 32 registers, IIRC.
I'm sure it's a number of things contributing to the performance delta.
It is nice to see that relatively untouched code can run decently on the SPU predominantly using compiler optimisations, though. Not that I'd expect that to hold in general, but still, even just in this case..
inefficient
15-Dec-2006, 17:04
Why there is such a difference between the ppu and the spu when they run the vectorised customized code?
The altivex unit have 128 registers, run at the same speed, etc...
How explain the 4X difference in speed?
The likely explanation is it's actually not generating vectorized code for the PPU.
Maybe I'm not using the right compiler switches to get the vector extensions to kick in. -maltivec and -mabi=altivec don't seem to do anything.
Edit: The asm seems to confirm this.
My bad titanio, I was thinking the ppu has the same altivex unit as the vmx128 in ppx (xenon).
inefficient
15-Dec-2006, 17:22
Don't think so. The vectorized version is still twice as fast as the scalar one.
Of course, it would be interesting to see the assembler output from gcc using the -S option (hint: Inefficient ;) )
Cheers
Generated with -S -O5
PPU asm: http://paste.uni.cc/12255
SPU asm: http://paste.uni.cc/12256
4-way on the Cell. -funroll-loops -O5 (file IO is still included)
mandelbrot (PPU) 0.413113
mandelbrot (SPU) 0.160597
Not bad, already a 10x increase from the the original (1.593085). And not far off from your score on a P4@2.8ghz.
the CFLAGS seem to make a difference in lots of cases, see DHolm's post for an idea
http://www.powerdeveloper.org/forums/viewtopic.php?t=252&highlight=-O3+-fexpensive-optimizations+-funroll-loops+-fomit-frame-pointer+-mcpu%3D7450+-maltivec#873
"There is really no such thing as correct USE-flags. The whole point of the USE-flag system is that you can adapt your Linux system to suit your personal needs. If your problems really are caused by a USE-flag (which is probably not the case) then it is a bug in Gentoo and not your fault.
Regarding CFLAGS, this is what I use on my G4 system: "-mcpu=7450 -mtune=7450 -O2 -maltivec -mabi=altivec -fno-strict-aliasing -fsigned-char -pipe"
The most important flag here is the -fno-strict-aliasing which is required with newer GCCs because not all software supports strict-aliasing yet.
-mcpu=7450 tells GCC to support all instructions available on the ppc7450-series (the 7447 which you find in the Pegasos is 7450-compatible, the main difference between the 744x and 745x is that the latter supports L3-cache).
-mtune=7450 tells GCC to optimize for the 7450
You could for instance use "-mcpu=G3 -mtune=7450" which would mean that it should only use instructions which are known to work on G3 and newer CPUs but still optimize for a 7450 (G4).
You can replace 7450 with G4, but then it will not optimize for the newer G4 models.
-O2 is the optimization level. 3 is the maximum but it is broken if you use mtune/mcpu 7450.
-maltivec enables the use of AltiVec instructions.
-mabi=altivec adds some extra code to your software to make it AltiVec-safe, it is not a good idea to use -maltivec without this option unless you really know what you are doing.
-fsigned-char is required by very few applications, I don't even remember which ones, and I don't know exactly why it is required. Someone else can probably answer this.
-pipe will tell GCC to pipe its output to the assembler, the default method is to create a temporary file in /tmp which the assembler reads from, but piping the data is normally faster.
Set CXXFLAGS="${CFLAGS}" which just tells GCC to use the same flags for C++ as for C.
If you are missing "-fno-strict-aliasing" then I would guess that is what is causing it to fail Xorg compilation. Although if this is not the case it would help if you could paste some output from the error here.
One way of logging the entire compilation process is to run something like:
emerge xorg-x11 &> /root/xorg-compile.log &
sleep 10; tail -f /root/xorg-compile.log
After the compilation breaks you will have the entire output of the compile process in /root/xorg-compile.log."
did you look at the freevec and talk to the Altivec and SPU lads in the altivsec section yet to see how you might match the x86 MMX switchs/settings?.
Rather then vectorizing - can you try something really simple?
Take the scalar version and align all the variables to 16 bytes ( "__attribute__((aligned(16)))" ) - this is relevant for SPU only.
why do you think its only relevant for SPU?, apparently its a very old trick to make Vectoring code
far better.
again, see the AltiVec section of http://www.powerdeveloper.org/forums/viewtopic.php?t=58&highlight=__attribute__ from way back in 2004.
http://www.powerdeveloper.org/forums/viewforum.php?f=23 is the best open place to ask your questions about that, as the other place is for now, closed access.
why do you think its only relevant for SPU?, apparently its a very old trick to make Vectoring code far better
It's not only relevant for SPUs but it's very imporant on SPUs than it is on other processors. Check SPUs documentation and how they access their local memory.
The likely explanation is it's actually not generating vectorized code for the PPU.
Maybe I'm not using the right compiler switches to get the vector extensions to kick in. -maltivec and -mabi=altivec don't seem to do anything.
Edit: The asm seems to confirm this.
hmm what version GCC are you useing?, i seem to remember that some verions dont do to well (if anything)with Auto vectorising for PPC/Altivec, i seem to remember inlineing works perhaps?.
i cant see a relevant thread over there, so perhaps ask powerdev yourself? to clarify it here.
It's not only relevant for SPUs but it's very imporant on SPUs than it is on other processors. Check SPUs documentation and how they access their local memory.
yes , i was wondering about that.
the main goal should be to get the must useful basic facts out there so its easyer to find the overall best options to compile for the fare general PPC/Altivec/MMX test given the PS3 has a clock speed on par with most x86 today and (infact it was) 'Hobold's' (not DHolm's) posted general code example seems to explain a lot.
http://www.powerdeveloper.org/forums/viewtopic.php?t=58&highlight=__attribute__#121
"/* All AltiVec loads and stores silently round down their addresses to a
multiple of 16. Handling of arbitrary string buffers must take that into
account.
general case: partial vector at start and end of string
|.....--+-------+-------+-----..|
so we must handle a partial vector first, then a number of whole vectors,
and another partial vector at the end
special case: first and last char of string within same vector
|.----..|
partial vectors are handled by loading the original value, merging in the
modified bytes with vec_sel, and storing back the modified vector
*/
"
macabre
15-Dec-2006, 19:16
IBM has just released a new version of the cell SDK, incase someone is interested:
http://alphaworks.ibm.com/tech/cellsw
YES !!! I'm so glad the ball continues to roll.
Linux kernel upgraded to 2.6.18; performance enhancements added; support added for a combined PPU and SPU Kernel debugger
GNU GCC tools upgraded to Version 4.1.1 and XL C/C++ compiler to Version 0.8.1
SPU debugger improved and support added for a combined Power Processing Unit (PPU) and Synergistic Processing Unit (SPU) debugger
addition of programming model frameworks, including SPU code overlays, an accelerator framework for offloading work to SPUs, and software managed cache
addition of SIMD Math library for PPU and SPU; revamping of LibC library for SPU; addition of MASS and MASS/V libraries for PPU
simulator support for performance modeling of memory subsystem components and interactions
addition of Cell BE-specific, post-link code optimization tool
addition of Eclipse Integrated Development Environment (IDE) support for building, compiling, and debugging Cell BE applications. The IDE uses the underlying SDK tools, including compilers, debugger, and system simulator.
The compiler in the new SDK will auto-vectorise, previous versions of gcc don't do this.
Also, that code has a few issues, some can be (hopefully) fixed by the compiler but others are just made invisible - e.g. there are integer and floating point variables intermixed. This means conversions everywhere which won't exactly help.
I've half done a vectorised version which fixes this and the speedup is quite hefty. I've not debugged it yet so I don't know if this will remain the case.
BTW AltiVec has 32 registers, only XBox360 has more than this.
The compiler in the new SDK will auto-vectorise, previous versions of gcc don't do this.
Also, that code has a few issues, some can be (hopefully) fixed by the compiler but others are just made invisible - e.g. there are integer and floating point variables intermixed. This means conversions everywhere which won't exactly help.
I've half done a vectorised version which fixes this and the speedup is quite hefty. I've not debugged it yet so I don't know if this will remain the case.
BTW AltiVec has 32 registers, only XBox360 has more than this.
apparently the 360 enhanced Altivec still has all the 32 old Altivec registers mapped to the the begining so as long as you keep to the standard, it should work fine, but then youcant get linux on it
or at least not easy can you?, so it doesnt matter right now, can you even run any 3rd party apps on there right now using in-built OS options and no developer licence.....
when you say 'that code has a few issues' are you refering to my pointer thread?,i must point out as i cant edit (ohh i can now thanks)the original that it was a typeing error and infact 'Hobold' was the person to post that code way back in 'Nov 01, 2004'.
if your refering to that code ?,then it might illustrate the fact that as you begin trying to learn the Altivec/scaler ways then it good that people like you ADEX etc can help clear the path and point people in the right directions.
regarding 'The compiler in the new SDK will auto-vectorise, previous versions of gcc don't do this.'
it might be fun to see a real life test with results for AVC/H.264/x264 video Encodeing (you cant use VC-1 right now as theres no open linux vc-1 encoder) useing this pure HD source video '[HD-DVD Challenge] MPEG2, VC-1 and H264 with real uncompressed source movie'
http://forum.doom9.org/showthread.php?t=114928
Lu_zero, along side his many other contributions http://overlays.gentoo.org/dev/lu_zero/wiki/CellImage
http://overlays.gentoo.org/dev/lu_zero
http://planet.gentoo.org/developers/lu_zero
had done much/many Altivec optimazations on the PPC FFMpeg codebase and so it can be used for any PPC/Altivec
testing you might want to try for instance, perhaps with the newest gcc options you might? be able to recompile/modify it and take advantage of one or several of the SPU's even?.
im not sure if x264 has had, or can take advantage of the Alivec optimisations as yet, so it might be interesting to find this out too, as x264 seems to surpass all the other (AVC)encoders at least on the x86 side according to the expert doom9 posters/coders.
this Maximizing the power of the Cell Broadband Engine processor: 25 tips to optimal application performance might give some interesting options.
http://www-128.ibm.com/developerworks/power/library/pa-celltips1/
Naboomagnoli
16-Dec-2006, 04:03
edit: deleted to keep focus on-topic.
Mostly off-topic I know, but does anyone have the link to the article by a former Microsoft guy (Japanese name, I believe) about Cell + a graphics card displaying thousands of 320 x 240 video streams at once in realtime? I can't remember the number (it was on a 4k projector, IIRC) or the name of the ex MS guy, which is making Googling a fruitless excercise.
It's a Toshiba demo, and it's not a 'graphics card,' but the Super Companion chip I believe that they were demonstrating.
(Thousands?)
Friend, use Google - Toshiba, MPEG-2, Demo - you'll find your answers...
(and yes it's very off-topic!) ;)
By the way Inefficient I love your efforts here - it's awesome to read the progress.
Naboomagnoli
16-Dec-2006, 05:02
edit: found it in another forum I had posted it in:
Here for off-topic goodness (shame on you if you click!) (http://translate.google.com/translate?u=http%3A%2F%2Ffurukawablog.spaces.live. com%2FBlog%2Fcns%21156823E649BD3714%213406.entry&langpair=ja%7Cen&hl=en&ie=UTF-8&oe=UTF-8&prev=%2Flanguage_tools)
I'd also like to echo your sentiments towards Inefficient's work - nice to see someone testing Cell more rigorously than mere graphics comparisons or comparing OS performance to that of more conventional computers.
inefficient
16-Dec-2006, 13:44
Installed the new 2.0 SDK. Compiled the code, found that "putc" no longer seems to work at all. Switched to fputc to write output bytes HUGE slowdown for the SPU code! No change for the PPE. So I decided to write bytes to a buffer than just output at the end with one fwrite. Performance is good on both PPE and SPU after this change.
mandelbrot (PPU) 0.360616
mandelbrot (SPU) 0.153928
Next I decided to manually vectorize the PPE code to see if I could speed it up. Never used altivec instructions before but I grabbed a reference here and started by converting the comparison in the innermost loop. http://www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf
I got a BIG speed up for the PPE! And the output is still accurate. Now performance between PPE and SPE is much closer.
-O5 -maltivec -mabi=altivec -funroll-loops
mandelbrot (PPU) 0.175734
mandelbrot (SPU) 0.154160
Just changing the bellow resulted in a 2x speed up.
From:
for (i=0;i<iter && ((*Tr + *Ti <= limit_sqr) || (*(Tr+1) + *(Ti+1) <= limit_sqr) || (*(Tr+2) + *(Ti+2) <= limit_sqr) || (*(Tr+3) + *(Ti+3) <= limit_sqr)); ++i)
To:
for (i=0;i<iter && vec_any_le(vec_add(Trv, Tiv), vlimit); ++i)
#include <altivec.h>
#include <stdio.h>
#include <stdlib.h>
//typedef float v2sf __attribute__ ((vector_size(16)));
typedef vector float v4sf;
int mandelbrot (int w,int h, FILE *out)
{
int bit_num = 0;
char byte_acc = 0;
int i, iter = 50;
float x, y, limit_sqr = 4.0, recpw=1.0/w;
v4sf Zrv, Ziv, Crv, Civ, Trv, Tiv;
v4sf zero, one, _1p5, two, vlimit;
float *Zr = (float*)&Zrv, *Zi = (float*)&Ziv,
*Cr = (float*)&Crv, *Ci = (float*)&Civ,
*Tr = (float*)&Trv, *Ti = (float*)&Tiv;
#define initv(name, val) *((float*)&name) = (float) val; \
*((float*)&name+1) = (float) val; \
*((float*)&name+2) = (float) val; \
*((float*)&name+3) = (float) val
initv(zero,0.0); initv(one,1.0); initv(_1p5,1.5); initv(two,2.0); initv(vlimit, 4.0);
char *img = malloc((w*h/8) + 13);
int img_pos = 13;
memcpy(img, "P4\n1000 1000\n", 13);
for(y=0;y<h;++y)
{
for(x=0;x<w;x+=4)
{
Zrv = Ziv = Trv = Tiv = zero;
*Cr = x*recpw; *(Cr+1) = (x+1.0)*recpw; *(Cr+2) = (x+2.0)*recpw; *(Cr+3) = (x+3.0)*recpw;
*Ci = y/h; *(Ci+1) = *Ci; *(Ci+2) = *Ci; *(Ci+3) = *Ci;
Crv = two * Crv - _1p5;
Civ = two * Civ - one;
//for (i=0;i<iter && ((*Tr + *Ti <= limit_sqr) || (*(Tr+1) + *(Ti+1) <= limit_sqr) || \
// (*(Tr+2) + *(Ti+2) <= limit_sqr) || (*(Tr+3) + *(Ti+3) <= limit_sqr)); ++i)
for (i=0;i<iter && vec_any_le(vec_add(Trv, Tiv), vlimit); ++i)
{
Ziv = two*Zrv*Ziv + Civ;
Zrv = Trv - Tiv + Crv;
Trv = Zrv * Zrv;
Tiv = Ziv * Ziv;
}
byte_acc <<= 4;
if(*Tr + *Ti <= limit_sqr){
byte_acc |= 0x08;
}
if((*(Tr+1) + *(Ti+1) <= limit_sqr)){
byte_acc |= 0x04;
}
if((*(Tr+2) + *(Ti+2) <= limit_sqr)){
byte_acc |= 0x02;
}
if((*(Tr+3) + *(Ti+3) <= limit_sqr)){
byte_acc |= 0x01;
}
bit_num+=4;
if(bit_num == 8)
{
img[img_pos++] = byte_acc;
byte_acc = 0;
bit_num = 0;
}
else if(x == w-4)
{
byte_acc <<= (8-w%8);
img[img_pos++] = byte_acc;
byte_acc = 0;
bit_num = 0;
}
}
}
fwrite(img, 1, img_pos, out);
}
int
main() {
FILE *f = fopen("image_ppu.pbm", "wb");
mandelbrot(1000,1000,f);
return(0);
}
This mandlebrot generator (the original one from debian) is deceptively complex, there's a lot of subtle potential bugs.
e.g. The individual elements in the vector can have different exit conditions from the main loop, keep iterating after one element is complete will mean that element will be wrong. The different may then also may act differently depending on the operations below.
The vectorisation is relatively simple, however doing that and branch removal is an exercise in hair pulling! Branches are bad enough in AltiVec but they are _the enemy_ on SPEs so I'm trying to remove all the branches. You usually use vec_sel to replace 4 branches in one go but in this case it requires some extra imagination.
I've looked at other Mandlebrot implementations and none are as complex as this example, I may end up using a different one.
It's a useful exercise though as you really have to think about it to be accurate.
The individual elements in the vector can have different exit conditions from the main loop, keep iterating after one element is complete will mean that element will be wrong.
The exit condition of mandelbrot is when the complex square iteration diverges (to infinity). Iterating diverging pixels past the initial exit condition should not result in wrong results *if* the floating point logic support inf. Even if it doesn't and just saturates, the probability for a wrong result is very small.
But I agree with you that from a functional standpoint as well as a performance standpoint, computing a predicate mask for Z>2 would be nicer, then you just loop until your predicate mask is all zero or the maximum number of iterations has been reached.
This should be fairly easy to do with Altivec intrinsics.
Cheers
inefficient
18-Dec-2006, 03:17
This mandlebrot generator (the original one from debian) is deceptively complex, there's a lot of subtle potential bugs.
e.g. The individual elements in the vector can have different exit conditions from the main loop, keep iterating after one element is complete will mean that element will be wrong. The different may then also may act differently depending on the operations below.
The vectorisation is relatively simple, however doing that and branch removal is an exercise in hair pulling! Branches are bad enough in AltiVec but they are _the enemy_ on SPEs so I'm trying to remove all the branches. You usually use vec_sel to replace 4 branches in one go but in this case it requires some extra imagination.
I've looked at other Mandlebrot implementations and none are as complex as this example, I may end up using a different one.
It's a useful exercise though as you really have to think about it to be accurate.
I'm not a mathematician, and someone might have to correct me on this. But my current understanding of the mandelbrot function is that you can't iterate "too many times".
A position is either in the mandelbrot set, or out of the madelbrot set. You can iterate 5000000 times or 50 times. The end result is that the pixel is either going to head to infinity or toward the limit 4.
Of course for computing purposes, you want to get an early exit for pixels (positions) you know are just going to head to infinity. You don't need to iterate an infinite number of times to know that.
On a color mandelbrot they just use the speed value (how quickly a position is heading to infinity) to index a color value. In this case, you want some accuracy about whether it took 5 iterations or 50 iterations to go out of bounds of the set. But since this test is monochrome, and each pixel is only either black or white, it's not necessary to know this.
So as I understand it, you can "keep iterating after one element is complete" and still be correct.
Actually the code I posted it will only exit early if all 4 pixels in the vector are > 4. There are going to be some worse case scenarios for this method. But on average it's not bad because it takes advantage of the fact adjacent pixels are usualy going to exit at rougly the same time.
Also I found this document. SPU_language_extensions_2.1.pdf (http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/30B3520C93F437AB87257060006FFE5E/$file/SPU_language_extensions_2.1.pdf) The extensions seem very simlar to the alitvec ones but not identical.
At least for vector instructions, it would be nice if there was a common language for both the PPE and SPU so you could compile/run identical C code on either and not have to do your manual optimizations for just one or the other.
Fafalada
18-Dec-2006, 03:39
why do you think its only relevant for SPU?, apparently its a very old trick to make Vectoring code far better.
Because I wasn't talking about vectorized code - scalars can run dramatically faster on SPU if you align them, this is what IBM coined the fancy "preferred slot paradigm" for.
My suggestion wasn't related to the search fo peak performance on this benchmark - just as a comparison of scalar performances of two cores (relative to original unmodified run).
It would be nice if there was a common language extention for both the PPE and SPU so you could compile/run identical C code on either and not have to do your manual optimizations for just one or the other.
That'd defeat the purpose of intrinsics though.
A language extension like you propose would make sense as something native to the compiler - it's pretty "fascinating" how after a decade of SIMD in consumer devices (and much longer in other areas) we are still stuck writting decorated asm opcodes to use the extra math power instead of seeing language support for SIMD types.
Or in the few cases where efforts are being made to do the latter, they tend to be groundbreakingly awful (in terms of generating stable code - referring to C++/C compilers here).
inefficient
18-Dec-2006, 04:16
That'd defeat the purpose of intrinsics though.
A language extension like you propose would make sense as something native to the compiler - it's pretty "fascinating" how after a decade of SIMD in consumer devices (and much longer in other areas) we are still stuck writting decorated asm opcodes to use the extra math power instead of seeing language support for SIMD types.
Or in the few cases where efforts are being made to do the latter, they tend to be groundbreakingly awful (in terms of generating stable code - referring to C++/C compilers here).
Actually reading the reference document a bit more I found there is actually a vmx2psu.h and spu2vmx.h that cover all the intrisics that do have a 1 to 1 mapping. So this really all I was looking for. And this is a good thing.
...this Maximizing the power of the Cell Broadband Engine processor: 25 tips to optimal application performance might give some interesting options.
http://www-128.ibm.com/developerworks/power/library/pa-celltips1/
Another document detailing Cell's programming gotchas. This is one of my favorites:
http://www.cs.utk.edu/~dongarra/cell2006/cell-slides/18-Michael-Perrone.pdf
This one is interesting too (One technique for optimizing branchy, unpredictable code on Cell):
http://www.cs.utk.edu/~dongarra/cell2006/cell-slides/12-Virat-Agarwal.pdf
More (benchmarks, utilities, applications) papers @ http://www.cs.utk.edu/~dongarra/cell2006/.
EDIT: To inefficient, thanks for all the benchmark effort in this thread :) and PMs to keep me on the right path (I was going to recompile the 2.6.18 kernel but apparently I can just use the provided binaries).
I tried writing my own mandlebrot generator using the algorithm from Wikipedia. Pretty much the same as the other one but instead of messing around with bits this just saves the number of iterations as a byte value. This is then used as a pixel value.
The bit manipulation just complicated vectorisation as it added branches. Using bytes was a lot easier as you can just use a few selects.
Performance so far is pretty good but it is not directly comparable as the output is different :
gcc -Wall -fast -mcpu=7450 -mtune=G4 -faltivec -mno-multiple -Os -mabi=altivec -funroll-loops mandelbrot2.c
G4 1.33GHz N=1000 maxiterations = 64
Scalar version: 0.68829417 Seconds
Vector1 version: 0.14070702 Seconds
The vectorised version is about 5.5 times faster (measured with a bigger N value).
This should run 2-3 times faster on the PPE / SPE going by clock alone.
I'll unroll next, after I've got it all going I'll post the code somewhere.
Actually reading the reference document a bit more I found there is actually a vmx2psu.h and spu2vmx.h that cover all the intrisics that do have a 1 to 1 mapping. So this really all I was looking for. And this is a good thing.
yes, trevor_smigiel, sony dot com, just put up another big gcc-patches patch4 with reference to the spu,and makes mention that vmx2psu.h is one of several new files, the patch content seems interesting.
they sound like a simple wrapper, perhaps that idea might be expanded for the general compile option?.
http://gcc.gnu.org/ml/gcc-patches/2006-11/msg01221.html
and read the thread.
given that as already stated PPC linux doesnt currently use the altivec, did anyone try taking the/any generic code and use that GLIBC Altivec vectorised replacement libfreevec http://freevec.org/ to compile apps and compare?.
id love to know if it currently makes any difference as a general rule to generic PPC linux compiling code/app speed.
The compiler in the new SDK will auto-vectorise, previous versions of gcc don't do this.
Also, that code has a few issues, some can be (hopefully) fixed by the compiler but others are just made invisible - e.g. there are integer and floating point variables intermixed. This means conversions everywhere which won't exactly help.
I've half done a vectorised version which fixes this and the speedup is quite hefty. I've not debugged it yet so I don't know if this will remain the case.
BTW AltiVec has 32 registers, only XBox360 has more than this.
ADEX do you know if the code generation has improved at all?, see Krashan's (aka Grzegorz Kraszewski) post and paper.
http://www.powerdeveloper.org/forums/viewtopic.php?t=892#6071
"Krashan said: We should separate two things here, the first one is what GCC 4 does with hand-tuned code (an this is the subject of the paper below), the second one (beyond my scope for now, as autovectorisation is still poor compared to a code written specifically for SIMD operators) is autovectorisation of scalar code.
Anyway here it is: GCC 2 versus GCC 4 compiling AltiVec code, a detailed discussion of a case. Note however this is not a scientific paper in the strict sense - it hasn't undergone peer-review process, it will also not be published neither on any conference, nor journal."
http://teleinfo.pb.edu.pl/~krashan/altivec/gccbenchmark/GCC2_vs_GCC4_and_AltiVec.pdf
"An official compiler for MorphOS operating system is still GCC 2.95.3. It is considered outdated by
many people, and lack of newer GCC 3 or GCC 4 compilers is a reason for complaints.
As some unofficial ports of GCC 3 and 4 appeared, there is an opportunity to test them and compare generated code. My main point of interest is AltiVec, so I've grabbed a port of GCC 4.0.3 done by Marcin "Morgoth" Kurek, and have given it a try with a Reggae class, fir.filter namely. For those of you not familiar with digital signal processing, FIR filtering is nothing more than doing a lot of MAC (multiply and accumulate) operations in a loop, so AltiVec is just what is needed to do it really fast. I've published a theory behind SIMD-optimized FIRs in [1] and [2]. I've just compiled the class with GCC 4, and ran some tests.
You may imagine how much I've been surprised when it turned out that GCC 4.0.3 generated code is 5 to 15% slower compared to GCC 2.95.3. I've extracted the important code from the class and written a testcase – still the same result. What is going on? The full source code of my benchmark is available in [3], the important part of the source is repeated here. I've compiled it as follows:"
his other official papers as mentioned here
http://www.powerdeveloper.org/forums/viewtopic.php?t=835
are also an interesting and practical Altivec read.
"References to the source code and benchmarks are placed inside papers. For long filters (1000 taps or more) I've achieved 2.5 Gtaps per second for floating point filter and 6.5 Gtaps per second for integer filter. All this on 1.0 GHz 7447 unit and Pegasos II. Presented ideas are implemented in a class of Reggae multimedia framework for MorphOS."
its good enough to peak markos's (freevec) interest anyway :-D .
Markos said:
"Krashan,
These are excellent papers and the results are astonishing!!! I am very interested to study the source code in detail, as soon as I'm "free" again (a few more days, just a few :-D) from my military service!
What I find very intriguing is the possibility that these algorithms may be applied to image processing applications as well, like GIMP, in particular the digital convolution filters, which I believe work in the same or similar way. Esp, since your code applies to "real" data -ie big data and not just "Proof of Concept" lab tests.
Keep them coming!
Konstantinos"
perhaps that referenced source code/benchmark might prove useful?.
Fafalada
20-Dec-2006, 03:09
ADEX do you know if the code generation has improved at all?, see Krashan's (aka Grzegorz Kraszewski) post and paper.
Having just read the paper, I find it a bit funny that the author's biggest complaint is GCC4.x auto-reordering of intrinsic code, which can actually be turned off if you dislike its effects.
Anyway - having used it for (too) many years, I can safely say the reason GCC 2.93.x sucks is not in how it treats hand-scheduled intrinsics code (in fact, that's just about the only thing it does Well).
I can't speak for scientific apps, but the bulk of code we write in games will not be hand tuned like this, and it will still take (very)significant amounts of processing time, and that's where smarter C++ compiler really comes into play.
...
Also I found this document. SPU_language_extensions_2.1.pdf (http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/30B3520C93F437AB87257060006FFE5E/$file/SPU_language_extensions_2.1.pdf) The extensions seem very simlar to the alitvec ones but not identical.
At least for vector instructions, it would be nice if there was a common language for both the PPE and SPU so you could compile/run identical C code on either and not have to do your manual optimizations for just one or the other.
JFR
SPU language extensions,you cant link directly to it, you need to use this link to reference it.
http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/30B3520C93F437AB87257060006FFE5E
thanks for keeping this interesting thread going guys,im sure new readers/coders are learning lots and looking forward to some fun/informative coding/benchmark samples.
and that's where smarter C++ compiler really comes into play
i dont know why i didnt realise it until you said it, but does that 'smarter' include The IBM XL C/C++ Alpha Edition Compilers?.
it seems that most people assume apple when you ask about PPC machines (understandable) and dont as yet realise that cell is part of that CPU family (theres also the genesi PPC based kit) , the thing is, is that XL a specially coded front/back end for gcc to remove many/all of the general problems with autovectorising etc as already mentioned, so you dont need to be a phd gcc core coder to know how to get around all the current ppc gcc problems/gotcha's?.
rather like the apple AltiVec back end they wrote to remove general ggc short comings etc(would have been better if they fixed the core ppc gcc for everyone...).
for instance that http://alphaworks.ibm.com/tech/cellcompiler
"The compilers include five optimization levels, which allow moderate to significant performance improvements with relatively little development effort:
-O0: some basic and minimum optimization
-O2: strong, low-level optimization that benefits most programs
-O3: intense, low-level optimization analysis with basic loop optimization
-O4: all of -O3, detailed loop analysis, and good whole-program analysis at link time
-O5: all of -O4 and detailed whole-program analysis at link time. "
standard gcc only does -03 and that was still broke last i heard, can anyone clarify please?,
is XL good for general PPC (G3/G4/G5) cpu's, and is it better, in what way?.
XL is the fastest compiler by far for general PowerPC chips. The last time I've seen it compared with GCC on Cell was over a year ago, but it was significantly more advanced.
There are some optimizations in the XL family that do not exist in GCC that are platform independent, and there was an aggressive push for more backend optimizations for Cell alone. There are a LOT of special cases designed specifically for SPE and PPE workload in the compilers.
-O4 and -O5 invoke another optimizing engine that does IPA (interprocedural analysis). If you use -qpdf on top of this, you're maxing out the optimization capabilities of the compiler.
Titanio
20-Dec-2006, 14:18
A slight aside, but XL isn't free to use, is it?
OK, got my version done.
Results (including file O/P)
G4 @ 1.33GHz:
gcc -Wall -faltivec -mno-multiple -mabi=altivec MandelbrotTest.c
Width = 10000 Height = 10000
----------------------------------------------------------------------
Scalar version: 180.22247601 Seconds
----------------------------------------------------------------------
Vector1 version: 91.48115182 Seconds
----------------------------------------------------------------------
Vector2 version: 57.10034990 Seconds
----------------------------------------------------------------------
Vector4 version: 47.97465992 Seconds
----------------------------------------------------------------------
gcc -Wall -fast -mcpu=7450 -mtune=G4 -faltivec -mno-multiple -mabi=altivec -funroll-loops MandelbrotTest.c
Width = 10000 Height = 10000
----------------------------------------------------------------------
Scalar version: 59.01638985 Seconds
----------------------------------------------------------------------
Vector1 version: 19.64178491 Seconds
----------------------------------------------------------------------
Vector2 version: 11.75537205 Seconds
----------------------------------------------------------------------
Vector4 version: 10.65353203 Seconds
----------------------------------------------------------------------
The compiler clearly has a very big effect especially on the vectorised versions.
Vectorising the loop had a big effect, unrolling the loop almost doubled the speed but a second unroll didn't make much difference at all after the compiler hit it.
The code is here:
http://www.blachford.info/computer/The_Source/MandelbrotTest.c
http://www.blachford.info/computer/The_Source/MandelbrotTest.h
It generates a raw 8bit image of a mandelbrot set. It won't compile on a PC but the scalar version should be OK if you remove the vectorised versions.
This should work on the PPE without any problems, the vector stuff is AltiVec though so I don't know what a SPU compiler will make of it.
I'd like to know how well the unrolling works on the SPE, it may work there a lot better.
I'll see if I can get it to work on the Cell-Sim next...
Entropy
21-Dec-2006, 15:16
A slight aside, but XL isn't free to use, is it?
No.
CellDude
22-Dec-2006, 00:24
There is a high likelihood that any conventional benchmark simply run on Linux on the PS3 does not take proper advantage of the multiple hardware threads. As a result, such benchmarks will likely run in one thread at one half the possible dispatch rate, in addition to the many compiler optimizations concerns being mentioned here.
inefficient
22-Dec-2006, 02:35
There is a high likelihood that any conventional benchmark simply run on Linux on the PS3 does not take proper advantage of the multiple hardware threads. As a result, such benchmarks will likely run in one thread at one half the possible dispatch rate, in addition to the many compiler optimizations concerns being mentioned here.
I feel one of the "proper advantages of the multiple hardware threads" has to be not having to design around mutiple hardware threads. So that when you have an already multi-threaded application and run it you can say "oh great I got this 5-10% improvement for free"
Time trying to actively optimize for SMT is probably time better spent optimizing else where because you can get much bigger gains elsewhere. Realistically you're only going to see small improvments with SMT except in pathological cases.
I see it more as a convenience feature than an actual key to unlocking more performance.
And remeber when you are running these benchmarks, there is a whole OS running in the backgroud. So SMT is already being taken advantage of in one form right there.
The code is here:
http://www.blachford.info/computer/T...ndelbrotTest.c
http://www.blachford.info/computer/T...ndelbrotTest.h
I'll try to run that later today on the Cell for you.
Barbarian
22-Dec-2006, 04:40
There is a high likelihood that any conventional benchmark simply run on Linux on the PS3 does not take proper advantage of the multiple hardware threads. As a result, such benchmarks will likely run in one thread at one half the possible dispatch rate, in addition to the many compiler optimizations concerns being mentioned here.
As long as there is an idle thread that implements a special NOP instruction, the primary thread should run at almost 100% dispatch rate. Actually Intel's HT processors implement something very similar, for details look up _mm_pause.
inefficient
22-Dec-2006, 16:08
OK, got my version done.
Results (including file O/P)
G4 @ 1.33GHz:
gcc -Wall -faltivec -mno-multiple -mabi=altivec MandelbrotTest.c
Width = 10000 Height = 10000
----------------------------------------------------------------------
Scalar version: 180.22247601 Seconds
----------------------------------------------------------------------
Vector1 version: 91.48115182 Seconds
----------------------------------------------------------------------
Vector2 version: 57.10034990 Seconds
----------------------------------------------------------------------
Vector4 version: 47.97465992 Seconds
----------------------------------------------------------------------
gcc -Wall -fast -mcpu=7450 -mtune=G4 -faltivec -mno-multiple -mabi=altivec -funroll-loops MandelbrotTest.c
Width = 10000 Height = 10000
----------------------------------------------------------------------
Scalar version: 59.01638985 Seconds
----------------------------------------------------------------------
Vector1 version: 19.64178491 Seconds
----------------------------------------------------------------------
Vector2 version: 11.75537205 Seconds
----------------------------------------------------------------------
Vector4 version: 10.65353203 Seconds
----------------------------------------------------------------------
I had to modify many lines just to get it to compile with ppu-gcc without errors. You can diff against this.
MandelbrotTest.c: http://pastebin.4programmers.net/1296
When I ran it, it was EXTREMELY slow. It took 381 secs for the scalar one to run with N=10000. This is because putc with that many output bytes is just a disaster with this compiler. I commented out lines with putc and fwrite (Your last 2 tests was only doing IO every 4 lines... but just to be fair with the first 2 tests)
Results on PS3:
ppu-gcc -O5 -maltivec -mno-multiple -mabi=altivec -funroll-loops MandelbrotTest.c -o ppu_man
Width = 10000 Height = 10000
----------------------------------------------------------------------
Scalar version: 87.60755396 Seconds
----------------------------------------------------------------------
Vector1 version: 21.07550097 Seconds
----------------------------------------------------------------------
Vector2 version: 17.48154807 Seconds
----------------------------------------------------------------------
Vector4 version: 12.16985607 Seconds
----------------------------------------------------------------------
I had to modify many lines just to get it to compile with ppu-gcc without errors. You can diff against this.
MandelbrotTest.c: http://pastebin.4programmers.net/1296
Thanks.
When I ran it, it was EXTREMELY slow. It took 381 secs for the scalar one to run with N=10000. This is because putc with that many output bytes is just a disaster with this compiler. I commented out lines with putc and fwrite (Your last 2 tests was only doing IO every 4 lines... but just to be fair with the first 2 tests)
I wonder does disc access go via the hypervisor, in that case it would hurt...
That said my tests were on a different OS (OS X 10.4.8).
I'll change it so they all write 4 lines in the next version, it'll also be in colour.
The results are a bit disappointing, it's either stalling somewhere or the compiler is not that great. I suspect some combination of both.
It might be worth messing around with different optimisation levels (even -Os), on my machine the fast, mcpu and mtune options seem to have a much bigger effect
perhaps you can use one of the PPC linux livecd's to do a quick boot of your mac and cross/compile and run your binary there to see if/how you might optimise it for your 1.3? G5 under that,but i seem to remember you need to treat the G5 Altivec slightly differently than G4 altivec to get the best out of it!.
its been a while since i looked for a PPC liveCD and dont know what kernel they might have as standard but Inefficient's using *.16 for his ps3 isnt he?.
I believe he recently upgraded PS3 kernel to 2.6.18.
good morning people, Merry Christmas...
the Gentoo G5/PS3 liveCD beta has now been added
http://www.powerdeveloper.org/forums/viewtopic.php?p=6474
"Lu-zero said:
We need some testers please follow the ranger's blog post for more instructions
http://planet.gentoo.org/developers/ranger/2006/12/24/gentoo_ppc64_ps3_livecd_is_in_beta
More information (and goods) will follow.
lu"
"Brent BAUDE said:
Please feel free to download and test the LiveCD; however, do not, under any circumstances, report bugs via Gentoo's official bugzilla mechanism. I'd prefer join us on #gentoo-ppc64 to report feedback."
http://www.gentoo.org/proj/en/base/ppc64/ps3/beta.jpg
http://www.gentoo.org/main/en/mirrors.xml
this should make life a little easyer now 8)
Sorry, never answered this:
ADEX do you know if the code generation has improved at all?, see Krashan's (aka Grzegorz Kraszewski) post and paper.
http://www.powerdeveloper.org/forums/viewtopic.php?t=892#6071
gcc used to be completely brain dead on PPC, Apple did a lot of work on it on their own branch and their enhancements have been working their way into the mainline gcc.
Autovectorisation is pretty new though so don't expect too much of it.
Also there could be specific reasons the example you mention didn't do well.
perhaps that referenced source code/benchmark might prove useful?.
Possibly, but it's probably been done on Cell already a zillion times. In fact there are some in the Cell SDK.
BTW I know many of the MorphOS devs, I used to work for the same company.
good morning people, Merry Christmas...
the Gentoo G5/PS3 liveCD beta has now been added
http://www.powerdeveloper.org/forums/viewtopic.php?p=6474
"Lu-zero said:
We need some testers please follow the ranger's blog post for more instructions
http://planet.gentoo.org/developers/ranger/2006/12/24/gentoo_ppc64_ps3_livecd_is_in_beta
More information (and goods) will follow.
lu"
"Brent BAUDE said:
Please feel free to download and test the LiveCD; however, do not, under any circumstances, report bugs via Gentoo's official bugzilla mechanism. I'd prefer join us on #gentoo-ppc64 to report feedback."
http://www.gentoo.org/proj/en/base/ppc64/ps3/beta.jpg
http://www.gentoo.org/main/en/mirrors.xml
this should make life a little easyer now 8)
I tried out gentoo on my PS3. I like GNOME better than Enlightenment. It reminds me of Mac OS 7 and then some.
At first, YDL prevented it from loading (Due to existing parameters in /etc/kboot.conf). The workarounds are here: http://planet.gentoo.org/developers/ranger/2006/12/24/gentoo_ppc64_ps3_livecd_is_in_beta#c18583
Crazyace
30-Dec-2006, 14:50
Played around with the mandelbrot code yesterday.. Still using double precision on ppu
timed with time ./a.out 10000 >a
Initial C code ( 64 iterations )
On 3.6GHz P4 ( gcc -O3 ) : 32.52s
On dual 2GHz G5 ( gcc -O3 ) : 30.91s
On PS3 linux ( gcc -O3 ) : 63.9s
This was pretty slow, mainly due to the long in-order floating point pipeline, so I rewrote the assembly slightly..
Assembly version (64 iterations)
On dual 2GHz G5 : 19.14s
On PS3 linux : 40s
There were still loads of stall cycles so I unrolled the loop once :)
Unrolled assembly version (64 iterations)
PS3 linux : 23.65s
Still quite a bit of room for optimisation left - maybe after dinner :)
compile with gcc -O3 -mregnames on PS3linux
#include<stdio.h>
#include<stdlib.h>
int main (int argc, char **argv)
{
int w, h;
int i,c,c1;
double x, y;
double Zr, Zi, Cr,Cr1, Ci, Tr, Ti;
double rw,rh;
w = h = atoi(argv[1]);
void *out = malloc( w*h );
unsigned short *ptr = out;
rw = 2.0/w;
rh = 2.0/h;
printf("P5\n%d %d 255\n",w,h);
#if 1 // Asm versions
#if 1 // Asm version unrolled twice
for(y=0;y<h;++y)
{
Ci = (y*rw - 1.0);
for(x=0;x<w;x+=2)
{
Cr = (x*rw - 1.5);
Cr1 = ((x+1)*rw - 1.5);
__asm__ volatile(
// Core calculation..
"fmr 1,%1 \n" // Zr = Tr
"fmr 2,%2 \n" // Zi = Ti
"crclr 4*cr1+lt \n" // Clear initial test
"fmr 11,%5 \n" // Zr1 = Tr1
"fmr 12,%2 \n" // Zi1 = Ti
"crclr 4*cr2+lt \n" // Clear initial test
"crclr 4*cr3+lt \n" // Clear initial test
"crclr 4*cr4+lt \n" // Clear initial test
"addi %0,0,64 \n"
"mtctr %0 \n"
"addi %0,0,0 \n"
".p2align 4,,15 \n"
"0: \n"
"fnmsub 5,1,1,%4 \n" // x = 4-Zr*Zr
"fmul 6,2,2 \n" // y = Zi*Zi
"fmadd 3,1,1,%1 \n" // r = Zr*Zr+Tr
"fmul 4,1,2 \n" // i = Zr*Zi
"fnmsub 15,11,11,%4 \n" // x1 = 4-Zr1*Zr1
"fmul 16,12,12 \n" // y1 = Zi1*Zi1
"fmadd 13,11,11,%5 \n" // r1 = Zr1*Zr1+Tr
"fmul 14,11,12 \n" // i1 = Zr1*Zi1
"crand cr0*4+lt,cr1*4+lt,cr2*4+lt \n" // Early exit only when both finished
"cror cr3*4+lt,cr3*4+lt,cr1*4+lt \n" // Make sticky
"cror cr4*4+lt,cr4*4+lt,cr2*4+lt \n" // Make sticky
"blt- cr0,3f \n" // Branch on count and compare result false..
"blt cr3,1f \n" // Only increment if valid
"addi %0,%0,0x100 \n"
"1: \n"
"blt cr4,2f \n" // Only increment if valid
"addi %0,%0,0x001 \n"
"2: \n"
// Stall cycles here
"fcmpu cr1,5,6 \n" // Check 4-Zr*Zr >Zi*Zi - or 4>(Zr*Zr+Zi*Zi)
"fnmsub 1,2,2,3 \n" // Zr = r-Zi*Zi
"fmadd 2,4,%3,%2 \n" // Zi = 2*i+Ti
"fcmpu cr2,15,16 \n" // Check 4-Zr1*Zr1 >Zi1*Zi1
"fnmsub 11,12,12,13 \n" // Zr1 = r1-Zi1*Zi1
"fmadd 12,14,%3,%2 \n" // Zi1 = 2*i1+Ti
// Stall cycles
"bdnz+ 0b\n"
"3: \n"
: "=b"(i) : "f"(Cr),"f"(Ci),"f"(2.0),"f"(4.0),"f"(Cr1) :
"f1","f2","f3","f4","f5","f6","cr1","cr3",
"f11","f12","f13","f14","f15","f16","cr2","cr4",
"cr0","ctr" );
*ptr++ = i;
}
}
#else // Single asm version
for(y=0;y<h;++y)
{
Ci = (y*rw - 1.0);
for(x=0;x<w;++x)
{
Cr = (x*rw - 1.5);
__asm__ volatile(
// Core calculation..
"fcmpu cr1,%3,%3 \n" // Clear initial test
"fmr 1,%1 \n" // Zr = Tr
"fmr 2,%2 \n" // Zi = Ti
"addi %0,0,64 \n"
"mtctr %0 \n"
"addi %0,0,0 \n"
".p2align 4,,15 \n"
"0: \n"
"blt cr1,1f \n" // Branch on count and compare result false..
"fnmsub 5,1,1,%4 \n" // t = 4-Zr*Zr
"fmul 6,2,2 \n" // t1 = Zi*Zi
"fmadd 3,1,1,%1 \n" // r = Zr*Zr+Tr
"fmul 4,1,2 \n" // i = Zr*Zi
// Stall cycles here
"fcmpu cr1,5,6 \n" // Check 4-Zr*Zr >Zi*Zi - or 4>(Zr*Zr+Zi*Zi)
"fnmsub 1,2,2,3 \n" // Zr = r-Zi*Zi
"fmadd 2,4,%3,%2 \n" // Zi = 2*i+Ti
"addi %0,%0,1 \n"
// Stall cycles
"bdnz 0b\n"
"1: \n"
: "=b"(i) : "f"(Cr),"f"(Ci),"f"(2.0),"f"(4.0) : "f1","f2","f3","f4","f5","f6","cr1","ctr" );
*ptr++ = i;
}
}
#endif
#else // Reference C version..
for(y=0;y<h;++y)
{
Ci = (y*rw - 1.0);
for(x=0;x<w;++x)
{
Zr = Zi = Tr = Ti = 0.0;
Cr = (x*rw - 1.5);
for (i=0;i<64 && (Tr+Ti <= 4.0);++i)
{
Zi = 2.0*Zr*Zi + Ci;
Zr = Tr - Ti + Cr;
Tr = Zr * Zr;
Ti = Zi * Zi;
}
*ptr++ = i;
}
}
#endif
fwrite( out,w,h,stdout );
}
inefficient
31-Dec-2006, 17:29
Color version. 64 iterations max. 10,000 x 10,000. File IO time is included. 32bit float vectors not doubles.
Output file is 97.6MB bmp (hello little endian) file (256 color indexed, 1 byter per pixel)
ppu-gcc -O5 -maltivec -mabi=altivec -funroll-loops
spu-gcc -funroll-loops -O5
PS3 Linux results
mandelbrot (PPU) 37.467187
mandelbrot (SPU) 17.743308
Only very lite optimization with intrinsics so far. PPU and SPU versions almost identical. But the SPU runs it faster. The previous optimizations didn't help after switching to color and it's even more branchy.
Obviously the whole output image is too big to post. Here is little piece for anyone curious.
http://g3torrent.sourceforge.net/bimage_spu.png
Crazyace
01-Jan-2007, 22:48
This runs in around 9 seconds... ( difficult getting timing as something called beagle seems to be running and thrashing the HDD )
gcc -O3 -mregnames -maltivec
./a.out 10000 >a
It vectorises 8 elements per loop iteration.
int main (int argc, char **argv)
{
int w, h;
int i,c,c1;
double x, y;
double Zr, Zi, Cr, Ci, Tr, Ti;
double rw,rh;
w = h = atoi(argv[1]);
void *out = malloc( w*h );
printf("P5\n%d %d 255\n",w,h);
unsigned int *ptr = out;
rw = 2.0/w;
rh = 1.0/h; // Calculate half vTi instead
vector float vrw = (vector float){rw,rw,rw,rw};
vector float vrh = (vector float){rh,rh,rh,rh};
vector float vTr,vTr1,vTi;
for(y=0;y<h;++y)
{
vTi = (vector float){y,y,y,y};
vTi = vec_madd( vTi,vrh,(const vector float){-0.5f,-0.5f,-0.5f,-0.5f} );
for(x=0;x<w;x+=8)
{
vTr = (vector float){x,x+1,x+2,x+3};
vTr1 = (vector float){x+4,x+5,x+6,x+7};
vTr = vec_madd( vTr,vrw,(const vector float){-1.5f,-1.5f,-1.5f,-1.5f} );
vTr1 = vec_madd( vTr1,vrw,(const vector float){-1.5f,-1.5f,-1.5f,-1.5f} );
asm __volatile__(
"vor 1,%1,%1 \n" // vZr=vTr
"vor 11,%3,%3 \n" // vZr=vTr1
"vaddfp 2,%2,%2 \n" // vZi=vTi
"vaddfp 12,%2,%2 \n" // vZi=vTi
"crclr 4*cr6+lt \n" // Clear initial test
"vxor 0,0,0 \n" // v0 is 0 constant
"vspltisw 10,4 \n"
"vcfux 10,10,0 \n" // v10 is 4.0f constant
"vxor 9,9,9 \n" // Counters reset to 0
"vspltisw 8,1 \n" // All valid initialy
"vspltisw 7,0 \n"
"vspltisw 17,0 \n"
"addi 0,0,64 \n"
"mtctr 0 \n"
".p2align 4,,15 \n"
"0: \n"
"vpkuwum 7,7,17 \n" // Pack 2 results..
"vmaddfp 5,1,1,%1 \n" // vr = vZr*vZr+vTr
"vmaddfp 6,1,2,%2 \n" // vi = vZr*vZi+vTi/2
"vmaddfp 15,11,11,%3 \n" // vr1 = vZr1*vZr1+vTr1
"vmaddfp 16,11,12,%2 \n" // vi1 = vZr1*vZi1+vTi/2
"vnmsubfp 3,1,1,10 \n" // vx = 4-vZr*vZr
"vmaddfp 4,2,2,0 \n" // vy = vZi*vZi+0
"vnmsubfp 13,11,11,10 \n" // vx1 = 4-vZr1*vZr1
"vmaddfp 14,12,12,0 \n" // vy1 = vZi1*vZi1+0
"vcmpgtuh. 17,7,0 \n" // Set early out check..
"vandc 8,8,7 \n" // Zero increments if fail
"vnmsubfp 1,2,2,5 \n" // vZr = vr-vZi*vZi
"vaddfp 2,6,6 \n" // Vzi = vi*2
"vnmsubfp 11,12,12,15 \n" // vZr1 = vr1-vZi1*vZi1
"vaddfp 12,16,16 \n" // Vzi1 = vi1*2
"vadduwm 9,9,8 \n" // Counters
"blt- cr6,1f \n" // Early out..
"vcmpgtfp 7,4,3 \n" // Out of range check
"vcmpgtfp 17,14,13 \n" // Out of range check
"bdnz+ 0b \n"
"1: \n"
//"vpkuwum 9,9,9 \n" // Pack xyzw (32bit) to xyzwxyzw(16bit)
"vpkuhum 9,9,9 \n" // Pack xyzwxyzw(16bit) to xyzwxyzwxyzwxyzw(8bit)
"addi 0,0,4 \n"
"stvewx 9,0,%0 \n" // Store single 32bit word ( containing 4 values ) to memory
"stvewx 9,%0,0 \n" // and 2nd word..
: : "r"(ptr),"v"(vTr),"v"(vTi),"v"(vTr1) : "r0","v0","v1","v2","v3","v4","v5","v6","v7","v8","v9","v10","v11","v12","v13","v14","v15","v16","v17"
);
ptr+=2;
}
}
fwrite( out,w,h,stdout );
}
Rainbow Man
01-Jan-2007, 23:29
I was wondering.. I hear about "brute-frcing" encrypted stuff and it seems they mean testing every key to find the right one sort of like Matt Broderick did with phone numbers in that old movei "war games" (showing my age eh! :cool:)
How woudl that sort of thing work on the PS3 with the SPU thingys?
Could they brute-force any moderin encryption scheme at roughly 6x speed of a normal CPU or would the onchip memory not be enough for something like that or are the numbers simply too big for the chip to handle?
I read about pepole using 4000 bit keys and such and I tried to find out simply how big a number that would be; gave up very very quickly.
"Too big to comprehend" was my final conclusion.
Could an army of PS3s maybe be a hacker's best friend..? (hacker with a big pocketbook at $500 a pop even for the cheaper ones) Please help a geezer figure ths stuff out eh?
Peace.
Onlooker1
02-Jan-2007, 00:45
If you add 3 bits to a key length, then a brute force search takes 8 times longer. So if key length X was not crackable by a single CPU, then key length X+3 is not crackable by cell, and X+11 is not crackable by a network of 256 PS3s all working on the same problem. Every time the encryption geeks get worried about NSA or whatever they go up in key length by 128, 1024 or more bits. So I think if they were safe before, they are still safe :)
a guy on #gentoo-ppc64 is writing/testing a wep key cracker and reported great success and VERY fast speeds, i cant remember that No. now, and dont have the log saved, but it was fast and not even fully optimised for using the spe's at the time (last week).
Rainbow Man
02-Jan-2007, 11:03
If you add 3 bits to a key length, then a brute force search takes 8 times longer. So if key length X was not crackable by a single CPU, then key length X+3 is not crackable by cell, and X+11 is not crackable by a network of 256 PS3s all working on the same problem.
That's merely a function of time is it not? Assume we're a really patient bunch when we network our 256 PS3s (though I can't afford one yet meh)
There's nothing technical stopping us is there? The keycrack program can fit in the internal memories of cell right?
Peace.
That's merely a function of time is it not? Assume we're a really patient bunch when we network our 256 PS3s (though I can't afford one yet meh)
There's nothing technical stopping us is there? The keycrack program can fit in the internal memories of cell right?
Peace.
A function of time, meaning possibly years, yes.
Found this Master Thesis while hunting for another:
http://www.cs.umu.se/education/examina/Rapporter/NilsHjelte.pdf
Smoothed Particle Hydrodynamics (SPH) is a method used mainly to simulate complex
materials, such as water. It would be of great benefit for many application areas to be
able to run large SPH systems at interactive speeds.
We have looked at ways to improve the performance of a SPH fluid simulation by
designing algorithms for parallel execution. These have been used to develop an implementation
for the target platform, the Cell Broadband Engine processor. Experimental
results show great potential with linear performance scaling for increasing problem sizes,
and excellent parallel efficiency.
inefficient
18-Jan-2007, 04:13
Found this Master Thesis while hunting for another:
http://www.cs.umu.se/education/examina/Rapporter/NilsHjelte.pdf
That is very excelent work for a Master Thesis. I just hope whatever professor put it up on the website actually got his permission to do so! It looks like an accidentally exposed directory containing a bunch of students submited Thesis'.
Since it was completed in June 06. I wonder if this guy has a job video industry by now, or at least is getting to do real world solutions on Cell systems. It would be a shame if he was still in school working on a PhD or whatever and poking around with simulators.
inefficient
20-Jan-2007, 08:12
I got around to playing with DMA engine to do memory transfers between SPU LS and system memory. And finally I have something working for me. I also got mutiple SPE's running at once and returning their results to system memory via DMA transfers.
N=10008 - This had to be divisible by 6 to keep it simple
No File IO - Just get the final image in main memory and we are done
mandelbrot (1 SPU) 6.878862 secs
mandelbrot (6 SPU) 1.342998 secs
With File IO - Writing almost 100MB to disk is a bit of overhead
mandelbrot (1 SPU) 9.141962 secs
mandelbrot (6 SPU) 2.933930 secs
One the PPE, despite my best efforts, I haven't been able to get it running with the same workload in under 20 seconds.
The work load isn't divided among them in any smart way. The problem is just divided along the y axis into tiles. And the SPE works on one then returns the result to main memory. The amount of work required is not evenly distributed per tile. And I'm sure the way I am managing DMA transfers could be done a lot better. But even so, the code is 5x faster on 6 SPEs over 1.
Just the code that runs on the SPU is pretty simple:
#include <stdio.h>
#include <stdlib.h>
#include <spu_mfcio.h>
#include <vec_types.h>
#include <spu_intrinsics.h>
#include "simpleDMA.h"
#define TILESIZE 128*1024
#define DMABLOCKSIZE 16384
#define DMATAG 31
// A fast way to test of all 4 elements in a vector are gt 0
inline uint32_t test_all_gt0 (vector unsigned int pointv)
{
vector unsigned int mask0v = spu_cmpgt(pointv, 0);
vector unsigned int mask1v = spu_gather(mask0v);
return (spu_extract(mask1v, 0) == 15);
}
// DMA data from LS to XDR in 64k chunks
uint32_t put_data(void *ls, uint32_t ea, uint32_t size)
{
uint32_t j;
uint32_t tail = size % DMABLOCKSIZE;
for (j=0; j<(uint32_t)(size/DMABLOCKSIZE); j++) {
mfc_put(ls+(j*DMABLOCKSIZE), ea+(j*DMABLOCKSIZE), DMABLOCKSIZE, DMATAG, 0, 0);
}
if (tail) {
mfc_put(ls+(j*DMABLOCKSIZE), ea+(j*DMABLOCKSIZE), tail, DMATAG, 0, 0);
}
mfc_read_tag_status_all();
}
// Main function
uint32_t mandelbrot (const uint32_t y_start, const uint32_t y_stop, const uint32_t w, const uint32_t h, uint32_t ea)
{
uint32_t i, iter = 64, tile_num = 0;
float x, y, recpw=1.0/w, tmpf;
vector float Zrv, Ziv, Crv, Civ, Trv, Tiv, xplusNv;
vector unsigned int pointv, mask0v, mask1v, iv;
float *Zr = (float*)&Zrv, *Zi = (float*)&Ziv,
*Cr = (float*)&Crv, *Ci = (float*)&Civ,
*Tr = (float*)&Trv, *Ti = (float*)&Tiv;
uint32_t *point = (uint32_t*)&pointv;
const vector float zero = (vector float){0.0,0.0,0.0,0.0};
const vector float one = (vector float){1.0,1.0,1.0,1.0};
const vector float _1p5 = (vector float){1.5,1.5,1.5,1.5};
const vector float two = (vector float){2.0,2.0,2.0,2.0};
const vector float limitv = (vector float){4.0,4.0,4.0,4.0};
const vector float recpwv = (vector float){recpw,recpw,recpw,recpw};
const vector unsigned int zero_i = (vector unsigned int){0,0,0,0};
const vector unsigned int one_i = (vector unsigned int){1,1,1,1};
char img[TILESIZE];
uint32_t img_pos = 0;
uint32_t i_accu = 0;
for(y=y_start;y<y_stop;++y)
{
tmpf = 2.0 * (y/h) - 1.0;
Civ = spu_splats(tmpf);
for(x=0;x<w;x+=4)
{
pointv = zero_i;
xplusNv = (vector float){x+1.0,x+2.0,x+3.0,x+4.0};
Zrv = Ziv = Trv = Tiv = zero;
Crv = two * (recpwv * xplusNv) - _1p5;
iv = zero_i;
// Inner most loop tuned for SIMD throughput
for (i=0;i<iter;++i)
{
iv = spu_add(iv, one_i);
mask0v = spu_cmpgt(Trv + Tiv, limitv);
mask1v = spu_cmpgt(pointv, 0);
pointv = spu_sel(pointv, iv, spu_andc(mask0v, mask1v));
if (__builtin_expect(test_all_gt0(pointv), 0)) {
break;
}
Ziv = two*Zrv*Ziv + Civ;
Zrv = Trv - Tiv + Crv;
Trv = Zrv * Zrv;
Tiv = Ziv * Ziv;
}
i_accu += i;
img[img_pos] = (char)*(point);
img[img_pos+1] = (char)*(point+1);
img[img_pos+2] = (char)*(point+2);
img[img_pos+3] = (char)*(point+3);
img_pos += 4;
// Once tile buffer is full, write the tile to XDR memory
if (img_pos == TILESIZE) {
put_data(img, ea+(tile_num * TILESIZE), TILESIZE);
img_pos = 0;
tile_num++;
}
}
}
// Write the final tile to system memory
put_data(img, ea+(tile_num * TILESIZE), img_pos);
}
int main(uint32_t spuid) {
uint32_t *mbox_data;
// The control block is a 32byte struct that contains all the program arguments
// including the effective address in main memory to put the results in
control_block cb __attribute__ ((aligned (128)));
// I use the "mail box" functionality to read the address of the
// "control block" from main memory. Then do a DMA get to pull that into the LS
mbox_data = (uint32_t*)spu_read_in_mbox();
mfc_get(&cb, mbox_data, sizeof(control_block), DMATAG, 0, 0);
// Wait for the DMA get transaction to complete...
// Actually I don't fully understand DMA tag groups. Using 31 seems to work fine.
mfc_write_tag_mask(1<<DMATAG);
mfc_read_tag_status_all();
// Now that the cb struct is in LS, the main program can be executed using it's arguments
mandelbrot(cb.y_start, cb.y_stop, cb.w, cb.h, (uint32_t)cb.addr.p);
return(0);
}
Rainbow Man
20-Jan-2007, 09:57
The work load isn't divided among them in any smart way. The problem is just divided along the y axis into tiles.
Okay you're probably smarter than me and maybe you've tried this already but if you divide the screen into lots of smaller tiles and maybe even distribute them out randomly..
Then tjhe varying workload of each tile would average out better than if you simply divide up into 6 tiles and send each to a SPU.
Just a thought. :cool:
Anyway. I'm fascinatd and impressed by your efforts. Keep on the good work! :cool:
Peacre.
Shifty Geezer
20-Jan-2007, 10:10
Okay you're probably smarter than me and maybe you've tried this already but if you divide the screen into lots of smaller tiles and maybe even distribute them out randomly..You could do them in order which would be simpler (wouldn't need to keep a record of what's been done before) and not slow down the process any more. Smaller tiles would help with keeping the SPE's populated - you won't have one SPE finishing it's easy sixth of the Mandlebrot way ahead of the others, and sit there idle the rest of the time.
Onlooker1
20-Jan-2007, 14:14
mandelbrot (6 SPU) 1.342998 secs
Since this is a 100mb frame, you could probably write a really nice mandelbrot explorer, one that does 60fps. Calc a 1mb frame (your current view) and then while the user is deciding whether to scroll, zoom in or zoom out, calc the 8 surrounding tiles, and then the fractionally zoomed in and zoomed out views, then wait for the dual shock to decide where it wants to go. The SPEs would all get super busy when the viewer decided to zoom in at high speed, and you could then claim absolutely that no other consumer PC or console out there manage 1/5th of the speed exhibited.
anyway thats what I'd do if I'd got to where you are now, had your knowledge, and had more free time :)
// Inner most loop tuned for SIMD throughput
for (i=0;i<iter;++i)
{
iv = spu_add(iv, one_i);
mask0v = spu_cmpgt(Trv + Tiv, limitv);
mask1v = spu_cmpgt(pointv, 0);
pointv = spu_sel(pointv, iv, spu_andc(mask0v, mask1v));
if (__builtin_expect(test_all_gt0(pointv), 0)) {
break;
}
Ziv = two*Zrv*Ziv + Civ;
Zrv = Trv - Tiv + Crv;
Trv = Zrv * Zrv;
Tiv = Ziv * Ziv;
}Does that even produce correct results? It looks like all samples in an SIMD block (4x1) will always be equal. You always increment all four elements of iv together, so how could they be different in the end?
What you could do there is to copy the i_one mask, and, in the copy, eliminate (with bitwise AND) the sample positions that are "done". Thus you still have ones on the sample positions that you still must count up, and zero on the finished positions. Adding zero to the finished results is safe.
The next significant optimization is unrolling the loop. With the masked vector increments, there is no longer a problem with running "slightly too many" iterations. The check for the end condition is relatively expensive, so just do it once every eight or 16 iterations and your throughput should rise.
I've written something like that a few years ago for SSE. Looking forward to playing with those SPEs as soon as Amazon delivers my PS3.
edit: Link to a forum post with the SSE code (http://www.forum-3dcenter.org/vbulletin/showthread.php?p=993203#post993203). Can't seem to find it anywhere else (not in the archive linked in that post anyway), I might have to dig through backups to find it, but it really should be everything you need.
inefficient
21-Jan-2007, 03:47
Does that even produce correct results? It looks like all samples in an SIMD block (4x1) will always be equal. You always increment all four elements of iv together, so how could they be different in the end?
What you could do there is to copy the i_one mask, and, in the copy, eliminate (with bitwise AND) the sample positions that are "done". Thus you still have ones on the sample positions that you still must count up, and zero on the finished positions. Adding zero to the finished results is safe.
The next significant optimization is unrolling the loop. With the masked vector increments, there is no longer a problem with running "slightly too many" iterations. The check for the end condition is relatively expensive, so just do it once every eight or 16 iterations and your throughput should rise.
I've written something like that a few years ago for SSE. Looking forward to playing with those SPEs as soon as Amazon delivers my PS3.
It is correct. It just might not be obvious because branches were optimized out.
We only want to give point a value if the condition is met. Otherwise it should stay 0. If it was just one pixel at a time, below would be the pseudo code.
if (point == 0 and Tr + Ti > limit) then
point = i \\ loop counter
else
point = point \\ change nothing
end if
The problem with the above code is having to evaluate that branch 4x every iteration of the innermost loop. The branch can be eliminated using a masking. Plus it can be all done in SIMD so you can do all 4 pixels at the same time.
iv = spu_add(iv, one_i);
\\ iv is just a vector with the loop counter {i, i, i, i} it gets incremented every loop cycle
mask0v = spu_cmpgt(Trv + Tiv, limitv);
\\ Mask0 holds the "if (Tr + Ti > limit)" condition for each pixel.
\\ If only the first and third elements are true it would like (shorted to 8 bits just for explanation purposes):
Mask0 = {11111111, 0000000, 11111111, 0000000}
mask1v = spu_cmpgt(pointv, 0);
\\ Mask1 holds the condition "if (point > 0)" condition for each pixel.
\\ We don't want to re-assign a value to a pixel if it already has been assigned.
\\ Lets say the first position has already previously been assigned a value
Mask1 = {11111111, 0000000, 0000000, 0000000}
pointv = spu_sel(pointv, iv, spu_andc(mask0v, mask1v));
\\ This line is 2 instructions. First is ANDC (and with complement)
\\ See this is why element 1 will not get re-assigned
Mask0 & ~Mask1 = {0000000, 0000000, 11111111, 0000000}
\\ Finally we use the Select Bits instruction to pick the right path out of 2 options.
If the matching element in "Mask0 & ~Mask1" is all 1's, that element in pointv will be assigned the value of the loop counter, otherwise it is assigned itself and there is no change.
Crazyace
21-Jan-2007, 11:50
A slightly faster version may be to mask the increment, rather than the count..
ie:
vec count = {1,1,1,1}
while (loop) {
count = spu_andc( count,spu_cmpgt(Tr+Ti,limitv) );
pointv = spu_add(pointv,count);
}
Of course - on the SPU there are no NAN's - so you are probally ok with the observation that when you escape the circle the sequence tends to infinity - so the compare will always fail. ( On the VMX it's possible that NAN's could creep in from infinity-infinity unless you're carefull )
Then you could just use
pointv = vec_sub( pointv,spu_cmpgt(Tr+Ti,limitv) );
Didn't see this posted:
IBM talks about SPU programming (Part II, published on 12th Feb 2007).
http://www-128.ibm.com/developerworks/power/library/pa-linuxps3-2/
ATI-liens
21-Feb-2007, 22:05
I saw the benchmarks almost a month ago, the PPE was slightly out performed by a 1.6Ghz G5.
Remember each SPE is nothing like the PPE. They don't use VMX and are not great at double precision.
Cell is a big hype.
I saw the benchmarks almost a month ago, the PPE was slightly out performed by a 1.6Ghz G5.
Remember each SPE is nothing like the PPE. They don't use VMX and are not great at double precision.
Cell is a big hype.
Borderline spam/troll post :cry:
ATI-liens
21-Feb-2007, 22:23
Borderline spam/troll post :cry:
Not really it's a fair opinion. Just like some people think the Wii is a Novelty.
I think the cell is powerful i'm not bias i just don't think it's what most people are making it out to be (When considering gaming.
The way it is being portrayed is as if we will see movie like graphics soon. I think we will see very good graphics out of the PS3 but i wouldn't get too over hyped. Im not convinced that it will exceed the X360's visual potentials.
Not really it's a fair opinion. Just like some people think the Wii is a Novelty.
I think the cell is powerful i'm not bias i just don't think it's what most people are making it out to be (When considering gaming.
The way it is being portrayed is as if we will see movie like graphics soon. I think we will see very good graphics out of the PS3 but i wouldn't get too over hyped. Im not convinced that it will exceed the X360's visual potentials.
You're bringing in elements that have nothing to do with this thread and attempting to start implications that have nothing to do with this particular section of the console forum. Please refrain :!:
My opinion:
Cell like architectures are a clear conceptual win over what's come before.
As far as Cell itself goes, I'll wait until devs have more time with it before I declare it a performance success, failure, or break even (with a $1 billion investment). Certainly there are loads that cell does very well with, but they generally they're the same types of loads a gpu does well with, gpgpu may win out over cell or vice versa. Cell in ps3 as opposed to some other cpu will probably end up being the better choice for the ps3, but I'm more concerned about what gets adopted by the scientific research community. (since the ps3 design decisions are already done, but where computing as a whole goes in the next few years is still unknown)
I saw the benchmarks almost a month ago, the PPE was slightly out performed by a 1.6Ghz G5.
Remember each SPE is nothing like the PPE. They don't use VMX and are not great at double precision.
Cell is a big hype.
In general, the PPE is similar in performance to 1 Xenon core. I believe the latter has some extended VMX instructions, 128 registers but more limited hardware threading (Someone please correct me if I'm wrong).
The SPE has an SIMD engine strapped to it and you don't need double precision for gaming ? It's faster than the PPE when the local store is used effectively, and there are 7 of them in PS3. Use the search to find comments from the devs on SPE.
The_legend_of_drtre
22-Feb-2007, 00:48
Lock if old:
・Dhrystone v2.1
PS3 Cell 3.2GHz: 1879.630
PowerPC G4 1.25GHz: 2202.600
PentiumIII 866MHz: 1124.311
Pentium4 2.0AGHz: 1694.717
Pentium4 3.2GHz: 3258.068
・Linpack 100x100 Benchmark In C/C++ (Rolled Double Precision)
PS3 Cell 3.2GHz: 315.71
PentiumIII 866MHz: 313.05
Pentium4 2.0AGHz: 683.91
Pentium4 3.2GHz: 770.66
Athlon64 X2 4400+ (2.2GHz): 781.58
・Linpack 100x100 Benchmark In C/C++ (Rolled Single Precision)
PS3 Cell 3.2GHz: 312.64
PentiumIII 866MHz: 198.7
Pentium4 2.0AGHz: 82.57
Pentium4 3.2GHz: 276.14
Athlon64 X2 4400+ (2.2GHz): 538.05
source: http://rian.s26.xrea.com/nicky.cgi?DT=20061121A#20061121A
Wait so the Wii's CPU comparable to a 2 ghz Pentium 4?
It's drystone was over 1600 dmips as well.
Sorry to go off topic..
I saw the benchmarks almost a month ago, the PPE was slightly out performed by a 1.6Ghz G5.[QUOTE]
Using benchmarks which were a lot better tuned for the G5...
e.g. Look at the stream figures, Cell is capable of a *lot* better and has been measured as such.
[QUOTE]Remember each SPE is nothing like the PPE.
Correct, they are also 1/4 of the size, faster and use at most 1/4 of the power.
They don't use VMX and are not great at double precision.
They use an ISA remarkably like VMX (not surprising given VMX was their starting point).
Theoretical double precision isn't that exciting but it's still comparable to other processors.
The difference is in actual problems, Cell can get a lot closer to it's theoretical maximum than other processors.
Cell is a big hype.
I wrote a paper on Cell a couple of years back and was accused of hyping Cell with made up benchmarks (actually they were estimates based on the theoretical figures). The umpteen research papers that have come out about Cell have subsequently confirmed what it is actually capable of - It actually exceeded all my "hype" estimates, in some cases by several times.
inefficient
22-Feb-2007, 01:43
Wait so the Wii's CPU comparable to a 2 ghz Pentium 4?
It's drystone was over 1600 dmips as well.
Sorry to go off topic..
That Wii dhrystone number was obviously just fabricated by using the GameCube CPUs dhystone number. 1125 Dmips @ 485mhz = 1691 Dmips @ 729mhz.
Dhrystone numbers only tell us one small part of the performance story. They only tell us about integer performace - there are no floating point operations in that benchmark. Wii's CPU should theoretically show good results in the Dhrystone thanks in part to the fact it's an Out-Of-Order CPU.
But PS3 has 8 cores, 360 has 3, Wii only has 1. In that PS3 dhrystone number above, that is just the PPU's Drystone number - only a small fraction of the full capability of the chip. And of course when it comes to floating point performance, Wii's CPU performance is completely dwarfed by single core
performance of either a Cell or Xenon CPU.
The_legend_of_drtre
22-Feb-2007, 01:45
That Wii dhrystone number was obviously just fabricated by using the GameCube CPUs dhystone number. 1125 Dmips @ 485mhz = 1691 Dmips @ 729mhz.
Dhrystone numbers only tell us one small part of the performance story. They only tell us about integer performace - there are no floating point operations in that benchmark. Wii's CPU should theoretically show good results in the Dhrystone thanks in part to the fact it's an Out-Of-Order CPU.
But PS3 has 8 cores, 360 has 3, Wii only has 1. In that PS3 dhrystone number above, that is just the PPU's Drystone number - only a small fraction of the full capability of the chip. And of course when it comes to floating point performance, Wii's CPU performance is completely dwarfed by single core
performance of either a Cell or Xenon CPU.
Thanks but that's not what i asked.:grin:
I said is the Wii's CPU comparable to a a 2 ghz pentium 4?
Oh and why would you think the Wii dmips number would be made up?
BTW, I was already aware that floating point calculations are not included in dhrystone and that one of the main reasons why the ps3 uses spies is because it inflats the systems floating point numbers dramactically.
I just wanted to know how it stacks up to the pentium 4 at 2ghz and the Main ps3 CPU without the spies in general performanec but I seemed to have answered my own question.
inefficient
22-Feb-2007, 01:58
Thanks but that's not what i asked.:grin:
I said is the Wii's CPU comparable to a a 2 ghz pentium 4?
No.
Oh and why would you think the Wii dmips number would be made up?
Because no one has ever actually run and published a Dhystone score for the Wii CPU. People with access to the machines are under NDA. The Gamecube on the otherhand, people have run Linux on. And people have benchmarked it.
BTW, I was already aware that floating point calculations are not included in dhrystone and that one of the main reasons why the ps3 uses spies is because it inflats the systems floating point numbers dramactically.
SPUs also "inflate" integer performance by the same logic. Thinking that SPUs only are good at floating point operations is a misunderstanding.
I wrote a paper on Cell a couple of years back and was accused of hyping Cell with made up benchmarks (actually they were estimates based on the theoretical figures). The umpteen research papers that have come out about Cell have subsequently confirmed what it is actually capable of - It actually exceeded all my "hype" estimates, in some cases by several times.
Is your paper online? If so or if not, please share :!:
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.