Confused about GFLOPS ratings

Laa-Yosh

I can has custom title?
Legend
Supporter
There's a lot of GFLOPS ratings thrown around nowadays, but it seems that marketing has got the advantage and it's not that easy to actually compare them...

For example, we've heard the 256GFLOPS value for the Cell CPU. But it turned out that this is for single precision FP operations, and once you change to double precision, it sinks down to ~25 GFLOPS.

Now, IBM is said to have plans to build CELL workstations and renderfarms for users like the CG industry. But most rendering applications are using double precision AFAIK, and thus the advertised performances (like the 16 TFLOPS rack) are suddenly not that good looking.

So my question is, how do the actual performances compare? How much FLOPS can a 3GHz P4 Xeon do; is it single or double precision? Are supercomputers rated for single or double prec? Am I right that scientific applications and CG requires double precision? Can SSE accelerate double prec?
 
If Emotion Engine's double precision flops rating was listed, it would be far below 6.2 gflops. probably less than 1 gflop dp. same with Gamecube's 1.9 gflops for cpu or ~10.5 gflops total for entire system. same for Xbox CPU's ~3+ gflops and Dreamcast's 1.4 gflops / 900 mflops. all would be much lower if listed in dp.

I don't know the sp flops or dp flops performances of other Intel Pentium or Xeon processors.
 
This ia a good question. I know games rarely need double precision, but some non-realtime renderers use it a lot. Still 25~30 GFlops is 10x more then a PC.

Excerp from RTW article said:
Floating Point Capability

As described previously, the prototype CELL processor’s claim to fame is its ability to sustain a high throughput rate of floating point operations. The peak rating of 256 GFlops for the prototype CELL processor is unmatched by any other device announced to date. However, the SPE’s are designed for speed rather than accuracy, and the 8 floating point operations per cycle are single precision (SP) operations. Moreover, these SP operations are not fully IEEE754 compliant in terms of rounding modes. In particular, the SP FPU in the SPE rounds to zero. In this manner, the CELL processor reveals its roots in Sony's Emotion Engine. Similar to the Emotion Engine, the SPE’s single precision FPU also eschewed rounding mode trivialities for speed. Unlike the Emotion Engine, the SPE contains a double precision (DP) unit. According to IBM, the SPE’s double precision unit is fully IEEE854 compliant. This improvement represents a significant capability, as it allows the SPE to handle applications that require DP arithmetic, which was not possible for the Emotion Engine.

Naturally, nothing comes for free and the cost of computation using the DP FPU is performance. Since multiple iterations of the same FPU resources are needed for each DP computation, peak throughput of DP FP computation is substantially lower than the peak throughput of SP FP computation. The estimate given by IBM at ISSCC 2005 was that the DP FP computation in the SPE has an approximate 10:1 disadvantage in terms of throughput compared to SP FP computation. Given this estimate, the peak DP FP throughput of an 8 SPE CELL processor is approximately 25~30 GFlops when the DP FP capability of the PPE is also taken into consideration. In comparison, Earth Simulator, the machine that previously held the honor as the world’s fastest supercomputer, uses a variant of NEC’s SX-5 CPU (0.15um, 500 MHz) and achieves a rating of 8 GFlops per CPU. Clearly, the CELL processor contains enough compute power to present itself as a serious competitor not only in the multimedia-entertainment industry, but also in the scientific community that covets DP FP performance. That is, if the non-trivial challenges presented by the programming model of the CELL processor can be overcome, the CELL processor may be a serious competitor in applications that its predecessor, the Emotion Engine, could not cover.


I'm sure the flops in the top 500 list are double precision. DP is absolutely necessary for most scientific needs. But then, even the 500th of that list must cost at 100's of times more then a CELL. I bet most of these supercomputer people are dying to put their hands on the chip.
 
From what I've been told the PS2 spported double floats in software, therefore making it practically useless.
 
Laa-Yosh said:
...
For example, we've heard the 256GFLOPS value for the Cell CPU. But it turned out that this is for single precision FP operations, and once you change to double precision, it sinks down to ~25 GFLOPS.
...

Flops are usually stated as SP unless explicitely stated as DP. AFAIK, all the speculation leading upto ISSCC was always for SP Flops. Also when comparing Flops you might want to consider if they're IEEE compliant which CELL isn't for SP but is for DP. But this isn't critical for a console as the EE isn't IEEE. But existing code may need checking before compiling to CELL if it worked on IEEE compliant SP and DP.

Laa-Yosh said:
...
Now, IBM is said to have plans to build CELL workstations and renderfarms for users like the CG industry. But most rendering applications are using double precision AFAIK, and thus the advertised performances (like the 16 TFLOPS rack) are suddenly not that good looking.
...

Again, these would be extremely good Flops for SP and I wasn't expecting those Flops for DP. Why would rendering apps need DP? AFAIK, they would be fine for SP unless you're doing high precision (DP) physics modelling/ visual simulation or something...

Laa-Yosh said:
...
So my question is, how do the actual performances compare? How much FLOPS can a 3GHz P4 Xeon do; is it single or double precision? Are supercomputers rated for single or double prec? Am I right that scientific applications and CG requires double precision? Can SSE accelerate double prec?

Http://www.top500.org

Supercomputers have to meet certain criteria and benchmark tools.

http://www.spec.org/

The SPEC marks are another way to compare processors but more specifically it's a system spec.

Most processor Flops are stated as SP and are IEEE compliant on the desktop arena.

Typical G5 @ 3 GHZ ~ 35 GFlops (VMX + FPU)

http://en.wikipedia.org/wiki/Flops

Science, Engineering, Simulations etc. may need DP, where the 64bit accurate calculations are necessary and critical but for CG, SP would be fine, AFAIK. Why would you need 64bit DP Floats for rasterising, even raytracing?
 
Considering that most of today's rendering software was developed to run on an x86 processor without any extensions, they are certainly using the default 80 bit FP precision. There has been some SSE optimization for some renderers (like the one in 3ds max) but it has not produced any considerable speedup; which suggests that most of the calculations are not using 32 bit precision.
I could list a bunch of reasons why I would want greater precision, like color accuracy or global world size, but I'm sure that there are some users around with more technical knowledge around here. I've heard that Lightwave's renderer is actually working with 160 bits of precision - tolerance is this much lower in offline CG compared to realtime applications. And I'd also like to add that a very important part of offline CGI is actually about running physical simulations for cloth, hair and rigid body dynamics.

But I might be wrong about the double precision issue..
 
The current DP champions in personal computing are the Apple G5s.
Each 970 has dual 64-bit FPUs, FMADD capable, ticking along at 2.5 GHz, giving the powermac 2(FPUs)x2(CPUs)x2,5(freq)x2(FMADD) = 20 DP GFLOPS,

Not too shabby.

Of course, being personal computers, they only have 128 bits worth of memory bus running at 400 MHz to feed the FPUs, reducing that nominal FPU performance to no better than a P4 or Opteron for many (most?) double precision codes, with data sets that correspond to relevant problems. Still, for codes and data sets that can be made to fit the memory hierarchy, the performance is there and is quite good.
 
Entropy said:
The current DP champions in personal computing are the Apple G5s.
Each 970 has dual 64-bit FPUs, FMADD capable, ticking along at 2.5 GHz, giving the powermac 2(FPUs)x2(CPUs)x2,5(freq)x2(FMADD) = 20 DP GFLOPS,

Which nicely explains why a G5-based cluster is on the Top 500 list of the supercomputers. Thanks for the post.


Anyway, we can now see that Cell won't be a supercomputer-on-a-chip, with 25-30 GFLOPS of comparable performance... But it's fast and nice nevertheless, especially for the purpose it has been engineered for: the PS3.
 
Well...considering the 500th computer in the Top500 costs close to U$ 200,000.00, I think that if the $300.00 PS3 is 30x slower in DP flops, it's still a remarkable feat. :)

Personally, I'm more concerned with game performance, so I'm wondering about the 256 SP GFlops and how much of those will really be sustainable.
 
Laa-Yosh said:
Considering that most of today's rendering software was developed to run on an x86 processor without any extensions, they are certainly using the default 80 bit FP precision. There has been some SSE optimization for some renderers (like the one in 3ds max) but it has not produced any considerable speedup; which suggests that most of the calculations are not using 32 bit precision.
...

Even though the an x86 FP unit maybe 80bit precision capable, the rendering software would be re-compiled to the appropriate architecture if ported and the default would be SP, unless DP is explicitely required.

Laa-Yosh said:
...
I could list a bunch of reasons why I would want greater precision, like color accuracy or global world size, but I'm sure that there are some users around with more technical knowledge around here. I've heard that Lightwave's renderer is actually working with 160 bits of precision - tolerance is this much lower in offline CG compared to realtime applications. And I'd also like to add that a very important part of offline CGI is actually about running physical simulations for cloth, hair and rigid body dynamics.
...

I'm not familiar with the specifics you mention above, especially with the 160bit lightwave renderer (160bit what?) but you should be aware that 'bitage' is often talked about as an agregate and not comparable to calculation precision. E.g. 24bit color is an agregate of 3 channels, RGB, where each channel is only 8bit and the calculations would be done for 8bit integers etc. (Using one CELL@ 4GHz ~ 1 TOPS 8bit integers peak).

So these 'bit' numbers flying about without context are not necessarily comparable to SP or DP accuracy. I'd also be interested to know what part of a rendering pipeline would actually benefit from DP, be it offline or real-time?
 
Alejux said:
This ia a good question. I know games rarely need double precision, but some non-realtime renderers use it a lot. Still 25~30 GFlops is 10x more then a PC.

I had currently read that a top of the line P4 CPU does close to 30GFLOPs. That is about 8-10x more FLOPs. And if we are to expand this to "a PC" compared to CELL you can throw in some of the other hardware

FLOPs are nice, but core logic is important also. If you compare a P4 chip you have to remember that it has extentions like MMX, SE, and others but its is not a FP focus chip like CELL. They have different audiances (at this point) and different design goals.

I am actually curious about the CELL, and maybe someone can answer. My basic understanding is that the PPC core (PE) will be feeding the SPEs (APUs). I am not a programmer, designer, or engineer. But It would seem to me that the PE would have to spend a bit of time keeping all 8 SPEs full. Which brings me to my quesiton: General application logic--will the PE have enough resources to run this or are programmers going to have to find ways to get these type of tasks to run on the SPEs?

What kind of hurdles, if any, are programmers going to have getting game code to run on the SPEs (which appear to be very math centric... good for Physics and Geometry... but how good at other tasks?)

Any familiar in these areas willing to take a stab at this? Obviuosly CELL is insanely powerful. While it has a lot of power, is that power approachable for what programmers are doing today in games? How will this all work? I am sincerely interested. CELL is a very unique design, I would like to hear about how it will be used in that not all code is like Physics or Geometry, so I am wondering if this is an issue or if CELL has some design features that make it pretty straight forward... again, I am not a program so be nice ;)

EDIT: The pic on this page answers some questions. Still wondering what type of applications can, and cannot, run on the SPEs.
 
ClearSpeed CSX600 operating at a measely 250MHz and consuming less than 5W can sustain 25GFLOPS DP and has a peak of 50GFLOPS DP. It's mainly designed as a coprocessor, but can operate as a separate cpu too. The CELL is not a supercomputer on a chip like the marketing people would like you to believe, however, if you view it like the G5 supercomputer with many G5s linked together, then you can link many CELLs together to form a supercomputer too, but it's no supercomputer on a chip. The 16TFOPS workstation that they keep talking about needs many many racks of CELL boards. Each board containing no less than 32 CELLs each.
 
PC-Engine said:
...
The 16TFOPS workstation that they keep talking about needs many many racks of CELL boards. Each board containing no less than 32 CELLs each.

I'm pretty sure that 16 TFLOPS is SP peak and for 2nd gen CELL processors.

Even with first gen CELL processors 256 GFLOPS * 42 (1 rack=42U) ~ 10.5 TFLOPS peak, single precision for a 42U rack, one CELL per 1U.
 
ClearSpeed CSX600
A browse at the official site reveals that it is quite a specialised unit. An array-processor with 96 processor elements, but each PE only has 6kb mem. A 128kb scratchpad sits on the chip, and I don't see any presence of any "general-purpose type CPU" for dealing with applications that do that map to array processing. Of course, that is why it is also positioned as a coprocessor to the "normal" pentium/athlon!

It's wonderful for certain applications, but for many other things we're better off relying on our "miserable" years-old pentiums and athlons.

EDIT: With 96 PEs it's clear how they arrived at those GF numbers at that clockspeed. It's also quite clear that only some specialised applications can benefit from being split into 96 threads limited to 6kb! (refer to Deano's post on Amdahl's Law)
 
Yes it's a specialized unit, that's why I brought it up with regards to supercomputing which IS specialized. I didn't say anything about why it would be a good general purpose processor which it wasn't designed to be nor did I say it would be a good processor for a game machines.

The CSX600 has a simple programming model and is supported
by a fully-featured Software Development Kit (SDK)
based around an optimizing C compiler.
The SDK is available for Linux and Microsoft Windows
platforms.

The key features of the SDK are:

• Optimizing C compiler based on ANSI C with simple
extensions to support MTAP architecture.
• Standard libraries optimized for MTAP architecture.
• Graphical debugger supports all features required by professional
software developers: sim ple & conditional breakpoints
and watchpoints, sin gle-stepping, symbolic and source-level
debugging. All features are available equally with both the
simulators and the target hardware.
• Macro assembler for assembling the code gener ated by the
compiler or hand-written assembler source.
• Pre-processor which does macro substitution on C and
assembler source files.
• Instruction set simulator allows application code to be run
and debugged in the absence of hardware.
• Linker for combining object files and libraries into an executable
program.
• Archiver / Librarian for building libraries of object code files.
• Object code dump tool for examining the con tents of object
files, libraries and executables.
Technology

The multi-threaded array processor architecture provides
an exceptionally powerful and scalable processing solution,
based on an array of tens to thousands of Processing
Elements (PEs). Each PE has its own local memory and
I/O capability, making the architecture ideally suited for
applications which have high processing and/or bandwidth
requirements. The inherently scalable array architecture is
also highly area and power efficient.

The CSX600 can serve either as a co-processor sitting
alongside an Intel or AMD CPU within a high performance
workstation, blade server or cluster configuration, or as
a standalone processor for embedded DSP applications
like radar pulse compression or image processing. In
applications where the CSX600 is acting as a co-processor,
dynamic libraries off-load an application’s inner loops to it.
These inner loops typically make up a only small portion of
the source code, but are responsible for the vast majority of
the application’s running time.

Although the CSX600 itself is programmed in C, host libraries
can be provided to allow C, C++ or FORTRAN applications
to communicate with CSX co-processors.
 
FWIW, the only people who are shocked by the notion of SP floats are those who don't really understand what the SIMD/GFLOP scene is in the first place. From the very first appearance of vector engines on desktop CPUs, the understanding has always been that the speed is obtained primarily from the SP operations. This is not to say that said vector units did not have a DP mode, but that isn't where the greatest speed benefits would be atained, anyway. So it's not like Sony was trying to "cheat" with SP performance. SP vector design has been the practice for a VERY long time.

The reason that most consumer, and even pro, apps today happen to use DP, is not because they need it, but because the software was designed to use the default FPU that will most assurably exist on the target CPU. On x86 platforms, that happens to be a very standard 80-bit processing unit (which has been around for a loooooong time), and on risc setups, it is a 64-bit processing unit. I'm thinking a lot of people here that see the SP thing as some kind of conpiracy genuinely have no grasp of the ridiculously huge number range a SP float encompasses or the sheer precision offered within 32-bits. It's pretty ridiculous to compain about this "shortcoming" when we are working with 16/24-bit audio and sub-32-bit video (in some cases even sub-24 bit in the case of LCD panels that so many people are ga-ga about) in the consumer space. The whole proliferation of speedy SP designs is because designers realized that the precision was quite adequate for just about anything "most" people will do on a desktop, and that building a faster DP FPU comes up as a waste of resources more often than not.

So things are progressing exactly as they should be in floating-point land- no shenanigans. This is not to say that there are the usual suspects in aerospace and nuclear energy research, for example, that really do require the best precision that technology can offer. Their computing needs will be served by very specific products, not any kind of Pentium, PowerPC, or BBE, that "we" would be ga-ga over. That's my 2 cts.

As has been stated ad nauseam, Cell is an architecture rather than an end product design. A "Cell" processor could be conjured using any sort of CPU design or processing unit as its base. If somebody needed and was willing to pay for it, there is absolutely nothing stopping the possibility of a scaleable processor product that does incorporate DP floats to meet whatever GFLOP target. Make no mistake, that when you are talking hundreds or thousands of GFLOPs or more, it will be a far more plausible proposition to come up with a Cell solution than asking Intel or even IBM to come up with a single, monolithic CPU design that can do that.
 
randycat99 said:
That's my 2 cts.

Good points.

The only thing I want to comment on is the "CELL Architecture". Yeah, CELL is an architecture, but don't put your head in the sand either--it is also a chip. And until specific configurations of a CELL chip get a name there is no point refereing to it as anything different as long as people know what we have seen may not be what ends up in HDTVs, PS3s, or Workstations.

The fact a Pentium can have different speeds, cache sizes, and other features (MMX, SSE) does not change that it is a Pentium and x86. Yes, CELL is highly configurable (but that also comes at a cost) but it does not magically make CELL different from everything else. A GPU is a good comparison. A Radeon 9600 and 9800 are the same basic architecture, but one has more pipelines (in the same way a CELL chip may have more or less PEs or SPEs). It does not prevent us from talking about a Radeon family chip or a 95xx-98xx series chip. And this did not prevent the chip from being expanded with even more features as in the R420/X800 series chips which doubled the pipelines of the 9800 and added new functionality.

So while there may not be "The CELL Processor" (just like there is no "The Pentium" or "The Radeon") we can talk about "A CELL Processor". And until we start seeing an array of different consumer level devices that are relevenat to consumers it is kinda pointless to continue beating this point to death. Yes it is customizable, but so far we have only seen 1 iteration of it. And unless we see some substantial changes (and I am not talking about 2 PEs or 1:12 PE:SPE configuration, which are pretty much akin to nVidia first showing a 4 pipeline part and then upping it to 6) a CELL chip is still a CELL chip until they show us something really different.

I may be wrong, but up to this point I see the whole "CELL is an architecture not a chip" as playing on semantics. Not that you are doing this, but the ones I see beating this dead mule the most are those who are fixated on "more power" or 1TFLOPs.
 
The point was, it is a technique in the larger picture. We are seeing one implementation of it. Granted, it is referred more casually as the Cell processor- as well it should, since we will only see one version as far as game consoles are concerned. It is the only one around. Maybe there will be more varieties, maybe not (depends on the success of this venture, I'd guess). The point is, the feature set we see today is far from locked in stone. The fundamentals are in place. If there are business customers in the future that require certain functionalities, the Cell architecture is adaptable to include whatever functional units are required and retain its scaleable nature (don't take that as me saying new versions would be trivial to create- just not a flatly prohibitive proposition). It need not even be based on a different CPU ISA if the exisiting structures are adequate for the job (I'm sure doing so would be considerable work, so there would have to be a good reason to bother). You want DP FPU arrays? Add them in. Apple decides they want an army of VMX's?- they could get them, if they were willing to buy.

I don't think it is necessary to believe that such variations cannot exist simply because none exist today (though, no one is stopping you if you would like to, and it doesn't really change either of our universes either way we choose), and there is no talk of any other versions. Indeed, maybe there won't be, ultimately. That does not mean it was not possible. It just seemed obvious to me that the multiplicity part has a drop-in nature to it. It was coming up with a standard to tie it all together in a scaleable fabric that is the big revelation here, imo.
 
I'll ramble a bit here. Sorry about that. It's Sunday, after all.

PC-Engine said:
Yes it's [the CSX600] a specialized unit, that's why I brought it up with regards to supercomputing which IS specialized.
Weeeell - yes and no. To a large extent, super computing simply means that you can afford to pay more than what is considered "normal" for a computer to get a better tool for attacking your problems.
It used to be (just before I got into computing) simply very fast systems, then it was vector processors, then parallell vector processors (my first association seeing early Cell sketches were the Cray X-MP), then gradually massively parallell processors or VLIW or multithreaded processors and a bunch of other ideas.

True, generally speaking you could say that the further we've walked down this path, the narrower the problem set where we get some proportional benefit for our effort/money and the more specialized the computers have become. (However, one of the more popular ways to get additional computing power is simply to use a lot of small cheap ones working on the same problem, so it's not true that the hardware necessarily gets more exotic.) And this narrowing of the set of tasks that can effectively adressed with higher priced hardware is a very seroius problem because of Amdahls law, and because there are a lot (most?) of very worthy problems that just don't benefit much from todays custom hardware.

A supercomputer that was simply a very fast general purpose CPU would still sell like hotcakes (very expensive but good ones, that is :)) but they just aren't on offer. For some time, personal computing CPUs have done the job too well and too cheaply, for alternatives to crop up.

randycat99 said:
FWIW, the only people who are shocked by the notion of SP floats are those who don't really understand what the SIMD/GFLOP scene is in the first place. From the very first appearance of vector engines on desktop CPUs, the understanding has always been that the speed is obtained primarily from the SP operations. This is not to say that said vector units did not have a DP mode, but that isn't where the greatest speed benefits would be atained, anyway. So it's not like Sony was trying to "cheat" with SP performance. SP vector design has been the practice for a VERY long time.
Weeeell - yes and no.
Add-on vector processors have been manufactured for many systems from top end to PCs. I have some personal experience at the high end with the IBM 3090VF, there were vector processors made for the DEC VAXen, we have had a number made for PCs. (And some parallell add ons, remember the transputer boards? ) They have never caught on even in computational sciences, and for good reason.

While SP has typically yielded the highest numbers, DP have still been emphasized in the marketing brochures of those add-on processors that have been good at it.

Pure FLOPS numbers have always been a marketing tool, and have never been a useful measure of anything. Even a couple of decades ago, I looked at the supporting memory hierarchy and snickered at FLOPS claims, as did all who actually used these computers. Note though that these manufacturer claims were useful in getting grants/funding from the beancounters who held the money so while we may have rolled our eyes, I don't know that anybody made much noise about the outrageousness of the claims. :D Ten times the performance is still pretty damn useful, as long as you don't have to foot the 500 times as large bill yourself.
Computing needs to have its architectural experiments funded somehow.

For instance, in the context of this forum, otherwise our gaming hardware would have developed at a much slower pace.
And then, where would we be?
 
Is the CELL architecture really that customizable though, that it can support different SPU designs and still work? If you replace the SPU's with some specialized Pixel pipelines, Cell code, running apulets, will keel over as the specialist units won't be able to run them.

I dunno. I guess the SPU's could easily accommodate a design change from SP to DP for more scientific purposes, but I can't see Cell working as a system that seemlessly glues different techs together.

On another note, are there any instances where existing supercomputers use SP? Will Cell find it's way to the top of the Top500 list, it will it be consigned to mainstream uses only?
 
Back
Top