Cell benchmarked

london-boy said:
Very nice, lots of detail especially about efficiency!!

Alot of it went over mead head but you can see that theres a wealth of information there.

Off topic: Nice Red Dots L-B!
 
SPE performance

The SPE is a modular design consisting of a Synergistic Processing Unit (SPU) and a Memory Flow Controller (MFC). An SPU is a compute engine with SIMD support and 256KB of dedicated local storage. The MFC contains a DMA controller with an associated MMU, as well as an Atomic Unit to handle synchronization operations with other SPUs and the PPU.

An SPU is a dual-issue, in-order machine with a large 128-entry, 128-bit register file used for both floating-point and integer operations. The SPU operates directly on instructions and data from its dedicated local store, and relies on a channel interface to access the main memory and other local stores. The channel interface, which is in the MFC, runs independently of the SPU and is capable of translating addresses and doing DMA transfers while the SPU continues with the program execution.


Figure 3. SPE block diagram
figure3.gif



The SPUs SIMD support can perform operations on sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers, or four single-precision floating-point numbers per cycle. At 3.2GHz, each SPU is capable of performing up to 51.2 billion 8-bit integer operations or 25.6GFLOPs in single precision. Figure 3 shows the main functional units in an SPU: (1) an SPU floating point unit for single-precision, double-precision, and integer multiplies; (2) an even SPU fixed-point unit for arithmetic, logical operations, and word shifts; (3) an odd SPU fixed-point unit for permutes, shuffles, and quadword rotates; (4) an SPU control unit for instruction sequencing and branch execution; (5) an SPU local store unit for load and stores; also to supply instructions to the control unit; (6) an SPU channel/DMA transport which is responsible for controlling input and output through the MFC.

As Figure 3 shows, each functional unit is assigned to one of the two execution pipelines. The floating point and fixed point units are on the even pipeline while the rest of the functional units are on the odd pipeline. The SPU can issue and complete up to two instructions per cycle, one on each of the execution pipelines. A dual issue occurs when a group of fetched instructions has two issueable instructions, one of which is executed by a unit on the even pipeline and the other executed by a unit on the odd pipeline.

There are three types of instruction fetches: flush-initiated fetches, inline prefetches, and hint fetches. To fetch instructions, the instruction fetch logic reads 32 instructions at a time into its instruction line buffer (ILB), from which two instructions at a time are sent to the issue logic. When the operands are ready, the issue logic sends the instructions to the functional units for execution. Functional unit pipelines vary from two to seven cycles. Hint instructions can preload instructions into the ILB.

Features such as a deterministic LS access time, simple issue rules, software-inserted branch hints, a large register file, and so on, are exposed to the compiler and applications for performance tuning. With some tuning efforts, we have seen a wide variety of applications approach the theoretical IPC of 2 in the SPU. The SPEs have DMA support for excellent data streaming bandwidth that is much higher than many modern processors.

Arguments about the "real" efficiency of the SPEs in 3... 2... 1...
 
Where can I find Sisoft Sandra Dhrystone MIPS for Cell and SYSMark2004 Office Productivity for Cell? GPP FTW! ;)
 
This aricle shows how optimizations play a significant role, such as the improvement found in the loop unrolling during T&L tests, and how the 'real-world' performance advantage of Cell varies considerable from task to task. For 'media' activities a rough guide seems to be Cell's 8 SPU's are together about 10x the applied float performance of a similarly clocked PPC970, with some much greater gains for more suited SIMD material, and at 1/10th the DP power negligable advantage in those tasks.
 
Interesting that we finally got some absolute figures for the Terrain Renderer. 30fps on Cell, 0.85fps on 2.7Ghz G5/VMX (or if you scaled up linearly to 3.2Ghz, 1fps ;)).

Nice to have some "tested" vertex transform rates too. 7 SPUs = 1.52bn verts/sec?
 
Last edited by a moderator:
Good read and good find.

Whatever ends up happening with Cell's push for recognition, I think it likely that it will gain traction in at least a few fields. With several to choose from in which it offers superior performance (and performance per watt I'm sure) to the traditional options, it'll likely stick in one or more places.

Cryptography is one area I hadn't really thought of before.
 
Titanio said:
Interesting that we finally got some absolute figures for the Terrain Renderer. 30fps on Cell, 0.85fps on 2.7Ghz G5/VMX (or if you scaled up linearly to 3.2Ghz, 1fps ;)).

I find it hard to believe.

So, with 8 cores we achieve a 30 times increase for the same program?
Each of the 8 cores being an extremely simplified version of the compared core.

Right, so 1 SPU with 20 million transistors (including local storage) is performing significantly better (4 times) than cores with several times the logic and storage.

Sure!!!
How was it? Code por SPE was heavily optimized at assembler level by several engineers and code por 970 chip was developed in Java by a junior programmer?

I am not bitching about Cell. I like it. But I am pretty tired of IBM/Sony/MS/nVIDIA/ATI (in this order) to consider us a bunch of idiots.
 
DarkRage said:
I find it hard to believe.

So, with 8 cores we achieve a 30 times increase for the same program?
Each of the 8 cores being an extremely simplified version of the compared core.

Right, so 1 SPU with 20 million transistors (including local storage) is performing significantly better (4 times) than cores with several times the logic and storage.

Sure!!!
How was it? Code por SPE was heavily optimized at assembler level by several engineers and code por 970 chip was developed in Java by a junior programmer?

I am not bitching about Cell. I like it. But I am pretty tired of IBM/Sony/MS/nVIDIA/ATI (in this order) to consider us a bunch of idiots.

Software explanations aside, just looking at hardware, performance isn't just a function of how many cores you have and how many flops. There are other architectural concerns that can have a big impact on performance. If you read the paper, the first claim they make relates not to the multiplicity of cores of floating point performance at all. As much as some would like to think otherwise, Cell's distinguishing claims aren't limited just to FP and the scale of the parallelism on offer. Other things could widen the gap much further with certain applications than the paper differences in those areas suggest.
 
DarkRage said:
Right, so 1 SPU with 20 million transistors (including local storage) is performing significantly better (4 times) than cores with several times the logic and storage.
Yes, because it works differently and is optimized for this sort of work. In exactly the same way 20 million GPU transistors can eclipse 20 million CPU transitors in graphics work, while 20 million CPU transistors can eclipse 20 million GPU transistors in running an OS.
I am not bitching about Cell. I like it. But I am pretty tired of IBM/Sony/MS/nVIDIA/ATI (in this order) to consider us a bunch of idiots.
Except that they're not treating us like idiots. If you don't understand the technology and how it works and how these real-world performances are achieved, it's your fault for not acquiring suitable education. This isn't imaginary numbers at work here. This is targetted design decisions being used as they're supposed to be (and of course the test subjects are weighted in SPE's favour in applying them to the sorts of tasks they're good for. I'm sure there's plenty of benchmarks that'd show where SPE's are inferior to typical CPUs.)
 
The SPUs SIMD support can perform operations on sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers, or four single-precision floating-point numbers per cycle. At 3.2GHz, each SPU is capable of performing up to 51.2 billion 8-bit integer operations or 25.6GFLOPs in single precision.

ermh...
3.2GHz * 4 FLOPS = 12.8GFLOPS

Should I conclude that they suck at maths or that they don't reread their papers ?
 
Ingenu said:
ermh...
3.2GHz * 4 FLOPS = 12.8GFLOPS

Should I conclude that they suck at maths or that they don't reread their papers ?

They're including MADDs.

3.2 * 4 * 2 = 25.6
 
Ok, so it looks like all of you believe that this comparison is fair?. That all the compilation options and code have received the same attention?. Even when reading the documents IBM states the needed effort to extract as much performance as possible from the SPEs, but not giving any details on the implementation on the 970, for example.

For example, anyone have a valid explanation, from an architectural point of view, about why a single SPE is 4 times more efficient with Ray Tracing than a 970+Altivec?
Please, go ahead, including Shifty Geezer. Maybe I have adquired suitable education to understand you ;-) .

A couple of additional questions:
- why do they not publish SpectFP/SpecInt or any other public benchmark to compare with?
- why are they not using real hardware in so many cases?
 
Let me ask you DarkRage, what in the world would be IBM's motive to talking down the 970? ;)

I mean no doubt Cell got some 'best-case' stuff, but I don't think anyone went so far as to artificially cripple the competition. Plus honestly a lot of this testing seems to be a lot closer to 'real world' then we've gotten in the past as far as Cell goes.

Did you see the Mercury Systems demo? You would have to be quite the cynic to think that Cell won't show superior performance to a GP core in at least *some* tasks.

And IBM is highlighting those tasks for the purpose of these benchmarks.
 
DarkRage said:
I find it hard to believe.

So, with 8 cores we achieve a 30 times increase for the same program?
Each of the 8 cores being an extremely simplified version of the compared core.

Right, so 1 SPU with 20 million transistors (including local storage) is performing significantly better (4 times) than cores with several times the logic and storage.

Sure!!!
How was it? Code por SPE was heavily optimized at assembler level by several engineers and code por 970 chip was developed in Java by a junior programmer?

I am not bitching about Cell. I like it. But I am pretty tired of IBM/Sony/MS/nVIDIA/ATI (in this order) to consider us a bunch of idiots.

They explained the results of the 970 in another paper I don't remember which one but the explanation is that the 970 is not fp bound in the TRE program, it is memory bound : a lot of cycles are waisted juste waiting for the memory. The SPE thanks to their fast local store are more capable of approaching their peak performance.
 
Back
Top