london-boy said:Very nice, lots of detail especially about efficiency!!
BlueTsunami said:Off topic: Nice Red Dots L-B!
SPE performance
The SPE is a modular design consisting of a Synergistic Processing Unit (SPU) and a Memory Flow Controller (MFC). An SPU is a compute engine with SIMD support and 256KB of dedicated local storage. The MFC contains a DMA controller with an associated MMU, as well as an Atomic Unit to handle synchronization operations with other SPUs and the PPU.
An SPU is a dual-issue, in-order machine with a large 128-entry, 128-bit register file used for both floating-point and integer operations. The SPU operates directly on instructions and data from its dedicated local store, and relies on a channel interface to access the main memory and other local stores. The channel interface, which is in the MFC, runs independently of the SPU and is capable of translating addresses and doing DMA transfers while the SPU continues with the program execution.
Figure 3. SPE block diagram
The SPUs SIMD support can perform operations on sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers, or four single-precision floating-point numbers per cycle. At 3.2GHz, each SPU is capable of performing up to 51.2 billion 8-bit integer operations or 25.6GFLOPs in single precision. Figure 3 shows the main functional units in an SPU: (1) an SPU floating point unit for single-precision, double-precision, and integer multiplies; (2) an even SPU fixed-point unit for arithmetic, logical operations, and word shifts; (3) an odd SPU fixed-point unit for permutes, shuffles, and quadword rotates; (4) an SPU control unit for instruction sequencing and branch execution; (5) an SPU local store unit for load and stores; also to supply instructions to the control unit; (6) an SPU channel/DMA transport which is responsible for controlling input and output through the MFC.
As Figure 3 shows, each functional unit is assigned to one of the two execution pipelines. The floating point and fixed point units are on the even pipeline while the rest of the functional units are on the odd pipeline. The SPU can issue and complete up to two instructions per cycle, one on each of the execution pipelines. A dual issue occurs when a group of fetched instructions has two issueable instructions, one of which is executed by a unit on the even pipeline and the other executed by a unit on the odd pipeline.
There are three types of instruction fetches: flush-initiated fetches, inline prefetches, and hint fetches. To fetch instructions, the instruction fetch logic reads 32 instructions at a time into its instruction line buffer (ILB), from which two instructions at a time are sent to the issue logic. When the operands are ready, the issue logic sends the instructions to the functional units for execution. Functional unit pipelines vary from two to seven cycles. Hint instructions can preload instructions into the ILB.
Features such as a deterministic LS access time, simple issue rules, software-inserted branch hints, a large register file, and so on, are exposed to the compiler and applications for performance tuning. With some tuning efforts, we have seen a wide variety of applications approach the theoretical IPC of 2 in the SPU. The SPEs have DMA support for excellent data streaming bandwidth that is much higher than many modern processors.
Nice!macabre said:
Titanio said:Interesting that we finally got some absolute figures for the Terrain Renderer. 30fps on Cell, 0.85fps on 2.7Ghz G5/VMX (or if you scaled up linearly to 3.2Ghz, 1fps ).
DarkRage said:I find it hard to believe.
So, with 8 cores we achieve a 30 times increase for the same program?
Each of the 8 cores being an extremely simplified version of the compared core.
Right, so 1 SPU with 20 million transistors (including local storage) is performing significantly better (4 times) than cores with several times the logic and storage.
Sure!!!
How was it? Code por SPE was heavily optimized at assembler level by several engineers and code por 970 chip was developed in Java by a junior programmer?
I am not bitching about Cell. I like it. But I am pretty tired of IBM/Sony/MS/nVIDIA/ATI (in this order) to consider us a bunch of idiots.
Yes, because it works differently and is optimized for this sort of work. In exactly the same way 20 million GPU transistors can eclipse 20 million CPU transitors in graphics work, while 20 million CPU transistors can eclipse 20 million GPU transistors in running an OS.DarkRage said:Right, so 1 SPU with 20 million transistors (including local storage) is performing significantly better (4 times) than cores with several times the logic and storage.
Except that they're not treating us like idiots. If you don't understand the technology and how it works and how these real-world performances are achieved, it's your fault for not acquiring suitable education. This isn't imaginary numbers at work here. This is targetted design decisions being used as they're supposed to be (and of course the test subjects are weighted in SPE's favour in applying them to the sorts of tasks they're good for. I'm sure there's plenty of benchmarks that'd show where SPE's are inferior to typical CPUs.)I am not bitching about Cell. I like it. But I am pretty tired of IBM/Sony/MS/nVIDIA/ATI (in this order) to consider us a bunch of idiots.
The SPUs SIMD support can perform operations on sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers, or four single-precision floating-point numbers per cycle. At 3.2GHz, each SPU is capable of performing up to 51.2 billion 8-bit integer operations or 25.6GFLOPs in single precision.
Ingenu said:ermh...
3.2GHz * 4 FLOPS = 12.8GFLOPS
Should I conclude that they suck at maths or that they don't reread their papers ?
DarkRage said:I find it hard to believe.
So, with 8 cores we achieve a 30 times increase for the same program?
Each of the 8 cores being an extremely simplified version of the compared core.
Right, so 1 SPU with 20 million transistors (including local storage) is performing significantly better (4 times) than cores with several times the logic and storage.
Sure!!!
How was it? Code por SPE was heavily optimized at assembler level by several engineers and code por 970 chip was developed in Java by a junior programmer?
I am not bitching about Cell. I like it. But I am pretty tired of IBM/Sony/MS/nVIDIA/ATI (in this order) to consider us a bunch of idiots.