So, I did remember correctly:
and
Mike Acton also reached this conclusion and discussed them on these forums, that PPE feeding bottleneck could be overcome by the SPE doing all this themselves.
Note, by the way, that only three projects were actually optimised to use the SPEs at all as discussed in 6-5 (Alya, BSIT and Siesta). Some of the others were ported but only used PPE.
This was in 2009, but I think this part was not changed from the original Cell, so their SIMD at least has managed to stay 'state of the art', at least according to this paper.
I don't think many people disagree with you on this one. As the article says itself, 'the control is put back into the hands of the programmer' with the downside being that you have to do a lot of work you otherwise wouldn't have. Even with some of the tools maturing and becoming quite good (and the Cell simulator I think is pretty helpful too, as it allows you to follow your data exactly, quite nice), Cell remains (too much of) a specialist device to become very mainstream. But it is still a really, really cool design and chip to have in a console, and of all the things that have gotten in the way of the PS3, I'd say the blue laser was a far bigger issue (delaying its launch) as well as the company that made the device being crap at the software side of things.
Thanks for digging up these papers by the way.
EDIT: and respect to the guys working on this. They really know their stuff and it is surprising how far they got with the amount of time they've spent.
If we consider the DMA transfers (Memory-LS communication) and the computation, the
theoretical speed-up should be the same as when not considering the communication. But this
is not the case as the PPE cannot feed all the SPEs with the required input and write back the
output results in the main memory global system fast enough.
and
The expected gain in performance was twice the actual that we are getting. The reason is due
to the idle time during which the SPEs are waiting for data from the PPE. The bottleneck is
reordering data for the gather/scatter task. The reasons could be the low peak performance of
the PPE (IPC throughput) and the number of L1/L2 cache misses (memory bound).
5.1.5 Conclusions
The major constraint is the PPE performance. The PPE is an in-order execution processor
with very few functional units. This property makes it difficult to get good performance on
the critical task that is executed on it. We cannot rely on executing critical code on such a
processing unit. We should port it (if possible or feasible) to the SPEs.
Data that we have to use for element loop computation is sparse in main memory and nontrivial to access from the SPEs. To help on that task the PPE gathers/scatters the required data
into a sequential buffer. The main unsolved problem is that the PPE driven gather/scatter
approach is not efficient and the SPEs are only working 60% of the time in the element loop
computation; they are idle waiting for data from the PPE in the remaining time. If we would
have started a new code from scratch we would have changed noticeable the data layout
structure. The data layout can be an important handicap for DMA transfers to the SPEs.
The solution could be to include a better PPE unit (Power6 or Power7 processor) in the
Cell/B.E. chip or implement the gather/scatter task through SPEs using DMA list (SPE
driven). In fact, the SPE driven approach for gather/scatter task can be implemented with the
actual hardware. However it requires DMA list research and maybe some reordering of the
mesh data at the beginning. It should be interesting if future Cell/B.E. implementations would
not have constraints for DMA list transfers of less than 16bytes.
We have SPE-wised the element loop computation, and the second remaining computational
part is the solver. Most of the time the solver executes SpMV (sparse matrix vector
operation). To further optimize the program we could provide an internal/external library with
such linear algebra operations on the SPEs.
Mike Acton also reached this conclusion and discussed them on these forums, that PPE feeding bottleneck could be overcome by the SPE doing all this themselves.
Note, by the way, that only three projects were actually optimised to use the SPEs at all as discussed in 6-5 (Alya, BSIT and Siesta). Some of the others were ported but only used PPE.
SIMD (Single Instruction, Multiple Data) is processing in which a single instruction operates
on multiple data elements that make up a vector data-type. SIMD instructions are also called
vector instructions. This style of programming implements data-level parallelism. Most
processor architectures available today have been augmented, at some point, in their design
evolution with short vector, SIMD extensions. Examples include Streaming SIMD Extensions
(SSE) for x86 processors, and PowerPC’s Velocity Engine with AltiVec and VMX. The
different architectures exhibit large differences in their capabilities. The vector size is either
64 bits or, more commonly, 128 bits. The register file size ranges from just a few to as many
as 256 registers. Some extensions only support integer types while others operate on single
precision, floating point numbers, and yet others process double precision values. Today, the
Synergistic Processing Element (SPE) of the CELL processor can probably be considered the
state of the art in short vector, SIMD processing. Possessing 128-byte long registers and a
fully pipelined fused, multiply-add instruction, it is capable of completing as many as eight
single precision, floating point operations each clock cycle.
This was in 2009, but I think this part was not changed from the original Cell, so their SIMD at least has managed to stay 'state of the art', at least according to this paper.
tool wise Cell never really has had a good set. One thing you notice pretty quickly is that anything you want to optimize kernel wise almost always has a call/function within MKL. Cells tools and libraries are severely lacking even today compared GPGPU. A large part of Cell's issues stem from it being a relatively exotic architecture will very poor tools support. It is interesting to contrast Cell in that regard with the data that came out of the TACC workshop on Intel's MIC which had large full scientific workloads ported in days and with good scaling.
I don't think many people disagree with you on this one. As the article says itself, 'the control is put back into the hands of the programmer' with the downside being that you have to do a lot of work you otherwise wouldn't have. Even with some of the tools maturing and becoming quite good (and the Cell simulator I think is pretty helpful too, as it allows you to follow your data exactly, quite nice), Cell remains (too much of) a specialist device to become very mainstream. But it is still a really, really cool design and chip to have in a console, and of all the things that have gotten in the way of the PS3, I'd say the blue laser was a far bigger issue (delaying its launch) as well as the company that made the device being crap at the software side of things.
Thanks for digging up these papers by the way.
EDIT: and respect to the guys working on this. They really know their stuff and it is surprising how far they got with the amount of time they've spent.