aaronspink
The problem is that the SPEs are basically limited to the same or less programmable functionality of a modern GPU core while being much larger and power hungry.
Hmm, can you generate the code from your compute shader and run it immediately without a glitch?
Still this "power hungry" CELL was unreachable by other CPUs a few years ago and was #1 CPU in Green500 list.
http://www.realworldtech.com/compute-efficiency-2012/
It took Intel (and IBM
) 5 years to come close to CELL in performance. Yeah, this is a bad architecture. LOL
I mean SP of course, but GPUs are not a DP powerhouse either :smile:
And they significantly lack the flexibility of a modern SIMD CPU core while being not much smaller.
So how GP core that is actually more flexible than any "classic" CPU core can be less programmable than GPU?
You can use cache where needed, with any cache-policy, prefetch data you
need in advance. All this is not possible on "classic" CPUs.
Program flow is "scalar" on SPE. Not sure how to describe it.
GPUs are SPMD machines and waste resources with their wide warps/wavefronts.
SPE is predictable and can entirely hide data transfer overhead on streaming workloads.
"Advanced" prefetchers that Intel uses is a joke compared to explicitly programmable DMAs.
Effective ISA and DMA allows 90%+ compute and bandwidth utilisation on a wide range of workloads.
Shared register file and lack of flags eliminate a lot of stalls and issues of PPC/ARM/X86 on data transfer between the different computation units.
You still want the capability to do coherent transaction/message passing in order to co-ordinate and interact. But this isn't how the SPEs were architected.
SPE has message passing channels as well as SL1 cache to fast inter-SPE communications via atomic DMA.
DMA traffic is CC, so it is perfectly synchronised with PPU. CC is not required for actual data processing because each core working with separate set of data.
All temporal data is also irrelevant to other cores. "Classic" cpus just wasting an energy here on useless job.
I've read all the documentation as well as discussed the architecture with many who have programmed it. One doesn't need to program an architecture in order to understand its issues and shortcomings.
As if someone who learned a book about sex somehow became
LESS virgin.
BTW Sex and SPU has at least two similarities
Where you see issues, I see opportunities. CELL is #1 most brilliant CPU design for me. You can color me as CELL fanboy.
I don't know for HPC, but in games the particular problem can usually be changed to map effectively to underlying hardware.
Andrew Lauritzen
Having a unified address space is barely useful since it's not even cached... not even for read-only data (like GPUs).
1) You think of GPU as a set of CPU-initiated kernels. IMO GPU should act like a separate CPU core and pick the data by itself and traverse CPU data structures directly without additional fuss from CPU.
Discrete GPU is a deadend because of perpetual interconnection bottleneck.
2) It is not cached yet on PC GPU, because GPUs are still in their infancy.
ARM Mali T6xx GPU has unified address space and allows cache coherency.
Next-gen APU will have the same functionality one day.
You don't want to be pointer-chasing on SPUs (or GPUs) anyways
Why not? As far as it is not on critical path.
In Local Store pointer-chasing is free. Immediate DMA pointer-chasing is not but it is a less problem than anemic PPU.
I really want to give Cell the benefit of the doubt but as I mentioned retrospectively the memory hierarchy choices they made are just too crippling.
PPU with "proper" memory hierarchy can hardly achieve 10% of memory BW under Linux without a manual prefetching.
Thus I agree with Aaron... either a CPU or a GPU is more efficient at the vast majority of interesting workloads.
That depends of CPU core. SPU can be several times faster than PPU on the
same unoptimised scalar code.
The problem of SPU is a lack of adequate programming environment and compiler. GCC is completely suck at this task. SPU programming still is a serious waste of time, but at least it is fun
Furthermore it came out around the same time that G80 did, which really is where the GPUs begun to put the nails in the coffin.
And it was a gigantic GPU with 3 times more transistors than CELL.
CELL is not GPU. So why not compare GPU to x86? Is GPU put the nails in x86 coffin?
The fact that CELL is used as RSX accelerator is a consequence of less-than-required GPU performance.
so I think it's pretty fair to conclude that Cell is going to look really bad in that comparison today.
7 billion transistor 2012 GPUs vs. 230 million 2005 CELL?
I'm writing this after sleepless night of SPU coding on weekend =)