CELL Docs: "Translated" for the common folk

Acert93

Artist formerly known as Acert93
Legend
http://www.tomshardware.com/hardnews/20050825_190713.html

You will want to go to their site to read the entire article.

But here is where Cell's architecture becomes truly unique: No SPE has a view into system memory. In Intel's multicore technology, for instance, all processing cores operate as fully-capable CPUs unto themselves, with equivalent access to system memory whose arbiter is a memory controller looking over the front-side bus. In Cell architecture, only the PowerPC element (PPE) has a view of system memory, and there can be as few as one of these elements within a processor. The PPE is the only conventional processing element, with complete access to system functions (or sharing that access with another PPE, when present). Those system functions include directing another processing element -- which hasn't been discussed in detail until today -- with the more familiar-sounding name of Memory Interface Controller. The MIC fetches swatches of memory for the SPEs, providing them with a shared, collective "sandbox."

Here is where cache organization plays a critical role. Each PPE has its own L1 cache, as you might expect, which is not shared with other PPEs. Performance is boosted -- as with the Power processors we've seen to date -- by an L2 cache, the size of which appears not to be limited by the spec. For the SPEs, there is a separate and new type of cache called the SL1. All SPEs within a group share a single SL1 cache. This cache is the only world they know. In conventional caching, the processor addresses data in memory by its absolute address, but caches provide that memory as though it were being provided directly from system RAM. But SPEs are little computers, and the SL1 cache is their system RAM. The memory controller acquires the products of their work like a teacher picking up after her students at the end of class.

Another unique revelation of the Cell 1.0 specification is an apparent second order of element grouping -- or, translated into an Intel context, a "multi-multicore" possibility. A CBEA-compliant processor package can contain groups of PPE elements and separate groups of SPE elements. Judging from the algebra IBM uses to describe the interaction between elements, there need not necessarily be as many PPE groups as SPE groups. This is important because it indicates that grouping isn't necessarily the product of simply sandwiching multiple Cell processors together, although the specification deals entirely with logic and not packaging. It's therefore conceivable that a Cell processor vendor could create multiple performance tiers by integrating any number of SPEs (probably in multiples of two) with one, two, or three PPEs.

I found this part interesting for CELL's future. I was thinking that PS4 could have ended up with 4 CELL processors... which is 32 cores/elements. That is NUTS to program for :oops:

But it seems that the SPEs are interchangable and upgradable, so the PS4 may contain say 2 CELLs, each CELL having lets say 2 beefed up PPEs with more cache and the SPEs are likewise beefed up in performance and workflow and have more memory (lets say 512K or even 1MB).

That way you do not have to end up with an insane number of cores. You would still have more but could scale the chip to how it best works for the designated task.

Out of the entire article I found this bit most interesting, if not only because there has been a lot of discussion about the independance and abilities of the SPEs. The "master-slave" arrangement was an issue of much debate 8 months ago.

Like the co-processor of ancient days, an SPE is subordinate to the PowerPC element, and performs no system management functions whatsoever. Instead, it can be delegated user-specific tasks, especially graphics processing, which can take advantage of the SPE's Single Instruction/Multiple Data (SIMD) architecture.

Again interesting read.
 
For the SPEs, there is a separate and new type of cache called the SL1. All SPEs within a group share a single SL1 cache. This cache is the only world they know. In conventional caching, the processor addresses data in memory by its absolute address, but caches provide that memory as though it were being provided directly from system RAM. But SPEs are little computers, and the SL1 cache is their system RAM.
Looks like a misinterpretation of the spec here. The author has confused the SL1 cache with the SPE Local Store. SL1 doesn't appear to be in the current existing implementation of Cell, and is only meant to cache data for DMA transfers.

Acert93 said:
But it seems that the SPEs are interchangable and upgradable, so the PS4 may contain say 2 CELLs, each CELL having lets say 2 beefed up PPEs with more cache and the SPEs are likewise beefed up in performance and workflow and have more memory (lets say 512K or even 1MB).
Even though SPE's have a specified addressible range of 32-bits for the LS, it seems the fixed length 32-bit instruction encoding effectively forces a 256K segmented addressing scheme (the load immediate address instruction, ila, takes one 18-bit immediate argument and one register argument). I wonder if it would be more beneficial to have more SPEs than to boost the LS of each SPE, given more die area.
 
Silkworm said:
Even though SPE's have a specified addressible range of 32-bits for the LS, it seems the fixed length 32-bit instruction encoding effectively forces a 256K segmented addressing scheme (the load immediate address instruction, ila, takes one 18-bit immediate argument and one register argument). I wonder if it would be more beneficial to have more SPEs than to boost the LS of each SPE, given more die area.

By that argument MIPs would only be able to access 64k of memory as the load immediate equiv instruction has a 16 bit field.... ( OK - 96k to be pedantic by selecting addi or ori )
 
Silkworm said:
I wonder if it would be more beneficial to have more SPEs than to boost the LS of each SPE, given more die area.

Looking forward (say 5 years) I think the problem is having too many cores/elements. Not only do developers have the problem of breaking things up, they also run into the issue of keeping them busy. Having possibly dozens of cores idle or under utlized is a lot of wasted transistor space.

Now maybe there will be a paradigm shift and at some point parallelization like on GPUs will take off and a streaming design with chew through stuff like no ones business. But a 4 ways CELL (40 cores/elements!) sounds nuts. In theory fewer but more productive cores seems like a better approach. e.g. for the same transistor budget a 4PPC/1MB cache chip could be made with ~150GFLOPs theoretical performance. While a 1:8 Cell may stop on that, can programmers realistically find ways to keep a 40 core CELL fed? How would a 16 core PPC compare?

I could be wrong, but a streaming design seems to work well for GPU type work. Increase pipes (Cores) on a GPU has a BIG benefit due to its workload. But a CPU wont see any benefit if programmers are unable to make their code scalable/parallel.

I guess my thoughts are more, "How in the world are they going to make this work" more than anything. While CELL in the PS3 seems managable with some elbow grease, time, and money, I have a hard time envisioning scenarios with 6 dozen cores being effecient.

I guess that is the root of my thought: Do a new revision of Cell with more capable SPEs that can have more memory. Obviously developers would like more (especially in 5-6 years I can see 256KB really being a limitation) and it also could streamline their workload some. More powerful, more capable SPEs :D Sounds good to me... having 32 SPEs running around sounds... like a headache :?
 
Acert93 said:
Looking forward (say 5 years) I think the problem is having too many cores/elements. Not only do developers have the problem of breaking things up, they also run into the issue of keeping them busy. Having possibly dozens of cores idle or under utlized is a lot of wasted transistor space.

Now maybe there will be a paradigm shift and at some point parallelization like on GPUs will take off and a streaming design with chew through stuff like no ones business. But a 4 ways CELL (40 cores/elements!) sounds nuts. In theory fewer but more productive cores seems like a better approach. e.g. for the same transistor budget a 4PPC/1MB cache chip could be made with ~150GFLOPs theoretical performance. While a 1:8 Cell may stop on that, can programmers realistically find ways to keep a 40 core CELL fed? How would a 16 core PPC compare?

I think one way or another, there's going to need to be some 'easy' way to thread code created at some point or another if Cell - and future architectures like it - are really to take off. And in that sense, 40 cores or 400 cores, it shouldn't be much more difficult to code for. Probably inefficient as hell though. Still the way it's all set up as it stands, it seems there's only going to be two things for developers to worry about (and not that they're small worries) - codng for the PPE and coding for the SPE's. As for what core gets what workload, it seems the Cell(s) will determine that for itself/themselves.

It's all kind of 'holy grail' type talk for now anyway, but for Cell's part there's also this from EETimes today:

...IBM is still working on a revolutionary tool — a single-source compiler that can take in a single code stream and map threads onto processor resources using a SIMD model and the explicit management of the Cell synergistic processors' local storage.

The compiler can deal with the complexities of routing streaming data through the chip as well. The single-source compiler is still under investigation and not scheduled for release. However, it was the subject of a paper at this year's Cool Chips conference in Japan...

Intel is working on something even more "push the envelope," their 'speculative threading' model/architecture discussed at IDF for OOO cores.

Speculative Threading
 
As I remeber from initial ramblings for Cell, the initial concept was apulets that could be shared across as many SPE's as the system has. Just as GPU's take a scene and divided the shading across all the GPU's pixel shaders, a Cell system would take a physics solution and divided it amongst all available SPE's. Remember the talk of connecting you Cell box to your Cell TV and having it help with whatever workloads there are?

More recent talk has idea of single activities per SPE, and passing data between them which is an option. That is set one SPE to physics, another to procedrual textures, another to AI, another to sound. I'm not sure this is the right way. If those processes can be redesigned to be segmentable much like 3D scene rendering has become, the software can scale with the hardware. If software needs to be targetted for the number of cores than the dream of Cell is no-hoper. Perhaps in PS3's initial life the transition to streamed processing of all and sundry tasks will be rather limited, and SPE's will have targetted applications instead of dipping into the work pool to dig out whatever task currently need doing. But the future surely must hold a paradigm shift in the same way Unified Shaders are offering processor that dip into the work pool to do whatever tasks need doing.

And regards inefficiencies, they aren't a huge concern. The point of multiprocessor is to get past the single-core bottlenecks. If you can cram 400 cores onto a die the size of one megalithic core, and if 200 of those cores are sitting idle at gross inefficient, if overall the multicore processor is still outperforming the single core 10:1 then it's a worthwhile change.

If I were in charge of Cell I wouldn't look to modify the basics of the architecture (256kb LS for example) for a long while, as all apulets need the same resources no matter the source if all programs are to be portable. A new Cell with 512kb LS would have new programs incompatible with older Cell hardware which fragements the system. I'd work on software tools to help the Apulet development and multi-Cell scaling, and maintain a base architecture that all Cell developers write for to maintain portability.
 
Acert93 said:
The "master-slave" arrangement was an issue of much debate 8 months ago.
Was it really an issue of debate? I would think the numeric arrangment alone explains there's one element that is mostly in charge of the others.

That said, this
Like the co-processor of ancient days, an SPE is subordinate to the PowerPC element, and performs no system management functions whatsoever.
is just plain wrong regardless.
 
I dont think Cell will evolve much, it is a good enough design for the here and now ... but think Bluegene/Cyclops will be more radical, and a better fit for the future.
 
How about having an evolved O.S. When they get to the ridiculous levels of 40+ cores. Why not have a program that will spread the work out for you. Have a program for the programmers that does the really nasty work.

Not everyone would be able to handle programming for 40+ cores.
 
That's what the original idea behind Cell was. Programs are built out of smaller programs called Apulets that run on SPEs and these are spread around whatever SPE's are available.
 
Back
Top