The memory 'topology' of Cell

JF_Aidan_Pryde · May 6, 2005

As I currently underestand, Cell has a shared (SMP) main memory. That's to say, all 9 CPUs on the chip
- Shares the main memory address space
- Have coherent access to it

In that sense, it's a 9 way SMP system with a single memory controller with 25GB/s of memory bandwidth.

The little SPE computers are still eluding me. Obviously they can communicate with each other through saving and loading from main memory, but they are better than that. The 8 SPEs are connected via a high speed internal bus (EIB) that should allow direct communication. The problem is the SPE local store is a private, non-coherent memory space. But it does have an alias in the system memory map.

Q1:
Now what does this main memory map cover?
- Main memory
- SPE alias (enough for n SPEs since the number of SPEs on the networks hould be transparent to the app)
- Locking L2 cache?

Q2:
What's the most efficient way SPEs can talk to each other?
- Writing to main memory is the obvious way
- For streaming, load appulet to SPE0 LS, store temp results to SPE0 LS, then pass them to SPE1 LS via DMA?
- Can I write directly to SPE1's LS alias in code and let the hardware handle the DMA translation? (isn't this dangerous due to lack of coherence?)

Q3:
Is it accurate to describe Cell as a 9 way SMP system AND a 8 way messege passing system with private memory? In otherwords, Cell is not NUMA, since NUMA requires:
1. Shared memory (SPEs don't share their LS)
2. Non-uniform access (main memory is uniform access).

Thanks!

overclocked · May 6, 2005

Can someone really answer that whitout having programmed for it?

Jawed · May 6, 2005

http://www-306.ibm.com/chips/techlib/techlib.nsf/products/Cell

http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/D9439D04EA9B080B87256FC00075CC2D

Jawed

version · May 6, 2005

JF_Aidan_Pryde said:
As I currently underestand, Cell has a shared (SMP) main memory. That's to say, all 9 CPUs on the chip
- Shares the main memory address space
- Have coherent access to it

In that sense, it's a 9 way SMP system with a single memory controller with 25GB/s of memory bandwidth.

The little SPE computers are still eluding me. Obviously they can communicate with each other through saving and loading from main memory, but they are better than that. The 8 SPEs are connected via a high speed internal bus (EIB) that should allow direct communication. The problem is the SPE local store is a private, non-coherent memory space. But it does have an alias in the system memory map.

Q1:
Now what does this main memory map cover?
- Main memory
- SPE alias (enough for n SPEs since the number of SPEs on the networks hould be transparent to the app)
- Locking L2 cache?

Q2:
What's the most efficient way SPEs can talk to each other?
- Writing to main memory is the obvious way
- For streaming, load appulet to SPE0 LS, store temp results to SPE0 LS, then pass them to SPE1 LS via DMA?
- Can I write directly to SPE1's LS alias in code and let the hardware handle the DMA translation? (isn't this dangerous due to lack of coherence?)

Q3:
Is it accurate to describe Cell as a 9 way SMP system AND a 8 way messege passing system with private memory? In otherwords, Cell is not NUMA, since NUMA requires:
1. Shared memory (SPEs don't share their LS)
2. Non-uniform access (main memory is uniform access).

Thanks!

Q2:

1. stream from LS to LS
2. synchron mode: LS to/from CHANNEL

JF_Aidan_Pryde · May 7, 2005

Jawed,
Thanks! The MPR pdf answered my question: "another SPE can use the DMA controller to move data to an address range that is mapped onto a local
store of another SPE or even to itself."

version said:
2. synchron mode: LS to/from CHANNEL

Can you explain this more?

Jawed · May 7, 2005

JF_Aidan_Pryde said:
Jawed,
Thanks! The MPR pdf answered my question: "another SPE can use the DMA controller to move data to an address range that is mapped onto a local
store of another SPE or even to itself."

My understanding of Cell is that SPE pipelining will be controlled by the OS. As the programmer it only makes sense to program "one" SPE per thread. The OS will decide that your program can be pipelined across, say, 4 SPEs. As the load on Cell varies, that pipeline may or may not change in length...

If you write a multi-threaded program then, again, the OS will decide how to assign each thread to SPE(s).

In other words I think it'll be like programming a GPU, where it's not possible to program meaningfully at the lowest level in terms of specific pipelines or memory spaces. All you'll be able to do is code in terms of threads, pipelines and data structures.

Jawed

JF_Aidan_Pryde · May 7, 2005

Jawed,
I am looking at GDC 05 Cell slides and it says SPE can be virtualised by OS or applications can control SPE sharing. In the latter case, it "allows maxium utilizatoin of the fixed resource" and "requires SPE management code in the application". It also says "each developer is free to create their own".

It's on the Japanese site watch impress somewhere but I can't find it. The file name is kaigai028.jpg and should be on some story regarding Cell info from GDC 2005.

one · May 7, 2005

JF_Aidan_Pryde said:
Jawed,
I am looking at GDC 05 Cell slides and it says SPE can be virtualised by OS or applications can control SPE sharing. In the latter case, it "allows maxium utilizatoin of the fixed resource" and "requires SPE management code in the application". It also says "each developer is free to create their own".

It's on the Japanese site watch impress somewhere but I can't find it. The file name is kaigai028.jpg and should be on some story regarding Cell info from GDC 2005.

http://www.research.scea.com/research/html/CellGDC05/index.html

JF_Aidan_Pryde · May 7, 2005

Found it:
http://pc.watch.impress.co.jp/docs/2005/0310/kaigai165.htm

aaaaa00 · May 8, 2005

As I currently underestand, Cell has a shared (SMP) main memory. That's to say, all 9 CPUs on the chip
- Shares the main memory address space
- Have coherent access to it

In that sense, it's a 9 way SMP system with a single memory controller with 25GB/s of memory bandwidth.

Just a minor note about terminology, you can't call it an SMP if the processors in it aren't identical. SMP also refers to the makeup of the processors in the machine.

randycat99 · May 8, 2005

I think it is 2 entirely different "SMP"'s being referred to here. In the context of the original statement, they aren't referring to symmetric multiprocessing, imo.

...but if a distinction is to be made, I would say that the presence of 8 nearly identical (if not, identical) SPU's is pretty far into SMP territory to carry the name satisfactorily. Whether or not work is symmetrically divided amongst them is really a matter of code implementation, rather than a hardware distinction.

aaaaa00 · May 8, 2005

randycat99 said:
I think it is 2 entirely different "SMP"'s being referred to here. In the context of the original statement, they aren't referring to symmetric multiprocessing, imo.

...but if a distinction is to be made, I would say that the presence of 8 nearly identical (if not, identical) SPU's is pretty far into SMP territory to carry the name satisfactorily. Whether or not work is symmetrically divided amongst them is really a matter of code implementation, rather than a hardware distinction.

The fact that there's a PPU in there intended to run the main OS and coordinate the SPUs pretty much ruins the definition of Cell as an SMP.

SMP means "all the CPUs in the machine are identical, they all have the same instruction set, they all see the same memory in the same way, and any thread running on one CPU in the machine can be scheduled on to any other CPU in the machine."

randycat99 · May 8, 2005

For the coming architectures, this strict definition will make itself obsolete quite quickly. Once you get down to the SPU level (where the real work will get done), it's all good with the definition, imo. If it makes you more comfortable, maybe SMcP (co-processing) is more apt? Either way, it's really splitting hairs.

...back to the topic, however, I believe the "SMP" that was referred to earlier was intended as "shared memory pool", maybe? Is that a problem to reconcile, as well?

JF_Aidan_Pryde · May 8, 2005

What I'm really not sure about is whether Cell falls into the NUMA catagory.

NUMA requires both shared memory and non-uniform access. Are the local stores of the SPEs shared? They are in the sense that the appear in the same address space. But they are not in the sense that they are private. Hum.

Npl · May 8, 2005

JF_Aidan_Pryde said:
What I'm really not sure about is whether Cell falls into the NUMA catagory.

NUMA requires both shared memory and non-uniform access. Are the local stores of the SPEs shared? They are in the sense that the appear in the same address space. But they are not in the sense that they are private. Hum.

Leave the SPEs out of the Picture for a Moment. If you connect multiple Cells together, each will have their own Pool of Memory through their XDR-Interconnect. But still its shared and accessible from all Cells. = NUMA Architecture. (ITs possible and IMHO likely that the PS3 GPU will using the same Topology)

Setting up a SPE then is something entirely different, its rather a seperate system in itself. You send them their "work", wether the SPE is on the same chip, same system or somewhere else in the network.

aaaaa00 · May 8, 2005

Randycat99, you can't just change the definition of SMP to suit yourself.

CELL is an asymmetric multi-processor. Because not all of the processors inside it are identical, by definition it cannot be called an SMP.

randycat99 · May 8, 2005

You can't be too rigid, either, or you end up with only a "single thing" matching the designation (where if that was the case, what was the point in coming up with a classification for a group, in the first place). I think the point is noted that you are very bothered with PS3 and SMP being associated together.

In any case, it seems clear that you had mistaken the context of SMP where it appeared first in an earlier comment. It was not referring to CPU configuration, anyway, rather shared-memory-pool.

aaaaa00 · May 8, 2005

randycat99 said:
In any case, it seems clear that you had mistaken the context of SMP where it appeared first in an earlier comment. It was not referring to CPU configuration, anyway, rather shared-memory-pool.

SMP refers to both memory configuration and CPUs.

For example:

A dual processor Xeon is an SMP, because it has two identical Xeon processors attached to one bank of memory. CPU1 can get to all memory at the same speed CPU2 can.

Another example:

A dual processor AMD64 is a NUMA, because it has two identical AMD64 processors, but it has two seperate banks of memory attached to the two processors, IE CPU1 can get to memory attached to CPU1 quickly, but is slower getting to memory attached to CPU2.

CELL probably doesn't qualify as either SMP or NUMA.

The closest analog is probably a cluster or message passing architecture, since (AFAIK) DMA transactions do not preserve coherence of the local stores (unlike the L2 caches in SMP or NUMA architectures).

Example:

1. SPU1 DMAs the contents of memory location 10 into LS1.
2. SPU2 DMAs the contents of memory location 10 into LS2.
3. SPU2 performs a computation, then DMAs a new value into memory location 10.

The value of memory location 10 stored in LS1 is now stale, since local stores are not L2 caches (there is no cache coherence enforced by the hardware, unlike in a NUMA or SMP architecture).

Unless I am mistaken of course (which I could very well be).

Gubbi · May 9, 2005

aaaaa00 said:
Unless I am mistaken of course (which I could very well be).

No you got it absolutely right. Local stores are not kept coherent with main memory. You'll have to maintain coherence in software.

CELL resembles existing baseband+DSP SOCs, - with a host CPU (the Power core) and one to multiple DSPs (the SPEs). It's just alot more ambitious - and hyped.

Cheers
Gubbi

Panajev2001a · May 9, 2005

Gubbi, when talking about coherency what do you make of this Hofstee Quote in his HPCA 2005 article

Local Store addresses do have an alias in the Power
processor address map and transfers to and from Local
Store to memory at large (including other local stores) are
coherent in the system. As a result a pointer to a data
structure that has been created on the Power processor can
be passed to an SPE and the SPE can use this pointer to
issue a DMA command to bring the data structure into its
local store in order to perform operations on it. If after
operating on this data structure the SPE (or Power core)
issues a DMA command to place it back in non-LS
memory, the transfer is again coherent in the system
according to the normal Power memory ordering rules.

?

The memory 'topology' of Cell

JF_Aidan_Pryde

overclocked

Jawed

version

JF_Aidan_Pryde

Jawed

JF_Aidan_Pryde

one

Unruly Member

JF_Aidan_Pryde

aaaaa00

randycat99

aaaaa00

randycat99

JF_Aidan_Pryde

Npl

aaaaa00

randycat99

aaaaa00

Gubbi

Panajev2001a

Similar threads