Why stream with Cell?

It's said that you can pass the computed results of one SPE to another and daisy chain them to 'stream' process.

Why would you want to do this?

The only reason why we want to pass on work is if it can be done by the next stage better than the current stage. But each SPE is identical. Why would I want to pass on work when I can just do it myself using multiple cycles?

I mean it's not like the work load has to go from SPE0 -> SPE7 before it can be outputed. We can store it back into main memory from any SPE, at any time.

So why stream?

And also, what is *really* the difference between a vector CPU (eg. Cray) and a SIMD SPE?
 
JF_Aidan_Pryde said:
It's said that you can pass the computed results of one SPE to another and daisy chain them to 'stream' process.

Why would you want to do this?
Cause SPEs don't have an infinite amount of resources such as local memory space, registers or even time.
As a simple example we could have a pipeline setup where a first SPE tesselates base meshes and a second SPE is used as T&L engine on tesselated vertices.
Another example: if a particular streaming kernel is so fast to be processed that a second SPE can't consume all the incoming data, the first SPE could broadcast the outgoing stream towards multiple SPEs.
 
nAo said:
JF_Aidan_Pryde said:
It's said that you can pass the computed results of one SPE to another and daisy chain them to 'stream' process.

Why would you want to do this?
Cause SPEs don't have an infinite amount of resources such as local memory space, registers or even time.
As a simple example we could have a pipeline setup where a first SPE tesselates base meshes and a second SPE is used as T&L engine on tesselated vertices.

I think I see it. There can be memory access savings.

So let's say our graphics pipeline starts at SPE0.

SPE0 loads vertex data and does the transformations. Since the vertex data size is greater than the local store, if streaming is not used, it would
have to store it back into main memory before loading it again to do lighting calculations.

With streaming, the transformed vertex mesh can be passed on to SPE1 through the EIB without going to main memory. For certain cases, SPE1 probably needn't access main memory at all. It just has to receive the half-computed data from SPE0, perform additional computation, and pass the results to SPE2.

In short, streaming changes the nature of the computation by moving the memory access to the beginning and end of the chain of processors, leaving the middle processors only for computation. 8)
 
Well, if you can break a task into modules that are of similar time to process, streaming is a great way to make the bigger task across multiple processors. You gain utilization of processors. And when the hardware encourages that, you probably don't lose much performance for streaming it.
 
JF_Aidan_Pryde said:
In short, streaming changes the nature of the computation by moving the memory access to the beginning and end of the chain of processors, leaving the middle processors only for computation. 8)
Yes, in the general case, but on CELL you could a do lot of different things, like store your intermediate results on the L2 cache :)
Note than even SPE1 needs to access memory, AFAIK the only way SPE0 can pass data to SPE1 is moving out data from its memory (or its registers) to SPE1 memory. But I'm not 100% sure about that, so I could be wrong.
Moreover in any reasonable long streaming kernel every data load from local memory can be make completely free, so it's not a big deal passing data from local mem to local mem.
 
JF_Aidan_pryde said:
Why would you want to do this?
Same reason CPUs pipeline things on hw level.
There are situations where pipeline level paralelism is easier to exploit then simple process repetition (and it may be easier to manage too).
 
nAo said:
JF_Aidan_Pryde said:
In short, streaming changes the nature of the computation by moving the memory access to the beginning and end of the chain of processors, leaving the middle processors only for computation. 8)
Yes, in the general case, but on CELL you could a do lot of different things, like store your intermediate results on the L2 cache :)

Isn't the L2 cache used only by the PPE?

As I understand, the SPEs use the 'three level' memory paradigm which does not have hardware cache:
Register <-> Local Store <-> Main Memory

Or do you mean storing intermediate results on the local store?
 
JF_Aidan_Pryde said:
Isn't the L2 cache used only by the PPE?
AFAIK SPEs can write into locked portions of L2 cache
As I understand, the SPEs use the 'three level' memory paradigm which does not have hardware cache:
Register <-> Local Store <-> Main Memory
Yeah..that's the paradigm..but let's say L2 can be memory mapped via MMU and SPEs can issue DMA writes to that 'special' memory... ;)

Or do you mean storing intermediate results on the local store?
I mean both. Probably every DMA command a SPE can issue read and write data from and to local memory and not registers. So it wouldn't be possible to move data 'out' of the SPE without storing stuff into the local store.
 
JF_Aidan_Pryde said:
It's said that you can pass the computed results of one SPE to another and daisy chain them to 'stream' process.

Why would you want to do this?

The only reason why we want to pass on work is if it can be done by the next stage better than the current stage. But each SPE is identical. Why would I want to pass on work when I can just do it myself using multiple cycles?
Because context changes are expensive?
 
Squeak said:
JF_Aidan_Pryde said:
It's said that you can pass the computed results of one SPE to another and daisy chain them to 'stream' process.

Why would you want to do this?

The only reason why we want to pass on work is if it can be done by the next stage better than the current stage. But each SPE is identical. Why would I want to pass on work when I can just do it myself using multiple cycles?
Because context changes are expensive?

Thats exactly why. Think of it like an assembly line where you have 8 workers building a part of a car or 8 workers each building completely seperate cars.
 
Back
Top