Is the internal bandwidth in CELL enough?

Robbz

Newcomer
I've been looking at the bandwidth that the EIB, the 4x128bit ringbus between the SPE's/PPE. STI have said that it's capable of transfering up to 96 bytes/cycle and I got a nagging feeling that is a bit on the low side.

By just looking at the bandwidth in absolute terms it seems that it's huge, ~307 GB/sec @ 3.2GHz, but on the other hand this "only" translates to 6 load/store/clock cycle when one want to a SPE->SPE transfer assuming that one can do load/store between the PPE/SPE's without going to main memory.

This, atleast to me, means that by just using all SPE's at once you end up overloading the EIB if you have an algorithm that can consume/produce data quicker than once every 3 clocks if one are able to divide this into 9 discrete steps, i.e. use two threads on the PPE and dual issue to the full extent on each SPE. This is probably just a pathological case but the thing I'm trying to get at is what happens when one tries to load a big chunk of data into the local memory of a SPE and two others already are doing this, won't this essentially mean that the SPE's will fight for the bandwidth and generate (potentially) big stalls and hence it'll be hard to achieve the rated GFlops in a "streaming" scenario.

This, in my mind, should also affect the FlexIO->EIB->XDR BW. This might generate stalls for the RSX when it tries to read/write to the XDR memory.

Anyway, one can argue that the BW on the EIB is so much better than anything else that exists in modern processors and this is a moot point but the to me the CELL design relies on having enough BW to supply all SPE's in a streaming fashion to be able fully utilize them. Stalls on an inorder design is determinal for the perfomance.

Then again, I might just be dreaming and the BW on EIB is more than enough.. :)

Any comments?

/Robbz
 
I think IBM may have implemented something similar to their token ring on the FDDI networks. It's not the EIB what I expect to block the system (In case it gets blocked), but the XDR bus.
 
FDDI networks? How does that work and how would that apply to the EIB?

I was just noting the fact that 96 bytes/clock is the peak BW of the EIB and to me it seems that one can easily saturate this when transfering blocks of memory between SPE's.

But then again, I might just be way off.. :)

/Robbz
 
The best way to solve that problem, would be by having the CPU order the actions of the SPE's in sequence. You might loose a bit in the maximum throughput possible, but you would maximize the available bandwidth.
 
Robbz said:
FDDI networks? How does that work and how would that apply to the EIB?

Well, a token ring structure. A 4 way bus, 2 of this chanels going clockwise and the other 2 going anticlockwise. The spe which owns the token, is the one which communicates, and the PPE is the unit which assigns the token to the next unit (Be it SPE, PPE or FLEXIO).



I suppose I'm wrong, but that's the first thing I thought when I saw the Element interconnect bus diagram back in february.
 
Robbz said:
FDDI networks? How does that work and how would that apply to the EIB?

I was just noting the fact that 96 bytes/clock is the peak BW of the EIB and to me it seems that one can easily saturate this when transfering blocks of memory between SPE's.

But then again, I might just be way off.. :)

/Robbz

Why would you ship data between SPE's? What can one SPE do that another can't? If it's a matter of the code that's on another SPE, why not just ship the code instead, since code size is often much smaller than data size?
 
The MDR pdf said that you can have up to 3 simultaneous transfers if the SPE's are sending to the next one over, so that involves 6 SPE's, 3 sending, 3 recieving.
 
phat said:
Why would you ship data between SPE's? What can one SPE do that another can't? If it's a matter of the code that's on another SPE, why not just ship the code instead, since code size is often much smaller than data size?
I think it's a question of how you're using your resources. Since individual SPE resources are limited (e.g. only 256k local store, and a fraction of the total FP performance quoted) you might have one SPE doing the first part of a function and then passing the results to a second one that finishes it (or does it's part, etc.)

I don't think we know enough about the total system to say what part if any will be the bottleneck. I'm kind of skeptical that very many games will have all 7 SPEs crunching away at full throttle though, it's going to take creative thinking to find good uses for all of the Cell resources.
 
Vaan said:
Well, a token ring structure. A 4 way bus, 2 of this chanels going clockwise and the other 2 going anticlockwise. The spe which owns the token, is the one which communicates, and the PPE is the unit which assigns the token to the next unit (Be it SPE, PPE or FLEXIO).

That makes sence to me.

Tacitblue said:
The MDR pdf said that you can have up to 3 simultaneous transfers if the SPE's are sending to the next one over, so that involves 6 SPE's, 3 sending, 3 recieving.

Well, I think this was one of the things I wanted to highlight. Doesn't this mean that you can have these inorder SPE's stall when trying to read/write mem and hence will not be nere there peak performance.

Btw, does the MDR pdf state wether one is limited to 1 load or store/cycle or can you do both a load and store/cycle/SPE (and load/store/cycle/PPE/FlexIO).

chachi said:
I think it's a question of how you're using your resources. Since individual SPE resources are limited (e.g. only 256k local store, and a fraction of the total FP performance quoted) you might have one SPE doing the first part of a function and then passing the results to a second one that finishes it (or does it's part, etc.)

I don't think we know enough about the total system to say what part if any will be the bottleneck. I'm kind of skeptical that very many games will have all 7 SPEs crunching away at full throttle though, it's going to take creative thinking to find good uses for all of the Cell resources.

I agree with you here that "normal" things that one might want to do on the SPE's will have much lower consume/produce frequency than I depicted in the first place. But I don't have _any_ problem with coming up with a problem that I easily can share between all Cell resources.

Say that I want do a SW rasterizer. Write down some pixel shader code, split the "instruction" evenly over the SPE's and then off you go. Depending on how complex the "instructions" one should hit the BW limit in some cases and in others not. Or, say that you wan't to postprocess the image from the RSX...

I think that my main concern here is not that the BW of the EIB is too low in the average case but the peak cases. As I see it, there can be BW usage peaks and this can affect the performance SPE's when they're stalling because of mem load/store operations, since they are inorder.

STI is mainly saying that since you have "big" local store on each SPE that have predictable latency access and hence the inorder/no branch prediction will work perfectly well since you can order the code accordingly. But is this true when one accounts for the load/store to and from the EIB?

/Robbz
 
I think focussing sharing data between SPE's is a bad idea and is the overall design. I can imagine say one SPE producing data for one aspect of a process and passing it to another SPE to process also, but getting all of them to share data between them...

So it could be a bottleneck, but won't be in it's application on PS3. eg. Why use a software rasterizer when you've got RSX? ;)
 
Shifty Geezer said:
I think focussing sharing data between SPE's is a bad idea and is the overall design. I can imagine say one SPE producing data for one aspect of a process and passing it to another SPE to process also, but getting all of them to share data between them...

So it could be a bottleneck, but won't be in it's application on PS3. eg. Why use a software rasterizer when you've got RSX? ;)
As one of the design goals of Cell was to support pipelineing of SPUs, it supports it very well.

SPUa produces a chunk of data (N buffered), it issues a DMA request to transfer the chunk its just finished to SPUb, while it continues working on another chunk. SPUb is still working on the previous chunk, while its next work load is DMAed in. The whole thing runs at the speed of the slowest part, its down to the developer to balence the work loads for maximum performance. (i.e. if SPUa takes much longer to produce its data than SPUb takes to process it, then it may be worth having two producer SPUs both supplying the same consumer SPUb)

The only real wrinkle in the whole system is the lack of memory ports on SPU ram but thats why SPUs have such huge register files.

The key point is that as long as the consumption of data takes longer than the DMA transfer (and the DMA engine is fast) then its doesn't matter.

96 bytes is enough for 6 quads (4 floats per quad) per cycle. Which means that each app controlled SPU gets a quad per cycle for itself. Thats not going to be a serious bottleneck if your doing any real work (just think how long its gonna take just to read a quad from LS into registers and write it back into LS without actually doing any work).
 
DeanoC said:
96 bytes is enough for 6 quads (4 floats per quad) per cycle. Which means that each app controlled SPU gets a quad per cycle for itself. Thats not going to be a serious bottleneck if your doing any real work (just think how long its gonna take just to read a quad from LS into registers and write it back into LS without actually doing any work).

Interesting point. As I stated in my first point I was merley looking if this could be a potential bottleneck in the quest of achieving the 100% utilisation of the SPE's. Whenever doing any serious work I guess it's going to be more than enough.

I'll just stop my ramblings here and be happy that every part of the Cell arch is well thought out.. :)

/Robbz
 
I think the weakness in PS3 at the moment is how well the graphics hold up with HDR and AA. That's one aspect that hsn't been explained yet, whereas we know XB360's solution.
 
chachi said:
I don't think we know enough about the total system to say what part if any will be the bottleneck. I'm kind of skeptical that very many games will have all 7 SPEs crunching away at full throttle though, it's going to take creative thinking to find good uses for all of the Cell resources.

I kind of disagree. I agree that full utilisation will be hard, but I do expect all seven SPEs to be working most of the time for any reasonably well designed application.

IMO, the key to programming Cell is "lots of short threads". You don't want to think along the lines of "okay, I got 7 SPEs, what kind of engine work can I map to seven hardware units." Rather, it's more like "I have to be able to break my parallelisable workload into lots of short lived threads and dispatch them to the scheduler. You want a lot more threads than 7 and you want enough such that the granularity will be fine enough to use the 7 SPEs as a virtualised resource. Like tens or even hundreds of SPE threads. I think that's the way to go.
 
JF_Aidan_Pryde said:
IMO, the key to programming Cell is "lots of short threads". You don't want to think along the lines of "okay, I got 7 SPEs, what kind of engine work can I map to seven hardware units." Rather, it's more like "I have to be able to break my parallelisable workload into lots of short lived threads and dispatch them to the scheduler. You want a lot more threads than 7 and you want enough such that the granularity will be fine enough to use the 7 SPEs as a virtualised resource. Like tens or even hundreds of SPE threads. I think that's the way to go.

Agreed.
 
The audioblog from MS had the technical guys saying to develop for XB360, you'd write code on one thread, and then take out a piece of that code to execute on anothr core. Because the cores in XeCPU are symmetric this is easy.

Is that how to write multithreaded code though? Or should it be being deisnged from the ground up? Are XeCPU and Cell going to be different in this respect, with Cell having lots of small threads, XeCPU having fewer. longer threads?
 
Shifty Geezer said:
The audioblog from MS had the technical guys saying to develop for XB360, you'd write code on one thread, and then take out a piece of that code to execute on anothr core. Because the cores in XeCPU are symmetric this is easy.

Essentially, yes. But it's not that easy to just split your game loop up into multiple threads. Things have to happen in sequence, be synchronized and cannot access or modify the same resources at the same time. It's actually quite hard to do well, especially for games.

Is that how to write multithreaded code though? Or should it be being deisnged from the ground up? Are XeCPU and Cell going to be different in this respect, with Cell having lots of small threads, XeCPU having fewer. longer threads?

Yes, the model used will be almost surely be totally different for both machines. They're totally different architectures in this respect.

Cell can manipulate small blocks of data blazingly fast, and it can do much more different things at the same time than XboX360. But it isn't very good at manipulating large blocks of data, and you need two different programming models for the two different kind of processor cores.

The XboX360 is very good at running multiple independent threads or processes at the same time, and it can do everyting that has to be done inside the same thread. But it isn't very good at parallelising many calculations at the same time, and there just aren't that many things that are easily split into multiple independent threads for a game.
 
Back
Top