I've been looking at the bandwidth that the EIB, the 4x128bit ringbus between the SPE's/PPE. STI have said that it's capable of transfering up to 96 bytes/cycle and I got a nagging feeling that is a bit on the low side.
By just looking at the bandwidth in absolute terms it seems that it's huge, ~307 GB/sec @ 3.2GHz, but on the other hand this "only" translates to 6 load/store/clock cycle when one want to a SPE->SPE transfer assuming that one can do load/store between the PPE/SPE's without going to main memory.
This, atleast to me, means that by just using all SPE's at once you end up overloading the EIB if you have an algorithm that can consume/produce data quicker than once every 3 clocks if one are able to divide this into 9 discrete steps, i.e. use two threads on the PPE and dual issue to the full extent on each SPE. This is probably just a pathological case but the thing I'm trying to get at is what happens when one tries to load a big chunk of data into the local memory of a SPE and two others already are doing this, won't this essentially mean that the SPE's will fight for the bandwidth and generate (potentially) big stalls and hence it'll be hard to achieve the rated GFlops in a "streaming" scenario.
This, in my mind, should also affect the FlexIO->EIB->XDR BW. This might generate stalls for the RSX when it tries to read/write to the XDR memory.
Anyway, one can argue that the BW on the EIB is so much better than anything else that exists in modern processors and this is a moot point but the to me the CELL design relies on having enough BW to supply all SPE's in a streaming fashion to be able fully utilize them. Stalls on an inorder design is determinal for the perfomance.
Then again, I might just be dreaming and the BW on EIB is more than enough..
Any comments?
/Robbz
By just looking at the bandwidth in absolute terms it seems that it's huge, ~307 GB/sec @ 3.2GHz, but on the other hand this "only" translates to 6 load/store/clock cycle when one want to a SPE->SPE transfer assuming that one can do load/store between the PPE/SPE's without going to main memory.
This, atleast to me, means that by just using all SPE's at once you end up overloading the EIB if you have an algorithm that can consume/produce data quicker than once every 3 clocks if one are able to divide this into 9 discrete steps, i.e. use two threads on the PPE and dual issue to the full extent on each SPE. This is probably just a pathological case but the thing I'm trying to get at is what happens when one tries to load a big chunk of data into the local memory of a SPE and two others already are doing this, won't this essentially mean that the SPE's will fight for the bandwidth and generate (potentially) big stalls and hence it'll be hard to achieve the rated GFlops in a "streaming" scenario.
This, in my mind, should also affect the FlexIO->EIB->XDR BW. This might generate stalls for the RSX when it tries to read/write to the XDR memory.
Anyway, one can argue that the BW on the EIB is so much better than anything else that exists in modern processors and this is a moot point but the to me the CELL design relies on having enough BW to supply all SPE's in a streaming fashion to be able fully utilize them. Stalls on an inorder design is determinal for the perfomance.
Then again, I might just be dreaming and the BW on EIB is more than enough..
Any comments?
/Robbz