Wasn't sure if I should make a new thread, but last week IBM put up a series of 5 tutorials regarding concerns for compilers targetting Cell, which looks like pretty interesting reading if you want an appreciation of the challenges faced (and more generally by programmers tackling Cell without some the compiler help discussed):
They also talk about an implementation of a software cache for SPEs - a last resort, as they put it - mentioning that hit latency in theirs is 20 cycles.
edit - Also, there's going to be a Cell Workshop held by IBM in March:
If a software cache gets a latency of 20 cycles (average, I guess), doesn't that sound rather phenomenal, considering doing a DMA operation across the ring bus has to be a major operation? I mean, how many cycles is it just to set up the DMAC?
If a software cache gets a latency of 20 cycles (average, I guess), doesn't that sound rather phenomenal, considering doing a DMA operation across the ring bus has to be a major operation? I mean, how many cycles is it just to set up the DMAC?
The latency for a cache hit in their implementation is 20 cycles. Thus there'd be no DMA there, the data is already in the local store, in the software-controlled cache. If the data isn't there, then obviously it's like a cache miss on any CPU (with DMAs to pull the required data in + some other data, depending on the cache policy or whatever).
The DMA operations in the miss handler take several hundreds of cycles, but this delay is roughly commensurate with L2 miss timeson the PPE side of the CELL processor. The performance impact of the Software Cache is dominated by the cost of the cache probes, not by the miss cost.
They also talk about an implementation of a software cache for SPEs - a last resort, as they put it - mentioning that hit latency in theirs is 20 cycles.