Fafalada said:I'd like to believe that - but I've yet to see a 256KB+ cache that has 4cycle load/store latencies at 3.2Ghz. Unless of course you're suggesting to go with a much smaller cache - but then I'm no longer gonna agree the design would have any performance advantages.
I believe nAo meant they would be harder to hide, regardless of the few cycle difference in total access time (given we're talking about at least 500cycles for memory access those won't make much difference anyway).
But are you honestly trying to argue that design wouldn't have a significant overhead in terms of die area usage? I am more inclined to think nAo is right there when he said there'd be only room for 4SPEs, heck I'm questioning if it would even be that much.
Btw - if cache overheads are so low - how do you explain that 512KB of L2 in Cell has nearly the same area as two SPEs combined.
nAo said:I don't believe that cause with a local store a SPE is guaranteed to hit data is looking for everytime it tries to access them, and it's not hard to imagine this would make SPE design much simpler.
How many 256kb cache designs running at 4 Ghz with 4 cycles access latency do you know?
Die are wise..if we look at CELL die photographs we see how a 512 L2 kb cache takes almost the same space than a couple of SPEs and I bet it has a much much higher latency (40 cycles..)
This is true for an unptomized app, an optmized app (with a optimized/customized data set) would likely prefetch/load data via DMA one time (per packet..) with burst accesses to the external mem.
I seriously doubt this would be slower than using a multitude of single accesses (due to cache misses).
Moreover a mem controller able to manage thousands of cache misses at the same time (we have 8 SPEs and one PPE!) would be more complex than the actual CELL mem controller (that already handles 128 concurrent dma queries)
nondescript said:Deterministic processing times (at least for the SPEs) - make it easier for developers to cycle-count and optimize their code. That rules out hit-or-miss caches, and out-of-order execution. I could see this being rather important for anyone trying to stream process.
The LS supports 1024-bit DMA transfers - which seems to me to be quite useful - rather than writing one cache line at a time, it can transfer (relatively) large blocks of memory at a time. Incredibly (to me, at least), all 1024-bits can be written in one write cycle, keeping the read/write port free for the SPE most of the time.
Link? Umm...go find Microprocessor Report...somehow.
According a more recent presentantion by Mr. Suzuoki (Sony) local stores have a 4 cycles load latency.aaronspink said:I can only assume you mean 6 cycles. cause according to the ISSCC paper, thats the load latency.
Ok, maybe CELL L2 is not a good metric (I'm not a hw designer..) but why we're not seeing huge/low latencies L1 caches on current CPUs?I wouldn't use the L2 on the Cell as an example of either cache density or optimization. It is most likely dealing with a lot of issues that are orthoganal to normal L2 or L1.
In some cases there could be some degradation but I've arranged my data this way too many times to don't know it's a winning choiceThe same can be done with a cache and pre-fetching. Also you are assuming that there is no degradation involved in packaging the data so that it can be loaded in block form
There are several differences. Via DMA one can load/store much more data with a single command and on a SPE one can issue tagged DMA queries and later checks for single or grouped DMA queries completition.There is really no difference between the DMA requests and cache miss requests. The memory/coherency controller still has basically the same requirements.
Well, you'd want it shared so you wouldn't have to deal with snooping that many caches, and locality would be an issue, you'd have to rearrange things from the current design to keep some cores from having longer latency than others, but I think the simplest explanation is probably the right one - it's easier to do it this way. Any performance considerations are probably secondary, if it works out better then that's icing on the cake.Jawed said:The LS architecture is targetted at pipelining data from one SPE to another - I don't see how a cache architecture would make sense in that case.
In fact it would be arse-backwards to implement the local store as cache in that scenario as the data being sent to another SPE would, in a cached architecture, have to be written both to local LS as well as destination LS - rather pointless when the originating LS is handing-off that data for the next stage in the pipeline and has no further use for it. Even if the cache is configured as "write-through" in this scenario, you're still needlessly occupying cache lines in the originator's LS that have absolutely no use and will block incoming data.
If it's that easy and cheap why aren't we all using 512KB L1 caches? Why would anyone bother with slowass L2s when apparently we can have near instant access cache at the same cost and area as the SRAM used to build it?aaronspink said:Well it doesn't matter if you've seen a 4 cycle 256 KB cache because the actual read latency is 6 cycles for the SPE's LS. The data plane of a cache would have the same access latency as the LS and the tag portion should be less than the access time of the LS by a cycle or two. So the actual access time for a cache should be the same.
If you are willing to pepper your loops with more prefetch instructions then you have math ops in there.Realistically for data sets that can be easily DMA'd into the LS, the cache should do as good.
That's what we'd use cache locking for - which apparently doesn't add a noticeable cost either according to aaron.Jawed said:The LS architecture is targetted at pipelining data from one SPE to another - I don't see how a cache architecture would make sense in that case.
Fafalada said:If it's that easy and cheap why aren't we all using 512KB L1 caches? Why would anyone bother with slowass L2s when apparently we can have near instant access cache at the same cost and area as the SRAM used to build it?
Anyway the 4cycle latency comes from latest updates on Cell, I suspect the change was made in DD2 revision.
If you are willing to pepper your loops with more prefetch instructions then you have math ops in there.
But the argument with localstores stands - if IBM could have stuffed 256KB*8 of 4cycle cache in there at virtually no added cost, why wouldn't they? (since we all agree they aren't incompetent).
And why use only 64K of L1 on PPE and use 512K of SLOW L2 on top? Maybe in OoOE design it wouldn't make a huge difference but the speed of that L2 is a big hurdle for an in-order part.
aaronspink said:Fafalada said:If it's that easy and cheap why aren't we all using 512KB L1 caches? Why would anyone bother with slowass L2s when apparently we can have near instant access cache at the same cost and area as the SRAM used to build it?
stop trying to create strawmen. The reason we don't have 512KB L1 Caches is because 2-3 cycle L1 caches that are smaller provide better performance over the majority of workload. Take some computer architecture classes and do some research.
In general, your average latency will be :
L1 latency * L1 hitrate+ L2 latency*(1-L1 hitrate)*L2 hitrate+(1-L1 hitrate)*(1-L2 hitrate)*Main memory latency
Jawed said:Well it's in the MPR article.
Jawed
psurge said:Well I must admit that at this point I'm stumped as to why they went with LS as opposed to cache. I see a couple possibilities:
- design time/complexity ... seems unlikely as aaronspink says that cache design with locking + explicit DMA requests has been done and is well known.
- security: it's seems like with the SPE + LS model, it would take considerably more programmer skill to get one SPE to trash the memory of another than with the SPE + cache model.
aaronspink - what kind of cache-coherency protocol are you thinking of for this hypothetical Cell with cache for SPEs?
aaronspink said:the L2 cache in the Cell design is an example of a very very non-dense cache. Most likely the reason (because IBM isn't incompetent) is a lot of overhead in the L2 design related to handling the 9 seperate load/store sources as well as the logic to handle all the other memory model issues and the controller.
aaronspink said:as stated earlier, the die area overhead should be minimal, <5% at worst.
You can build a smaller cache, probably, but with corresponding loss of capability.myself said:It depends on how over/under-built that cache is. I can construct some pathological fully-associative cache that takes up ridiculous die space, or some cache that has so little overhead that it barely functions as a cache at all. 1-2%? I must admit, I really have no idea.
MPR says 6, Suzuoki says 4...I'll just take your word for it.DeanoC said:From a purely software point of view (bah to all that complicated hardware gubbins ), 6 is the magic number...
aaronspink said:Um, I'll pass on MPR. MPR lost its place as a credible source sometime in the early 90's. Want good press? send a lot of people to MPF. And lets not get into all the "awards" they've given out over the years. Seriously, a lot of people don't consider then anymore than a PR outlet for sale.