Will L2 cache on X360's CPU be a major hurdle?

Fafalada said:
I'd like to believe that - but I've yet to see a 256KB+ cache that has 4cycle load/store latencies at 3.2Ghz. Unless of course you're suggesting to go with a much smaller cache - but then I'm no longer gonna agree the design would have any performance advantages.

Well it doesn't matter if you've seen a 4 cycle 256 KB cache because the actual read latency is 6 cycles for the SPE's LS. The data plane of a cache would have the same access latency as the LS and the tag portion should be less than the access time of the LS by a cycle or two. So the actual access time for a cache should be the same.

BTW, anyone know what load models are supported by the SPE?

I believe nAo meant they would be harder to hide, regardless of the few cycle difference in total access time (given we're talking about at least 500cycles for memory access those won't make much difference anyway).

Realistically for data sets that can be easily DMA'd into the LS, the cache should do as good.

But are you honestly trying to argue that design wouldn't have a significant overhead in terms of die area usage? I am more inclined to think nAo is right there when he said there'd be only room for 4SPEs, heck I'm questioning if it would even be that much.

as stated earlier, the die area overhead should be minimal, <5% at worst.

Btw - if cache overheads are so low - how do you explain that 512KB of L2 in Cell has nearly the same area as two SPEs combined.

the L2 cache in the Cell design is an example of a very very non-dense cache. Most likely the reason (because IBM isn't incompetent) is a lot of overhead in the L2 design related to handling the 9 seperate load/store sources as well as the logic to handle all the other memory model issues and the controller.

Aaron Spink
speaking for myself inc.
 
nAo said:
I don't believe that cause with a local store a SPE is guaranteed to hit data is looking for everytime it tries to access them, and it's not hard to imagine this would make SPE design much simpler.

a little bit simpler, it depends on what other hazards the pipeline has to handle.

How many 256kb cache designs running at 4 Ghz with 4 cycles access latency do you know?

I can only assume you mean 6 cycles. cause according to the ISSCC paper, thats the load latency.

Die are wise..if we look at CELL die photographs we see how a 512 L2 kb cache takes almost the same space than a couple of SPEs and I bet it has a much much higher latency (40 cycles..)

I wouldn't use the L2 on the Cell as an example of either cache density or optimization. It is most likely dealing with a lot of issues that are orthoganal to normal L2 or L1.



This is true for an unptomized app, an optmized app (with a optimized/customized data set) would likely prefetch/load data via DMA one time (per packet..) with burst accesses to the external mem.
I seriously doubt this would be slower than using a multitude of single accesses (due to cache misses).

The same can be done with a cache and pre-fetching. Also you are assuming that there is no degradation involved in packaging the data so that it can be loaded in block form.


Moreover a mem controller able to manage thousands of cache misses at the same time (we have 8 SPEs and one PPE!) would be more complex than the actual CELL mem controller (that already handles 128 concurrent dma queries)

There is really no difference between the DMA requests and cache miss requests. The memory/coherency controller still has basically the same requirements.

Aaron Spink
speaking for myself inc.
 
nondescript said:
Deterministic processing times (at least for the SPEs) - make it easier for developers to cycle-count and optimize their code. That rules out hit-or-miss caches, and out-of-order execution. I could see this being rather important for anyone trying to stream process.

deterministic processing times can be somewhat overrated when they interfear with your performance in the first place.

The LS supports 1024-bit DMA transfers - which seems to me to be quite useful - rather than writing one cache line at a time, it can transfer (relatively) large blocks of memory at a time. Incredibly (to me, at least), all 1024-bits can be written in one write cycle, keeping the read/write port free for the SPE most of the time.

No real difference from a cache with a 128 byte line size.

Link? Umm...go find Microprocessor Report...somehow.

Um, I'll pass on MPR. MPR lost its place as a credible source sometime in the early 90's. Want good press? send a lot of people to MPF. And lets not get into all the "awards" they've given out over the years. Seriously, a lot of people don't consider then anymore than a PR outlet for sale.

Aaron Spink
speaking for myself inc.
 
aaronspink said:
I can only assume you mean 6 cycles. cause according to the ISSCC paper, thats the load latency.
According a more recent presentantion by Mr. Suzuoki (Sony) local stores have a 4 cycles load latency.
It should be noted that ISSCC presentation was about CELL DD1 version..and probably more recent Suzuoki's presentation is about CELL DD2 version.
http://www.rambus.co.jp/events/Main1_2_SCE_Suzuoki.pdf (Page 26)

I wouldn't use the L2 on the Cell as an example of either cache density or optimization. It is most likely dealing with a lot of issues that are orthoganal to normal L2 or L1.
Ok, maybe CELL L2 is not a good metric (I'm not a hw designer..) but why we're not seeing huge/low latencies L1 caches on current CPUs?

The same can be done with a cache and pre-fetching. Also you are assuming that there is no degradation involved in packaging the data so that it can be loaded in block form
In some cases there could be some degradation but I've arranged my data this way too many times to don't know it's a winning choice ;)

There is really no difference between the DMA requests and cache miss requests. The memory/coherency controller still has basically the same requirements.
There are several differences. Via DMA one can load/store much more data with a single command and on a SPE one can issue tagged DMA queries and later checks for single or grouped DMA queries completition.

It's good thing CT forums has been re-opened, we're having an interesting discussion.
I'm not still that conviced that SPE could have been designed around a local cache without dramatically impact per SPE die area though :)
As you wrote before IBM guys are not incompetent..I still can't understand why they haven't adopted a cache instead a local store if it has a very small cost.
 
The LS architecture is targetted at pipelining data from one SPE to another - I don't see how a cache architecture would make sense in that case.

In fact it would be arse-backwards to implement the local store as cache in that scenario as the data being sent to another SPE would, in a cached architecture, have to be written both to local LS as well as destination LS - rather pointless when the originating LS is handing-off that data for the next stage in the pipeline and has no further use for it. Even if the cache is configured as "write-through" in this scenario, you're still needlessly occupying cache lines in the originator's LS that have absolutely no use and will block incoming data.

Jawed
 
Jawed said:
The LS architecture is targetted at pipelining data from one SPE to another - I don't see how a cache architecture would make sense in that case.

In fact it would be arse-backwards to implement the local store as cache in that scenario as the data being sent to another SPE would, in a cached architecture, have to be written both to local LS as well as destination LS - rather pointless when the originating LS is handing-off that data for the next stage in the pipeline and has no further use for it. Even if the cache is configured as "write-through" in this scenario, you're still needlessly occupying cache lines in the originator's LS that have absolutely no use and will block incoming data.
Well, you'd want it shared so you wouldn't have to deal with snooping that many caches, and locality would be an issue, you'd have to rearrange things from the current design to keep some cores from having longer latency than others, but I think the simplest explanation is probably the right one - it's easier to do it this way. Any performance considerations are probably secondary, if it works out better then that's icing on the cake.

I'm curious to see if they continue with this kind of design in the future, SMP seems like it's a better fit for most things unless you're looking into the future towards a one chip solution (e.g. video, audio, etc.) It'd be good for embedded except it's too big / expensive right now. I guess we'll see. :)
 
aaronspink said:
Well it doesn't matter if you've seen a 4 cycle 256 KB cache because the actual read latency is 6 cycles for the SPE's LS. The data plane of a cache would have the same access latency as the LS and the tag portion should be less than the access time of the LS by a cycle or two. So the actual access time for a cache should be the same.
If it's that easy and cheap why aren't we all using 512KB L1 caches? Why would anyone bother with slowass L2s when apparently we can have near instant access cache at the same cost and area as the SRAM used to build it?
Anyway the 4cycle latency comes from latest updates on Cell, I suspect the change was made in DD2 revision.

Realistically for data sets that can be easily DMA'd into the LS, the cache should do as good.
If you are willing to pepper your loops with more prefetch instructions then you have math ops in there.

Anyway, I'll take your word on Cell L2 being area inefficient, I really wouldn't know.
But the argument with localstores stands - if IBM could have stuffed 256KB*8 of 4cycle cache in there at virtually no added cost, why wouldn't they? (since we all agree they aren't incompetent).
And why use only 64K of L1 on PPE and use 512K of SLOW L2 on top? Maybe in OoOE design it wouldn't make a huge difference but the speed of that L2 is a big hurdle for an in-order part.

Jawed said:
The LS architecture is targetted at pipelining data from one SPE to another - I don't see how a cache architecture would make sense in that case.
That's what we'd use cache locking for - which apparently doesn't add a noticeable cost either according to aaron.
 
Fafalada said:
If it's that easy and cheap why aren't we all using 512KB L1 caches? Why would anyone bother with slowass L2s when apparently we can have near instant access cache at the same cost and area as the SRAM used to build it?

stop trying to create strawmen. The reason we don't have 512KB L1 Caches is because 2-3 cycle L1 caches that are smaller provide better performance over the majority of workload. Take some computer architecture classes and do some research.

Anyway the 4cycle latency comes from latest updates on Cell, I suspect the change was made in DD2 revision.

I'm heavily inclined to believe the ISSCC presentation as far as actual pipeline of the LS vs a couple of numbers in a presentation without any actual information of what is being presented or measured.

If you are willing to pepper your loops with more prefetch instructions then you have math ops in there.

Whats a DMA engine but a really crappy prefetch engine.


But the argument with localstores stands - if IBM could have stuffed 256KB*8 of 4cycle cache in there at virtually no added cost, why wouldn't they? (since we all agree they aren't incompetent).

Because certain pie in the sky usage models can be somewhat easier with a LS without thinking askew.

And why use only 64K of L1 on PPE and use 512K of SLOW L2 on top? Maybe in OoOE design it wouldn't make a huge difference but the speed of that L2 is a big hurdle for an in-order part.

because I suspect that the L2 pipeline is handling a lot of complex tasks including the memory controller and the 9 or 10 seperate requesters.

Aaron Spink
speaking for myself inc.
 
Still haven't gotten an asnwer to my question on load/store address models for the SPEs. can anyone that knows talk?

Aaron Spink
speaking for myself inc.
 
aaronspink said:
Fafalada said:
If it's that easy and cheap why aren't we all using 512KB L1 caches? Why would anyone bother with slowass L2s when apparently we can have near instant access cache at the same cost and area as the SRAM used to build it?

stop trying to create strawmen. The reason we don't have 512KB L1 Caches is because 2-3 cycle L1 caches that are smaller provide better performance over the majority of workload. Take some computer architecture classes and do some research.

Ouch, lets keep it civil guys.

I already posted this on the other console board:
In general, your average latency will be :

L1 latency * L1 hitrate+ L2 latency*(1-L1 hitrate)*L2 hitrate+(1-L1 hitrate)*(1-L2 hitrate)*Main memory latency

If you take the Doom 3 cache usage patterns earlier in this thread on an A64 3500+ (2.2GHz) with 64KB D$ and 512KB L2 cache, you get:

L1 D$ hitrate: 98.36%, latency: 3 cycles
L2 hitrate: 64.55%, latency: 12 cycles
Main memory latency: ~200 cycles

So your average memory access latency for Doom 3 is
3 * 0.9836+ 12 * (1-0.9836)*0.6455 + 200 * (1-0.9836)*(1-0.6455) = 4.3 cycles

Cheers
Gubbi
 
Jawed said:
Well it's in the MPR article.

Jawed

Whats in the MPR article? Without a quote for reference it is hard to figure out what you are refering to.

If you are refering to the load/store address model question, the no, it is not in the MPR article. The article is about what I've come to expect from MPR over the past 5 years.... Fluff.

Aaron Spink
speaking for myself inc.
 
From a purely software point of view (bah to all that complicated hardware gubbins ;-) ), 6 is the magic number...

There are some seriously detailed docs on the SPU LS pipeline, but I just concentrate on the software side of things, so 6 is good enough for me...
 
Well I must admit that at this point I'm stumped as to why they went with LS as opposed to cache. I see a couple possibilities:

- design time/complexity ... seems unlikely as aaronspink says that cache design with locking + explicit DMA requests has been done and is well known.

- security: it's seems like with the SPE + LS model, it would take considerably more programmer skill to get one SPE to trash the memory of another than with the SPE + cache model.

aaronspink - what kind of cache-coherency protocol are you thinking of for this hypothetical Cell with cache for SPEs?
 
psurge said:
Well I must admit that at this point I'm stumped as to why they went with LS as opposed to cache. I see a couple possibilities:

- design time/complexity ... seems unlikely as aaronspink says that cache design with locking + explicit DMA requests has been done and is well known.

- security: it's seems like with the SPE + LS model, it would take considerably more programmer skill to get one SPE to trash the memory of another than with the SPE + cache model.

aaronspink - what kind of cache-coherency protocol are you thinking of for this hypothetical Cell with cache for SPEs?

FWIW I think it's largely a philisophical decision, looking at the Cell architecture it's pretty clearly designed to scale to very large levels. This is evident in archiectural features such as the ring bus.

It becomes impractical to share the memory pool at somepoint without making the snoop logic extremely complex and simply starving the processors, the obvious solution is a local store.
 
ERP makes sense... aaronspink did say that coherency doesn't have to be expensive (depending on how much of it desired). I'm interested to hear exactly what this means.

I was thinking along the lines of caching only read-only pages and locally owned RW pages. A page would only be allowed a single writer at a time, responsible for releasing it to the other SPEs for reading (or change of ownership)...

I'll stop there as I honestly don't know enough about cache coherency to say anything really worthwhile, but I guess what I'm wondering is if there is a simple way to punt on the expensive snooping logic and force the programmer to deal with coherency (but not the memory transfers). Does it even makes sense to do something like this?

Regards,
Serge
 
aaronspink said:
the L2 cache in the Cell design is an example of a very very non-dense cache. Most likely the reason (because IBM isn't incompetent) is a lot of overhead in the L2 design related to handling the 9 seperate load/store sources as well as the logic to handle all the other memory model issues and the controller.

9 seperate load/store sources (!?) I'm pretty sure that's not true. If you're referring to 8 SPE + 1 PPU = 9 sources, that is mistaken. The SPE (and the rest of the CELL) accesses L2 cache via the EIB. Only the PPU accesses the L2 cache directly (through the L1, of course).

aaronspink said:
as stated earlier, the die area overhead should be minimal, <5% at worst.

Can you give an example of an cache that is comparable to the SPE LS in terms of latency and size? (on similar fab processes, of course...) I think the burden is on you to prove this assertion, as Faf and nAo have pointed out, there are some pretty obvious reasons to believe this is not true.

As I said:
myself said:
It depends on how over/under-built that cache is. I can construct some pathological fully-associative cache that takes up ridiculous die space, or some cache that has so little overhead that it barely functions as a cache at all. 1-2%? I must admit, I really have no idea.
You can build a smaller cache, probably, but with corresponding loss of capability.


DeanoC said:
From a purely software point of view (bah to all that complicated hardware gubbins ), 6 is the magic number...
MPR says 6, Suzuoki says 4...I'll just take your word for it.

aaronspink said:
Um, I'll pass on MPR. MPR lost its place as a credible source sometime in the early 90's. Want good press? send a lot of people to MPF. And lets not get into all the "awards" they've given out over the years. Seriously, a lot of people don't consider then anymore than a PR outlet for sale.

If you say so. But they have a full page explaining the load/store of the LS that you want, including port sizes and number, access priorities, TLB and SLB specs, DMA and more. But hey, its all PR. :!:
 
Back
Top