Local Store, L1, L2 caches and the Future of Processrs: Compare, Contrast?

Acert93

Artist formerly known as Acert93
Legend
This question may be already answered, but maybe someone could summarize for me the similarities and disimilarities between the SPE Local Store and L1 and L2 cache on a typical processor (e.g. Core2Duo).

My understanding (correct me if I am wrong) is that Local Store is basically a fast local Ram. Very fast. It is a working space/scratch pad with very low latency. It isn't like L2 cache in that it doesn't reflect the system memory and that you have to DMA to get data from system memory.

L2 appears, in general, to have higher latency than Local Store but has coherancy with the objects in system memory. L2 requires little management from the programmer.

Level 1 cache, from my meager knowledge, tends to be much faster than L2 cache but also much smaller. I have seen it frequently broken up into 2 parts, instruction and data.

My main interest of discussion (besides getting corrected on my understanding of these technologies) is how do they compare and contrast on a technical level? e.g. execution and latency issues, capabilities, etc.

The second part of interest is how do we see the future unfolding in regards to LS and caches? Will there be some convergance? Could we see L1 caches grow in size to mimick LS? Could LS concepts be brought to PC CPUs, or does the business reality of higher level coding make such an approach unrealistic?

IBM recently announced their roadmap with Cell2 on it with 32 SPEs and 2 PPEs, but not details on the cache arrangements. My guess is with SiO that IBM will be taking a serious look at ZRAM, which could result in inflated LS and cache sizes. Intel mentioned server chips in 2010 where there will be 3 levels of cache in a unique arrangement, with each processor having 32kB data and 32kB instruction cache, 512kB L2 cache (16MB total), and clusters of 4 cores having a common pool of 3MB L2 cache (24MB total). And then there is Terrascale.

So there seems to be a lot of movement in the processor industry (a lot more than over the last 5 years it seems) and I am curious how caches and memory will play out in future designs and what we may, or may not, see in the 2010-2012 timeframe.
 
My understanding (correct me if I am wrong) is that Local Store is basically a fast local Ram. Very fast. It is a working space/scratch pad with very low latency. It isn't like L2 cache in that it doesn't reflect the system memory and that you have to DMA to get data from system memory.
The important difference between the two is that an L2 cache is coherent with memory as you mentioned and invisible from the application. It affects performance but there is (almost) no way the programmer can affect it, though one can always code with its size in mind or use hints or flushes. The application reads and writes to memory and transparently the L2 (and L1) caches too keep the most used data for speed. A local store on the other hand behaves just like any other chunk of memory, except that it is very fast. The main difference being that if you want to use a local store for keeping your most used data you have to do it explicitely by coding for it. Local stores from that POV are less flexible than caches but may have other advantages.

My main interest of discussion (besides getting corrected on my understanding of these technologies) is how do they compare and contrast on a technical level? e.g. execution and latency issues, capabilities, etc.
Local stores are usually larger or as big as an L1 cache and almost as fast. Load-to-use latencies are comparable to L1 access, in the Cell SPEs they are 6 cycles (not 100% sure, I should check) which is twice the cycles than Athlons or Core (2) processors L1. However they usually have a wider path to the execution units, this is possible because they are simpler in structure compared to a cache.

My guess is with SiO that IBM will be taking a serious look at ZRAM, which could result in inflated LS and cache sizes.
Not very likely, current local stores are implented with almost the fastest SRAM cells available in modern processes. ZRAM is faster than eDRAM but even when optimized for speed it doesn't even come close to the speed of SRAM cells used for L2 caches. It could be used for very big L3s maybe but certainly not for something close to the core as local-store should be.
 
Concerning the Cell and local store: I think I read somewhere about an ibm engineer who considered a size of 1 MB to be ideal for the SPU local store, but I could not find a link to that.

But I found this quote by Kahle et al:
We increased the size of the local store twice: first in the initial concept phases from 64 KB to 128 KB, and later, when it was doubled again to 256 KB. In both of these cases, programmability was the driving factor.
http://www.research.ibm.com/journal/rd/494/kahle.pdf

Considering that the LS is such a large part of the SPUs die area, I would be a bit surprised if they would quadrouple the size of the LS in the next iteration, adding more processing power may give better return of value for the die space, but I take at least a doubling for granted. :)
 
My understanding (correct me if I am wrong) is that Local Store is basically a fast local Ram. Very fast. It is a working space/scratch pad with very low latency. It isn't like L2 cache in that it doesn't reflect the system memory and that you have to DMA to get data from system memory.
Local store is an explicitely addressed memory address space that is not kept coherent with outside memory. DMA is something present with Cell, and is in general a good idea with this kind of architecture, but is not required.

Cache transparently exploits locality in memory accesses that become apparent as part of the dynamic behavior of the software, at the cost of a lack of predictability and complexity. Local store forces locality, which is simpler, but it works under the constraint that dynamic behavior cannot always be captured by static software.

Local store, being explicit in nature, cannot be expanded to improve performance on old code. Old code must be recompiled, otherwise it will never try to address the extra space. Cache is transparent, if it can help, it will help.

Coherency is a double-edged sword. It exists because on-chip storage is not readily visible to the system. Coherency traffic is not computationally useful, and if not managed it can impact performance.
LS omits a lot of extraneous coherency traffic, but this makes things difficult if coherence is needed.

Level 1 cache, from my meager knowledge, tends to be much faster than L2 cache but also much smaller. I have seen it frequently broken up into 2 parts, instruction and data.
This is part of the design of a Harvard Architecture. It reflects the reality that accesses for code are almost always separate from accesses to data. Separate partitions allow for two advantages: loading an instruction does not block loading data (and vice versa), and one can have two caches with half the ports each (which is simpler and faster than one cache with a full number of ports).

L1 is kept small so the average latency is low.
L2 is larger, and it has higher associativity to cover the cases where the L1 misses.

As a result, the average latency is close to that of an L1, and most accesses avoid main memory in applications that show good locality.

The second part of interest is how do we see the future unfolding in regards to LS and caches? Will there be some convergance? Could we see L1 caches grow in size to mimick LS?
Capacity has a direct effect on latency, an L1 cache the size of LS would offer nothing in a latency advantage.

Could LS concepts be brought to PC CPUs, or does the business reality of higher level coding make such an approach unrealistic?

Issues with broad software compatibility and legacy code make local store very difficult to justify. Because it is software-visible, it cannot be used by any existing program without a recompile and will in most cases require a redesign of the software.

PC code is also less of a fixed platform, compared to embedded situations like a console. Workloads cannot be expected to conform to what works best for LS, and without a dominant market-share, heavily optimizing for local store might not be worthwhile if everyone else goes without it.

This may be a problem the PS3 may face when it comes to multi-platform titles.

So there seems to be a lot of movement in the processor industry (a lot more than over the last 5 years it seems) and I am curious how caches and memory will play out in future designs and what we may, or may not, see in the 2010-2012 timeframe.

Perhaps we can expect there to be an L1 cache of some kind in front of the LS for future SPEs. It will allow local store to expand in size, while the cache keeps average latency down. Due to the much simpler memory space inside the LS, a cache mapped to that could be even faster than a cache that must be mapped to a more complex memory space.

On the other hand, it may be possible that future CPUs will emulate some aspects of an LS by allowing more explicit control of cache locking, coherency, and access. It won't have all the benefits of a local store, but it will be an opt-in kind of improvement. Even if the emulation is not used, the extra cache will still be useful.
 
Capacity has a direct effect on latency, an L1 cache the size of LS would offer nothing in a latency advantage.
Would not large LS not suffer from any latency penalties versus cache? If I'm right in think cache latencies are caused by checking the cache for if a memory address is cached or not, the latencies increase with more cache as there's more to search through. As LS access is explicit, the size of the LS is irrelevant to the direct addressing of LS. Thus a 3 MB LS should still maintain the 6 cycle (aren't later revisions 4 cycle?) latency of Cell's LS, no? Would that offer a huge benefit to large dataset/instructin set applications over L2 cache? Or are pipelined processors, especially OOOe, effective at hiding these latencies enough that the locality of LS wouldn't be very apparent?
 
LS has a 6-cycle latency in part because it takes that long to propogate signals through something that big.

It is faster than a comparable L2, in part because there is no need to check tags, but it can't be as fast as an L1 for signal and power reasons.

An L1, being as small as it is, can be optimized to allow for more power consumption, and it has much shorter data and signal lines.
A 6 MB LS with the same optimizations will burn more power, and it will still have to take additional time to get signals to travel through all the extra data lines.

Whether that time amounts to additional cycles depends on just how long the target clock cycle will be, and just how much larger the LS will be.
 
Perhaps we can expect there to be an L1 cache of some kind in front of the LS for future SPEs. It will allow local store to expand in size, while the cache keeps average latency down. Due to the much simpler memory space inside the LS, a cache mapped to that could be even faster than a cache that must be mapped to a more complex memory space.
I don't really see how this would work at all. Would the cache pull from the LS or from main memory? The former makes no sense as you would still have to explicitly address the LS, and thus still have to know the LS size. Plus you'd still have to DMA into the LS to begin with. The latter doesn't really make any sense either, as you're already holding down latency by storing the data very close in LS. How much closer would it be in yet another cache on top of the LS? In fact, at that point, what difference is there between a normal L1 cache and a L1 cache/LS combo? The LS seems useless there.

The whole point of the LS is to trade increased flexibility for increased complexity in the code. Adding another cache on top of it just doesn't compute.
On the other hand, it may be possible that future CPUs will emulate some aspects of an LS by allowing more explicit control of cache locking, coherency, and access. It won't have all the benefits of a local store, but it will be an opt-in kind of improvement. Even if the emulation is not used, the extra cache will still be useful.
This, OTOH, doesn't seem like a bad idea. You already see some of those ideas with cache locking. But again, it's not particularly platform agnostic, so if you move the code to a processor without this feature or without enough cache to allow this, it wouldn't work. So I doubt we'll ever see this feature on x86, although possibly on some other ISA.

Still, it may not be the greatest use of die space, since a normal cache also has a lot of hardware behind it to automatically handle cache misses as well as keep cache coherency across multiple cores or processors. Using the cache as LS would waste all that hardware. Plus, a direct mapped or n-way associative cache would probably be difficult to implement on that since the cache size could change. You might have to have a fully associative cache, which would either cost even more die space or suffer some performance penalty.
 
Concerning the Cell and local store: I think I read somewhere about an ibm engineer who considered a size of 1 MB to be ideal for the SPU local store, but I could not find a link to that.

But I found this quote by Kahle et al:

http://www.research.ibm.com/journal/rd/494/kahle.pdf

Considering that the LS is such a large part of the SPUs die area, I would be a bit surprised if they would quadruple the size of the LS in the next iteration, adding more processing power may give better return of value for the die space, but I take at least a doubling for granted. :)

It's not just about die space. There is a size vs speed trade off. Bigger is slower.

It's 6 cycles to do a load, thats slower than most other CPUs L1's (Core2 L1 is 3 cycles for instance) but much faster than most L2's. Every time they increased the size of the LS it got a little slower. If they made it smaller it would have been faster. 256KB was the compromise that best fit the types of things they imagined the CPU would be used for.

If they increased it to 1MB per SPE it would be significantly slower than what it is now. If they do eventually decide to have SPEs with 1MB or larger LS, very likely they would want the LS to have it's own L1.

Smallish LS means that for some problems you might have to go to system memory more often than you like. System memory is going to be something like 1000 cycles away. But the DMA unit can help hide that latency and make system memory seem closer.

The DMA unit in each SPE is basically an independent processor itself that executes simple programs. With a single SPU instruction you can tell the DMA execute a pre-prepared list of DMA commands from LS and then the SPU can merrily continue executing other stuff while the DMA processor takes care of the business of bringing data into the LS. By the time the SPU actually needs that data, with any luck you have timed it so that the data needed has been delivered to the LS before you actually hit a stall.

Peter Hofstee analogy is the bucket brigade. The idea is to keep as many buckets on the move at the same time so that you never have to wait too long for the next bucket of water to throw on the fire.
 
I don't really see how this would work at all. Would the cache pull from the LS or from main memory? The former makes no sense as you would still have to explicitly address the LS, and thus still have to know the LS size. Plus you'd still have to DMA into the LS to begin with. The latter doesn't really make any sense either, as you're already holding down latency by storing the data very close in LS. How much closer would it be in yet another cache on top of the LS? In fact, at that point, what difference is there between a normal L1 cache and a L1 cache/LS combo? The LS seems useless there.
The cache would be a very small cache mapped to the LS. It wouldn't be helpful now, but in the future, a descendent of the current SPE might be clocked much higher and have a much larger LS.

The cache would allow the LS to scale in size with more favorable average-case latency.

The whole point of the LS is to trade increased flexibility for increased complexity in the code. Adding another cache on top of it just doesn't compute.
The LS forces a 6-cycle latency 100% of the time. A future LS at double or quad size might force a 10-cycle penalty. With a small L1 catching frequently accessed portions of the LS, it might be half that, on average.

This, OTOH, doesn't seem like a bad idea. You already see some of those ideas with cache locking. But again, it's not particularly platform agnostic, so if you move the code to a processor without this feature or without enough cache to allow this, it wouldn't work. So I doubt we'll ever see this feature on x86, although possibly on some other ISA.
It could go into SSE5 or 6 as an obscure extension, it's a more involved combination of a number of streaming loads and prefetches and the setting the MESI status of a cache line.

Still, it may not be the greatest use of die space, since a normal cache also has a lot of hardware behind it to automatically handle cache misses as well as keep cache coherency across multiple cores or processors. Using the cache as LS would waste all that hardware.
In instances where it is undesireable to clog the memory bus with coherency traffic-- a problem that worsens with more cores, it may be worthwhile to turn off coherency in situations that it isn't needed.

Plus, a direct mapped or n-way associative cache would probably be difficult to implement on that since the cache size could change. You might have to have a fully associative cache, which would either cost even more die space or suffer some performance penalty.
Reserve one way in an 8-way 1 MB L2 cache, and you get 128 KB of simulated LS. Reserve 2 ways, and you get 256.

The rest of the cache can be managed normally. When it comes to eviction policy, the reserved lines are always set so that in an LRU algorithm, they are never considered least recently used.

An L2 might take longer latency-wise, but then the much faster L1 would bring average case behavior much closer in line to an LS. Coupled with OoO logic already in place, it will be a hack, but a serviceable one.
 
Adding a L1 cache to the LS has another drawback. Your program will get less predictable execution times as you add the dependency to previous access patterns. The predictability was an essential design objective of the Cell.
One design goal of the Cell processor was predictable execution times, so programmers could better estimate the processing time of their software to meet frame rates.
http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/D9439D04EA9B080B87256FC00075CC2D/$file/MPR-Cell-details-article-021405.pdf
The synergistic processors in Cell provide a highly deterministic operating environment. Since they do not use caches, cache misses do not factor into the performance.
http://www.research.ibm.com/journal/rd/494/kahle.pdf

However, a cache miss to the LS would not be that expensive, in comparison to going to main memory, so in reality that may not be such a big issue if they would chose to implement a L1 cache in a future design for the reasons 3dilettante mentioned.
 
Last edited by a moderator:
Would not large LS not suffer from any latency penalties versus cache? If I'm right in think cache latencies are caused by checking the cache for if a memory address is cached or not, the latencies increase with more cache as there's more to search through.
As 3dilettante already mentioned what really bites you past a certain point is the physical size of the cache. For large L2/L3 caches on recent processors most of the latency comes from driving the signals to the core, and that is getting worse at smaller process nodes with metal layers providing inferior scaling compared to the logic. In theory you can always shave cycles by using bigger drivers in the L2<=>L1 bus but power consumption will go through the roof.
 
Trying to get round those dratted laws of thermodynamics that are always dragging us backwards would not the best solution for large LS be a dual LS sytem rather than LS+L1?
Perhaps a faster 256 KB LS, and beyond that a 1 MB slower LS. It would befall the developer to juggle content between stores, though I expect the compiler could provide quite a lot of help there. It would give the benefit of faster LS plus larger LS at a slight penalty in speed, and with a DMA engine between the two that could be set up to handle data flow. Of course, if you have a fast DMA engine and can structure your data enough to fit into that design, you could probably manage the smaller cache with predetermined DMA'd as is already the case in Cell!

I guess the question is how much LS capacity is really needed in the applications, and in STI's development, 256 KB was the best compromise. Most tasks can work on data chunks that can be buffered+streamed into that storage (two lots of 64 KB data+code and work space). I can't really see many cases where you'd be craving for more storage, not wanting to take on a LS latency to provide more data nearby instead of prefetch it. I think larger LS is more a matter of convenience and perhaps one that won't matter when devs are used to the 256 KB limit(?)
 
A cache miss to the LS would not be that expensive, in comparison to going to main memory, so in reality that may not be such a big issue if they would chose to implement a L1 cache in a future design for the reasons 3dilettante mentioned.
But what would a cache hit benefit you? You should be able to keep the pipeline fully populated most of the time even with the LS latency. It's not like processors only run as fast as their nearest cache latency, executing only every third to sixth clock cycle! The data and instructions are streamed to the execution units, and in the SPE's case, it's an absolute certainty (designer's doing their job!) the data is 6 cycles away, so that can easily be factored into the pipelining.
 
I guess the question is how much LS capacity is really needed in the applications, and in STI's development, 256 KB was the best compromise. Most tasks can work on data chunks that can be buffered+streamed into that storage (two lots of 64 KB data+code and work space). I can't really see many cases where you'd be craving for more storage, not wanting to take on a LS latency to provide more data nearby instead of prefetch it. I think larger LS is more a matter of convenience and perhaps one that won't matter when devs are used to the 256 KB limit(?)

Yeah. I think future progress is going to come from just using a lot more SPEs, really, and optimising routines to scale to as many SPEs as are available.
 
But what would a cache hit benefit you? You should be able to keep the pipeline fully populated most of the time even with the LS latency. It's not like processors only run as fast as their nearest cache latency, executing only every third to sixth clock cycle! The data and instructions are streamed to the execution units, and in the SPE's case, it's an absolute certainty (designer's doing their job!) the data is 6 cycles away, so that can easily be factored into the pipelining.
That is a valid question. A cache would probably only serve a purpose if the LS latency grows a certain amount beyond 6 cycles.

I guess the ideal would be to increase the LS size without introducing higher latencies. Isn´t that possible as you move to finer processes?

Here is little more history and background to the SPE LS design, which may help understand the design.
Sony and Toshiba had experience with the Emotion Engine** [1] processor with an accelerator processor whose memory could be accessed only via DMA. Since a large fraction of the power and chip area on conventional processors is associated with caches, it appeared that this model could provide for a more efficient computational element. The size of the private store of such an accelerator processor, which must hold its code and data, is a critical factor in the programmability of that processor. If the store is very small, the only programming models that seem to work are the deeply vectorized and streaming programming models popular on some of the more recent graphics processors. We increased the size of the local store twice: first, in the initial concept phases from 64 KB to 128 KB, and later, when it was doubled again to 256 KB. In both of these cases, programmability was the driving factor. A second crucial design decision is the organization of the dataflow. A number of options were discussed, including deep vector and other indirect forms of register addressing. The SIMD model was chosen because it had become the dominant model in media units in both x86 and PowerPC* processors as well as the Emotion Engine (MIPS) [2]. This allowed the reuse and adaptation of existing compiler technology.
http://www.research.ibm.com/journal/rd/494/kahle.pdf

I think the comparison to GPUs is a little interesting.

I remember reading at the Cell-forums that if they had chosen to split data and program memory in separate memory spaces they might have been able to reduce the latency to about 4 cycles, but the flexibility of having one common address space for program and data was more valueble.
 
Last edited by a moderator:
That is a valid question. A cache would probably only serve a purpose if the LS latency grows a certain amount beyond 6 cycles.

I guess the ideal would be to increase the LS size without introducing higher latencies. Isn´t that possible as you move to finer processes?

That can't be assumed, signal delay hasn't scaled as well as transistor density or switching speed.
Also, on-chip memory is a prime target for power savings, such as using thicker gate oxides, gating, and sleep transistors. These all can slow the LS or add a cycle outright.

A larger LS would have to deal with that, or it can leak current like a seive and burn a larger proportion of the core's power budget.

Even if the LS's wall-clock latency doesn't grow too much, a clock-speed increase will decrease the amount of leeway in terms of cycles that the LS can manage.

There could even be some code reuse between generations of Cell, since the cache could hide extra latency that the older code was not profiled to handle.

I remember reading at the Cell-forums that if they had chosen to split data and program memory in separate memory spaces they might have been able to reduce the latency to about 4 cycles, but the flexibility of having one common address space for program and data was more valueble.

With a cache, that could be approximated while still maintaining flexibility.
 
Just an innocent and potentially stupid question...

what will happen if SPU II has 2 "flat" LSes, instead of 1 bigger LS ?
 
Just an innocent and potentially stupid question...

what will happen if SPU II has 2 "flat" LSes, instead of 1 bigger LS ?

In terms of latency ? You need to have the LS as close to the load-store unit as possible. Having two LS increases latency.

Harvard architecture CPUs with seperate I and D caches exploit that the text (code) segment of a program is usually not intermingled with the stack and heap segments. On top of that different units address the two caches, instruction fetch reads from the I$, the load-store unit from the D$. That is what enables the separation.

Registers, caches and local stores are all attempts to put parts of the memory system as close to the computational resources as possible. In a traditional CPU you have:
1. Registers, explicitly controlled coherence
2. Caches, implicit coherence

In CELL's SPEs you have
1. Registers, explicitly controlled coherence
2. Local store, explicitly controlled coherence.

In both you actually have an implicit level 0 formed by the result forwarding network.

Explicitly controlled memory (like registers and LS) becomes part of the CPU core state. This makes the life harder for a SPE supervisor preempting the core because the supervisor has to pessimisticly swap the entire LS (unless some form of managed code is used). This takes time and, worse, uses precious main memory bandwidth.

On a traditional CPU, only the core state is saved on a context-switch. The implicit context stored in the caches is evicted (replaced) in a demand-based fashion as the new active thread uses more and more resources. The advantage of this approach is that you can have multiple thread contexts co-exist in the cache as long as their footprint doesn't stomp all over the others.

Of course when you get cache trashing you're in a world of pain, these are some of the most painful pathological performance cases to analyze and solve.

The answer to the high price of pre-emption on the SPEs is "don't do that". DeanoC from NT already detailed how they use a (massive) job-queue and runs these jobs to completion on the SPEs sequentially. You can view this as co-operative multitasking or a throwback to the sixties with batch-job processing instead of the multiprogramming paradigm that is used everywhere else today.

Because the local store is essentially part of the CPU state and thus makes context switching crazy expensive I don't think we'll see larger local stores in future versions of CELL. Instead we might see a huge shared cache (ie. implicitly controlled memory pool) optimized for density and bandwidth (ie. super-wide slow busses, high latency), where LS content can be dumped.

Cheers
 
Last edited by a moderator:
That can't be assumed, signal delay hasn't scaled as well as transistor density or switching speed.
Also, on-chip memory is a prime target for power savings, such as using thicker gate oxides, gating, and sleep transistors. These all can slow the LS or add a cycle outright.

A larger LS would have to deal with that, or it can leak current like a seive and burn a larger proportion of the core's power budget.

Even if the LS's wall-clock latency doesn't grow too much, a clock-speed increase will decrease the amount of leeway in terms of cycles that the LS can manage.

There could even be some code reuse between generations of Cell, since the cache could hide extra latency that the older code was not profiled to handle.

With a cache, that could be approximated while still maintaining flexibility.
The SRAM used in Cell is a core component, they had a separate paper about in ISSCC 05 and they mention the dual power support as a mean to keep the stability and performance in their ISSCC 07 paper.
The 65 nm CELL BE design features a dual power supply, which enhances SRAM stability and performance using an elevated array-specific power supply, while reducing the logic power consumption.
http://www.isscc.org/isscc/2007/ap/isscc2007.advanceprogram110306.pdf

I think they want to stick with the current setup of the LS as long as possible.
 
The SRAM used in Cell is a core component, they had a separate paper about in ISSCC 05 and they mention the dual power support as a mean to keep the stability and performance in their ISSCC 07 paper.

That's nothing new. Desktop CPUs have been doing that for a while.

I think they want to stick with the current setup of the LS as long as possible.

If they keep the exact same setup, the latency in cycles for the LS will be going up in future versions of Cell.
 
Back
Top