Local Store: Possible with a "traditional" cache CPU?

Acert93

Artist formerly known as Acert93
Legend
As developers have come to grips with the new consoles we have heard a lot about the Cell processor, both positive and negative comments. But in general all agree that Cell has a ton of potential. One reason that Cell has such impressive performance (among many reasons) is the presence of Local Store on the Cell SPEs.

The one hurdle I see for Local Store as a mainstream concept (and I could be wrong here) is that the concept isn't as accessible and needs more hand holding, hence it has a hard time penetrating markets where more managed solutions (like cache) thrive. Code complexity is also increasing, which may be a hurdle for mass market penetration (i.e. outside of consoles, super computers, render farms, workstations, and other specialty systems).

So my question(s):

Will we see CPUs with both cache and a "local store" (or SPRAM, or whatnot)? If possible, would the footprint of both be unreasonable? Or will 256K Local Stores be here for a while and thus be an option (e.g. 2MB cache and 256K LS)?

Will we see Local Stores designed (possibly with a software layer?) that can mimmick a traditional L2 cache? Or vice-versa, caches that can revert to a local memory/scratchpad?

How likely are these developments? What hurdles are there? Is such an advance unnecessary? Would a LS/L2 hybrid solution be nice in concept but poor in execution?

And what is the future of Local Stores? Will we see them migrate into Intel's and AMD's processors or will they remain in the realm of specialty "closed" systems like consoles and super computers?


Ps- Thanks to Panajav for reminding me of this question :) Different context, but it reminded me of the idea.
 
As developers have come to grips with the new consoles we have heard a lot about the Cell processor, both positive and negative comments. But in general all agree that Cell has a ton of potential. One reason that Cell has such impressive performance (among many reasons) is the presence of Local Store on the Cell SPEs.

The one hurdle I see for Local Store as a mainstream concept (and I could be wrong here) is that the concept isn't as accessible and needs more hand holding, hence it has a hard time penetrating markets where more managed solutions (like cache) thrive. Code complexity is also increasing, which may be a hurdle for mass market penetration (i.e. outside of consoles, super computers, render farms, workstations, and other specialty systems).

So my question(s):

Will we see CPUs with both cache and a "local store" (or SPRAM, or whatnot)? If possible, would the footprint of both be unreasonable? Or will 256K Local Stores be here for a while and thus be an option (e.g. 2MB cache and 256K LS)?

Will we see Local Stores designed (possibly with a software layer?) that can mimmick a traditional L2 cache? Or vice-versa, caches that can revert to a local memory/scratchpad?

How likely are these developments? What hurdles are there? Is such an advance unnecessary? Would a LS/L2 hybrid solution be nice in concept but poor in execution?

And what is the future of Local Stores? Will we see them migrate into Intel's and AMD's processors or will they remain in the realm of specialty "closed" systems like consoles and super computers?
Some time ago I was googgling for software caches on SPE local memory, IIRC there is even an implementation for texture caching in PS3 SDK. Obviously it cannot be as fast as hardware cache but should simplify coding.

I highly doubt we will see any PC CPU that forces local store programming even if it's limited to OS.
I also don't think having both local memory and "same level", similarly sized cache makes much sense. If I had to guess, I would say fully programmable caches are the future in console space.
 
Just my two (uneducated) cents.
Doesn't adding a software cache would defeat the purpose of LS?
Ls is fast due to very few extra logic for control, It's completely transparent so devs can easily profile what's wrong and plenty fast in regard to its size.

For the future I could see a huge share LS (like a L3 for a system that support cahe of anykind).

This huge memory pools could help to alleviate some bandwith problems, if the EIB doesn't ramp up evenly with the number of SPE(internal bandwith llimitation) or if the chip is slightly bandwith limited to main RAM.
This could help while letting the devs in complete control and making up for some possible bandwith limitation.

IMHO, IMVHO.
 
Last edited by a moderator:
Just my two (uneducated) cents.
Doesn't adding a software cache would defeat the purpose of LS?
Ls is fast due to very few extra logic for control, It's completely transparent so devs can easily profile what's wrong and plenty fast in regard to its size.
I think the idea is providing devs with choices. For normal operation you can rely on the cache, and for performance tasks you can manage the LS. In a dual-core setup, perhaps you could assign a core to one or other, or maybe have one L2 cache for one core and an L2/LS switchable store for the other, that you can set up for SIMD work.

However, I personally think it unlikely because the established coding base wouldn't very willingly make the switch. Existing code won't benefit from LS, and no-one will write future code for it if no-one's got a processor to use it. That's for established x86, PPC and conventional designs. You may get a switch in specialist processors, perhaps, although I imagine people interested in specialist processors would more happily accept an LS only solution like Cell.
 
The advantage of putting non hardwired associative memory (scratchpad, local store etc.) on the die, is space and nothing but (well, maybe heat also).
But that is a huge advantage if you build the software around it.
Traditional cache can often be locked, so as to "mimic" a scratch pad mem, but the advantage of taking up less space (thereby being able to fit more in the same space) is lost of course, making the usability of that function relatively limited.
 
Another name for Local Store is Scratchpad RAM which has been in processors for ages.

The PS2 MIPS 5700 has Scratchpad that can be hardware switched into D Cache (at least half anyway) so the software can decide which is best.

On the PS1, I remember remapping our stack to scratchpad to make function calls faster...

Explicit local fast RAM is a great idea as long as you get enough. The question of what enough is something that I expect changes over time. I remember being impressed with the 16K (IIRC) of PS2 Scratchpad over the PS1's 1K.

Scratchpad RAM/Local Store over the generations
PS1 - 1KB
PS2 - 40KB (16KB + 16KB VU1 + 4KB VU0)
PS3 - 1792KB (7 * 256KB)
Extrapolated PS4 - 70MB (assuming a ~x40 increase) ;-)
 
As LS size grows, doesn't latency grow with it?

Btw, thanks for all the great comments!

For the future I could see a huge share LS (like a L3 for a system that support cahe of anykind).

It does seem quite a few designs are going to tiered memory/cache systems. So this seems likely.[/quote]

I think the idea is providing devs with choices.

Yeah, that was my thought. I doubt no upcoming console will go without at least one CPU with a cache (the PPE has a cache afterall). The "traditional" model has some appeal and a lot of resources (people and tools) that work great with it. Opening the door, though, to the performance benefits of a scratchpad on a "normal" CPU would be a benefit (maybe?) It may not be SIMD or be as streamlined as an SPE, but having a very fast local memory pool with low latencies has some advantages?

For normal operation you can rely on the cache, and for performance tasks you can manage the LS.

Cookies for Shifty for getting what I was "hinting at". So, now is it possible?

You may get a switch in specialist processors, perhaps, although I imagine people interested in specialist processors would more happily accept an LS only solution like Cell.

Which is an arguement for future consoles to have a mix of SPE and PPE like processors... but I am wondering if maybe a company like MS or Nintendo, who isn't quite as tied to Cell, may be able to have "their cake and ice cream" to a degree.

Traditional cache can often be locked, so as to "mimic" a scratch pad mem

I thought L2 cache also came with much bigger penalties. When it is locked can it mimick the general performance of a similar KB sized (not footprint sized) Local Store?

Extrapolated PS4 - 70MB (assuming a ~x40 increase)

MEGATON!

"PS4 Cell to have 70MB of on chip memory." -- DeanoC, PS3 uber dev.

You heard it here first folks! :smile:
 
I would like to ask the honorable armchair CPU designers to stop throwing around numbers for a while and suggest how these LS would be utilized by real-world applications - ones who have already been written, and ones who will be written in the future by normal software teams (i.e. without a mad genius locked in the basement).
 
Just my two (uneducated) cents.
Doesn't adding a software cache would defeat the purpose of LS?
Not unless the purpose of LS is to hinder programming.
Echoing others, the point of software cache is to simplify programming especially when performance is not an issue.
Unfortunately memory access through software cache is not "fully" transparent to the coder unlike hardware cache.
Ls is fast due to very few extra logic for control,
The latency introduced by extra logic is not necessarily significant compared to other sources of latency.
As LS size grows, doesn't latency grow with it?
Yes, generally speaking.
I thought L2 cache also came with much bigger penalties. When it is locked can it mimick the general performance of a similar KB sized (not footprint sized) Local Store?
In case of Xenon, I believe the answer is no. Not that I know much about Xenon but I've read locking is for FIFO style streaming (to GPU?) only as opposed to more general purpose scratchpad usage. I hope some 360 developers can shed light on this.
 
MEGATON!

"PS4 Cell to have 70MB of on chip memory." -- DeanoC, PS3 uber dev.

You heard it here first folks! :smile:
Is this really so far fetched? Maybe your smiley wasn´t intended that way?

We will soon (Q4’07) have intel desktop CPUs with 12 MB of cache, if we extrapolate the developement to 2012, intel CPUs could certainly have those numbers.

I have another question to people with some LSI knowledge concerning CPUs. How much more area effective is LS compared to conventional cache memory (let say 4 way associative), i.e. how much more MB/die area do get with LS compared to cache. 10%, 20%, 30% more?
 
Squeak, indeed, R5900 had 56K of "local-store" mem.
But accounting for split into D/I memory on VUs the increase is still pretty much 40x. ;)

Joshua said:
I thought L2 cache also came with much bigger penalties. When it is locked can it mimick the general performance of a similar KB sized (not footprint sized) Local Store?
Performance doesn't change with locking, but cache locking isn't exclusive to L2 (eg. Gekko has it for L1).
Not all memories are created the same either - some L2s are actually fast, unlike the HD-consoles, and some Scratchpads can be butt slow too...

And btw, 70MB on-chip memory wasn't so far from happening even this gen.
 
Wouldn't it be difficult to utilize a LS in a PC considering you typically run many different programs simultaneously.

Wouldn't task switching introduce absolutely devastating overhead for pretty mcuh any larger than trivial/useless size LS?

It's bad enough switching context already it seems -or that is the impression I get anyway from listening to you programmer types discussing draw calls and the windows XP DX9 driver model and such..

Then imagine adding one or even several megabytes of extra data to switch in/out. Major kick in the nads performance-wise.

So unless the LS is a finite resource that only one program at a time is allowed to utilize it wouldn't do us much good it seems to me. But then I'm just a layperson. :cool:

Peace.
 
Wouldn't it be difficult to utilize a LS in a PC considering you typically run many different programs simultaneously.

Wouldn't task switching introduce absolutely devastating overhead for pretty mcuh any larger than trivial/useless size LS?
Yep.
It's bad enough switching context already it seems -or that is the impression I get anyway from listening to you programmer types discussing draw calls and the windows XP DX9 driver model and such..
The idea is that you dont use those SPUs( or however you call them) for switching contexts, but rather see them as servers. You sent a packet of data from your Process to a server and get it back later (and hopefully can do something else inbetween).
For example if you have a dedicated "DX9-Server", it wouldnt have to switch context ever, various processes would just sent data to it. That of course would mean you wont run anything else on it.

Using it as just another CPU would cause the problems you stated, Cell has the PPU for a reason.
 
There's a reasonably interesting paper at ISCA this year on the topic of trade-offs between caches and local store. (Basically argues that caches are the winning approach all around, local store isn't worth the trouble.)

http://csl.stanford.edu/~christos/publications/2007.pmarch.isca.pdf

Comparing memory systems for chip multiprocessors
Jacob Leverich, Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis

Abstract:
There are two basic models for the on-chip memory in CMP sys-
tems: hardware-managed coherent caches and software-managed
streaming memory. This paper performs a direct comparison of the
two models under the same set of assumptions about technology,
area, and computational capabilities. The goal is to quantify how
and when they differ in terms of performance, energy consump-
tion, bandwidth requirements, and latency tolerance for general-
purpose CMPs. We demonstrate that for data-parallel applications,
the cache-based and streaming models perform and scale equally
well. For certain applications with little data reuse, streaming scales
better due to better bandwidth use and macroscopic software pre-
fetching. However, the introduction of techniques such as hard-
ware prefetching and non-allocating stores to the cache-based mod-
el eliminates the streaming advantage. Overall, our results indicate
that there is not sufficient advantage in building streaming memory
systems where all on-chip memory structures are explicitly man-
aged. On the other hand, we show that streaming at the program-
ming model level is particularly beneficial, even with the cache-
based model, as it enhances locality and creates opportunities for
bandwidth optimizations. Moreover, we observe that stream pro-
gramming is actually easier with the cache-based model because
the hardware guarantees correct, best-effort execution even when
the programmer cannot fully regularize an application’s code.
 
There have already been a lot of processors which can use cache as local memory, by allowing you to lock a region in memory in cache ... I think Cyrix even had it in some of their x86 processors, although I don't know if it was exposed.
 
But as has already been said, localstore/scratchpad memory is hard to virtualize in a timesliced multiprocessing environment (I assumen that "mainstream" meant PCs and servers in the first post).

CPUs with cache, better implementation of prefetches via hints (and maybe a small dedicated buffer for prefetches that won't pollute caches) and proper use of non-temporal stores with relaxed write semantics so that they can bypass store queues and caches (use a memory fence if you need to order your non-temporal stores) is the way forward IMO.

Localstores are not only hard to virtualize, they are also hard to share in a multicore environment (unlike lower level caches).

Cheers
 
Sharing lower level caches is also hard. Even with CMP hardware cache coherency is hardly low cost if you allow writable data be replicated (personally I would prefer writable data to never be replicated between caches).
 
There's a reasonably interesting paper at ISCA this year on the topic of trade-offs between caches and local store. (Basically argues that caches are the winning approach all around, local store isn't worth the trouble.)

http://csl.stanford.edu/~christos/publications/2007.pmarch.isca.pdf

Comparing memory systems for chip multiprocessors
Jacob Leverich, Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis

This paper is really interesting, thanks for the link.
I was surprised to see that local store based architectures don't have, at least according this work, a significant edge over cache based architectures.
I'd like to see more studies like this, but if other indipendent works confirm this one then yes..it seems caches are the winning approach (less hassle for the coder..) even though you still need to make an effort and code as you were working on a local store based architecture to extract maximum performance from your cachesbased architecture.
I only have one doubt: is it realistic to assign the same amount of local memory to both architectures (32kb) when the one based on a local store should be able to have more local memory than the one employing caches? (I assume a local store type of memory is more dense than a cache..)
 
Back
Top