Local Store, L1, L2 caches and the Future of Processrs: Compare, Contrast?

Panajev2001a · Jan 23, 2007

Crossbar said:
Adding a L1 cache to the LS has another drawback. Your program will get less predictable execution times as you add the dependency to previous access patterns. The predictability was an essential design objective of the Cell.

http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/D9439D04EA9B080B87256FC00075CC2D/$file/MPR-Cell-details-article-021405.pdf

http://www.research.ibm.com/journal/rd/494/kahle.pdf

However, a cache miss to the LS would not be that expensive, in comparison to going to main memory, so in reality that may not be such a big issue if they would chose to implement a L1 cache in a future design for the reasons 3dilettante mentioned.

Truth to be told, in basically all of CBEA's hardware documents available online they have a L1 cache shared by all SPE's (it sits between the EIB and the LS though).

Crossbar · Jan 23, 2007

3dilettante said:
If they keep the exact same setup, the latency in cycles for the LS will be going up in future versions of Cell.

Do you think that has happened in themove to 65 nm?

They claim the 65 nm chip is a 6 GHz chip in comparison to the current version which was rated as a 4 GHz chip. 6 GHz sounds pretty remarkable if they been able to keep the 6 cycle latency then.

From the PS3 point of view a clock by clock (or very close to that) backward compatibility is very important, so Sony has a pretty strong incentive, to keep it low.

3dilettante · Jan 23, 2007

If 65nm is the only transition Cell goes through, and it's only going in the PS3, that would be interesting.

It might be that the current overal design for Cell was really targeted for 65nm all along.

They are going to change things eventually, either for another transition, or to fit Cell into markets not served by a one size fits all solution.

Crossbar · Jan 23, 2007

3dilettante said:
If 65nm is the only transition Cell goes through, and it's only going in the PS3, that would be interesting.

I am not sure I follow you, Sony et al has been open about that they will move to finer processes than 65 nm. What do you mean?

Edit: Of course there will be different versions of Cell targeting different applications, i.e. Kutaragi has been talking about micro-Cells. IBM will present an enhanced dp version, maybe the 6 GHz presented at ISSCC is that chip I don´t know.

Anyways, I maintain that code and timing compatibility will be key features of future Cells for the PS3. There are probably many different ways to achieve that.

3dilettante · Jan 23, 2007

Crossbar said:
I am not sure I follow you, Sony et al has been open about that they will move to finer processes than 65 nm. What do you mean?

Edit: Of course there will be different versions of Cell targeting different applications, i.e. Kutaragi has been talking about micro-Cells. IBM will present an enhanced dp version, maybe the 6 GHz presented at ISSCC is that chip I don´t know.

Anyways, I maintain that code and timing compatibility will be key features of future Cells for the PS3. There are probably many different ways to achieve that.

My view was longer-term than the PS3. Cell could stand for more improvement with the extra transistors offered by finer geometries.
I agree they wouldn't change much for a PS3 version.
On the other hand, it's not like an LS cache would necessarily damage anything, if the worst-case latency is kept to 6 cycles.

The worst that can happen in such situations is that the SPE finishes early and has to idle.

crystall · Jan 24, 2007

Crossbar said:
They claim the 65 nm chip is a 6 GHz chip in comparison to the current version which was rated as a 4 GHz chip. 6 GHz sounds pretty remarkable if they been able to keep the 6 cycle latency then.

IBM has significantly improved in the last two years their ability to develop high speed ICs. Both Cell and Xenon were great occasions to use external funds to do advanced research on custom logic design and it seems that it has paid off. If POWER6 comes out in the 4-5 GHz range as advertised with a much shorter pipeline than Cell then it is not so unrealistic that a 65nm shrink could hit 6 GHz. Not taking into account power consumption obviously which might go way up.

ADEX · Jan 24, 2007

crystall said:
IBM has significantly improved in the last two years their ability to develop high speed ICs. Both Cell and Xenon were great occasions to use external funds to do advanced research on custom logic design and it seems that it has paid off. If POWER6 comes out in the 4-5 GHz range as advertised with a much shorter pipeline than Cell then it is not so unrealistic that a 65nm shrink could hit 6 GHz. Not taking into account power consumption obviously which might go way up.

I'm expecting the 65nm Cell to plummet in power use (at least at 3.2Ghz).
It's currently rated at 110W at 1.1v. Going dual voltage will allow them to drop the logic voltage, even at 90nm that'd be a 30+% saving. 65nm should allow a further saving of around 30%. In total I expect the saving to be 50% or better at the same clock speed.

This should enable the Cell to run at higher frequencies in uses other than PS3, they'll have less constraints on power so expect some pretty high frequency parts. I don't know if they'll get anything like 6Ghz, though that may be possible if they de-couple the frequency of the PPE, it uses more power than an SPE so if the PPE was clocked lower the SPEs could be clocked relatively higher.

LS Vs Cache
On the topic of the thread...

Each has relative advantages and disadvantages but LS is higher bandwidth and lower latency than cache. L1s may be lower latency but that's because they are small, a 256K L1 would run slower than a 256K LS.

Cache represents memory, it's purpose is to pretend to the CPU that the memory is faster than it really is. Cache needs to be kept in sync and any other caches, this can be avoided on dual core chips by using larger shared caches but this increases latency. As the number of cores increases coherence will become a big problem as it'll increase cache latency. Expect to see complex cache arrangements like AMDs Barcelona or Sun's Rock.

LS does not represent memory so it does not require redirection logic, it also doesn't need to be kept in sync (coherent) with other LSs so increasing the number of SPEs will not cause LS latency to increase. LS are also smaller than caches and use less power.

The problem with LSs is in unpredictable code or data structures. A cache is better for working with this type of code, this is why the PPE uses a cache, control code is much more likely to be like this.

High compute code can use more predictable data structures which is more LS friendly, when it does data can be double buffered so the processor does not stall on memory, processors without LS cannot generally do this very well (if at all) so SPEs can run this sort of code faster.

If the data cannot be made LS friendly the LS can be made to act like a cache, this has a fairly high latency but enables more app types to work on SPEs.

LS is like many decisions in Cell, they've traded software complexity for hardware complexity, the result is a fast processor which uses relatively little power but takes more thought (or a complete change of thought) to program it.

Crossbar · Jan 24, 2007

Gubbi said:
On a traditional CPU, only the core state is saved on a context-switch. The implicit context stored in the caches is evicted (replaced) in a demand-based fashion as the new active thread uses more and more resources. The advantage of this approach is that you can have multiple thread contexts co-exist in the cache as long as their footprint doesn't stomp all over the others.

Of course when you get cache trashing you're in a world of pain, these are some of the most painful pathological performance cases to analyze and solve.

The answer to the high price of pre-emption on the SPEs is "don't do that". DeanoC from NT already detailed how they use a (massive) job-queue and runs these jobs to completion on the SPEs sequentially. You can view this as co-operative multitasking or a throwback to the sixties with batch-job processing instead of the multiprogramming paradigm that is used everywhere else today.

Because the local store is essentially part of the CPU state and thus makes context switching crazy expensive I don't think we'll see larger local stores in future versions of CELL. Instead we might see a huge shared cache (ie. implicitly controlled memory pool) optimized for density and bandwidth (ie. super-wide slow busses, high latency), where LS content can be dumped.

As you point out, one weakness of the LS is that there is not a continously updated image of its content in the main memory. (from the bandwidth pow its an advantage, but I leave that discussion out here)
And it really manifest it self if you are forced to switch in/out processes that take advantage of all SPEs, as it literally means swapping in and out megabytes of data.
As you point out you can avoid that by using batch job queues that runs to completion, but this is a fairly old style of handling things compared to how pre-emptive multi-tasking is handled on modern computers.

I have this maybe crazy idea, that you could somehow combine them (the job queue with a modern multi-tasking OS). First you will have some mulit-task processes running on the PPE just like on any ordinary multi-tasked OS, but to each process you will have an associated job que feeding the SPEs. The PPE process will spawn jobs to the job que and perhaps even the SPE jobs them self will be allowed to spawn new jobs, just like in DeanoCs implementation. The SPEs will be running batch jobs fed by the queue of the active process and when a process is swapped out for another process, the queue will swapped as well, so each SPE will start picking jobs from the job queue of the new process as their present job was run to completion. The OS will monitor the SPEs and if some jobs do not run to completion within some predefined time those jobs will get explicitly swapped out until their process get swapped in again.
For this to be viable solution the memory controllers of the SPEs must allow memory protection to be associated to each SPE individually. Have anyone seen any information about this?

Of course this would impose a new programming paradigm with a lot of asynchronously running jobs, but I am not sure it would add much complexity compared to the alternatives. Already today are programmers striving to create loosely coupled asynchrounous processes/threads to have them running on multiple cores. This would be an explicit way of implementing it and achieving pretty good load balancing at the same time.

almighty · Jan 24, 2007

Was'nt one of the reason Sony and IBM used local store instead of cache was to totaly get rid of cache misses?

aaaaa00 · Jan 24, 2007

almighty said:
Was'nt one of the reason Sony and IBM used local store instead of cache was to totaly get rid of cache misses?

All LS does is replace cache misses with manually scheduled DMAs.

It's a tradeoff either way, you get more control with DMAs, but more work and things to debug for the developer.

Rolf N · Jan 25, 2007

The crucial reason for local stores is scaling to many cores. Maintaining coherence across multiple caches is a headache that grows with their number, not only from a chip complexity pov, but also from a performance pov (see: Intel's old FSB approach to mulitprocessor systems vs AMD's approach vs stuff on Intel's own roadmap). Snooping eight peers to find out who has the most recent version of a piece of memory is a lot of extra latency.

Panajev2001a · Jan 25, 2007

aaaaa00 said:
All LS does is replace cache misses with manually scheduled DMAs.

It's a tradeoff either way, you get more control with DMAs, but more work and things to debug for the developer.

It is more work, but I would not underestimate the importance of having more control and more debug info for the developer. In consoles, when you want to have precise control of your hardware's performance there are some nice abstractions that would make programming easier, but would make it difficult to finely control performance (one reason because garbage collected programming languages are still not the first choice when developing performance oriented game engines).

I'd say that having a common SPE L1 cache that sits between the LS's and the EIB ( you have to go through it to access XDR memory) would be a possible implementation for a CELL Broadband Engine 2 processor: the problem could be L1 size depending on the number of SPU's, but we might decide to have separate L1 caches for clusters of SPU's (let's say 1 cluster = max 8 SPU's) to minimize the L1 cache's size.

Coherency traffic between L1 caches would take away some of our precious EIB bandwidth though.

How to prevent issues ?

1.) we could have DMA transactions (originated by the SPE's MFC) which are marked specifically as cache-able so that normal DMA operations (useful for backward compatibility) would just be streamed through the L1 cache and not overwrite its contents: sometimes we do not want to make random accesses to XDR or to take advantage of temporal locality, but stream chunks of data and we can schedule these transfers early enough and have enough of them in flight simultaneously to hide pretty well the latency of the DMA operation. This could help to reduce L1's trashing.

and (optional)

2.) we could have an inclusive shared L2 cache between the L1 caches (1 per SPU cluster): this would bring coherency traffic away from the EIB, but it would add more cycles to a cache miss that blows through the L1 and L2 and HAS to go to XDR memory.

I also think we might see LS's size increase (after-all, it would not cause backward-compatibility problems and code could adapt pretty fast to the new size as you can even query it via appropriate instructions even today IIRC).

Crossbar · Jan 25, 2007

Panajev2001a said:
1.) we could have DMA transactions (originated by the SPE's MFC) which are marked specifically as cache-able so that normal DMA operations (useful for backward compatibility) would just be streamed through the L1 cache and not overwrite its contents: sometimes we do not want to make random accesses to XDR or to take advantage of temporal locality, but stream chunks of data and we can schedule these transfers early enough and have enough of them in flight simultaneously to hide pretty well the latency of the DMA operation. This could help to reduce L1's trashing.

Or you could have certain memory pages marked as cache-able or non-cache-able, which is the case for the cache of the PPE IIRC.
Program code and common global data should be located at a cache-able address and streamed data at non-cache-able addresses.

inefficient · Jan 25, 2007

Panajev2001a said:
...

How big of a cache were you thinking of bettween the LS and EIB. Smaller or larger than the LS itself?

It seems like werid place for a cache. A buffer maybe, but there is probably already a small buffer there. A small cache (much smaller than the LS) doesn't seem to fit in with the design paradigm of the Cell. And if you were thinking of a cache bigger than the LS, why would you put that between the LS and EIB. Wouldn't a large shared cache between the XDR bus and the EIB be the obvious choice?

How about instead of a cache, a L2 LS! A slower but large LS shared by mutiple SPUs that is fully addressable. It would be a L2 scratch space that you could very quicky load prebuffered code/data from. The interface to it could be DMA, the same way you can DMA data from other SPU's LS.

I could be way off, but in my mind I'm thinking that the memory operation that is going to be the most expensive and needs to be addressed is the one requiring a full context switch for the SPU & LS. Not the streaming memory operations during actual computation - those are going to be largely misses anyway. The latency of switching from one task to another or job to job. I'm trying to think of how you could speed this up. I don't see how a small cache between the LS and EIB will help much.

Gubbi · Jan 25, 2007

inefficient said:
How about instead of a cache, a L2 LS! A slower but large LS shared by mutiple SPUs that is fully addressable. It would be a L2 scratch space that you could very quicky load prebuffered code/data from. The interface to it could be DMA, the same way you can DMA data from other SPU's LS.

But which SPU(s) decide what goes in the shared LS ?

The only thing that makes sense is to place a big cache (coherent memory pool) before the bus interface.

Cheers

inefficient · Jan 25, 2007

Gubbi said:
But which SPU(s) decide what goes in the shared LS ?

The only thing that makes sense is to place a big cache (coherent memory pool) before the bus interface.

Cheers

The programmer would decide. But I guess the advantages to have full control are probably undone by the fact large teams of programmers will face nightmare concurrency issues etc.

ADEX · Jan 25, 2007

Gubbi said:
The only thing that makes sense is to place a big cache (coherent memory pool) before the bus interface.

Why does everyone see the need for a cache?

If data is cacheable it may end up it the PPE cache in which case the SPE will read it from there. However this is both high latency and not terribly fast.
A dedicated cache for SPEs isn't going to be any better, probably worse as it needs to be coherent with the PPE cache.

Panajev2001a · Jan 25, 2007

ADEX said:
Why does everyone see the need for a cache?

If data is cacheable it may end up it the PPE cache in which case the SPE will read it from there. However this is both high latency and not terribly fast..

Not to mention you are polluting the EIB and stealing EIB's bandwidth: if you can afford an L1 cache between the SPU's LS and the EIB (to speed up random accesses from XDR memory for example) that is the best place to put it. It sounds rational and it being rational and all MIGHT also be the reason why IBM itself outlines in the CELL Broadband Engine Architecture paper the optional presence of a SPU L1 cache and its place between the SPE's LS and the EIB

.

nonamer · Jan 25, 2007

Panajev2001a said:
Not to mention you are polluting the EIB and stealing EIB's bandwidth: if you can afford an L1 cache between the SPU's LS and the EIB (to speed up random accesses from XDR memory for example) that is the best place to put it. It sounds rational and it being rational and all MIGHT also be the reason why IBM itself outlines in the CELL Broadband Engine Architecture paper the optional presence of a SPU L1 cache and its place between the SPE's LS and the EIB .

If it's between the LS and EIB, wouldn't be more accurate to call it a L2 cache? Are "L" nomenclatures even valid for such a cache?

Shifty Geezer · Jan 25, 2007

ADEX said:
If data is cacheable it may end up it the PPE cache in which case the SPE will read it from there. However this is both high latency and not terribly fast.

SPE DMA's don't go through PPE cache, do they? They can't do, as they'd saturate the cache and render it useless for the PPE.

Local Store, L1, L2 caches and the Future of Processrs: Compare, Contrast?

Panajev2001a

Crossbar

3dilettante

Crossbar

3dilettante

crystall

ADEX

Crossbar

almighty

aaaaa00

Rolf N

Recurring Membmare

Panajev2001a

Crossbar

inefficient

Gubbi

inefficient

ADEX

Panajev2001a

nonamer

Shifty Geezer

uber-Troll!

Similar threads