Local Store, L1, L2 caches and the Future of Processrs: Compare, Contrast?

Sethamin · Jan 25, 2007

ADEX said:
Why does everyone see the need for a cache?

Because memory access is very, very, very slow compared to processor speeds, and has historically been falling behind every year in the regard. The fact that you have LS at all is because main memory access is so slow - or, to phrase it another way, if main memory was blazingly fast, then LS would probably not be necessary at all. Even if just LS is sufficient today, it's likely that it won't be as you scale up the number of SPEs.

Personally I doubt we'll see a traditional cache grafted anywhere onto the SPE subsystem because it would not scale at all with more cores. SPEs working on wildly different things would constantly evict each other's data and cause massive cache thrashing. You could make the cache very, very large, but that requires a lot of power and transistors and trades off latency. This is all in addition to the extra traffic on the main memory bus to keep all the data coherent, which I'm not certain is all that important for SPEs anyway.

If anything I think we'll see a sort of "L2 LS" or "shared LS" between all SPEs that can be DMA'd into and out of independent of the other SPEs and that could be used to cut down on the traffic of data to and from main memory. It might be useful as a buffer for data that needs to be processed by multiple cores, or just as a preloading area for data that might be needed in the very near future but not immediately.

Crossbar · Jan 25, 2007

Sethamin said:
Because memory access is very, very, very slow compared to processor speeds, and has historically been falling behind every year in the regard. The fact that you have LS at all is because main memory access is so slow - or, to phrase it another way, if main memory was blazingly fast, then LS would probably not be necessary at all. Even if just LS is sufficient today, it's likely that it won't be as you scale up the number of SPEs.

Why do you think the LS will not be sufficient as you scale the number of SPEs?

Sethamin said:
Personally I doubt we'll see a traditional cache grafted anywhere onto the SPE subsystem because it would not scale at all with more cores. SPEs working on wildly different things would constantly evict each other's data and cause massive cache thrashing. You could make the cache very, very large, but that requires a lot of power and transistors and trades off latency. This is all in addition to the extra traffic on the main memory bus to keep all the data coherent, which I'm not certain is all that important for SPEs anyway.

The SPEs will in a lot of cases be working with streamed data, and you do not want to cache that kind of data. Maybe you restrict the cache to only carry program code, which would increase the speed when swapping in a new program in an SPE. That would have some benefits:

Latency, you get the SPE program in place faster so it can start executing faster.
You reduce the bandwidth consumption to main memory, leaving more bandwidth for data fetches/writes.
You may have a much higher bandwidth between the LS and the cache, than what the DMA transfer from main memory can offer, which also would reduce the time to get the program in place.

When some SPE programs grow and shrink in number of executing instances over time, such a cache may be quite useful, and increase the efficiency.

Sethamin said:
If anything I think we'll see a sort of "L2 LS" or "shared LS" between all SPEs that can be DMA'd into and out of independent of the other SPEs and that could be used to cut down on the traffic of data to and from main memory. It might be useful as a buffer for data that needs to be processed by multiple cores, or just as a preloading area for data that might be needed in the very near future but not immediately.

This is a scenario that I find highly unlikely, because it would require a significant change in the program paradigm. We have already heard about the complaints of the painful Cell. Adding dependency between the SPEs in the way you suggest through some explicitly addressable common LS, would be more or less madness IMHO. An anonymous cache would be a better choice as I see it.

Shifty Geezer · Jan 25, 2007

Sethamin said:
Because memory access is very, very, very slow compared to processor speeds, and has historically been falling behind every year in the regard. The fact that you have LS at all is because main memory access is so slow - or, to phrase it another way, if main memory was blazingly fast, then LS would probably not be necessary at all. Even if just LS is sufficient today, it's likely that it won't be as you scale up the number of SPEs.

RAM access is slow in access times. If you can prefetch ahead of time, those latencies can be nullified. That's the principle of the LS. The only time a cache will be useful is if you want to access RAM without a suitable prefetch time before access, and end up waiting on RAM. Then a cache may have the data ready for you, saving a few hundred cycles wait. For the tasks SPE's are primarily doing, a cache won't help. And for the cost, I don't think adding better random access memory buffering is worth it. Add better caches to the PPEs and use them for the random access data processing, and keep SPE's for the structured memory accessing and it's managed prefetches.

Panajev2001a · Jan 25, 2007

Shifty Geezer said:
RAM access is slow in access times. If you can prefetch ahead of time, those latencies can be nullified. That's the principle of the LS. The only time a cache will be useful is if you want to access RAM without a suitable prefetch time before access, and end up waiting on RAM. Then a cache may have the data ready for you, saving a few hundred cycles wait. For the tasks SPE's are primarily doing, a cache won't help. And for the cost, I don't think adding better random access memory buffering is worth it. Add better caches to the PPEs and use them for the random access data processing, and keep SPE's for the structured memory accessing and it's managed prefetches.

If the cost of the shared cache is reasonable, I do think that the added flexibility (executing random access-focused code) and independence (from the PPE, which is something nice in and of itself as it makes them less bound to PPE's performance) would be well received by developers (always depending on exactly what they would not receive because of this, it would surely change their thought if it were to be something important).

Sethamin · Jan 25, 2007

Crossbar said:
Why do you think the LS will not be sufficient as you scale the number of SPEs?

As you add more and more SPEs, you need more and more main memory bandwidth to move data into and out of their LS. Having a bit of closer storage may help a little if there's any redundant data going across the main memory bus.

Crossbar said:
The SPEs will in a lot of cases be working with streamed data, and you do not want to cache that kind of data.

I agree. But while you can easily imagine processing streamed data using 8 SPEs, what about 64? Or 128? At some point you reach an upper-bound on how big a "pipeline" of SPEs you can string together as there is a finite limit on the number of steps you can split a task into. Also, streamed data is great for some applications (say, image processing or video decoding), but if you're concerned about latency then, again, there's an upper bound, as each SPE you string together is going add a bit depth to the "pipeline" and a bit more latency. Just depends on what use scenario they want to target.

Crossbar said:
Maybe you restrict the cache to only carry program code, which would increase the speed when swapping in a new program in an SPE. That would have some benefits:

Latency, you get the SPE program in place faster so it can start executing faster.

You reduce the bandwidth consumption to main memory, leaving more bandwidth for data fetches/writes.

You may have a much higher bandwidth between the LS and the cache, than what the DMA transfer from main memory can offer, which also would reduce the time to get the program in place.

When some SPE programs grow and shrink in number of executing instances over time, such a cache may be quite useful, and increase the efficiency.

Agreed, restricting the cache to code segments doesn't seem like a bad idea. But then again, how often will you be moving code on and off an SPE? Seems likely that once a program is loaded onto an SPE it will stay there for a long time. Preemptive multitasking seems unlikely, since you'd have to store the entire LS to accurately save the state of the running program. So I question how much performance this sort of cache would actually gain.

Crossbar said:
This is a scenario that I find highly unlikely, because it would require a significant change in the program paradigm. We have already heard about the complaints of the painful Cell. Adding dependency between the SPEs in the way you suggest through some explicitly addressable common LS, would be more or less madness IMHO. An anonymous cache would be a better choice as I see it.

I don't see any change in the paradigm. Developers would be free to ignore the "L2 LS" if desired and nothing would be different. Think of it more as a SPE with only LS and no execution units. It's just a bit of storage closer to the SPEs that can be used for extra buffering space.

Shifty Geezer said:
RAM access is slow in access times. If you can prefetch ahead of time, those latencies can be nullified. That's the principle of the LS. The only time a cache will be useful is if you want to access RAM without a suitable prefetch time before access, and end up waiting on RAM. Then a cache may have the data ready for you, saving a few hundred cycles wait. For the tasks SPE's are primarily doing, a cache won't help. And for the cost, I don't think adding better random access memory buffering is worth it. Add better caches to the PPEs and use them for the random access data processing, and keep SPE's for the structured memory accessing and it's managed prefetches.

Um, so we agree - a cache doesn't make sense. I was just advocating an extra bit of shared LS that all SPEs could share if they needed some extra space.

Panajev2001a said:
If the cost of the shared cache is reasonable, I do think that the added flexibility (executing random access-focused code) and independence (from the PPE, which is something nice in and of itself as it makes them less bound to PPE's performance) would be well received by developers (always depending on exactly what they would not receive because of this, it would surely change their thought if it were to be something important).

SPEs awee not made for branchy, random-access code. They don't have dynamic hardware prediction and they are far from main memory. I don't think trying to shoehorn them into something they weren't designed for is the best idea. Hence, no cache.

Shifty Geezer · Jan 25, 2007

Panajev2001a said:
If the cost of the shared cache is reasonable, I do think that the added flexibility (executing random access-focused code) and independence (from the PPE, which is something nice in and of itself as it makes them less bound to PPE's performance) would be well received by developers (always depending on exactly what they would not receive because of this, it would surely change their thought if it were to be something important).

You know, I think perhaps better would be another class of SPE. You'd have PPE, SPE+LS, and SPE+cache+LS or somesuch as an optimized random-access processor. Nope, still can't see much point! Just shove another PPE in there! The moment you hit random access, the SPE's performance advantage is totally lost.

inefficient · Jan 26, 2007

Shifty Geezer said:
SPE DMA's don't go through PPE cache, do they? They can't do, as they'd saturate the cache and render it useless for the PPE.

I was pretty sure DMA's don't go through the PPE cache as well.

If you look at diagram of the interface in IBMs documents, the L2 cache sits between the PPU and the EIB bus not between the EIB bus and the XDR memory interface.

patsu · Jan 26, 2007

Sethamin said:
SPEs [are] not made for branchy, random-access code. They don't have dynamic hardware prediction and they are far from main memory. I don't think trying to shoehorn them into something they weren't designed for is the best idea. Hence, no cache.

I agree with this principle the most... except that I hope there is a way to overcome the LS size limitation.

Throwing in another PPE will help somewhat (for branchy code), but it's actually solving a different problem.

Panajev2001a · Jan 26, 2007

Shifty Geezer said:
You know, I think perhaps better would be another class of SPE. You'd have PPE, SPE+LS, and SPE+cache+LS or somesuch as an optimized random-access processor. Nope, still can't see much point! Just shove another PPE in there! The moment you hit random access, the SPE's performance advantage is totally lost.

We might say that adding a shared L1 for the SPE's is a waste of silicon and with that argument we might agree with (depending on what we sacrifice if we do end up cutting things back).

Saying that we should add more PPE's (which are fairly HUGE and power-hungry as it is and would cut down a lot of space: better than changing the actual PPE to SPE's ratio, it would be to optimize both for power and performance the PPE's [somehow developers are publicly, on forums, suggesting that something better than PPE/PPX could be realized

]) or optimize random-access processors misses the point.

If all SPE's were is glorified vector co-processors of the PPE it would be one thing, but they are not: we value their independence and the amount of code they can run fast (the more time developers are spending with writing SPE code the more they seem to enjoy the speed that these buggers seem to execute what you throw at them).

Would it improve their performance if we could accelerate small, cache-able memory transactions and thus general-purpose code without changing the way developers access SPU's register-file and SPU's Local Storage ?

Surely, even with a good LS-L1 cache-EIB hierarchy performance with general purpose code might not be as fast as say a Core 2 Duo (and the worst case scenarios might even produce an increased performance hit, as in the case of a cache miss we would have some cycles thrown away)

, but let's not over-estimate PPE's performance as it does not seem that many developers are THAT happy about the kind of code you run quickly on them and what code is faster or quite as fast on the SPE's (basing this on things publicly said on forums such as this one, and IIRC something was said in this forum indeed) despite its huge size and power consumption (compared to SPE's).

If for a reasonable price we could buy a few percentage points of performance in general purpose processing for SPE's we could render them even more independent from the PPE's and faster over-all (especially when the developer was not able to optimize an SPU application well enough and left in some scalar/random-access happy code that constitutes a bottleneck in performance critical areas of the SPU application) not to mention better adaptable to all kinds of processing (which still takes advantage of the fact that on a CELL processor we have MANY SPE's and they are all pretty FAST even though they might be not as optimal as someone could want them to be in all kinds of code) it might be a win.

Evidently for the PLAYSTATION 3 CELL Broadband Engine this optional L1 cache was not implemented, but the game is not already closed for PLAYSTATION 4

.

Crossbar · Jan 26, 2007

Sethamin said:
As you add more and more SPEs, you need more and more main memory bandwidth to move data into and out of their LS. Having a bit of closer storage may help a little if there's any redundant data going across the main memory bus.

Yeah, the bandwidth to main memory must scale as well if we increase the numbers of SPEs.

Sethamin said:
I agree. But while you can easily imagine processing streamed data using 8 SPEs, what about 64? Or 128? At some point you reach an upper-bound on how big a "pipeline" of SPEs you can string together as there is a finite limit on the number of steps you can split a task into. Also, streamed data is great for some applications (say, image processing or video decoding), but if you're concerned about latency then, again, there's an upper bound, as each SPE you string together is going add a bit depth to the "pipeline" and a bit more latency. Just depends on what use scenario they want to target.

Streaming data over several SPEs in serial is very nice, but I was not thinking of that in particular when i refered to streamed data. I am not concerned about latency for streamed data, that is the whole point of double buffering and DMA transfers.

Sethamin said:
Agreed, restricting the cache to code segments doesn't seem like a bad idea. But then again, how often will you be moving code on and off an SPE? Seems likely that once a program is loaded onto an SPE it will stay there for a long time. Preemptive multitasking seems unlikely, since you'd have to store the entire LS to accurately save the state of the running program. So I question how much performance this sort of cache would actually gain.

Yeah, pre-emptive multitasking on the SPE-level is really bad, you should not do that. And yeah, you are right that you should not load programs that often if you do it the right way. DeanoC mentioned a minimum size of 100 000 cycles for each job, that gives 32000 jobs/s per SPE, and you are using a number of SPEs as well. Point is, that if you could reduce the penalty for stalling an SPE while swapping in a new program, by using some kind of cache, you could reduce that minimum job size and have an finer granularity for the jobs you assign to the SPEs and thereby achieve higher efficiency.

Sethamin said:
I don't see any change in the paradigm. Developers would be free to ignore the "L2 LS" if desired and nothing would be different. Think of it more as a SPE with only LS and no execution units. It's just a bit of storage closer to the SPEs that can be used for extra buffering space.

Let me see, what does a common addressable LS mean? For that to be useful you need to introduce some memory allocation mechanisms and some synchronisation mechanisms, like semaphores, critical sections or whatever you want to call them, specifically for that piece of memory. Mechanisms that you likely already have on the PPE level, now you will introduce another set of those for the SPE level. If that isn't a change of program paradigm, I don´t know what is?
Not to mention that by using those mechanisms you basically nullify all the strengths of the SPE, which is to run code and process data in its little LS bubble at break-neck speed independent of the world outside of the SPE.
It would also be an efficient mean to prohibit SPEs to run in different protected memory spaces.

Sethamin said:
SPEs awee not made for branchy, random-access code. They don't have dynamic hardware prediction and they are far from main memory.

I don´t really see the connection between branching and cache from the SPEs point of view.

3dilettante · Jan 26, 2007

There is another maybe possible way of implementing a different level of LS without breaking compatibility, but it would be tricky for unwary programmers.

Since the address space is so much larger than the current amount of LS, a future version could implictely define that the range that corresponds to the current LS is one range of addresses, while the new level of LS is another range. Old code will only be addressing the original range, while newer code could be made aware of the additional sections.

One block of addresses would then be used for low-latency operation, while the other range would correspond to a larger but slower L2 LS,

It reminds me of the extended/expanded/upper memory hassle of the DOS days, though.

ADEX · Jan 27, 2007

inefficient said:
I was pretty sure DMA's don't go through the PPE cache as well.

If you look at diagram of the interface in IBMs documents, the L2 cache sits between the PPU and the EIB bus not between the EIB bus and the XDR memory interface.

They don't go "through the L2" as such but they do have to be checked against it's contents. If the data is present there it will be read from there. You are advised against this however as bandwidth from XDR RAM is higher.

Why else did you think the L2 latency is so high?

If this wasn't done it wouldn't be coherent. However I don't know how it's done but coherence can be switched off cutting latency to RAM by about a third.

Let me see, what does a common addressable LS mean?

All the LSs are commonly addressable, SPEs can all read and write to each others LSs. The only time this isn't allowed is when the security mechanism is activated.

Crossbar · Jan 27, 2007

ADEX said:
All the LSs are commonly addressable, SPEs can all read and write to each others LSs. The only time this isn't allowed is when the security mechanism is activated.

Sure, but that is through DMA transfers.

inefficient · Jan 27, 2007

"The SPU accesses its LS with load and store instructions, and it performs no address translation for such accesses. Privileged software on the PPE can assign effective-address aliases to an LS. This enables the PPE and other SPEs to access the LS in the main-storage domain. The PPE performs such accesses with load and store instructions, without the need for DMA transfers. However other SPEs must use DMA transfers to access the LS in the main-storage domain."

So:
PPE <-> XDR: load/store.
PPE <-> LS : load/store or DMA.
SPU <-> LS : load/store
SPU <-> Another SPU's LS : DMA
SPU <-> XDR: DMA

deathkiller · Jan 27, 2007

I don't think that the SPU are bad for random-access problems, they are only bad if the problem is not parallel and you can only work on it sequentially. If you can do lots of random DMA loads from main memory at the same time to solve the problem you shift from latency limited to bandwidth limited or computational limited.

Cache won't help you with pure random access, I think.

Bigger Local Store is needed as latency increase because you need a bigger buffer. The cost of adding a cache system to the SPUs would be too big for the performance increase in the target workload and remember that SPUs accessing other SPUs LS have a very high latency.

Deleted member 7537 · Feb 1, 2007

Now that you are talking about this, is 1MB of L2 Cache enough for the 3 cores (using the 2 threads) in the 360?

ATI-liens · Feb 3, 2007

3dilettante said:
If 65nm is the only transition Cell goes through, and it's only going in the PS3, that would be interesting.

It might be that the current overal design for Cell was really targeted for 65nm all along.

They are going to change things eventually, either for another transition, or to fit Cell into markets not served by a one size fits all solution.

The X360 is also going 65nm and i wouldn't be surprised if the Wii jumped on the bandwagon.

It doesn't mean much at all for gaming.

Maby quieter consoles, especially for the X360,

less power consumption for the PS3 maybe they can tone down the PSU to 320 Watts with 65nm.

If the Wii goes 65nm it won't overheat in standby.

I'm assuming there will be the same number of transistors on the cpu's.

games will stay the same, it won't be legal for Sony to take advantage of some PS3's and not others unless they advertise it as (PS3.2) which is not likely.

Local Store, L1, L2 caches and the Future of Processrs: Compare, Contrast?

Sethamin

Crossbar

Shifty Geezer

uber-Troll!

Panajev2001a

Sethamin

Shifty Geezer

uber-Troll!

inefficient

patsu

Panajev2001a

Crossbar

3dilettante

ADEX

Crossbar

inefficient

deathkiller

Deleted member 7537

Guest

ATI-liens

Similar threads