Will L2 cache on X360's CPU be a major hurdle?

PC-Engine · Jul 18, 2005

AlgebraicRing said:
Can we get back to the topic at hand? I think the thread starter is asking whether or not it was a design mistake to have a shared cache for the three processors. Particularly I would like to know if there is a way to partition the cache so that each processor can get a specified chunk of cache space. If there is no way to partition the cache, then is there a way to prevent one processor from filling the cache with its pages of memory which then push the pages of memory out which belong to the other two processors?

You can lock portions of the cache. This is well known fact.

cobragt · Jul 18, 2005

In essence the 7 SPEs will be fighting with the PPE over the 512kb of actual cache.

that doesnt sound right at all for me. the SPE's have their own cache which is there local memory.

Edit:
the 256k high speed local memory of each cell is like a programmer controlled cache. There is no cache hardware to keep it synchronized with main memory, it's a scratchpad that the programmer controls. It's better than regular cache.

Bobbler · Jul 18, 2005

PC-Engine said:
inefficient said:

PC-Engine said:

Correct me if I'm wrong but doesn't the 512kb of cache in PPE in CELL have direction access to main memory while the LS of the SPEs have to go through the cache to get to main RAM?

Click to expand...

"through the cache" you say that as if that were a bad thing

And you can just lock the L2 cache if you want to actually just bypass it.

Click to expand...

The point I'm making is that how can LS function as cache if it doesn't have a direct connection to main RAM? If it can function as cache then why does it need to connect to another pool of cache to get to main RAM? AFAIR LS can't bypass the 512kb and go straight to RAM. Isn't the point of cache to cache data from main RAM so that you don't have to go offchip since it's a lot slower? If LS doesn't have a direct path to main RAM then how can it fill its LS and operate as cache? In essence the 7 SPEs will be fighting with the PPE over the 512kb of actual cache.

It seems to me that each SPE has direct access to ram through the EIB -- the PPE is no more special than an SPE to the internal bus (EIB). Everything has equal access to the ram as far as I know... they are all just ports on the EIB "ring". Does that even make sense that the 7 (or 8 ) SPEs would need to go through the PPE's cache to get anything from ram?

Maybe you have some actual info; what you said goes against everything I've heard about how the Cell works.

Inane_Dork · Jul 18, 2005

cobragt said:
the 256k high speed local memory of each cell is like a programmer controlled cache. There is no cache hardware to keep it synchronized with main memory, it's a scratchpad that the programmer controls. It's better than regular cache.

Not only is that a huge can of worms, but it's not really on topic. The guy asked about the XeCPU, not Cell.

Anyway, I know the CPU can lock L2 cache lines for the GPU to grab data from. I don't know that the CPU can lock the cache in any other circumstance, though many people seem to presume it's true.

Shifty Geezer · Jul 18, 2005

Acert93 said:
At least that is how I remember each design. (I am sure Shifty can correct me if I am wrong).

Me?! I'm the last person inn the world to trust for accurate stats!

Cell's LS provides data to the SPE's at the same rate as data is available from L1 cache. What it doesn't do is fetch data itself - the cache needs to be managed. The SPU sends requests for data to be fetched from RAM over the FlexIO without getting in the way of any other storage, so in effect you have 7 LS and 1 L2 cache independantly accessing RAM over FlexIO as I understand it.

Looking at XeCPU, we've got 1 MB L2 cache. Between three cores that's 333 MB each which isn't bad - more than SPUs (though really the figures aren;t comparable as they're used differently.) Adding another 3 hardware threads takes that down a lot.

However I don't know if that's really gonna be too bad. Yeah, more cache is great, but the cache can be managed similarly to SPU LS with prefetching commands etc. Also threads and cores can share data by writing it to the unifed cache.

I think it quite possible that a couple threads lock down say 64 KB for 32+32 KB buffered streamed data, while another thread works on some material that stored locally to feed a fourth...three or four threads could probably be quite happily fed on 256 KB, leaving 384 KB each for two major threads (generic processing) would shouldn't be too bad a limiting factor.

SanGreal · Jul 18, 2005

coincidentally, Deano's blog had a bit about cache today:

http://rattie.demon.co.uk/

(regarding cell)

Guden Oden · Jul 18, 2005

PC-Engine said:
The point I'm making is that how can LS function as cache if it doesn't have a direct connection to main RAM?

Your point is based on a false premise. Each SPE does have a direct connection to main RAM through its DMA controller.

If it can function as cache then why does it need to connect to another pool of cache to get to main RAM?

You're confused; local store CAN'T function as a cache because it ISN'T a cache. It's just a quite normal piece of random-access memory. Caches are divided up into what is known as cache lines, each line equipped with a "tag" which tells the cache controller logic which sequence of addresses in memory is stored in that particular cache line. When a read request comes in, the cache controller reads through its tags to see if that address is stored or not and acts accordingly. If it is, the data is delivered straight to the CPU. If it isn't, a request for it is generated and then the CPU has to wait for it to come in. It all works automatically, and the CPU cannot differentiate between cache and main memory. It's completely transparent, and typically there is no way of telling if a piece of data is present in cache or not as cache isn't addressable as memory; it's a MIRROR of the memory it caches.

As local store is just memory, the program has to decide what is stored in the store and what isn't. This doesn't make it into some sort of "software controlled" cache; it ISN'T CACHE plain and simple. It's just a quarter megabyte SRAM memory, that's it. Think of the SPE as a computer of its own equipped with 256 kB memory attached to its motherboard and an I/O controller to bring data in and out of that memory.

Oh and by the way, as someone else brought it up: the local store SRAM isn't zero wait-state. No SRAM running at 3.2GHz is going to be zero wait-state, it isn't physically possible, or at least not with current technology. Even SRAM running at a fraction of that speed have wait-states of a couple cycles.

PC-Engine · Jul 18, 2005

Ah ok thanks for the explanation.

Gubbi · Jul 18, 2005

Guden Oden said:
Oh and by the way, as someone else brought it up: the local store SRAM isn't zero wait-state. No SRAM running at 3.2GHz is going to be zero wait-state, it isn't physically possible, or at least not with current technology. Even SRAM running at a fraction of that speed have wait-states of a couple cycles.

The SPEs have 6 cycles of load-to-use latency (5 "wait-state" cycles).

Cheers
Gubbi

LunchBox · Jul 18, 2005

i am very layman...

am i getting this straight?

local store = scratch pad

Cache = scratch pad with labels

?????????

nAo · Jul 18, 2005

Gubbi said:
[
The SPEs have 6 cycles of load-to-use latency (5 "wait-state" cycles).

It seems local store load/write ops now have a smaller latency (4 cycles), according Mr Suzuoki's presentation at Rambus conference.

fouad · Jul 18, 2005

coldstorm said:
sufficent L2 ram is reached when the data capacity of the L2 surpass the data capacity of the bus

Click to expand...

This will never happen in real situations. It cost too much to produce enough L2 cache running at the same speed of the processor.

Gubbi · Jul 18, 2005

nAo said:
Gubbi said:

[
The SPEs have 6 cycles of load-to-use latency (5 "wait-state" cycles).

Click to expand...

It seems local store load/write ops now have a smaller latency (4 cycles), according Mr Suzuoki's presentation at Rambus conference.

Link?

The MPR presentation shows the LS unit of the SPE to provide data after stage FW06.

Cheers
Gubbi

nAo · Jul 18, 2005

Gubbi said:
Link?

http://www.rambus.co.jp/events/Main1_2_SCE_Suzuoki.pdf (page 26)

Shifty Geezer · Jul 18, 2005

LunchBox said:
Cache = scratch pad with labels

No. You cannot deliberately write to cache. The cache is just like a HDD cache or CD cache. The idea is that it stores recently used data from the storage device (RAM, HDD, CD) so that if that information is accessed again, rather than having to obtain it from the slow source it can be retrieved from the fast cache.

I don't know how effective it is in real terms. I always though it very good, but Deano's Blog posted here by SanGreal pointed out investigations that show accessing the same data again doesn't hapen very often, so cache isn't too useful. Managed storage doesn't have that problem but has the faff of the dev having to manage it. XeCPU can prefetch data this way I believe.

Acert93 · Jul 18, 2005

Shifty Geezer said:
I don't know how effective it is in real terms. I always though it very good, but Deano's Blog posted here by SanGreal pointed out investigations that show accessing the same data again doesn't hapen very often, so cache isn't too useful. Managed storage doesn't have that problem but has the faff of the dev having to manage it. XeCPU can prefetch data this way I believe.

I believe it was arstechnica (could be wrong), but I think they hit on the fact cache is important for a diverse working environment (like using Word, Excel, Outlook, Browsing, etc) but less so for gaming. He used Quake 3 and the old Celerons without cacheto make the point. Quake 3 ran just as well on a Celeron, even though the CPU did T&L. Why? Much of the gaming data was streamed geometry and not reused.

Obviously games are more complex these days, but then again we are talking about closed box systems as well.

i.e. One of those "Desktop PCs and Game Consoles are different animals".

More cache would not hurt (and probably would help) but there is always that tradeoff point. Would 2 cores and 2MB of cache have been better than 3 cores and 1MB of cache? IBM did not think so. Will developers agree? We will find out sooner or later

Embedded Sea · Jul 18, 2005

The problem with SPEs can be related to the story of four people examining an elephant in a dark room. They can't figure out what the heck it is, one thinks it's a pillar, one thinks it's a tree, one thinks they feel the top of a roof. Having local memory to one processing unit in a game is essentially forcing you to get those four people to not care about the big picture - they only examine one little piece, do their thing on it, and move on, even if what they conclude is relatively dumb and limited.

nAo · Jul 18, 2005

Embedded Sea said:
Having local memory to one processing unit in a game is essentially forcing you to get those four people to not care about the big picture

Most of the times this could be a good thing (tm)

3roxor · Jul 18, 2005

I think 1Mb is just to litle.. B.t.w..anyone know why they didn't use a higher clocked G5 dual processor model since they go up to 2.7Ghz :?:

Shifty Geezer · Jul 18, 2005

The SPE's aren't forced to work in isolation. They can share information. And in a dark room where instead of considering 1 elephant there is a sock, a basketball, and apple and a piano, all four can work on their own without any problem, which is kinda the point of multithreading.

Will L2 cache on X360's CPU be a major hurdle?

PC-Engine

cobragt

Bobbler

Shazbot!

Inane_Dork

Rebmem Roines

Shifty Geezer

uber-Troll!

SanGreal

Guden Oden

Senior Member

PC-Engine

Gubbi

LunchBox

nAo

Nutella Nutellae

fouad

Gubbi

nAo

Nutella Nutellae

Shifty Geezer

uber-Troll!

Acert93

Artist formerly known as Acert93

Embedded Sea

nAo

Nutella Nutellae

3roxor

Shifty Geezer

uber-Troll!

Similar threads