Will L2 cache on X360's CPU be a major hurdle?

As I understand it, as shown in the Suzuoki PDF on Deano's site, the reason why LS was chosen over cache is because the access patterns observed for modern "hot-spots" (to borrow Suzuoki's terminology) go some thing like this:

Directly taken from slide 15 said:
Instruction is small and reused
- Loop Intensive
- Good news for cache system

Data is large and not to be revisited again
- Sometimes larger than L2 cache
- Bad news for cache system

I'm a computer architecture guy (actually, more of a semiconductor process guy), but I can easily imagine than most gaming code requires a small amount of instruction to handle your-favorite-example-algorithm (collision, procedural synthesis, whatever), and vast amounts of data that constitute the objects in the scene/game world.

In that case, it would make sense to manually load the instructions (which are heavily accessed, and are small), and then go to main mem for the data, doing away with the cache altogether, since the cache would be worthless large, one-use data. (For example, examining every object in the world once per frame)

Several reasons for this:

- As mentioned before, cache is more expensive (tags, logic, the works) A 256KB cache will be bigger in die size than a 256KB LS

- Latency. This hurts in two ways. Not only is cache slower than an equivalent-sized local store (because of logic overhead due to n-way associativity) when it hits, but it is also slower when it misses - mem request goes to cache, the cache misses, than it goes to main mem. The SPE, which has a programmer-controlled LS, would simply know what is in LS and what is not, and would just read main mem. Now, the 10-cycle lag of cache is insignificant to the 100-1000 cycle lag of memory, so this is not such a big deal. But the point remains.

And let's not forget the I/O controller! That was the best case scenario, with a perfect I/O controller, that makes sure waiting read/write requests from cache are executed in a optimal way that prevents additional waiting. This is usually not true, so you have additional latency from making sure the cache and memory are concurrent.

- When it comes to cache, bigger != better. The Pentium 4 is a case in point. There was a big hoopla about the Pentium 4 L2 cache going from 1MB to 2MB. The problem is, the 2MB P4 is actually slower in most scenarios. Going from 1MB to 2MB meant a few more cache hits, but each hit was slower, since the cache was bigger, so overall performance was actually lower. The point of cache is to be fast, and to be fast, it must be small. As mentioned before, the A64 does quite well with a 512KB L2.

In summary, a cache is only superior to a LS when it can insure significantly more hits than a LS. Which, thankfully, is most of the time. But for some scenarios (the one in the Suzuoki PDF), the LS is superior. It is questionable that the scenarios used to design CELL are characteristic of computer game loads, but I for one don't doubt that this model has at least some merit.
 
Thanks ! Your explanation is clear and concise. 8)
...although I think we have gone very far off-topic. *Scratch head*
heh heh.
 
DeanoC said:
I'm not a hardware guy, I always assumed that LS is alot simplier/faster than equivilent amounts of cache. If it is, then maybe it makes sense to have more manual LS than less automatic cache, however if its not, then I'm lost...

There is some overhead but for the applications your likely to run on something like an SPE you can get away with fairly large line sizes (256-512B) which will greatly minimize the overheads.

Even with 64B line sizes you are looking at roughly <20b of overhead per line which is under 5% overhead. With 256B line sizes, you would be roughly 1% overhead.

Aaron Spink
speaking for myself inc.
 
psurge said:
aaronspink - would it be possible to have a combination of the two- i.e. a cache that still allows the programmer to explicitly submit DMA requests (possibly locking/unlocking cache-lines so that certain lines are not evicted unless you give the OK)?

Yes. This has been done on a variety of embedded designs.

When you say that a cache has small overhead compared to the local store, does this include tags + logic which stalls the execution pipeline on cache-miss? What about cache-coherency?

yes. Depends on how much coherency you require.

Aaron Spink
speaking for myself inc.
 
nAo said:
A local store takes less transistors, reduces load/store latency, and even if it doesn't dramatically cost much less transistors than a cache it makes the design of a in order cpu much simpler.

Doesn't really make the design any simpler. in order CPUs have been designed and built since the mid 80's with caches. Its a known design.


At the end it's a matter of hw design choices, there's nothing in CELL architecture that works against having a SPE + local cache, but if STI would have made that choice now we'd have a CELL processor with less than 8 SPEs and with a higher load/store latency.

The load/store latency wouldn't have been affected. The additional logic overhead for the tags is more than overcome in the multi-cycle access time for the data array at the frequency that the spe runs at.


A unoptmized code would run faster than on current SPE's design, an optimized code would run slower or would require even more work to hide higher mem accesses latencies.

Unoptimized code would run faster and optimized code would likely run faster do to the easier programming and memory enviroment. The memory latencies would be less with a cache based design because you don't have to wait for the DMA engine to finish before using the data.

Aaron Spink
speaking for myself inc.
 
patsu said:
What would people call out in the Rambus's Cell presentation (in particular slide 12, 15, 31 and 34) ? It implied why STI went for 8 SPEs, and have some data to support their choice of LS instead of cache. The engineers seem to think that modern day instruction + data hotspots are typically up to 128K (Hence, they sized LS to 256K). Did I understand it correctly ?

The data they are presenting assumes that the working set is a nice linear region and that it is perfectly predictable ahead of runtime. It is not uncommon to have a smallish working set but from a much larger region of memory. Nor is it uncommon to have the region of the working set be somewhat runtime dependant.

The problem is there are lies, damn lies, and statistics.
What you don't specialize in is what kills you.

Aaron Spink
speaking for myself inc.
 
nondescript said:
In that case, it would make sense to manually load the instructions (which are heavily accessed, and are small), and then go to main mem for the data, doing away with the cache altogether, since the cache would be worthless large, one-use data. (For example, examining every object in the world once per frame)

A local instruction store can make sense and be very usefull. There are a variety are architectures that have employed local instruction stores for this reason.


- As mentioned before, cache is more expensive (tags, logic, the works) A 256KB cache will be bigger in die size than a 256KB LS
The size difference should be minimal. realistic overheads are in the range of 1-2%.

- Latency. This hurts in two ways. Not only is cache slower than an equivalent-sized local store (because of logic overhead due to n-way associativity) when it hits, but it is also slower when it misses - mem request goes to cache, the cache misses, than it goes to main mem.
The access speed should be roughly the same since for the size and speed of the structures we are talking about, the primary delay is the data access. Even with associativity, the data path overhead will be a mux. The LS has the same issues with DMA latencies.

The SPE, which has a programmer-controlled LS, would simply know what is in LS and what is not, and would just read main mem. Now, the 10-cycle lag of cache is insignificant to the 100-1000 cycle lag of memory, so this is not such a big deal. But the point remains.

Can the SPEs even directly read main memory? you are neglecting the DMA overheads and latencies as well.


And let's not forget the I/O controller! That was the best case scenario, with a perfect I/O controller, that makes sure waiting read/write requests from cache are executed in a optimal way that prevents additional waiting. This is usually not true, so you have additional latency from making sure the cache and memory are concurrent.

This is an othogonal issue and affects both the DMA into the LS as well as cache misses equally.

- When it comes to cache, bigger != better. The Pentium 4 is a case in point. There was a big hoopla about the Pentium 4 L2 cache going from 1MB to 2MB. The problem is, the 2MB P4 is actually slower in most scenarios. Going from 1MB to 2MB meant a few more cache hits, but each hit was slower, since the cache was bigger, so overall performance was actually lower. The point of cache is to be fast, and to be fast, it must be small. As mentioned before, the A64 does quite well with a 512KB L2.

This is a non-sequiter to the discussion at hand.

In summary, a cache is only superior to a LS when it can insure significantly more hits than a LS. Which, thankfully, is most of the time. But for some scenarios (the one in the Suzuoki PDF), the LS is superior. It is questionable that the scenarios used to design CELL are characteristic of computer game loads, but I for one don't doubt that this model has at least some merit.

It is questional if the data presented by Suzuoki even supports his thesis. That the working set is small does not mean that a LS is sufficient. In addition, if the data fits within a LS store, it will fit within an equivlent cache at the same performance.

Aaron Spink
speaking for myself inc.
 
aaronspink said:
The load/store latency wouldn't have been affected.
I'd like to believe that - but I've yet to see a 256KB+ cache that has 4cycle load/store latencies at 3.2Ghz. Unless of course you're suggesting to go with a much smaller cache - but then I'm no longer gonna agree the design would have any performance advantages.

The memory latencies would be less with a cache based design because you don't have to wait for the DMA engine to finish before using the data.
I believe nAo meant they would be harder to hide, regardless of the few cycle difference in total access time (given we're talking about at least 500cycles for memory access those won't make much difference anyway).
Anyway - I'm with you here - having SPEs with 256KB of 4cycle L1 each would be great, especially with allowance for cache locking and DMA access to the locked regions.
But are you honestly trying to argue that design wouldn't have a significant overhead in terms of die area usage? I am more inclined to think nAo is right there when he said there'd be only room for 4SPEs, heck I'm questioning if it would even be that much.

Btw - if cache overheads are so low - how do you explain that 512KB of L2 in Cell has nearly the same area as two SPEs combined.
 
aaronspink said:
Doesn't really make the design any simpler.
I don't believe that cause with a local store a SPE is guaranteed to hit data is looking for everytime it tries to access them, and it's not hard to imagine this would make SPE design much simpler.
in order CPUs have been designed and built since the mid 80's with caches. Its a known design.
I'm not disputing that.


The load/store latency wouldn't have been affected. The additional logic overhead for the tags is more than overcome in the multi-cycle access time for the data array at the frequency that the spe runs at
How many 256kb cache designs running at 4 Ghz with 4 cycles access latency do you know?
load/store latency from local store is a key point in SPEs design and I doubt a cache would have the same latency.
Die are wise..if we look at CELL die photographs we see how a 512 L2 kb cache takes almost the same space than a couple of SPEs and I bet it has a much much higher latency (40 cycles..)
[EDIT: Ops..Faf had the same thought :) ]
The memory latencies would be less with a cache based design because you don't have to wait for the DMA engine to finish before using the data.
This is true for an unptomized app, an optmized app (with a optimized/customized data set) would likely prefetch/load data via DMA one time (per packet..) with burst accesses to the external mem.
I seriously doubt this would be slower than using a multitude of single accesses (due to cache misses).
Moreover a mem controller able to manage thousands of cache misses at the same time (we have 8 SPEs and one PPE!) would be more complex than the actual CELL mem controller (that already handles 128 concurrent dma queries)
 
One point people are forgetting with the Local Stores in Cell is that they are not only used to bring data in from main memory, but each SPE can effectively move data to other SPEs for pipelined applications.

It seems to me that a "cache" architecture to support this kind of pipelining would grow exponentially with the number of SPEs implemented (whether on the current Cell, or conceivably on another Cell). LS has access to two data spaces, main memory and the next LS (or multiple LSs if data branching is used) in the pipeline.

The LS architecture is, quite deliberately in my view, programmer managed for optimal performance. SPEs are single-threaded. The pipeline of multiple SPEs would, itself, be single-threaded (whether 2 SPEs or 7). The programmer is forced to split his algorithm in such a way as to maximise the use of available bandwidth (EIB bandwidth, effectively) while also tuning the pipeline length (number of SPEs) to hit the parallelism sweetspot for that algorithm.

Jawed
 
Starting with what I agree with, working from there...

It is questional if the data presented by Suzuoki even supports his thesis. That the working set is small does not mean that a LS is sufficient.

Right, that's what I said. I can only trust what Deano says in his blog - he thinks the data is representative. Far be it from me, a lowly hardware-EE, to contradict a PS3 developer on PS3 development.

This is a non-sequiter to the discussion at hand.

You're right. I got excited, started blabbering.

Can the SPEs even directly read main memory? you are neglecting the DMA overheads and latencies as well.

You're right again - when I see "local store", I immediately think in terms of scratchpads and embedded systems - and totally forgot about CELL's memory architecture. But there is still the extra delay of going to the cache and finding out that it's a miss.

This is an othogonal issue and affects both the DMA into the LS as well as cache misses equally.

I agree its not as easy as I originally said to get the SPE to read mem, but the fact remains that one less level of cache means one less level of concurrency maintainance to worry about, and one less source of latency. How significant is this source compare to others, I dunno.

The access speed should be roughly the same since for the size and speed of the structures we are talking about, the primary delay is the data access. Even with associativity, the data path overhead will be a mux. The LS has the same issues with DMA latencies.

I'm not sure its just a mux - you still need to find which of the cache lines (blocks? lines? help me if my terminology is wrong...) actually has the data you're looking for. A cleverly designed system will hide the latency by excuting that logic in parallel with the actual SRAM read, asserting the correct control signal to the mux just as the data gets ready, but is it possible to TOTALLY hide that logic latency... I dunno.

And for cache writes, there is clearly more latency, because you have to implement your replacement algorithm somehow...and you obviously can't do it in parallel with the actual SRAM write. But write latency isn't really a problem for instruction cache anyhow.

The size difference should be minimal. realistic overheads are in the range of 1-2%.

It depends on how over/under-built that cache is. I can construct some pathological fully-associative cache that takes up ridiculous die space, or some cache that has so little overhead that it barely functions as a cache at all. 1-2%? I must admit, I really have no idea.

But what Faf said makes sense - "Btw - if cache overheads are so low - how do you explain that 512KB of L2 in Cell has nearly the same area as two SPEs combined." It seems clear that the cost is more than 1-2%.
 
All this talk of LS vs. Cache, as I understanad it you're talking L1 cache? Not L2? L1 cache is normaly fairly small, like 32kb data/instruction.

Now lets say you've got an Apulet with a 64kb program that works on sets of 64kb data. You prefetch your data and the instructions are ready to hand. How does a conventional CPU compare? The entire program can't fit in the instruction cache, so there'll be fetches to the rest of the program in L2 cache. I guess access is predicatable so this isn't a huge lag? Same for data. It won't all fit so you hve to prefetch. Again I guess this isn't a major problem and the cache is smart enough to handle it.

But what happen when you go multicore? Let's say 4 cores with 32kb data/instruction L1 caches all doing the same. All will be sharing an extra 32kb of L2 cache, + 128kb of data = 640kb and the cache will have to managing all these accesses. That's got to be a lot slower than each core having it's only local space without any conflicts with other resources (save DMA data input to CPU).

It seems to an uneducated me that the only area LS looks bad (apart from developers having to manually control it) is scattered data access without cohesive lumps of data to crunch. That'll mean huge lag as the SPE requests missing data from all over RAM. I would have thought though that that's the pint of the PPU, to handle those aspects of a game that need these types of access patterns, and the SPEs will be dedicated to systems designed for them. In such circumstances it would look like LS is a much more economical speed/die ratio than LS.
 
Adding another post, because my previous one was directed specifically at aaron, and this is a more general one...

I started reading Microprocessor Report's Feb 14, 2005 newsletter about CELL, and there seem to be a few other reasons for LS over cache.

Deterministic processing times (at least for the SPEs) - make it easier for developers to cycle-count and optimize their code. That rules out hit-or-miss caches, and out-of-order execution. I could see this being rather important for anyone trying to stream process.

Microprocessor Report said:
...local store memory for the SPEs do not use hardware cache-coherency snooping protocols avoiding the indeterminate nature of cache misses.

The LS supports 1024-bit DMA transfers - which seems to me to be quite useful - rather than writing one cache line at a time, it can transfer (relatively) large blocks of memory at a time. Incredibly (to me, at least), all 1024-bits can be written in one write cycle, keeping the read/write port free for the SPE most of the time.

Link? Umm...go find Microprocessor Report...somehow.
 
nondescript said:
The LS supports 1024-bit DMA transfers - which seems to me to be quite useful - rather than writing one cache line at a time, it can transfer (relatively) large blocks of memory at a time. Incredibly (to me, at least), all 1024-bits can be written in one write cycle, keeping the read/write port free for the SPE most of the time.
Yeah.. DMA reads/writes and instructions fetch run at 400 GBytes per second...per SPE :D stunning..(we could made up some number ala Xenos edram. what about 3.2 Terabyte/s aggregated bandwith from local stores? :) it's funny to feed fanb0ys)
Some weeks ago I asked why instruction fetch from SPEs didn't consume half the SPE local store bw..MPR gave us the answer :)
 
nAo said:
(we could made up some number ala Xenos edram. what about 3.2 Terabyte/s aggregated bandwith from local stores? :) it's funny to feed fanb0ys)

Haha, absolutely. ;) Their brains would cook instantly.
 
Jawed said:
Sadly EIB can't keep up with your magic numbers, nAo.
Sadly you don't understand I'm kidding.. Jawed. (so we just proved it works to feed...thank you nondescript ;) )
 
Back
Top