Will L2 cache on X360's CPU be a major hurdle?

nondescript · Jul 26, 2005

Gubbi said:
L1 latency * L1 hitrate+ L2 latency*(1-L1 hitrate)*L2 hitrate+(1-L1 hitrate)*(1-L2 hitrate)*Main memory latency

Haven't seen that equation in a while, brings back memories of the good old days when all I had to do was read Hennessy for computer architecture, and life was simple.

V3 · Jul 26, 2005

aaronspink said:
The size difference should be minimal. realistic overheads are in the range of 1-2%.

I belive, they simply don't have the budget to have that kind of overhead. 1-2% given to these SPEs overheads, they have to take off from somewhere else that may compromise even more performance.

aaronspink · Jul 26, 2005

nondescript said:
9 seperate load/store sources (!?) I'm pretty sure that's not true. If you're referring to 8 SPE + 1 PPU = 9 sources, that is mistaken. The SPE (and the rest of the CELL) accesses L2 cache via the EIB. Only the PPU accesses the L2 cache directly (through the L1, of course).

You are assuming that the SPE requests to the EIB which consolidates and requests to the L2/memory subsystem. My understanding is the the EIB is merely transport and therefore logically transparant. In addition, there appears to be an exact split in the outstanding load/store issues from the SPE of 16 per SPE making the total of 128.

Can you give an example of an cache that is comparable to the SPE LS in terms of latency and size? (on similar fab processes, of course...) I think the burden is on you to prove this assertion, as Faf and nAo have pointed out, there are some pretty obvious reasons to believe this is not true.

What have they proven and what obvious reasons? You want an example? the example is the LS itself. The LS is the data array for the cache. Then we only need to calculate the tag array and associated logic overhead. With 128 or 256 Byte cache lines, the data storage for the tag array is miniscule. On the order of 16-20b per line with only 1024 or 2048 lines. The match logic is also minor. The Tag array should have much better access time than the data array.

If you say so. But they have a full page explaining the load/store of the LS that you want, including port sizes and number, access priorities, TLB and SLB specs, DMA and more. But hey, its all PR.

MPR had a full page of fluff, leaving out any details that are useful as far as memory models and load/store addressing are concerned. What is presented in the MPR piece only a little more than useless for figuring out what is the interaction between the load/stores and the LS as well as the coherence and ordering of the DMA engine wrt the main memory and the PPE.

nondescript · Jul 26, 2005

I'll assume by your silence that you agree with the points that you have not directly addressed.

aaronspink said:
You are assuming that the SPE requests to the EIB which consolidates and requests to the L2/memory subsystem. My understanding is the the EIB is merely transport and therefore logically transparant. In addition, there appears to be an exact split in the outstanding load/store issues from the SPE of 16 per SPE making the total of 128.

I don't think the EIB is logically transparent, since it is capable of 96B/cycle, while its interface to L2 cache is 16B/cycle. Each SPE has a 16B/cycle link to the EIB. Clearly, some logic is involved. Also, following your counting, you would need at least 11 ports, to include the I/0 controllers to RSX and main mem, which are also connected to the EIB.

aaronspink said:
What have they proven and what obvious reasons?

The obvious reasons are: 1. IBM's not dumb, 2. If cache is that cheap, why is the 512KB L2 so big? You posted some rebuttals, one of which I am disputing - the one above - and the rest I understand but find unconvincing. But rather than going in circles, let's move on.

aaronspink said:
You want an example? the example is the LS itself. The LS is the data array for the cache. Then we only need to calculate the tag array and associated logic overhead. With 128 or 256 Byte cache lines, the data storage for the tag array is miniscule. On the order of 16-20b per line with only 1024 or 2048 lines. The match logic is also minor. The Tag array should have much better access time than the data array.

I read this the first time you posted it, no need to repeat yourself. My question was, "Can you give an example of an cache that is comparable to the SPE LS in terms of latency and size?" - meaning a real implimentation in silicon. If such a great cache is possible, someone surely must have made it. It would also end this debate quite easily. Just saying "The match logic is also minor" is rather unconvincing. Like Guy Kawasaki (the VC guy) said, "Ideas are cheap, implimentation is hard."

aaronspink said:
MPR had a full page of fluff, leaving out any details that are useful as far as memory models and load/store addressing are concerned. What is presented in the MPR piece only a little more than useless for figuring out what is the interaction between the load/stores and the LS as well as the coherence and ordering of the DMA engine wrt the main memory and the PPE.

If you're saying MPR doesn't read like a spec sheet (more like super-thick spec book, these days), sure, I agree with you.

aaronspink · Jul 26, 2005

nondescript said:
I don't think the EIB is logically transparent, since it is capable of 96B/cycle, while its interface to L2 cache is 16B/cycle. Each SPE has a 16B/cycle link to the EIB. Clearly, some logic is involved. Also, following your counting, you would need at least 11 ports, to include the I/0 controllers to RSX and main mem, which are also connected to the EIB.

the 96B/cycle is a bit of sleight of hand the way some people are using it. The actual bandwidth available on the EIB at any given point is 32B/cycle. This is derived from the 4 seperate 8B counter-rotating rings. Each ring is broken up into several sections with a maximal occupany of any given ring of 3. No information is currently available if the occupany limitation is electrical or logical nor if there occupany restrictions are source neutral or specific (ie can 3 SPE datum be on a ring at the same time).

Actually its not even know if the bandwidth at any given point is even 32B/cycle, since there is not information if it is allowed for a point to receive or send from more than 1 ring at a time.

The actual L2 cache array may only support 16B per cycle but the interface into the L2 could possibly support more. We still don't know the coherence between a DMA request from an SPE and the L2 cache.

and you are right, I should have said 11 requestors.

The obvious reasons are: 1. IBM's not dumb, 2. If cache is that cheap, why is the 512KB L2 so big? You posted some rebuttals, one of which I am disputing - the one above - and the rest I understand but find unconvincing. But rather than going in circles, let's move on.

Compare the 512KB L2 to other L2s from say Banias which has 512 KB of cache in less area and is 1 process generation behind. The L2 in CELL is very much un-dense and is that way for reasons that do not apply to say a large L1 cache in an SPE.

But you don't even need to make that comparison, you can look at the actual die photo and see the 4 sub arrays and 4 tag arrays that make up the data storage of the L2. Pretty much all the other logic in the L2 block is book keeping logic dealing with requests in the L2, etc.

I read this the first time you posted it, no need to repeat yourself. My question was, "Can you give an example of an cache that is comparable to the SPE LS in terms of latency and size?" - meaning a real implimentation in silicon. If such a great cache is possible, someone surely must have made it. It would also end this debate quite easily. Just saying "The match logic is also minor" is rather unconvincing. Like Guy Kawasaki (the VC guy) said, "Ideas are cheap, implimentation is hard."

As for an example? The L2 of Dothan has a ~7 cycle access latency for a 2 MB cache and seems to run fine at 2.4+ GHz.

The match logic is minor. Its always nice to quote Kawasaki, but this has been implemented in designs for the last 20 years.

Aaron Spink
speaking for myself inc.

nondescript · Jul 27, 2005

aaronspink said:
the 96B/cycle is a bit of sleight of hand the way some people are using it. The actual bandwidth available on the EIB at any given point is 32B/cycle. This is derived from the 4 seperate 8B counter-rotating rings. Each ring is broken up into several sections with a maximal occupany of any given ring of 3.

Sounds like 96B/cycle to me...granted, that's 96B/cycle peak bandwidth, but since when have hardware manufacturers been honest about real performance?

aaronspink said:
No information is currently available if the occupany limitation is electrical or logical nor if there occupany restrictions are source neutral or specific (ie can 3 SPE datum be on a ring at the same time).

Actually its not even know if the bandwidth at any given point is even 32B/cycle, since there is not information if it is allowed for a point to receive or send from more than 1 ring at a time.

The actual L2 cache array may only support 16B per cycle but the interface into the L2 could possibly support more. We still don't know the coherence between a DMA request from an SPE and the L2 cache.

Interesting...but irrelevant to the question of whether the L2 has 11 read/write ports. Do you seriously believe the L2 has 11 read/write ports? If anything, your reservations about whether "it is allowed for a point to receive or send from more than 1 ring at a time" seem to show otherwise. Whats the point of having 11 read/write ports if they cannot be used simultanously? (We agreed IBM is not dumb...)

The ring bus topology clearly suggests that the number of ports in the L2 is proportional to the number of rings and not the number of devices on the ring - how could it be otherwise and still be a ring bus?

Discussing the other question (does the EIB have logic) seems to be moot now. How are you going to manage three occupants on one ring bus without some logic to partition it?

Compare the 512KB L2 to other L2s from say Banias which has 512 KB of cache in less area and is 1 process generation behind. The L2 in CELL is very much un-dense and is that way for reasons that do not apply to say a large L1 cache in an SPE.

But you don't even need to make that comparison, you can look at the actual die photo and see the 4 sub arrays and 4 tag arrays that make up the data storage of the L2. Pretty much all the other logic in the L2 block is book keeping logic dealing with requests in the L2, etc.

I did look at the dies, and you're right, the for L2 for CELL is bigger than than most L2 in Intel chips.

As for an example? The L2 of Dothan has a ~7 cycle access latency for a 2 MB cache and seems to run fine at 2.4+ GHz.

L2 of Dothan - 10 cycles @ 2.3 GHz max = 4.35 ns
LS of SPU - 6 cycles @ 3.2 GHz (5.2 GHz with higher power) = 1.88 ns (1.15 ns)

To be fair, the L2 of Dothan is 2 MB, which is eight times bigger, so this is hardly a fair comparison. Looking at the die pictures and scaling them wrt die size, I think the Dothan L2 is about 33% less dense than the SPE LS. (Meaning, per unit capacity, the Dothan L2 is about 50% bigger than the SPE LS) That's a paper-pencil-ruler-on-the-screen calculation - if someone wants to do it rigorously, be my guest.

nondescript · Jul 27, 2005

Anyways, I think we're hitting diminishing returns here...if after three(or is it four?) pages, if you (aaronspink) still believe that it would be possible to squeeze a 256KB cache of similar latency into a similarly-sized space, I don't know how three of four more pages would help. Besides space and latency, LS has other advantages such as deterministic processing times and simplified I/O management. In any case, the design decision has already been made, its just a matter of time now to see if it was right.

aaronspink · Jul 27, 2005

nondescript said:
Sounds like 96B/cycle to me...granted, that's 96B/cycle peak bandwidth, but since when have hardware manufacturers been honest about real performance?

For quite a long time. You spend too much time in the console mud and you think everyone is dishonest.

Interesting...but irrelevant to the question of whether the L2 has 11 read/write ports. Do you seriously believe the L2 has 11 read/write ports? If anything, your reservations about whether "it is allowed for a point to receive or send from more than 1 ring at a time" seem to show otherwise. Whats the point of having 11 read/write ports if they cannot be used simultanously? (We agreed IBM is not dumb...)

I don't believe that I've ever stated that the L2 has 11 read or write ports. I have merely stated that it is likely that the L2 has support for handling more than 1 requestor with the total number of requesters it can handle potentially being 11. this says nothing about the number of read or write ports, but about the logic needed around the cache to handle the ordering, arbitration, and tracking of accesses to the L2.

The ring bus topology clearly suggests that the number of ports in the L2 is proportional to the number of rings and not the number of devices on the ring - how could it be otherwise and still be a ring bus?

Actually the number of ports in the L2 is most likely 1 read and 1 write port and orthogonal to the number of packets/requests coming from the ring interface each cycle.

Discussing the other question (does the EIB have logic) seems to be moot now. How are you going to manage three occupants on one ring bus without some logic to partition it?

I believe the term I used was logically transparent. The ring bus would be logically transparent to the SPE if it acted solely as a transport mechanism. If on the other hand it handled the aggregation and combination of requests as well as order of said requests, then it wouldn't be logically transparent. Nothing has been reveiled that would make me assume that the rings are anything but logically transparent.

[

L2 of Dothan - 10 cycles @ 2.3 GHz max = 4.35 ns
LS of SPU - 6 cycles @ 3.2 GHz (5.2 GHz with higher power) = 1.88 ns (1.15 ns)

The load to use latency of the L2 in Dothan in 10 cycles but this includes the 3 cycle miss latency of the L1. The access latency is 7 cycles.

Also the LS doesn't run @ 5.2 unless you want to aquiesce and state that the Dothan L2 runs at 4+ GHz. The smoo plots of the LS were done at Tjs of <50 degrees.

Or we can look at the PA-8900 which as a 3 cycle 1.5 MB L1 @ 1.1 GHz on a .13 nM process.

To be fair, the L2 of Dothan is 2 MB, which is eight times bigger, so this is hardly a fair comparison. Looking at the die pictures and scaling them wrt die size, I think the Dothan L2 is about 33% less dense than the SPE LS. (Meaning, per unit capacity, the Dothan L2 is about 50% bigger than the SPE LS) That's a paper-pencil-ruler-on-the-screen calculation - if someone wants to do it rigorously, be my guest.

Dothan fits ~256KB of cache in ~6 MM^2 which is roughly the same as the LS. I think your scaling is off.

Will L2 cache on X360's CPU be a major hurdle?

Similar threads