aaronspink said:
the 96B/cycle is a bit of sleight of hand the way some people are using it. The actual bandwidth available on the EIB at any given point is 32B/cycle. This is derived from the 4 seperate 8B counter-rotating rings. Each ring is broken up into several sections with a maximal occupany of any given ring of 3.
Sounds like 96B/cycle to me...granted, that's 96B/cycle peak bandwidth, but since when have hardware manufacturers been honest about real performance?
aaronspink said:
No information is currently available if the occupany limitation is electrical or logical nor if there occupany restrictions are source neutral or specific (ie can 3 SPE datum be on a ring at the same time).
Actually its not even know if the bandwidth at any given point is even 32B/cycle, since there is not information if it is allowed for a point to receive or send from more than 1 ring at a time.
The actual L2 cache array may only support 16B per cycle but the interface into the L2 could possibly support more. We still don't know the coherence between a DMA request from an SPE and the L2 cache.
Interesting...but irrelevant to the question of whether the L2 has 11 read/write ports. Do you seriously believe the L2 has
11 read/write ports? If anything, your reservations about whether "it is allowed for a point to receive or send from more than 1 ring at a time" seem to show otherwise. Whats the point of having 11 read/write ports if they cannot be used simultanously? (We agreed IBM is not dumb...)
The ring bus topology clearly suggests that the number of ports in the L2 is proportional to the number of rings and not the number of devices on the ring - how could it be otherwise and still be a ring bus?
Discussing the other question (does the EIB have logic) seems to be moot now. How are you going to manage three occupants on one ring bus without some logic to partition it?
Compare the 512KB L2 to other L2s from say Banias which has 512 KB of cache in less area and is 1 process generation behind. The L2 in CELL is very much un-dense and is that way for reasons that do not apply to say a large L1 cache in an SPE.
But you don't even need to make that comparison, you can look at the actual die photo and see the 4 sub arrays and 4 tag arrays that make up the data storage of the L2. Pretty much all the other logic in the L2 block is book keeping logic dealing with requests in the L2, etc.
I did look at the dies, and you're right, the for L2 for CELL is bigger than than most L2 in Intel chips.
As for an example? The L2 of Dothan has a ~7 cycle access latency for a 2 MB cache and seems to run fine at 2.4+ GHz.
L2 of Dothan - 10 cycles @ 2.3 GHz max = 4.35 ns
LS of SPU - 6 cycles @ 3.2 GHz (5.2 GHz with higher power) = 1.88 ns (1.15 ns)
To be fair, the L2 of Dothan is 2 MB, which is eight times bigger, so this is hardly a fair comparison. Looking at the die pictures and scaling them wrt die size, I think the Dothan L2 is about 33% less dense than the SPE LS. (Meaning, per unit capacity, the Dothan L2 is about 50% bigger than the SPE LS) That's a paper-pencil-ruler-on-the-screen calculation - if someone wants to do it rigorously, be my guest.