Large L3 cache worth it?

ninelven

PM
Veteran
So this article over at Anandtech got me to thinking...

http://anandtech.com/cpuchipsets/showdoc.aspx?i=3572&p=1

The Athlon II with 1mb L2 per core comes in at 117 mm^2. From the die shot, I'd estimate a 4 core variant would be ~185 mm^2. This would be substantially less than the Phenom II at 258mm^2, about 72%.

In Anand's review comparing the 3.1GHz Phenom II to the 3.0GHz Athlon II, the Athlon II certainly is performing better than 72% of the Phenom II even with its 100MHz clock disadvantage.

Looking at it another way, they could fit 5 Athlon II cores in a smaller die than the Phenom II or 6 cores in one a few mm^2 larger than the Phenom II.

I'm probably missing something... so what is it...
/boggle?
 
From this clock-for-clock comparison it seems that a large L3 is still not a complete overkill for dual-core K10.
The PhII's L3 cache is still good to provide low-latency communication between the running threads where it matters, as long as the working set fits its size. Bandwidth-wise, it's not much better than a moderately fast DDR3 interface.
The doubled L2 in Regor seems to be a coincidence with the fact, that the die had enough leftover space to fill it with more stuff -- pad limitation or else, you name it. The exclusive cache relation hardly calls for big L2's in this architecture (the uber-fast L1D is still holding it's place).
 
Yes, it averages ~5% faster, but in that space (just the L3) one could put two more Athlon II cores.

So, AMD could put out a 6 core Athlon II with prettmuch the same die size as a 4 core Phenom II.
 
Sure they can -- the question is, will those 4 (or 6) cores scale in performance well enough, without L3 cache.
 
L1D latency for PhII is what it was for every AMD architecture since the very first K7 -- 3 cycles access time.
L2 latency is 15 cycles - same as the first K10 65nm impl and similar to Penryn's L2 value.
L3 figures are a bit of mystery, but it is measured to be ~50 cycles average, depending on the test pattern.
 
Yes, it averages ~5% faster, but in that space (just the L3) one could put two more Athlon II cores.

So, AMD could put out a 6 core Athlon II with prettmuch the same die size as a 4 core Phenom II.

Not really! Everyone seems to forget about power consumption - cache is relatively power friendly and CPU cores the opposite!

No?
EDIT:
I guess Istanbul blows that theory out of the water... hmmm.
 
Last edited by a moderator:
Yes, it averages ~5% faster, but in that space (just the L3) one could put two more Athlon II cores.

So, AMD could put out a 6 core Athlon II with prettmuch the same die size as a 4 core Phenom II.
Don't forget the advantage the L3 cache gives should grow in theory with more cores (since the cores have to fight for access to memory).
Also I think that the Athlon II also has some die size advantage cause it only has one HT link (afaik phenom/opteron use same die), maybe someone can actually find that in the die shot.
 
Looking at it another way, they could fit 5 Athlon II cores in a smaller die than the Phenom II or 6 cores in one a few mm^2 larger than the Phenom II.

It may be smaller, but it's not necessarily cheaper. Small error in SRAM array can be easily corrected and therefore make its yield rate much better.

Also, multiple cores have their limit in scaling. You really can't expect normal applications to just perform better with double number of cores, when you already have, say, four cores.
 
I guess Istanbul blows that theory out of the water... hmmm.
Well, power doesn't really seem to be an issue from the benchmarks I've seen. Comparing the PII X3 vs PII X2, the power draw increase is minimal.

Don't forget the advantage the L3 cache gives should grow in theory with more cores
This is true, however I have to wonder how significant it would be. (seems to be worth ~6% @ 2 cores)

You really can't expect normal applications to just perform better with double number of cores, when you already have, say, four cores.
The Athlon II is certainly slower in single threaded apps, and more cores won't help there. But do people really care if Word opens 1ms faster these days? I was thinking the extra cores would help out where people care most: long, processor intensive tasks / processor intensive multi-tasking. For example, video transcoding could be done on four cores, with the user still left with 2 free cores to work on.

I will say, I don't think it would be viable to take this much/any further than 6 cores, but I do think it could offer the user a meaningful alternative.
 
From this clock-for-clock comparison it seems that a large L3 is still not a complete overkill for dual-core K10.
The PhII's L3 cache is still good to provide low-latency communication between the running threads where it matters, as long as the working set fits its size. Bandwidth-wise, it's not much better than a moderately fast DDR3 interface.
The doubled L2 in Regor seems to be a coincidence with the fact, that the die had enough leftover space to fill it with more stuff -- pad limitation or else, you name it. The exclusive cache relation hardly calls for big L2's in this architecture (the uber-fast L1D is still holding it's place).

Can the cores communicate over the hypertransport crossbar, or does that still require going out to main memory?

Too bad AMD couldn't redesign the dual core athlons to have a shared L2 cache like the core 2 duos.
I'd imagine that performance is close enough that the majority of AMD's line up will be sans L3 cache though, for dual and quad cores. Only the highest end quad cores will have L3 cache, and cut down chips with failed cores.

As far as the 6 core idea with no L3 cache...I wonder how it would perform when paired with the fastest DDR3 memory? The memory performance should matter more when cache impaired.
 
Can the cores communicate over the hypertransport crossbar, or does that still require going out to main memory?
The cross-bar interface here is to provide a short-cut for the MOESI flags broadcasting, regarding the state of the data in all the caches in the system. Now, when it comes to actual read/write/modify of the content in the cache, there is the main memory being used as a "medium" to pass the data all around, if required.
 
Back
Top