Why AMD's L2 only has 2MB maximum?

mailrapid · Oct 30, 2007

Why AMD's L2 only has 2MB max.?
Even if the high-end product, AMD's L2 cache size only featured 2MB.
On the other hand, Intel's L2 cache size can reach 12MB.

Is this a disadvantage of AMD?

Albuquerque · Oct 30, 2007

L2 cache size isn't a specific advantage or disadvantage, it's entirely dependant on the architecture of the chip. If AMD had 64mb of L2 cache tomorrow, would that solve their performance issues?

For an example answer to this question, look at the older Intel Pentiums that had 1Mb of L2 cache -- were they always faster than the AMD parts that had 512kb cache at the time? The answer was near-always a resounding "no".

Even the current E4200 chips with the comparitively small amounts of cache are quite fast when compared to equally-clocked parts from AMD. The performance difference isn't specifically L2 cache, it's architecture.

Would an AMD chip benefit from more cache? If it was a significant benefit, I'd have enough faith in their engineering staff to figure it out. As such, it sounds like AMD feels the large mound of transistors are better saved for something of more interest -- like more floating point units, or more register space, or an on-board memory controller, or squeezing three cores into the same piece of silicon, et al.

3dilettante · Oct 30, 2007

It certainly isn't an advantage.

Intel has always had leading cache density.
AMD's cache density is actually better than it used to be, but it doesn't get near Intel's.

The usual argument is that the on-die memory controller makes cache less important, but if AMD could get more cache on-die, it would.

The capacity disparity is so large between AMD and Intel chips that the IMC is not enough to make up for it.

The overall cache capacity issue is worse with Barcelona. Thanks to the paltry and slow L3, memory latency is now on par with Core2 and the L2 cache is half-sized.

Bludd · Oct 30, 2007

The Intel CPUs use a large cache (in addition to very sophisticated usage of said cache) to offset the liability of having to use the FSB to communicate with the external memory controller.

3dilettante · Oct 30, 2007

That and large caches don't hurt performance, all else being equal.

Intel's caches are faster than AMD's and several times larger.

Some games, like Half Life 2, are affected by cache sizes smaller than 1 MiB.

Large caches reduce memory traffic on laptops, which allows slower bus and memory speeds to save power.

A lot of server workloads like large caches, because their access patterns are too irregular to predict or stream.

Xeon MPs based on the Pentium 4 with large caches sold for much longer than the desktop variants, because there are workloads where the cache matters.

As such, Intel's large caches enabled better performance across 3 market segments.
AMD's caches are an increasing liability in the general server and desktop markets.

The_Wolf_Who_Cried_Boy · Nov 3, 2007

3dilettante said:
As such, Intel's large caches enabled better performance across 3 market segments.
AMD's caches are an increasing liability in the general server and desktop markets.

Isn't Shanghai supposedly going back to 1MB L2's? Even if not, Shanghai and Nehalam are going to have very similar L2/L3 cache volumes.

ShaidarHaran · Nov 3, 2007

The_Wolf_Who_Cried_Boy said:
Isn't Shanghai supposedly going back to 1MB L2's? Even if not, Shanghai and Nehalam are going to have very similar L2/L3 cache volumes.

I haven't seen any indication that Shanghai will have a larger L2, but its L3 size will triple compared to Barcelona.

hoom · Nov 5, 2007

Part of it is the cache architecture.
Core 2 uses a big shared L2 that gets agressively filled by pre-fetch from both the chip & the northbridge.
As noted previously, Intel has lots of room to chuck in bits of info that might come in handy.

AMD uses 512kb L2 per core, filled by stuff that expired from that cores' L1 & now with Barcelona, they have a shared L3 that takes both stuff dumped from L2 & I believe pre-fetched data. (or does that go to L2?)
Future cores were supposed to go for bigger L3.

But well, they mostly need to actually start getting decent numbers of chips out the door...

The_Wolf_Who_Cried_Boy · Nov 5, 2007

ShaidarHaran said:
I haven't seen any indication that Shanghai will have a larger L2, but its L3 size will triple compared to Barcelona.

I can half remember seeing it discussed, stricly as a rumor, by Dave Kanter on Aces, with suitable acerbic interjections from Paul Demone. I've had a quick scan but can't find it. Unless I'm remembering things which didn't actually happen again the context was a maybe for Shangai and a likely for follow on 45nm product. in any case, take with a suitable gigatonnage of salt.

ShaidarHaran · Nov 5, 2007

The_Wolf_Who_Cried_Boy said:
I can half remember seeing it discussed, stricly as a rumor, by Dave Kanter on Aces, with suitable acerbic interjections from Paul Demone.

Good old Paul, what's he up to these days anyway? Never see him on RWT anymore and now that Ace's is essentially dead...

Edit: a quick check via Google shows he's still active at the spun-off Ace's @ freeforums.org.

The_Wolf_Who_Cried_Boy said:
I've had a quick scan but can't find it. Unless I'm remembering things which didn't actually happen again the context was a maybe for Shangai and a likely for follow on 45nm product. in any case, take with a suitable gigatonnage of salt.

k, I'll keep it in mind. Would be nice to see a larger L2 accompany the larger L3 with Shanghai and its descendents, I just don't think AMD has the tranny density + manufacturing capacity to handle it, even @ 45nm.

The_Wolf_Who_Cried_Boy · Nov 5, 2007

ShaidarHaran said:
Good old Paul, what's he up to these days anyway? Never see him on RWT anymore and now that Ace's is essentially dead...

Yes, he's still active on Ace's liferaft. He was either temp banned or warned for monstering people at RWT so had a bit of a dummy spit and left for good, taking his articals with him.

ShaidarHaran said:
k, I'll keep it in mind. Would be nice to see a larger L2 accompany the larger L3 with Shanghai and its descendents, I just don't think AMD has the tranny density + manufacturing capacity to handle it, even @ 45nm.

I'm sure if it's needed (which appears to be a yes) their engineers will manage.

ShaidarHaran · Nov 6, 2007

The_Wolf_Who_Cried_Boy said:
Yes, he's still active on Ace's liferaft. He was either temp banned or warned for monstering people at RWT so had a bit of a dummy spit and left for good, taking his articals with him.

sounds like Paul. I think I talked to him once on Ace's (or maybe it was RWT, I don't really remember), made some rather flattering remarks about him personally while disagreeing with his opinion on a particular subject, and he spat some witty remark back at me. Was fun. Reminds me of me

The_Wolf_Who_Cried_Boy said:
I'm sure if it's needed (which appears to be a yes) their engineers will manage.

Those benchmarks are rather fishy. I think I'll wait until both Phenom and Crysis are officially available and some more reputable sites like B3D, EB, TR, X-bit, digit-life, FS, and numerous others can catalog the performance for us. I don't doubt that K10 could benefit from larger L2s per core, I just question AMD's ability to make it happen in a volume product. Perhaps the next FX based on Shanghai/Montreal will feature beefed-up L2s?

The_Wolf_Who_Cried_Boy · Nov 6, 2007

<OFFTOPIC>

I had a few replies to my dumb-arse questions from Paul, all of them have been pretty good, this one just floored me though (so much so I saved a copy and go back occasionaly for a ponder).

Paul DeMone said:
There are three major elements to microprocessor yield.

1) Defectivity based yield. Very simple - manufacturing
imperfections cause point defects like shorts or opens
in devices or interconnect. Unless it is in a structure
like SRAM that can be repaired by direct substitution of
redundant elements the device is useless. Also a dual
CPU device may be salvageable as single CPU device.

2) Parametric based yield. As process feature size falls
basic electrical characteristics of transistors become
more variable. Some circuits are sensitive to device
variation under certain operating corners and may fail.
Some percent of devices may draw excessive amounts
of leakage current. With increasing variability within a
single device of both transistors and interconnect the
use of statistical based timing closure means otherwise
functional devices may fail because of one or more hold
time failures.

3) Commercial yield. The mobile, desktop and server
MPU markets are extremely competive. Maximum power
is a crucial factor in defining commercial yield (i.e. the
fraction of all functional and parametrically acceptable
devices saleable without hurting the competitive image
of the device). The best devices that come out of the fab
combine low leakage and a high maximum clock rate
vs voltage curve. These devices are the basis of BOTH
the highest clocked SKUs within a realistic power limit
and also the lowest power SKUs at any given frequency.
Unfortunately those golden devices are relatively rare
because the processing variations that produce high
speed also tend to produce high leakage. In contrast,
devices with high leakage power (but still within spec,
or else the devices would be parametric losses) and a
low maximum clock rate vs voltage curve are the least
desirable parts. These typically sell as the slowest
SKUs in high power (desktop and server) segments.
In some cases the maximum frequency these devices
achieve while still falling within the device's spec for
maximum voltage and power may not be commercially
viable and thus are basically unusable - i.e. a form of
yield loss. Fortunately the garbage devices (leaky and
slow) are also a minority. The majority of production
falls in a continuum of low leakage/low speed to high
leakage/high speed. The trick for MPU vendors is having
designs for which it can sell even the garbage devices
(even if only in significantly discounted SKUs). OTOH
if only the golden-ish devices are attractive to buyers
then the vendor is up the creek - low commercial yield.

Up until about 10 or 15 years ago defectivity yield was
just about the entire story. Occassionally an entire wafer
batch would be excessively leaky etc and it would have
to be scrapped but that was a relatively rare event. For
small devices (100 to 150 mm2) in mature processes
it wasn't exceptional for yields to exceed 90%. Device
frequency binning was pretty much a bell curve defined
by circuit speed. Maximum power was specified for
each speed grade and was typically set so loosely that
practically every last device could meet it.

These days for most high end MPUs parametric yield
loss is probably much more significant than classic
defectivity yield loss. Maximum power specs and speed
grades are determined together to trade off dynamic
and static power variation and make the device family
as competitive as possible without giving rise to risks
of significant commercial yield loss. That is why you
now see 1) single TDP spec for a range of frequency
bins, and 2) annecdotal reports of one MPU sample
using much less power than its datasheet TDP while
another runs very close to TDP.

True he has a prickly side but jesus he knows his shit and isn't backwards about sharing the knowledge, I have the utmost of respect for him.

</OFFTOPIC>

Why AMD's L2 only has 2MB maximum?

mailrapid

Albuquerque

Red-headed step child

3dilettante

Bludd

Experiencing A Significant Gravitas Shortfall

3dilettante

The_Wolf_Who_Cried_Boy

ShaidarHaran

hardware monkey

hoom

The_Wolf_Who_Cried_Boy

ShaidarHaran

hardware monkey

The_Wolf_Who_Cried_Boy

ShaidarHaran

hardware monkey

The_Wolf_Who_Cried_Boy

Similar threads