AMD Bulldozer Core Patent Diagrams

rpg.314 · Dec 6, 2011

Color me Dan said:
Does anyone feel like humoring a layman/outside observer?

I wonder if there is any way for performance to improve over time through "easily" implemented code optimization such as compilers and/or (and I guess) libraries for the Bulldozer uArch. Could (really out of my depth here) microcode be updated if that has any meaningful impact on performance?

The Anandtech review mentions that Windows 8 ought to have a better scheduler that takes the modular CPU architecture into account which ought to improve performance somewhat. That's what made me think about it as it sort of suggested that some problems could stem from how the CPU is seen, and thus used, by software.

No doubt there are serious flaws in design that will have to be rectified, I just wonder how much of the performance penalty stems from the architecture directly and how much is due to simple novelty.

There is room for that, but do not expect improvements of this kind to exceed 10% at best. And these sort of improvements are not exclusive to AMD.

Jawed · Dec 6, 2011

fellix said:
For the lower cache levels I don't have any reliable information about protection implementations.

I was hoping that one of those people who could do arithmetic would have provided a solid answer by now.

3dilettante · Dec 6, 2011

I've tried to calculate some of the overheads more clearly, but some of my numbers may not be correct. Regardless, the use of just the arrays as a floor value is the best-case for AMD's funny math. Any elaboration makes the margin left in 1.2B worse.

For cache tags, I am assuming the following: 6T SRAM, 2^23 for L3 with 32-way associativity.
With 64-byte lines, that leaves 2^17 cache lines and cache tags.
For cache tags, I am assuming a 48-bit address space.
2^17 / 2^5 = 2^12 sets.
Tag length = 48 - 12 - 6 = 30

30 bits in the tag X 2^17 lines x 6 transistors per bit is roughly 23.6M transistors for tags.

For the L2, it's 2^15 lines which leads to 2^11 sets with 16-way associativity, and I'm getting 31 bits in the tag.
2^15 x 30 x 6 = 6.1M per L2.

For ECC, I'm assuming the array would have 6T SRAM for the ECC, but I'm not sure.
If it's implemented with the same scheme as Opteron, that's 2^15 lines with 64 bits = roughly 12.6M transistors per L2.

I'm not sure what the L3 would have for ECC. It would be another 50.3M transistors if the overhead is the same as what I've calculated for the L2.

402.7M for L3 arrays
100.6M for each L2, which is then x4
~809M for L2 + L3
The tags for L2 and L3 add up to another ~50M
The ECC could add up to ~100M more.

It's close to a full billion in L2+L3 cache and associated arrays, leaving 200 Million for everything else.
If all other controllers and IO took 0 transistors, that leaves 50 million for the cores in each module.

I'm thinking there are inconsistencies still in AMD's counts, and that 1.2B is too low.

Lightman · Dec 6, 2011

On top of that I remember AMD saying they moved to 8T SRAM at 32nm process, at least for some of their cache structures.

That alone would add few more transistors to your math :???:

3dilettante · Dec 7, 2011

That would affect the L1 caches, the L2 and L3 still use 6T.
The arrays for the L1I and 2xL1D are about 6.3M per module.

fellix · Dec 7, 2011

Lightman said:
On top of that I remember AMD saying they moved to 8T SRAM at 32nm process, at least for some of their cache structures.

That alone would add few more transistors to your math

That was for Llanos' L1 cashes, but Llano is energy efficient architecture, unlike Bulldozer.

3dilettante · Dec 7, 2011

The L1 is 8T for BD as well.

denev2004 · Dec 7, 2011

Why don't they let the L2/L3 of BD to be 8T? They said 8T is good for reducing power consumption

rpg.314 · Dec 7, 2011

Area.

L1 is 16K/core. L2+L3 is 16M overall.

mczak · Dec 7, 2011

rpg.314 said:
Area.

L1 is 16K/core. L2+L3 is 16M overall.

Isn't L1I also 8T? That would make it 96K/module of 8T cache (of course that's still tiny compared to L2/L3).

rpg.314 · Dec 7, 2011

Isn't L1I also 8T?

I don't know.

Jawed · Dec 7, 2011

This article:

http://semiaccurate.com/2010/02/10/amd-finally-outs-32nm-llano-core/

says that L1 in Llano is a new architecture which is also used in BD.

4 modules at 211M plus 477M for L3 seems to make 1321M and looking at a die picture the stuff in the centre looks about the same size as a non-L2 portion of a module, i.e. around 100M, for a total of ~1.4 billion transistors.

Lower-overhead ECC would save around 50M transistors, say, so not making much of a dent in the excess.

Lightman · Dec 7, 2011

So we have:
- 2BT previously claimed by AMD themself
- 1.4BT calculated from info given by AMD documents presented at ISCC and some very good guesses
- 1.2BT new figure given by AMD

:?:

fellix · Dec 16, 2011

AMD 'Bulldozer' gets an Update from Microsoft

Gubbi · Dec 16, 2011

fellix said:
AMD 'Bulldozer' gets an Update from Microsoft

....but this confirms Windows 7 was in fact hampering “Bulldozer” from performing at 100% in all prior benches

Right, so AMD making a CPU that requires page coloring to perform decent is Microsoft's fault.

Anyway, good news.

Cheers

fellix · Dec 16, 2011

itsmydamnation · Dec 16, 2011

for some reason you remind me of this......

which then leads me to

Albuquerque · Dec 17, 2011

fellix said:
AMD 'Bulldozer' gets an Update from Microsoft

And it's been recalled. I can already hear the AMD apologists / standard MS haters crying how "M$" is broken and obviously wants to crater Christmas sales and how there's like 40% performance bumps just waiting to happen except that MS is inept and will never let it be that good and blah blah.

Followed by tons of blog and forum posts of "OMG my benchmarks went up 10% but it's SOOO MUCH SMOOOOTHER that you can't just measure how awesome it now is..."

fehu · Dec 24, 2011

http://www.xbitlabs.com/news/cpu/di...e_Preparing_Phenom_II_X8_Microprocessors.html

Merry christmans and confusing new year

I.S.T. · Dec 24, 2011

Wow, X6s would be better than those things...

AMD Bulldozer Core Patent Diagrams

Red-headed step child