AMD Bulldozer Core Patent Diagrams

Does anyone feel like humoring a layman/outside observer?

I wonder if there is any way for performance to improve over time through "easily" implemented code optimization such as compilers and/or (and I guess) libraries for the Bulldozer uArch. Could (really out of my depth here) microcode be updated if that has any meaningful impact on performance?

The Anandtech review mentions that Windows 8 ought to have a better scheduler that takes the modular CPU architecture into account which ought to improve performance somewhat. That's what made me think about it as it sort of suggested that some problems could stem from how the CPU is seen, and thus used, by software.

No doubt there are serious flaws in design that will have to be rectified, I just wonder how much of the performance penalty stems from the architecture directly and how much is due to simple novelty.
There is room for that, but do not expect improvements of this kind to exceed 10% at best. And these sort of improvements are not exclusive to AMD.
 
I've tried to calculate some of the overheads more clearly, but some of my numbers may not be correct. Regardless, the use of just the arrays as a floor value is the best-case for AMD's funny math. Any elaboration makes the margin left in 1.2B worse.

For cache tags, I am assuming the following: 6T SRAM, 2^23 for L3 with 32-way associativity.
With 64-byte lines, that leaves 2^17 cache lines and cache tags.
For cache tags, I am assuming a 48-bit address space.
2^17 / 2^5 = 2^12 sets.
Tag length = 48 - 12 - 6 = 30

30 bits in the tag X 2^17 lines x 6 transistors per bit is roughly 23.6M transistors for tags.

For the L2, it's 2^15 lines which leads to 2^11 sets with 16-way associativity, and I'm getting 31 bits in the tag.
2^15 x 30 x 6 = 6.1M per L2.


For ECC, I'm assuming the array would have 6T SRAM for the ECC, but I'm not sure.
If it's implemented with the same scheme as Opteron, that's 2^15 lines with 64 bits = roughly 12.6M transistors per L2.

I'm not sure what the L3 would have for ECC. It would be another 50.3M transistors if the overhead is the same as what I've calculated for the L2.

402.7M for L3 arrays
100.6M for each L2, which is then x4
~809M for L2 + L3
The tags for L2 and L3 add up to another ~50M
The ECC could add up to ~100M more.

It's close to a full billion in L2+L3 cache and associated arrays, leaving 200 Million for everything else.
If all other controllers and IO took 0 transistors, that leaves 50 million for the cores in each module.

I'm thinking there are inconsistencies still in AMD's counts, and that 1.2B is too low.
 
Last edited by a moderator:
On top of that I remember AMD saying they moved to 8T SRAM at 32nm process, at least for some of their cache structures.

That alone would add few more transistors to your math :???:
 
Last edited by a moderator:
On top of that I remember AMD saying they moved to 8T SRAM at 32nm process, at least for some of their cache structures.

That alone would add few more transistors to your math :???:
That was for Llanos' L1 cashes, but Llano is energy efficient architecture, unlike Bulldozer.
 
This article:

http://semiaccurate.com/2010/02/10/amd-finally-outs-32nm-llano-core/

says that L1 in Llano is a new architecture which is also used in BD.

4 modules at 211M plus 477M for L3 seems to make 1321M and looking at a die picture the stuff in the centre looks about the same size as a non-L2 portion of a module, i.e. around 100M, for a total of ~1.4 billion transistors.

Lower-overhead ECC would save around 50M transistors, say, so not making much of a dent in the excess.
 
So we have:
- 2BT previously claimed by AMD themself
- 1.4BT calculated from info given by AMD documents presented at ISCC and some very good guesses
- 1.2BT new figure given by AMD


:?:
 
t9f97c.jpg
 

And it's been recalled. I can already hear the AMD apologists / standard MS haters crying how "M$" is broken and obviously wants to crater Christmas sales and how there's like 40% performance bumps just waiting to happen except that MS is inept and will never let it be that good and blah blah.

Followed by tons of blog and forum posts of "OMG my benchmarks went up 10% but it's SOOO MUCH SMOOOOTHER that you can't just measure how awesome it now is..."

:D
 
Back
Top