I've tried to calculate some of the overheads more clearly, but some of my numbers may not be correct. Regardless, the use of just the arrays as a floor value is the best-case for AMD's funny math. Any elaboration makes the margin left in 1.2B worse.
For cache tags, I am assuming the following: 6T SRAM, 2^23 for L3 with 32-way associativity.
With 64-byte lines, that leaves 2^17 cache lines and cache tags.
For cache tags, I am assuming a 48-bit address space.
2^17 / 2^5 = 2^12 sets.
Tag length = 48 - 12 - 6 = 30
30 bits in the tag X 2^17 lines x 6 transistors per bit is roughly 23.6M transistors for tags.
For the L2, it's 2^15 lines which leads to 2^11 sets with 16-way associativity, and I'm getting 31 bits in the tag.
2^15 x 30 x 6 = 6.1M per L2.
For ECC, I'm assuming the array would have 6T SRAM for the ECC, but I'm not sure.
If it's implemented with the same scheme as Opteron, that's 2^15 lines with 64 bits = roughly 12.6M transistors per L2.
I'm not sure what the L3 would have for ECC. It would be another 50.3M transistors if the overhead is the same as what I've calculated for the L2.
402.7M for L3 arrays
100.6M for each L2, which is then x4
~809M for L2 + L3
The tags for L2 and L3 add up to another ~50M
The ECC could add up to ~100M more.
It's close to a full billion in L2+L3 cache and associated arrays, leaving 200 Million for everything else.
If all other controllers and IO took 0 transistors, that leaves 50 million for the cores in each module.
I'm thinking there are inconsistencies still in AMD's counts, and that 1.2B is too low.