AMD Bulldozer Core Patent Diagrams

Besides the high latency we can guess, One thing is interesting
The number of the eight cores in Bulldozer are
[01][27][34][56]
resptectively
Where the "[]" represent the module

That's so wired..

No wonder the Windows scheduler is confused.

Cheers
 
More results, this time from stock FX8120:
Probably the low idle clocks and TurboCORE are dragging the timings long here. The test is loading each pair of cores once at a time until all non-repetitive permutations are exhausted.
 
Shouldn't the cores pick up speed quite fast once they get any load? The test loads each core for several seconds, isn't that enough time?
 
Probably the low idle clocks and TurboCORE are dragging the timings long here. The test is loading each pair of cores once at a time until all non-repetitive permutations are exhausted.
That's sounds reasonable...This software was written at a time where there's no turbo and cnq as well as EIST is just for notebook computer.. But it's still too high compared to others. My friends's X5570 with EIST on just reach about 70ns
 
PII x6 1055T @ 3.7 & 2.4NB, turbo off
CPU0<->CPU1: 94.4nS per ping-pong
CPU0<->CPU2: 91.6nS per ping-pong
CPU0<->CPU3: 91.6nS per ping-pong
CPU0<->CPU4: 93.6nS per ping-pong
CPU0<->CPU5: 93.3nS per ping-pong
CPU1<->CPU2: 93.8nS per ping-pong
CPU1<->CPU3: 93.3nS per ping-pong
CPU1<->CPU4: 95.2nS per ping-pong
CPU1<->CPU5: 96.7nS per ping-pong
CPU2<->CPU3: 91.2nS per ping-pong
CPU2<->CPU4: 91.8nS per ping-pong
CPU2<->CPU5: 92.3nS per ping-pong
CPU3<->CPU4: 92.3nS per ping-pong
CPU3<->CPU5: 95.0nS per ping-pong
CPU4<->CPU5: 95.4nS per ping-pong
 
Nehalem does not really intergrated "North Bridge"
PCI-E isn't a part of it untill Lynnfield released

The relevant portions of the memory controller and core arbitration logic were moved on-die with Nehalem.
The northbridge has been a shadow of its former self since.
 
The relevant portions of the memory controller and core arbitration logic were moved on-die with Nehalem.
The northbridge has been a shadow of its former self since.
If we talk about, the AMD's uncore structure isn't the same as the Intel?
What's the core arbitration logic?
 
How is it even possible that they initially "mistook" the number of transistors by that much?

Could this have been a reason for some layoffs in the marketing department?
 
AFAIK, exact count of the planar elements for any IC comes from the manufacturing foundry first. But probably some miscommunication within AMD departments could carry the blame.
 
Llano's density is heavily skewed due to the presence of a highly compact structure like the IGP part, that takes a hefty chunk of the transistor budget.
 
AMD is not maintaining a consistent count still. The number of transistors per module it disclosed earlier is 213M, so that x4 plus 400M in the L3 is enough to hit 1.2B, so something still seems off.

Going by 1.2B, the density scaling is notably inferior to Intel, probably due to that bloated uncore.

The Anandtech count for SB may not be comparable to AMD's wonky count. They are using the schematic count of 995M, while physically it has 1.16B.
 
Pretty much wherever AMD's revised count is showing up, it's being disputed by people who can perform arithmetic.

If AMD is using a schematic-level count for BD at 1.2B as opposed to the physical count, perhaps it is not comparable to the count given for each module, which may have been using a physical count.
That could open up a little leeway in the totals per die, but since 100M of each module must be just L2 cache cells, it's not leaving much room for the logic and everything else on the die that's not cache.
 
How many bits of error detection and correction are there per cache line? How many bits in a cache line?
 
For what I know, up to K10, AMD used to have ECC for the L1D in 8:1 ratio, i.e. eight ECC bits for every 64 bits of protected data, using 64-bit Hamming SED/DED method. The ECC bits were organized in separate banks along the main L1D array. For the lower cache levels I don't have any reliable information about protection implementations.

p.s.: In Bulldozer, AMD removed the ECC protection for the L1D caches due to the inclusive relation to the L2, so now any error in the L1D will trigger data reload form the L2, which is ECC protected.
 
Does anyone feel like humoring a layman/outside observer?

I wonder if there is any way for performance to improve over time through "easily" implemented code optimization such as compilers and/or (and I guess) libraries for the Bulldozer uArch. Could (really out of my depth here) microcode be updated if that has any meaningful impact on performance?

The Anandtech review mentions that Windows 8 ought to have a better scheduler that takes the modular CPU architecture into account which ought to improve performance somewhat. That's what made me think about it as it sort of suggested that some problems could stem from how the CPU is seen, and thus used, by software.

No doubt there are serious flaws in design that will have to be rectified, I just wonder how much of the performance penalty stems from the architecture directly and how much is due to simple novelty.
 
It is possible to get some performance boost after OS manages to distribute threads equally over modules but it only helps as long as you don't load all the cores and even then the benefit is often tiny.

Biggest problem seems to be godawful cache architecture and only thing fixing it is redesigned chip, not going to happen for at least a couple of years.
 
Back
Top