AMD Bulldozer Core Patent Diagrams

AlexV · Nov 21, 2011

fellix said:
Speaking of thread management, if someone here have access to a Bulldozer system, I would be grateful for a test run with this little console application in the post attachment. It measures the sync latency between all the CPUs and cores (physical/virtual/local/remote) that are present in the system.

It's quite...slow. 30ns for same module, 116 for across module. At 3.6GHz that translates into about 108 cycles for same module and a whopping 417.6 cycles across modules. Now, I don't know how the test is structured (if there are any details please share), but if it's doing some message passing (Send->RSVP, for example), there'll be some overhead associated with OS messaging - still, it looks rather sub-mediocre. This is under Win 8, by the way.

mczak · Nov 21, 2011

AlexV said:
It's quite...slow. 30ns for same module, 116 for across module. At 3.6GHz that translates into about 108 cycles for same module and a whopping 417.6 cycles across modules. Now, I don't know how the test is structured (if there are any details please share), but if it's doing some message passing (Send->RSVP, for example), there'll be some overhead associated with OS messaging - still, it looks rather sub-mediocre. This is under Win 8, by the way.

Hmm sync latency is higher than with Core2, which was only connected via FSB?

fellix · Nov 21, 2011

AlexV said:
It's quite...slow. 30ns for same module, 116 for across module. At 3.6GHz that translates into about 108 cycles for same module and a whopping 417.6 cycles across modules.

Damn. The intermodule latency is actually worse than Phenom I numbers (TLB fixed), as far as I can remember. But the latency inside a module is puzzling. Is it possible that the rather small coalescing cache is wrecking something here (on top of other things)?

3dilettante · Nov 21, 2011

There were similar problems with Barcelona when it launched, though I think more in-depth testing showed that this was a worst-case result. This poor showing belied the stupid "native quad core" marketing AMD put out, since an inferior FSB setup had no problem keeping up with it.

It seems to point out that the uncore for Bulldozer has not moved very far from 10h. I had posted earlier that I expected some kind of improvement over 10h with regards to the L3 and uncore. It doesn't seem I was right.

I wish there were some occasion that AMD would surprise me by doing better than expected.

fellix · Nov 21, 2011

AMD doesn't actually need a super-fast L3, since their implementation is [mostly] exclusive to the higher levels and the L2 caches are the ones being truly burdened with the coherency traffic. But still, at least BD manages to bump read bandwidth from its L3 to more "modern" levels, probably thanks to the new bank-interleaved organisation. The access latency took a hit, though.
It's the L2 overall performance that's the troubling factor here -- at least they got the size right, as a compensation. The L3 is doing its job, and this time AMD managed to improve its SRAM density by a ~20% over the L2.

3dilettante · Nov 21, 2011

fellix said:
AMD doesn't actually need a super-fast L3, since their implementation is [mostly] exclusive to the higher levels and the L2 caches are the ones being truly burdened with the coherency traffic.

Is that certain? The ability to prefetch directly to L1 and bypass the L2 means there is probably a snoop to the L2 and then to both L1 caches.

fellix · Nov 21, 2011

3dilettante said:
Is that certain? The ability to prefetch directly to L1 and bypass the L2 means there is probably a snoop to the L2 and then to both L1 caches.

Don't forget that the L1d caches are mostly inclusive to the L2 with WT policy. Which on the other hand doesn't guarantee that the L2 would contain the most recent and valid data bits from the L1, but still it has to be updated also.

hoho · Nov 21, 2011

Some guy ran the timing test on FX4100@ default:

Code:

CPU0<->CPU1:       33.8nS per ping-pong 
CPU0<->CPU2:      136.7nS per ping-pong 
CPU0<->CPU3:      135.9nS per ping-pong 
CPU1<->CPU2:      136.1nS per ping-pong 
CPU1<->CPU3:      136.4nS per ping-pong 
CPU2<->CPU3:       33.6nS per ping-pong

denev2004 · Nov 22, 2011

hoho said:

Some guy ran the timing test on FX4100@ default:

Code:

CPU0<->CPU1:       33.8nS per ping-pong 
CPU0<->CPU2:      136.7nS per ping-pong 
CPU0<->CPU3:      135.9nS per ping-pong 
CPU1<->CPU2:      136.1nS per ping-pong 
CPU1<->CPU3:      136.4nS per ping-pong 
CPU2<->CPU3:       33.6nS per ping-pong

Almost 100+ ns? How? That's about the same level as some MCM does.

denev2004 · Nov 22, 2011

I think it's going to be a problem that ALL AMD CPU share

Phenom II X6 1055T -- 2800Mhz

CPU0<->CPU1: 112.2nS per ping-pong
CPU0<->CPU2: 111.9nS per ping-pong
CPU0<->CPU3: 111.6nS per ping-pong
CPU0<->CPU4: 111.7nS per ping-pong
CPU0<->CPU5: 112.1nS per ping-pong
CPU1<->CPU2: 112.6nS per ping-pong
CPU1<->CPU3: 112.6nS per ping-pong
CPU1<->CPU4: 112.7nS per ping-pong
CPU1<->CPU5: 111.7nS per ping-pong
CPU2<->CPU3: 112.6nS per ping-pong
CPU2<->CPU4: 113.8nS per ping-pong
CPU2<->CPU5: 112.5nS per ping-pong
CPU3<->CPU4: 113.2nS per ping-pong
CPU3<->CPU5: 112.8nS per ping-pong
CPU4<->CPU5: 112.8nS per ping-pong
=============================================
Core 2 Duo E6700 -- 2670Mhz

CPU0<->CPU1: 33.9nS per ping-pong

This result is made by two of my friends

Gubbi · Nov 22, 2011

3dilettante said:
Is that certain? The ability to prefetch directly to L1 and bypass the L2 means there is probably a snoop to the L2 and then to both L1 caches.

Maybe they do something stupid, like wait for store queues to drain before reading data from L2 when tags indicate a hit in a foreign L1 cache. You end up with something like this:

Miss own L1 cache
Miss own L2 cache
Broadcast read-request through uncore.
Tags in other modules are checked.
Hit in foreign L1
Wait for foreign L1 store queue to drain
Read data from foreign L2.

Cheers

3dilettante · Nov 22, 2011

The 30ns time period for the in-module transfer may be due to the draining of queues and traffic to the WCC, L2 and L1.
The weird part is when it goes across the interconnect, where the latency shoots up for all AMD chips.
Is it querying the memory controller, is it the SRQ?

Are there outputs for dual/quad/hex AMD cores for comparison?

hoho · Nov 22, 2011

3dilettante said:
Are there outputs for dual/quad/hex AMD cores for comparison?

Unfortunately people in the other forum where I asked them to run the app haven't really done much testing. Only semi-interesting thing I have is from s939 x2:
CPU0<->CPU1: 120.9nS per ping-pong

Not sure what specific model or at what speed it was.

3dilettante · Nov 22, 2011

That big chunk of time spent in the uncore puzzles me.
Perhaps it is waiting on the SRQ to process through or buffers in the IMC to empty.
I wonder what Intel is doing differently, since it has had an integrated north bridge since Nehalem.

DeF · Nov 22, 2011

My PII 940 @ 3000Mhz

Code:

CPU0<->CPU1:      113.6nS per ping-pong
CPU0<->CPU2:      113.5nS per ping-pong
CPU0<->CPU3:      113.4nS per ping-pong
CPU1<->CPU2:      114.2nS per ping-pong
CPU1<->CPU3:      112.8nS per ping-pong
CPU2<->CPU3:      113.0nS per ping-pong

fellix · Nov 22, 2011

AlexV said:
Now, I don't know how the test is structured (if there are any details please share), but if it's doing some message passing (Send->RSVP, for example), there'll be some overhead associated with OS messaging - still, it looks rather sub-mediocre. This is under Win 8, by the way.

From this article: http://www.anandtech.com/show/1910/3

Michael S. started this extremely interesting thread at the Ace's hardware Technical forum. The result was a little program coded by Michael S. himself, which could measure the latency of cache-to-cache data transfer between two cores or CPUs. In his own words: "it is a tool for comparison of the relative merits of different dual-cores.

Cache2Cache measures the propagation time from a store by one processor to a load by the other processor. The results that we publish are approximately twice the propagation time. For those interested, the source code is available here.

denev2004 · Nov 23, 2011

3dilettante said:
That big chunk of time spent in the uncore puzzles me.
Perhaps it is waiting on the SRQ to process through or buffers in the IMC to empty.
I wonder what Intel is doing differently, since it has had an integrated north bridge since Nehalem.

Nehalem does not really intergrated "North Bridge"
PCI-E isn't a part of it untill Lynnfield released

fellix said:
AMD doesn't actually need a super-fast L3, since their implementation is [mostly] exclusive to the higher levels and the L2 caches are the ones being truly burdened with the coherency traffic. But still, at least BD manages to bump read bandwidth from its L3 to more "modern" levels, probably thanks to the new bank-interleaved organisation. The access latency took a hit, though.
It's the L2 overall performance that's the troubling factor here -- at least they got the size right, as a compensation. The L3 is doing its job, and this time AMD managed to improve its SRAM density by a ~20% over the L2.

Isn't exclusive design require lower latency in order to sync more quickly?

almighty · Nov 23, 2011

I just traded my 5Ghz Phenom 2 x6 for Sandy Bridge, If I would of checked this a few days ago I could of tested the latency all varying clocks.

Guys when testing the latency do one at default and then run it again with an overclock on the HT and Northbridge.

hoho · Nov 23, 2011

More results, this time from stock FX8120:

denev2004 · Nov 23, 2011

hoho said:
More results, this time from stock FX8120:

Besides the high latency we can guess, One thing is interesting
The number of the eight cores in Bulldozer are
[01][27][34][56]
resptectively
Where the "[]" represent the module

That's so wired..

AMD Bulldozer Core Patent Diagrams

Heteroscedasticitate