Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 22-Nov-2011, 08:27   #1051
Gubbi
Senior Member
 
Join Date: Feb 2002
Posts: 2,569
Default

Quote:
Originally Posted by 3dilettante View Post
Is that certain? The ability to prefetch directly to L1 and bypass the L2 means there is probably a snoop to the L2 and then to both L1 caches.
Maybe they do something stupid, like wait for store queues to drain before reading data from L2 when tags indicate a hit in a foreign L1 cache. You end up with something like this:

Miss own L1 cache
Miss own L2 cache
Broadcast read-request through uncore.
Tags in other modules are checked.
Hit in foreign L1
Wait for foreign L1 store queue to drain
Read data from foreign L2.

Cheers
__________________
I'm pink, therefore I'm spam
Gubbi is offline   Reply With Quote
Old 22-Nov-2011, 15:28   #1052
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,134
Default

The 30ns time period for the in-module transfer may be due to the draining of queues and traffic to the WCC, L2 and L1.
The weird part is when it goes across the interconnect, where the latency shoots up for all AMD chips.
Is it querying the memory controller, is it the SRQ?

Are there outputs for dual/quad/hex AMD cores for comparison?
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 22-Nov-2011, 15:38   #1053
hoho
Senior Member
 
Join Date: Aug 2007
Location: Estonia
Posts: 1,218
Send a message via MSN to hoho Send a message via Skype™ to hoho
Default

Quote:
Originally Posted by 3dilettante View Post
Are there outputs for dual/quad/hex AMD cores for comparison?
Unfortunately people in the other forum where I asked them to run the app haven't really done much testing. Only semi-interesting thing I have is from s939 x2:
CPU0<->CPU1: 120.9nS per ping-pong

Not sure what specific model or at what speed it was.
hoho is offline   Reply With Quote
Old 22-Nov-2011, 15:48   #1054
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,134
Default

That big chunk of time spent in the uncore puzzles me.
Perhaps it is waiting on the SRQ to process through or buffers in the IMC to empty.
I wonder what Intel is doing differently, since it has had an integrated north bridge since Nehalem.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 22-Nov-2011, 17:24   #1055
DeF
Member
 
Join Date: May 2007
Posts: 131
Default

My PII 940 @ 3000Mhz
Code:
CPU0<->CPU1:      113.6nS per ping-pong
CPU0<->CPU2:      113.5nS per ping-pong
CPU0<->CPU3:      113.4nS per ping-pong
CPU1<->CPU2:      114.2nS per ping-pong
CPU1<->CPU3:      112.8nS per ping-pong
CPU2<->CPU3:      113.0nS per ping-pong
DeF is offline   Reply With Quote
Old 22-Nov-2011, 17:26   #1056
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,819
Send a message via Skype™ to fellix
Default

Quote:
Originally Posted by AlexV View Post
Now, I don't know how the test is structured (if there are any details please share), but if it's doing some message passing (Send->RSVP, for example), there'll be some overhead associated with OS messaging - still, it looks rather sub-mediocre. This is under Win 8, by the way.
From this article: http://www.anandtech.com/show/1910/3
Quote:
Michael S. started this extremely interesting thread at the Ace's hardware Technical forum. The result was a little program coded by Michael S. himself, which could measure the latency of cache-to-cache data transfer between two cores or CPUs. In his own words: "it is a tool for comparison of the relative merits of different dual-cores.

Cache2Cache measures the propagation time from a store by one processor to a load by the other processor. The results that we publish are approximately twice the propagation time. For those interested, the source code is available here.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 23-Nov-2011, 04:01   #1057
denev2004
Member
 
Join Date: Apr 2010
Location: China
Posts: 143
Send a message via MSN to denev2004 Send a message via Skype™ to denev2004
Default

Quote:
Originally Posted by 3dilettante View Post
That big chunk of time spent in the uncore puzzles me.
Perhaps it is waiting on the SRQ to process through or buffers in the IMC to empty.
I wonder what Intel is doing differently, since it has had an integrated north bridge since Nehalem.
Nehalem does not really intergrated "North Bridge"
PCI-E isn't a part of it untill Lynnfield released

Quote:
Originally Posted by fellix View Post
AMD doesn't actually need a super-fast L3, since their implementation is [mostly] exclusive to the higher levels and the L2 caches are the ones being truly burdened with the coherency traffic. But still, at least BD manages to bump read bandwidth from its L3 to more "modern" levels, probably thanks to the new bank-interleaved organisation. The access latency took a hit, though.
It's the L2 overall performance that's the troubling factor here -- at least they got the size right, as a compensation. The L3 is doing its job, and this time AMD managed to improve its SRAM density by a ~20% over the L2.
Isn't exclusive design require lower latency in order to sync more quickly?
__________________
Well I'm not a native English speaker so there might be misuse through my words. I just hope it won't cause too much misunderstanding.
denev2004 is offline   Reply With Quote
Old 23-Nov-2011, 06:52   #1058
almighty
Senior Member
 
Join Date: Dec 2006
Posts: 2,290
Default

I just traded my 5Ghz Phenom 2 x6 for Sandy Bridge, If I would of checked this a few days ago I could of tested the latency all varying clocks.

Guys when testing the latency do one at default and then run it again with an overclock on the HT and Northbridge.
__________________
(\__/)
(='.'=) This is Bunny. Put Bunny into your sig to help him take over the world.
(")_(")
almighty is online now   Reply With Quote
Old 23-Nov-2011, 06:53   #1059
hoho
Senior Member
 
Join Date: Aug 2007
Location: Estonia
Posts: 1,218
Send a message via MSN to hoho Send a message via Skype™ to hoho
Default

More results, this time from stock FX8120:
hoho is offline   Reply With Quote
Old 23-Nov-2011, 07:06   #1060
denev2004
Member
 
Join Date: Apr 2010
Location: China
Posts: 143
Send a message via MSN to denev2004 Send a message via Skype™ to denev2004
Default

Quote:
Originally Posted by hoho View Post
More results, this time from stock FX8120:
Besides the high latency we can guess, One thing is interesting
The number of the eight cores in Bulldozer are
[01][27][34][56]
resptectively
Where the "[]" represent the module

That's so wired..
__________________
Well I'm not a native English speaker so there might be misuse through my words. I just hope it won't cause too much misunderstanding.
denev2004 is offline   Reply With Quote
Old 23-Nov-2011, 08:12   #1061
Gubbi
Senior Member
 
Join Date: Feb 2002
Posts: 2,569
Default

Quote:
Originally Posted by denev2004 View Post
Besides the high latency we can guess, One thing is interesting
The number of the eight cores in Bulldozer are
[01][27][34][56]
resptectively
Where the "[]" represent the module

That's so wired..
No wonder the Windows scheduler is confused.

Cheers
__________________
I'm pink, therefore I'm spam
Gubbi is offline   Reply With Quote
Old 23-Nov-2011, 08:30   #1062
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,819
Send a message via Skype™ to fellix
Default

Quote:
Originally Posted by hoho View Post
More results, this time from stock FX8120:
Probably the low idle clocks and TurboCORE are dragging the timings long here. The test is loading each pair of cores once at a time until all non-repetitive permutations are exhausted.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 23-Nov-2011, 09:02   #1063
hoho
Senior Member
 
Join Date: Aug 2007
Location: Estonia
Posts: 1,218
Send a message via MSN to hoho Send a message via Skype™ to hoho
Default

Shouldn't the cores pick up speed quite fast once they get any load? The test loads each core for several seconds, isn't that enough time?
hoho is offline   Reply With Quote
Old 23-Nov-2011, 12:05   #1064
denev2004
Member
 
Join Date: Apr 2010
Location: China
Posts: 143
Send a message via MSN to denev2004 Send a message via Skype™ to denev2004
Default

Quote:
Originally Posted by fellix View Post
Probably the low idle clocks and TurboCORE are dragging the timings long here. The test is loading each pair of cores once at a time until all non-repetitive permutations are exhausted.
That's sounds reasonable...This software was written at a time where there's no turbo and cnq as well as EIST is just for notebook computer.. But it's still too high compared to others. My friends's X5570 with EIST on just reach about 70ns
__________________
Well I'm not a native English speaker so there might be misuse through my words. I just hope it won't cause too much misunderstanding.
denev2004 is offline   Reply With Quote
Old 23-Nov-2011, 13:26   #1065
hoom
Senior Member
 
Join Date: Sep 2003
Posts: 2,076
Default

PII x6 1055T @ 3.7 & 2.4NB, turbo off
CPU0<->CPU1: 94.4nS per ping-pong
CPU0<->CPU2: 91.6nS per ping-pong
CPU0<->CPU3: 91.6nS per ping-pong
CPU0<->CPU4: 93.6nS per ping-pong
CPU0<->CPU5: 93.3nS per ping-pong
CPU1<->CPU2: 93.8nS per ping-pong
CPU1<->CPU3: 93.3nS per ping-pong
CPU1<->CPU4: 95.2nS per ping-pong
CPU1<->CPU5: 96.7nS per ping-pong
CPU2<->CPU3: 91.2nS per ping-pong
CPU2<->CPU4: 91.8nS per ping-pong
CPU2<->CPU5: 92.3nS per ping-pong
CPU3<->CPU4: 92.3nS per ping-pong
CPU3<->CPU5: 95.0nS per ping-pong
CPU4<->CPU5: 95.4nS per ping-pong
__________________
But it's DOUBLE CONFIRMED
hoom is offline   Reply With Quote
Old 23-Nov-2011, 14:45   #1066
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,134
Default

Quote:
Originally Posted by denev2004 View Post
Nehalem does not really intergrated "North Bridge"
PCI-E isn't a part of it untill Lynnfield released
The relevant portions of the memory controller and core arbitration logic were moved on-die with Nehalem.
The northbridge has been a shadow of its former self since.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 24-Nov-2011, 06:17   #1067
denev2004
Member
 
Join Date: Apr 2010
Location: China
Posts: 143
Send a message via MSN to denev2004 Send a message via Skype™ to denev2004
Default

Quote:
Originally Posted by 3dilettante View Post
The relevant portions of the memory controller and core arbitration logic were moved on-die with Nehalem.
The northbridge has been a shadow of its former self since.
If we talk about, the AMD's uncore structure isn't the same as the Intel?
What's the core arbitration logic?
__________________
Well I'm not a native English speaker so there might be misuse through my words. I just hope it won't cause too much misunderstanding.
denev2004 is offline   Reply With Quote
Old 24-Nov-2011, 06:19   #1068
denev2004
Member
 
Join Date: Apr 2010
Location: China
Posts: 143
Send a message via MSN to denev2004 Send a message via Skype™ to denev2004
Default

Quote:
Originally Posted by hoho View Post
Shouldn't the cores pick up speed quite fast once they get any load? The test loads each core for several seconds, isn't that enough time?
Maybe we can turn cnq & tubro down and see what's gonna happen.
__________________
Well I'm not a native English speaker so there might be misuse through my words. I just hope it won't cause too much misunderstanding.
denev2004 is offline   Reply With Quote
Old 02-Dec-2011, 11:23   #1069
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,819
Send a message via Skype™ to fellix
Default

It's official now: AMD Revises Bulldozer Transistor Count: 1.2B, not 2B
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 02-Dec-2011, 11:31   #1070
ToTTenTranz
Senior Member
 
Join Date: Jul 2008
Posts: 2,160
Default

How is it even possible that they initially "mistook" the number of transistors by that much?

Could this have been a reason for some layoffs in the marketing department?
ToTTenTranz is offline   Reply With Quote
Old 02-Dec-2011, 11:56   #1071
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,819
Send a message via Skype™ to fellix
Default

AFAIK, exact count of the planar elements for any IC comes from the manufacturing foundry first. But probably some miscommunication within AMD departments could carry the blame.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 02-Dec-2011, 13:31   #1072
hoho
Senior Member
 
Join Date: Aug 2007
Location: Estonia
Posts: 1,218
Send a message via MSN to hoho Send a message via Skype™ to hoho
Default

I'd love to know how did AMD's transistor density grow going from 65 to 32nm if those numbers are correct
hoho is offline   Reply With Quote
Old 02-Dec-2011, 14:19   #1073
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,819
Send a message via Skype™ to fellix
Default

Llano's density is heavily skewed due to the presence of a highly compact structure like the IGP part, that takes a hefty chunk of the transistor budget.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 02-Dec-2011, 14:54   #1074
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,134
Default

AMD is not maintaining a consistent count still. The number of transistors per module it disclosed earlier is 213M, so that x4 plus 400M in the L3 is enough to hit 1.2B, so something still seems off.

Going by 1.2B, the density scaling is notably inferior to Intel, probably due to that bloated uncore.

The Anandtech count for SB may not be comparable to AMD's wonky count. They are using the schematic count of 995M, while physically it has 1.16B.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 02-Dec-2011, 14:57   #1075
fehu
Member
 
Join Date: Nov 2006
Location: Somewhere over the ocean
Posts: 634
Default

Bulldozer!
Now with 40% less transistor!
fehu is offline   Reply With Quote

Reply

Tags
amd, blewdozer, oh well, patents

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 02:24.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.